Scholarly Literature as an Object of Computation: Implications for Libraries, Publishers, Authors

  • Clifford Lynch - Executive Director, Coalition for Networked Information
[msp: Clifford Lynch opened the conference with this keynote presentation on data mining and the implications for libraries, license agreements, authors and publishers, and related issues. As always, he presented new ideas and provocative arguments, which I tried to record accurately. I know I missed a lot, mainly because Clifford always blows me away with his brilliance and the way he is able to speak in complete sentences and paragraphs, without notes. I think I've captured the gist of his presentation, but believe me, his talk was much better than my attempt to record it. ]

Currently, the journal article still replicates the paper artifact. With digital articles, we are trying to emulate the same relationship we had to print articles. A scholar from 1950s would still recognize an online article as journal article.

New generation of graduate students build personal libraries on laptops. You can graduate with all the literature you used (or didn’t use) on a laptop. Instead of gleaning knowledge through osmosis from the piles of paper in your office, now osmosis happens from the multitude of files on your laptop. We need better search and discovery software for PCs. Librarians need to understand that faculty and graduate students are not just downloading articles, they are building personal databases and libraries. How can we help them manage these?

At first, Google search was disconnected from most of the scholarly literature. Google Scholar was a wake up call on access to scholarly literature. Suddenly, it was easier. Suddenly you could perform computations on the literature, not just in the lab or on formulas.

Because of the high payoffs for “killer apps” in Big Pharma and biotech, there’s serious motivation to develop effective data mining tools in the life sciences, and this is the discipline where much development of data mining techniques is happening.

Data mining: gathering up big chunks of literature and performing computations on it.

Researchers combine the data mined from the published literature with their own data, or data from a gazetteer, or MESH terms, or atlases of genes, etc. Or they mine for specific formulas or algorithms, then perform their own computations on the newly mined data. There are researchers actively at work building effective combining tools.

Data mining is usually explicitly prohibited in publisher licenses. However, in most cases, publishers have been pretty cooperative once they are convinced that it’s a single faculty member conducting his or her research. Still, there are implications for the legal uses of scholarly literature and licensing.

Legal status on data mining.
There is very little case law, and it’s highly speculative at this point. There’s a notion in current copyright law: derivative works. Derivative works include translations, summaries, and extracts. Some derivative works are very mechanical : first five pages, for example. A translation is more of an individual contribution, and a second layer of copyright is introduced.


Summaries are usually thought of as property of summary writer. The summary writer rephrases and restates the original thoughts. So, is a computation on an existing body of text a derivative work? If you apply a new computational formula to existing published data, is that derivative?

Say you perform a computation on corpus of 50,000 articles and you get Result A. Then, remove 500 articles from the original set. If you rerun the computation and the result is the same as Result A, are those 500 extracted articles implicated? Are they relevant to the original results? [msp: I don’t think I captured all the nuances of this argument.]

Legal status of data mining: there are huge legal questions to resolve. Data mining is not viewed as research. Data mining is invisible, unless you observe/witness the application of a data mining process.

Should we change the nature of the scholarly literature to make text/data mining easier?

A lot of computational work is required to disambiguate text. How do you extract people, places, organizations? King John….which one? Manhattan… which one? Some disambiguation is done through context, some basic lookups in reference resources. In the sciences there are some basic items such as chemical compounds, gene names, or species that have existing taxonomies, and new coding systems are being created. [msp: e.g., CML, Chemical Markup Language. Peter Murray Rust]

Data should be integrated with published articles to facilitate data mining, but it only works if it’s supplemented with discipline-specific tagging (aka markup). If specialized markup is present, you can be confident that the context is recognized.

Currently, very little markup is provided by authors. Authors typically provide Word files or camera-ready copy. Publishers supply XML markup. Publishers might provide markup to, say, Portico, but the markup files are not generally available.

The most important value-add that STM publishers provide is editorial markup. Can we rely on authors to provide markup in an open access environment?

The big question is how to construct scholarly literature for new forms of analysis. Who will perform the value-added service of providing context?
[msp: The big question facing open access is not whether it should or should not replace current publishing models. Who will format the literature to optimize data mining? Who will provide the markup?]

[msp: How different is data mining from traditional research? The literal definition of re-search is just that: searching the literature, searching the known body of knowledge, again and again. Data mining can quickly perform searches and analyses that could take scholar a lifetime. Other than speed and quantity, however, is there an essential difference? ]

How is data mining different from traditional research? The scale is different; it’s beyond human scale. Data mining allows meta analysis across disciplines. The “context” that we are adding to the literature through taxonomies and tagging schemes is to help machines do what humans do. Humans bring a spark of insight to the research process. Humans make rapid connections and create new relationships among ideas.

What's ahead for scholarly communication

[n.b., this entry is a mash-up of all presentations at this session.]
  • Greg Tananbaum – Consultant
  • James Mullins – Dean of Libraries, Purdue University
  • Ian Russell, Chief Executive, Association of Learned & Professional Society Publishers
Changes in scholarly communication
  • Integration of data in primary literature.
  • Preservation of digital archive.
  • Government intervention – unfunded mandates.
The need for authoritative scholarly literature that’s been vetted and peer reviewed is as great as ever. Cf., the endangered Northwest tree octopus.

The Dragon Economy: the next scientific revolution is coming from the East.

In China, R&D spending has tripled in the last 10 years. 1.23% of the GDP is spent on R&D. This is a huge amount, comparatively, and the Chinese government intends to invest 2.5% of the GDP by 2020. The Chinese are making a significant investment in the infrastructure. In the US, research funding is harder to come by.

Researchers trained in the States and other Western countries are returning to China, because of better funding, better salaries, and post-9/11 immigration problems.
Researchers and academics are revered in China. These scholars are bringing teaching and research methods from the West . Returning academics are forcing higher standards for research and publication.

University enrollment tripled between 1995 and 2003. China's current world share of all science papers is 7%.

English is compulsory in high school. There are 110 million Chinese studying English, compared to 50,000 Americans studying Chinese.

This article from JCB, Wells, 2007: 176:4 376-401 was referenced. It provides a good overview of the current research environment in China.


Espresso Book Publishing Machine

http://www.libraryjournal.com/article/CA6469274.html

Any book published in last five years will never be out of print. Why buy and store books? The Espresso printer downloads and prints bound copies in 3 minutes. The copies are not archival quality, but so what? The archival copy is the digital version. New York Public Library has purchased an Espresso, and NYU is experimenting. At $20,000 it’s not out of reach. Will Google be the shared digital repository?

Librarians and the Research Process

Librarians traditionally consider the research process complete after examining tertiary or secondary sources. This is no longer adequate according to researchers. The research article itself provides the best metadata to re-search the literature. Researchers are finding ways to automate the process of taking new terminology and language from newly published articles and use it to search the literature.

Research in progress should be archived and preserved. Librarians should insert themselves further upstream in the research process. Don’t wait for the published article. Archive and preserve digital data produced in the research process.

Projects to Preserve Digital Data

Purdue Ionomics Information Management System
Information Flow and data storage project funding by NSF. This is what a library is!

Sustainable Digital Data Preservation and Access Net Partners. DataNET

The new role of librarians is to help researchers organize data. The library must insert itself into the research process before publication.

Thoughts on Unweaving the Web

  • Deborah E. Wiley – Senior Vice President, Corporate Communications, John Wiley & Sons, Inc.

[n.b., Just a few nuggets from Ms. Wiley's talk]

2006 was Wiley's bicentennial year. Ms. Wiley is part of the sixth generation to manage the company, and there is a 7th generation waiting in the wings. In 1807, Charles Wiley founded a print shop in lower Manhattan. At the time, he supplied books to private libraries and scholars.

Today, 50% sales are outside US.

What is the most over-discussed scholarly communication issue: OPEN ACCESS! Ms. Wiley observed that there is much repetition of principles and declarations. Not enough attention is focused on the business model, and scholars themselves are not that interested.

Today journals are records of research. Tomorrow, the record could be the conversation about research, or the research process itself.