- Clifford Lynch - Executive Director, Coalition for Networked Information
Currently, the journal article still replicates the paper artifact. With digital articles, we are trying to emulate the same relationship we had to print articles. A scholar from 1950s would still recognize an online article as journal article.
New generation of graduate students build personal libraries on laptops. You can graduate with all the literature you used (or didn’t use) on a laptop. Instead of gleaning knowledge through osmosis from the piles of paper in your office, now osmosis happens from the multitude of files on your laptop. We need better search and discovery software for PCs. Librarians need to understand that faculty and graduate students are not just downloading articles, they are building personal databases and libraries. How can we help them manage these?
At first, Google search was disconnected from most of the scholarly literature. Google Scholar was a wake up call on access to scholarly literature. Suddenly, it was easier. Suddenly you could perform computations on the literature, not just in the lab or on formulas.
Because of the high payoffs for “killer apps” in Big Pharma and biotech, there’s serious motivation to develop effective data mining tools in the life sciences, and this is the discipline where much development of data mining techniques is happening.
Data mining: gathering up big chunks of literature and performing computations on it.
Researchers combine the data mined from the published literature with their own data, or data from a gazetteer, or MESH terms, or atlases of genes, etc. Or they mine for specific formulas or algorithms, then perform their own computations on the newly mined data. There are researchers actively at work building effective combining tools.
Data mining is usually explicitly prohibited in publisher licenses. However, in most cases, publishers have been pretty cooperative once they are convinced that it’s a single faculty member conducting his or her research. Still, there are implications for the legal uses of scholarly literature and licensing.
Legal status on data mining.
There is very little case law, and it’s highly speculative at this point. There’s a notion in current copyright law: derivative works. Derivative works include translations, summaries, and extracts. Some derivative works are very mechanical : first five pages, for example. A translation is more of an individual contribution, and a second layer of copyright is introduced.
Summaries are usually thought of as property of summary writer. The summary writer rephrases and restates the original thoughts. So, is a computation on an existing body of text a derivative work? If you apply a new computational formula to existing published data, is that derivative?
Say you perform a computation on corpus of 50,000 articles and you get Result A. Then, remove 500 articles from the original set. If you rerun the computation and the result is the same as Result A, are those 500 extracted articles implicated? Are they relevant to the original results? [msp: I don’t think I captured all the nuances of this argument.]
Legal status of data mining: there are huge legal questions to resolve. Data mining is not viewed as research. Data mining is invisible, unless you observe/witness the application of a data mining process.
Should we change the nature of the scholarly literature to make text/data mining easier?
A lot of computational work is required to disambiguate text. How do you extract people, places, organizations? King John….which one? Manhattan… which one? Some disambiguation is done through context, some basic lookups in reference resources. In the sciences there are some basic items such as chemical compounds, gene names, or species that have existing taxonomies, and new coding systems are being created. [msp: e.g., CML, Chemical Markup Language. Peter Murray Rust]
Data should be integrated with published articles to facilitate data mining, but it only works if it’s supplemented with discipline-specific tagging (aka markup). If specialized markup is present, you can be confident that the context is recognized.
Currently, very little markup is provided by authors. Authors typically provide Word files or camera-ready copy. Publishers supply XML markup. Publishers might provide markup to, say, Portico, but the markup files are not generally available.
The most important value-add that STM publishers provide is editorial markup. Can we rely on authors to provide markup in an open access environment?
The big question is how to construct scholarly literature for new forms of analysis. Who will perform the value-added service of providing context? [msp: The big question facing open access is not whether it should or should not replace current publishing models. Who will format the literature to optimize data mining? Who will provide the markup?]
[msp: How different is data mining from traditional research? The literal definition of re-search is just that: searching the literature, searching the known body of knowledge, again and again. Data mining can quickly perform searches and analyses that could take scholar a lifetime. Other than speed and quantity, however, is there an essential difference? ]
How is data mining different from traditional research? The scale is different; it’s beyond human scale. Data mining allows meta analysis across disciplines. The “context” that we are adding to the literature through taxonomies and tagging schemes is to help machines do what humans do. Humans bring a spark of insight to the research process. Humans make rapid connections and create new relationships among ideas.