CrossRef Text and Data Mining

Rachael Lammey
Product Manager, CrossRef
28 October 2014

Not-for-profit association of scholarly publishers
All subjects, all business models
4,000+ organizations from all over the world
83 non-publisher affiliates, 2000 library affiliates
68 million content items

User clicks on
CrossRef DOI
reference link
in Journal A
Tani, N., N. Tomaru, M. Araki, AND K. Ohba. 1996. Genetic diversity and
differentiation in populations of Japanese stone pine (Pinus pumila) in
Japan. Canadian Journal of Forest Research 26: 1454–1462.[CrossRef]
DOI
directory
returns URL
User accesses
cited article in
Journal B

Services
• Cross-publisher
reference linking
• Cross-publisher
Cited-by linking
• Cross-publisher
metadata feeds
• Cross-publisher
plagiarism screening
• Cross-publisher
update identification
• Cross-publisher
funder identification
• Cross-publisher
text and data mining
Powered by
iThenticate

A Text and Data Mining Hub for Researchers

What is text and data mining?
Text Mining is an interdisciplinary field combining
techniques from linguistics, computer science and
statistics to build tools that can efficiently retrieve
and extract information from digital text.
http://blogs.plos.org/everyone/2013/04/17/announcing-the-plos-text-mining-collection/
It uses powerful computers to find links between
drugs and side effects, or genes and diseases, that
are hidden within the vast scientific literature.
These are discoveries that a person scouring
through papers one by one may never notice.
http://www.theguardian.com/science/2012/may/23/text-mining-research-tool-forbidden

http://www.jisc.ac.uk/media/documents/publications/textminingbp_rtf.rtf
Marc Weeber and colleagues used automated text mining tools to infer that the
drug thalidomide could treat several diseases it had not been associated with
before. Thalidomide was taken off the market 40 years ago, but is still the subject of
research because it seems to benefit leprosy patients via their immune systems.
Weeber and Grietje Molema, an immunologist, used text mining tools to search the
literature for papers on thalidomide and then pick out those containing concepts
related to immunology. One concept, concerning thalidomide’s ability to inhibit
Interleukin-12 (IL-12), a chemical involved in the launch of an immune response,
struck Molema as particularly interesting. A second automated search for diseases
that improve when the action of IL-12 is blocked, revealed several not previously
linked with thalidomide, including chronic hepatitis, myasthenia gravis and a type of
gastritis.
“Type in thalidomide and you get 2-3000 hits. Type in disease and you get 40,000
hits. With automated text mining tools we only had to read 100-200 abstracts and
20 or 30 full papers. We’ve created hypotheses for others to follow up” says
Weeber.
Weeber et al. J Am Med Inform Assoc. 2003 10 252-259

http://www.forbes.com/sites/stevensalzberg/2014/03/23/why-google-flu-
is-a-failure/

Why?
• Researchers find it impractical to negotiate multiple
bilateral agreements with hundreds of subscription-
based publishers in order to authorize TDM of
subscribed content.
• Subscription-based publishers find it impractical to
negotiate multiple bilateral agreements with thousands
of researchers and institutions in order to authorize TDM
of subscribed content.
• All parties would benefit from support of standard APIs
and data representations in order to enable TDM across
both open access and subscription-based publishers.

* Chinese Geoscience Union * Chinese Institute Of
Automation Engineers (Ciae) * Chinese Journal Of
Mechanical Engineering * Chinese Mathematical Society *
Chinese Physical Society * Chinese Physiological Society *
Chinese Society Of Theoretical And Applied Mechanics *
Chonnam National University Medical School (Kamje) *
Christ University Bangalore * Cic Edizioni Internazionali *
Cig Media Group * Cilip Information Literacy Group *
Civil-Comp, Ltd. * Claremont Colleges Library * Classical
Association Of The Middle West And South, Inc. (Camws)
* Clawar Association Limited * Clay Minerals Society *
Cleo Revues.Org * Cleveland Clinic Journal Of Medicine *
Clinical Autonomic Research Society * Clinical Laboratory
Publications * Clinics Cardive Publishing * Clockss Archive
* Cnps * Cnrs France * Cnu Journal Of Agricultural Science

Using the DOI as the basis for a common text and data mining
API provides several benefits. For example, the DOI provides:
•An easy way to de-duplicate documents that may be found on
several sites.
•Persistent provenance information.
•An easy way to document, share and compare coropra without
having to exchange the actual documents
•A mechanism to ensure the reproducibility of TDM results using
the source documents.
•A mechanism to track the impact of updates, corrections
retractions and withdrawls on corpora.
Why use the DOI?

http://dx.doi.org/10.5555-12345678
(Accept: text/html)

http://dx.doi.org/10.5555-12345678
(Accept: application/bibjson+json)

CrossRef TDM
HTTP Headers
CR-TDM-Rate-Limit: 1500
(the rate limit ceiling per window on requests)
CR-TDM-Rate-Limit-Remaining: 1387
(number of requests left for the current window)
CR-TDM-Rate-Limit-Reset: 1378072800
(the remaining time in UTC epoch seconds before the
rate limit resets and a new window is started)
*this is a technique used by many APIs, including Twitter’s

Common API Summary
• Content Negotiation (Required)
• New Metadata (Required)
• Full text URIs
• License URIs
• Rate Limiting Headers (optional)

1. Full Text Link
https://apps.crossref.org/docs/tdm/full-text-
uris-technical-details/

https://apps.crossref.org/docs/tdm/license-uris-technical-https://apps.crossref.org/docs/tdm/license-uris-technical-
details/details/
2. License Information
https://apps.crossref.org/docs/tdm/license-
uris-technical-details/

Example from Hindawi
<ai:program name="AccessIndicators">
<ai:license_ref>http://creativecommons.org/licenses/by/3.0/</ai:license_ref>
</ai:program>
<doi_data>
<doi>10.1155/2014/969265</doi>
<timestamp>20140401090031</timestamp>
<resource>http://www.hindawi.com/journals/aaa/2014/969265/</resource>
<collection property="text-mining">
<item>
<resource mime_type="application/pdf">
http://downloads.hindawi.com/journals/aaa/2014/969265.pdf
</resource>
</item>
<item>
<resource mime_type="application/xml">
http://downloads.hindawi.com/journals/aaa/2014/969265.xml
</resource>
</item>

Stop here if
• You are an open access publisher
• You include TDM as a part of
your subscription license/T&Cs.

Click-Through
Service
(Optional)

Researcher queries DOI using CN + API
token
Publisher verifies API token
If token verified AND access control allows,
publisher returns full text
(frequency at publisher discretion)

Benefits
• Streamlines researcher access to distributed
full text for TDM
• Enables machine-to-machine, automated
access for recognized TDM (i.e. researchers won’t be
locked out of publisher sites)
• Enables article-level licensing info and easy
mechanism for supplemental T&Cs for text
and data mining (publishers discussing
model license via STM)

What do
researchers
publishers
tools developers
need to do?

Publishers
There are two additional metadata elements that publishers will
need to deposit to support TDM via CrossRef. These are:
•Full Text URIs: One or more URIs that point to full text
representations of the content identified by your CrossRef DOIs.
•License URIs: One or more URIs pointing at licenses that govern
how the full text content can be used.
•OPTIONAL: Add publisher TDM terms and conditions to the
click-through service

Researchers
• Modify TDM tools to make use of the API token
• Modify TDM tools to look for <lic_ref>
elements
• Register with the click-through service and
accept/decline licenses (if applicable)

http://tdmsupport.crossref.org/

Progress to date
• DOI content negotiation
• CrossRef support for recording links to full text
• CrossRef metadata support for:
• ORCIDS
• FundRef
• License information
• CrossRef Metadata Search for Discovery:
http://search.labs.crossref.org/
• Click-through license service
• Publisher API for verifying and managing tokens
• Launched as live service 29th
May 2014

Publishers
Articles with full-text links and license information deposited:
998,416
Cost? Free to researchers and the public
No cost for publishers through 2014, 2015 tbc
Register interest at:
http://www.crossref.org/tdm/contact_form.html

Usable as is:
https://blogs.nd.edu/emorgan/

www.crossref.org
http://www.crossref.org/tdm/index.html
tdm@crossref.org

CrossRef Text and Data Mining

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (13)

Similar a CrossRef Text and Data Mining

Similar a CrossRef Text and Data Mining (20)

Más de Crossref

Más de Crossref (20)

Último

Último (20)

CrossRef Text and Data Mining

Notas del editor