Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Crossref for Text &
Data Mining
Rachael Lammey
Product Manager, CrossRef
December 2015
Not-for-profit association of scholarly publishers
All subjects, all business models
5,000+ organizations from all over th...
10.1098/rstl.1665.0001
User clicks on Crossref
DOI reference link in
Journal A
Tani, N., N. Tomaru, M. Araki, AND K. Ohba. 1996. Genetic diversit...
100,000,000
Crossref Services
• Cross-publisher reference linking
• Cross-publisher Cited-by linking
• Cross-publisher metadata feeds
...
Using Crossref
for text mining
What is text and data mining?
Text Mining is an interdisciplinary field combining
techniques from linguistics, computer sc...
http://www.jisc.ac.uk/media/documents/publications/textminingbp_rtf.rtf
Marc Weeber and colleagues used automated text min...
http://www.forbes.com/sites/stevensalzberg/2014/03/23/why-google-flu-is-a-failure/
Why?
• Researchers find it impractical to negotiate multiple bilateral
agreements with hundreds of subscription-based publ...
Botanical Publishing Board * Fisheries Sciences.Com * Florida
Entomological Society * Fondazione Annali Die Matematica Pur...
Using the DOI as the basis for a common text and data mining API provides several
benefits. For example, the DOI provides:...
The TDM Workflow
Step 1: A researcher identifies the articles they are interested in:
The search engines they use bring back results from l...
The searches they run bring back results showing publications from a range of publishers, in
different locations and using...
How to do that?
Each of those articles has a DOI, or digital object identifier. Each DOI is unique and identifies the
pape...
2. The researcher takes the DOIs that correspond to the articles they are interested in.
Search engines will allow them to...
3. The researcher gives this list to the Crossref REST API:
And that tells them
Where the full-text is located What they a...
What are they are allowed to do with it?
This is communicated by license information that publishers give to Crossref.
Som...
4. The researcher uses that information to go directly to each publisher via Crossref. It is a central
channel for them vi...
5. The full-text is then returned to the researcher, and they can use their tools to mine it
Researchers
: Common
API
DOI Content
Negotiation
http://dx.doi.org/10.5555-12345678
(Accept: text/html)
http://dx.doi.org/10.5555-12345678
(Accept: application/bibjson+json)
Rate Limiting
(optional)
Crossref TDM HTTP Headers
CR-TDM-Rate-Limit: 1500
(the rate limit ceiling per window on requests)
CR-TDM-Rate-Limit-Remain...
Common API Summary
• Content Negotiation (Required)
• New Metadata (Required)
• Full text URIs
• License URIs
• Rate Limit...
New metadata
https://apps.crossref.org/docs/tdm/full-text-uris-technical-details/
1. Full-text links
https://apps.crossref.org/docs/tdm/license-uris-
technical-details/
2. License information
Example: Hindawi
<ai:program name="AccessIndicators">
<ai:license_ref>http://creativecommons.org/licenses/by/3.0/</ai:lice...
Stop here if
• You are an open access publisher
• You include TDM as a part of your
subscription license/T&Cs.
Click-through
service
(optional)
Researcher
View
Publisher
View
Researcher queries DOI using CN + API token
Publisher verifies API token
If token verified AND access control allows,
publ...
Benefits
• Streamlines researcher access to distributed full text for
TDM
• Enables machine-to-machine, automated access f...
Implementation
Publishers
There are two additional metadata elements that publishers will need
to deposit to support TDM via CrossRef. Th...
Researchers
• Modify TDM tools to make use of the API token
• Modify TDM tools to look for <lic_ref> elements
• Register w...
http://tdmsupport.crossref.org/
Publishers
Articles with full-text links and license information deposited: 15
million from over 200 DOI prefixes
Cost? Fr...
Usable as is:
https://blogs.nd.edu/emorgan/
https://github.com/ropensci/rcrossref
www.crossref.org
http://www.crossref.org/tdm/index.html
tdm@crossref.org
Thank you!
Introduction to CrossRef Text and Data Mining Webinar
Introduction to CrossRef Text and Data Mining Webinar
Próxima SlideShare
Cargando en…5
×

Introduction to CrossRef Text and Data Mining Webinar

2.496 visualizaciones

Publicado el

Introduction to CrossRef Text and Data Mining Webinar held on December 10, 2015. Presented by Rachael Lammey.

Publicado en: Tecnología
  • Sé el primero en recomendar esto

Introduction to CrossRef Text and Data Mining Webinar

  1. 1. Crossref for Text & Data Mining Rachael Lammey Product Manager, CrossRef December 2015
  2. 2. Not-for-profit association of scholarly publishers All subjects, all business models 5,000+ organizations from all over the world 83 non-publisher affiliates, 2000 library affiliates 76 million content items About Crossref
  3. 3. 10.1098/rstl.1665.0001
  4. 4. User clicks on Crossref DOI reference link in Journal A Tani, N., N. Tomaru, M. Araki, AND K. Ohba. 1996. Genetic diversity and differentiation in populations of Japanese stone pine (Pinus pumila) in Japan. Canadian Journal of Forest Research 26: 1454–1462.[CrossRef] Crossref DOI directory returns URL User accesses cited article in Journal B
  5. 5. 100,000,000
  6. 6. Crossref Services • Cross-publisher reference linking • Cross-publisher Cited-by linking • Cross-publisher metadata feeds • Cross-publisher plagiarism screening • Cross-publisher update identification • Cross-publisher funder identification • Cross-publisher text and data mining
  7. 7. Using Crossref for text mining
  8. 8. What is text and data mining? Text Mining is an interdisciplinary field combining techniques from linguistics, computer science and statistics to build tools that can efficiently retrieve and extract information from digital text. http://blogs.plos.org/everyone/2013/04/17/announcing-the-plos-text-mining-collection/ It uses powerful computers to find links between drugs and side effects, or genes and diseases, that are hidden within the vast scientific literature. These are discoveries that a person scouring through papers one by one may never notice. http://www.theguardian.com/science/2012/may/23/text-mining-research-tool-forbidden
  9. 9. http://www.jisc.ac.uk/media/documents/publications/textminingbp_rtf.rtf Marc Weeber and colleagues used automated text mining tools to infer that the drug thalidomide could treat several diseases it had not been associated with before. Thalidomide was taken off the market 40 years ago, but is still the subject of research because it seems to benefit leprosy patients via their immune systems. Weeber and Grietje Molema, an immunologist, used text mining tools to search the literature for papers on thalidomide and then pick out those containing concepts related to immunology. One concept, concerning thalidomide’s ability to inhibit Interleukin-12 (IL-12), a chemical involved in the launch of an immune response, struck Molema as particularly interesting. A second automated search for diseases that improve when the action of IL-12 is blocked, revealed several not previously linked with thalidomide, including chronic hepatitis, myasthenia gravis and a type of gastritis. “Type in thalidomide and you get 2-3000 hits. Type in disease and you get 40,000 hits. With automated text mining tools we only had to read 100-200 abstracts and 20 or 30 full papers. We’ve created hypotheses for others to follow up” says Weeber. Weeber et al. J Am Med Inform Assoc. 2003 10 252-259
  10. 10. http://www.forbes.com/sites/stevensalzberg/2014/03/23/why-google-flu-is-a-failure/
  11. 11. Why? • Researchers find it impractical to negotiate multiple bilateral agreements with hundreds of subscription-based publishers in order to authorize TDM of subscribed content. • Subscription-based publishers find it impractical to negotiate multiple bilateral agreements with thousands of researchers and institutions in order to authorize TDM of subscribed content. • All parties would benefit from support of standard APIs and data representations in order to enable TDM across both open access and subscription-based publishers.
  12. 12. Botanical Publishing Board * Fisheries Sciences.Com * Florida Entomological Society * Fondazione Annali Die Matematica Pura Ed Applicata * Fondazione Eni Enrico Mattei (Feem) * Fondazione Pro Herbario Mediterraneo * Food And Agriculture Organization Of The United Nations (Fao) * Food Safety Commission, Cabinet Office * Foot And Ankle Online Journal * Fordham University Press * Forest Products Society * Forschungsinstitut Freie Berufe * Forum: Carbohydrates Coming Of Age * Foundation Compositio Mathematica * Foundation For Cellular And Molecular Medicine * Foundation For Sickle Cell Disease Research * Foundation Of Computer Science * Franco Angeli * Fraunhofer-Institut Fur Materialfluss Und Logistik * French Chemistry Society * French Physical Society * French-Vietnamese Association Of Pulmonology
  13. 13. Using the DOI as the basis for a common text and data mining API provides several benefits. For example, the DOI provides: •An easy way to de-duplicate documents that may be found on several sites. •Persistent provenance information. •An easy way to document, share and compare corpora without having to exchange the actual documents •A mechanism to ensure the reproducibility of TDM results using the source documents. •A mechanism to track the impact of updates, corrections retractions and withdrawals on corpora. Why use the DOI?
  14. 14. The TDM Workflow
  15. 15. Step 1: A researcher identifies the articles they are interested in: The search engines they use bring back results from lots of different publishers. They can also use Crossref to search.
  16. 16. The searches they run bring back results showing publications from a range of publishers, in different locations and using different business models. The challenge is to harvest all these articles in order to be able to mine them, without engaging in individual transactions with each publisher.
  17. 17. How to do that? Each of those articles has a DOI, or digital object identifier. Each DOI is unique and identifies the paper. Researchers are familiar with DOIs and are used to working with them.
  18. 18. 2. The researcher takes the DOIs that correspond to the articles they are interested in. Search engines will allow them to download this as a list, the researcher does not need to go to each paper to extract the DOI from it: 10.5555/12345678 10.5556/12345679 10.1016/12345680 10.8080/12345681 10.1155/12345682 10.1100/12345683 10.5555/12345684 10.1007/12345685 10.1111/12345686 10.2406/12345687 10.3994/12345688 10.5006/12345689 Click to download
  19. 19. 3. The researcher gives this list to the Crossref REST API: And that tells them Where the full-text is located What they are allowed to do with it
  20. 20. What are they are allowed to do with it? This is communicated by license information that publishers give to Crossref. Some publishers ask researchers to agree to an additional license to be able to use their content for mining. Crossref TDM allows researchers to log in with their ORCID ID and can view and accept publisher licenses all in one place: Again, this saves multiple transactions on the part of the researcher. The publishers do not charge researchers for this, and Crossref does not charge researchers for the service.
  21. 21. 4. The researcher uses that information to go directly to each publisher via Crossref. It is a central channel for them visit thousands of publishers via one request or transaction. Where they will be identified in a number of ways: No identification (Open Access content) IP recognition/log in credentials IP recognition/log in credentials + Crossref token (API key) from the TDM service
  22. 22. 5. The full-text is then returned to the researcher, and they can use their tools to mine it
  23. 23. Researchers : Common API
  24. 24. DOI Content Negotiation
  25. 25. http://dx.doi.org/10.5555-12345678 (Accept: text/html)
  26. 26. http://dx.doi.org/10.5555-12345678 (Accept: application/bibjson+json)
  27. 27. Rate Limiting (optional)
  28. 28. Crossref TDM HTTP Headers CR-TDM-Rate-Limit: 1500 (the rate limit ceiling per window on requests) CR-TDM-Rate-Limit-Remaining: 1387 (number of requests left for the current window) CR-TDM-Rate-Limit-Reset: 1378072800 (the remaining time in UTC epoch seconds before the rate limit resets and a new window is started) *this is a technique used by many APIs, including Twitter’s
  29. 29. Common API Summary • Content Negotiation (Required) • New Metadata (Required) • Full text URIs • License URIs • Rate Limiting Headers (optional)
  30. 30. New metadata
  31. 31. https://apps.crossref.org/docs/tdm/full-text-uris-technical-details/ 1. Full-text links
  32. 32. https://apps.crossref.org/docs/tdm/license-uris- technical-details/ 2. License information
  33. 33. Example: Hindawi <ai:program name="AccessIndicators"> <ai:license_ref>http://creativecommons.org/licenses/by/3.0/</ai:license_ref> </ai:program> <doi_data> <doi>10.1155/2014/969265</doi> <timestamp>20140401090031</timestamp> <resource>http://www.hindawi.com/journals/aaa/2014/969265/</resource> <collection property="text-mining"> <item> <resource mime_type="application/pdf"> http://downloads.hindawi.com/journals/aaa/2014/969265.pdf </resource> </item> <item> <resource mime_type="application/xml"> http://downloads.hindawi.com/journals/aaa/2014/969265.xml </resource> </item>
  34. 34. Stop here if • You are an open access publisher • You include TDM as a part of your subscription license/T&Cs.
  35. 35. Click-through service (optional)
  36. 36. Researcher View
  37. 37. Publisher View
  38. 38. Researcher queries DOI using CN + API token Publisher verifies API token If token verified AND access control allows, publisher returns full text (frequency at publisher discretion)
  39. 39. Benefits • Streamlines researcher access to distributed full text for TDM • Enables machine-to-machine, automated access for recognized TDM (i.e. researchers won’t be locked out of publisher sites) • Enables article-level licensing info and easy mechanism for supplemental T&Cs for text and data mining (publishers discussing model license via STM)
  40. 40. Implementation
  41. 41. Publishers There are two additional metadata elements that publishers will need to deposit to support TDM via CrossRef. These are: •Full Text URIs: One or more URIs that point to full text representations of the content identified by your CrossRef DOIs. •License URIs: One or more URIs pointing at licenses that govern how the full text content can be used. •A .csv upload option is available to populate backfiles •OPTIONAL: Add publisher TDM terms and conditions to the click- through service
  42. 42. Researchers • Modify TDM tools to make use of the API token • Modify TDM tools to look for <lic_ref> elements • Register with the click-through service and accept/decline licenses (if applicable)
  43. 43. http://tdmsupport.crossref.org/
  44. 44. Publishers Articles with full-text links and license information deposited: 15 million from over 200 DOI prefixes Cost? Free to researchers and the public No cost for publishers for 2015 Register interest at: http://www.crossref.org/tdm/contact_form.html
  45. 45. Usable as is: https://blogs.nd.edu/emorgan/ https://github.com/ropensci/rcrossref
  46. 46. www.crossref.org http://www.crossref.org/tdm/index.html tdm@crossref.org Thank you!

×