Many policies and projects now encourage investigators to share their raw research data with other scientists. Unfortunately, it is difficult to measure the effectiveness of these initiatives because data can be shared in such a variety of mechanisms and locations. We propose a novel approach to finding shared datasets: using NLP techniques to identify declarations of dataset sharing within the full text of primary research articles. Using regular expression patterns and machine learning algorithms on open access biomedical literature, our system was able to identify 61% of articles with shared datasets with 80% precision. A simpler version of our classifier achieved higher recall (86%), though lower precision (49%). We believe our results demonstrate the feasibility of this approach and hope to inspire further study of dataset retrieval techniques and policy evaluation.
4. Our aim:
Identify research articles for which the authors
have shared their datasets
For this research:
sharing = submitted to centralized databases
13. 1. Through database citations:
When authors upload data to a database, they
have the opportunity to cite the paper that
describes the data collection
14.
15.
16. Unfortunately, the citation is often left blank
because the data is submitted before
Text
the paper is published
17. 2. Through hyperlink urls in the text
Authors often reference their datasets within
their paper with a website url
18.
19. But the meaning of the hyperlinks is ambiguous.
Sometimes they point to datasets that have been
accessed, rather than submitted.
20. But the meaning of the hyperlinks is ambiguous.
Sometimes they point to datasets that have been
accessed, rather than submitted.
30. Our aim:
Identify research articles for which the authors
have shared their raw datasets.
Proposed approach:
Develop a system to
identify statements of shared data
from an article’s full text.
31. Materials:
Full text from a subset of the open access literature
Database submission citations from five databases:
• Genbank
• Protein Data Bank
• Gene Expression Omnibus
• ArrayExpress
• Stanford Microarray Database
32. Our Gold standard:
An article was considered to have a “shared dataset” if
the article was cited within the primary submission
field of a database entry
(+ a small amount of manual screening to find additional
positives based on full text)
33. Approach:
For those articles that mention database names,
• Extract a 300-character window around every
mention of a database name
• Apply various mining algorithms to decide if
there is evidence that the authors deposited
data from this study in the database
34. Results:
• queried 24 000 articles across 27 journals
• 25% of all open access articles mentioned one
of the database names (50% Genbank)
• development set of 4434 articles
training set of 2000
test set of 1028
35. True positives:
23% of the articles that mentioned a
database were cited from within a database
submission field
= evidence that article shared its data!
36. Three simple methods
for identifying sharing
Does the excerpt surrounding the database name
contain:
1. the word “accession”
2. an accession number
3. a URL
37. Two complex methods:
4. A manually-derived regular expression to
match lexical cues that suggest sharing
5. An automatically-derived bag of words
decision tree
38. Snippet of manually-developed
regular expression
accessioned
added
archived
we assigned
deposited
have entered
has imported
is included
+
are inserted
loaded
was lodged
were placed
be posted
been provided
registered
reported to
stored
submitted
uploaded to
39. How accurately were these methods able to
identify papers with evidence of public
database submissions?
40. Recall: % of papers cited in database submission fields
that were found by our methods
41. Recall: % of papers cited in database submission fields
that were found by our methods
Best
method for
recall
depends on
database
42. Recall: % of papers cited in database submission fields
that were found by our methods
“accession”
good for
some,
<url> for
others
43. Recall: % of papers cited in database submission fields
that were found by our methods
lexical
regular
expressions
do well
overall
44. Precision: % of papers found by our methods
that were cited in database submissions fields
45. Precision: % of papers found by our methods
that were cited in database submissions fields
lexical
regular
expressions
do well
overall,
bag-of-
words does
even better
46. Precision: % of papers found by our methods
that were cited in database submissions fields
Precision of
simple
patterns
depends on
database
47. Precision: % of papers found by our methods
that were cited in database submissions fields
Simple
patterns do
poorly on
the most
popular
databases
(those with
the most
statements
of reuse?)
49. Relative strength of methods for this task
across databases
bag of words
<lexical patterns>
<accession>
<url>
“accession”
50. Limitations:
• bias due to manual screening of negatives
• database-centric classifier
• approach requires computational access to
literature full text!
51. Impact:
• A recent version that runs in PubMed Central:
• could increase GEO article links by 2.6%
• by 5.5% annually when all NIH in PMC
• double the recall (to 80%),
double these estimates
• 40 links already added by GEO staff!
52. Ongoing work:
1. Continue focusing on methods that use existing
full-text query interfaces, like PubMed Central
2. Use this tool to evaluate
the patterns and prevalence of
biomedical research data sharing and reuse
53. Thanks to
the Dept of Biomedical Informatics at the U of Pittsburgh,
the NLM for funding through training grant 5 T15 LM007059-22,
and everyone who publishes “gold” open access,
thereby facilitates reuse of article full text for studies like this.
My shared data: www.dbmi.pitt.edu/piwowar
Share your research data too!
54.
55. Our manual filter for additional positive classifications
identified more cases in some databases than others: we
reclassified 19% of [article,database] cases from
ArrayExpress as positive despite an omitted literature
link, compared to 11%, 7%, 2%, and 1% for GEO, Genbank,
PDB, and SMD respectively (see Table 2 for raw number
of cases). The most common situations included: the
database entry listed a citation for another paper by the
same authors, the entry listed an erroneous PubMed ID,
the entry included a citation without a PubMed ID, or the
entry had a blank citation field.
56. Usage?
• scientists looking for datasets for reuse
• curators looking for primary citations
• researchers studying data sharing behaviour
58. Precise Regular expression
• we
have
has
is
are
was
were
be
been
accessioned|added|archived|assigned|deposited|entered|imported|included|inserted|loaded|lodged|placed|
posted|provided|registered|reported.to|stored|submitted|uploaded.to))",
is|are|will.be|made).{0,20}(available|accessible)
(be).(accessed|browsed|downloaded|found|obtained|queried|retrieved|searched|viewed)
(through|under|as).{0,20}accession
(given)|new|received|assigned).{0,20}(accession)
(data.{0,20}availability|for public distribution|for.{0,20}release upon publication|for the.{0,20}data.{0,20}
generated|from this study have.{0,20}accession|data.{0,10}from this study|access to.{0,20}data.
62. Evaluation
• queried 24 000 articles across 27 journals
• 25% mentioned one of the database names
• development set of 4434
training set of 2000
test set of 1028
65. Research data
PAST MEDICAL HISTORY:
Past medical history showed she had superficial
phlebitis times two in the past, had non-insulin
dependent diabetes mellitus for four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png;
http://en.wikipedia.org/wiki/Image:Microarray2.gif;
http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
66. Research data
PAST MEDICAL HISTORY:
Past medical history showed she had superficial
phlebitis times two in the past, had non-insulin
dependent diabetes mellitus for four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png;
http://en.wikipedia.org/wiki/Image:Microarray2.gif;
http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
67. Research data
PAST MEDICAL HISTORY:
Past medical history showed she had superficial
phlebitis times two in the past, had non-insulin
dependent diabetes mellitus for four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png;
http://en.wikipedia.org/wiki/Image:Microarray2.gif;
http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
68. Research data
PAST MEDICAL HISTORY:
Past medical history showed she had superficial
phlebitis times two in the past, had non-insulin
dependent diabetes mellitus for four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png;
http://en.wikipedia.org/wiki/Image:Microarray2.gif;
http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441