SlideShare una empresa de Scribd logo
1 de 69
Descargar para leer sin conexión
Identifying data sharing
 in the biomedical literature

       Heather Piwowar and Wendy Chapman
Department of Biomedical Informatics, U of Pittsburgh
Our full paper:




Visualized as a “Wordle”
  (font size ~ word frequency, location and orientation are random)
Created at IBM’s data sharing and visualization site Many Eyes
Our aim:
Identify research articles for which the authors
have shared their datasets


For this research:
sharing = submitted to centralized databases
Links between article and data
         are important
The data provides detail for the
      results of the article
The article provides detail for the
               data
Specialized searching methods help us find articles
  OR data...
but what about when we want articles WITH data?
How can we find articles that have
      shared their datasets?
Sometimes the links are easy to discover
1. Through database citations:
When authors upload data to a database, they
have the opportunity to cite the paper that
describes the data collection
Unfortunately, the citation is often left blank
   because the data is submitted before
                    Text
          the paper is published
2. Through hyperlink urls in the text
Authors often reference their datasets within
their paper with a website url
But the meaning of the hyperlinks is ambiguous.
Sometimes they point to datasets that have been
         accessed, rather than submitted.
But the meaning of the hyperlinks is ambiguous.
Sometimes they point to datasets that have been
         accessed, rather than submitted.
And often the text contains no hyperlinks at all:
3. Through text mining
What if we could extract phrases like

“data of the experiment can be accessed at”
full-text phrases containing “... accessed”
“can be accessed” suggests data is shared
BUT “was/were accessed” suggests data reuse!
full-text phrases containing “... downloaded”
“was/were downloaded” suggests data reuse
while “can be downloaded” suggests data sharing
Our aim:
Identify research articles for which the authors
have shared their raw datasets.


Proposed approach:
Develop a system to
identify statements of shared data
from an article’s full text.
Materials:
Full text from a subset of the open access literature
Database submission citations from five databases:
    • Genbank
    • Protein Data Bank
    • Gene Expression Omnibus
    • ArrayExpress
    • Stanford Microarray Database
Our Gold standard:


An article was considered to have a “shared dataset” if
the article was cited within the primary submission
field of a database entry

(+ a small amount of manual screening to find additional
positives based on full text)
Approach:
For those articles that mention database names,
    • Extract a 300-character window around every
     mention of a database name
    • Apply various mining algorithms to decide if
     there is evidence that the authors deposited
     data from this study in the database
Results:
• queried 24 000 articles across 27 journals
• 25% of all open access articles mentioned one
  of the database names (50% Genbank)
• development set of 4434 articles
  training set of 2000
  test set of 1028
True positives:

23% of the articles that mentioned a
database were cited from within a database
submission field

= evidence that article shared its data!
Three simple methods
  for identifying sharing
Does the excerpt surrounding the database name
contain:
1. the word “accession”
2. an accession number
3. a URL
Two complex methods:

4. A manually-derived regular expression to
   match lexical cues that suggest sharing

5. An automatically-derived bag of words
   decision tree
Snippet of manually-developed
     regular expression
                        accessioned
                        added
                        archived
 we                     assigned
                        deposited
 have                   entered
 has                    imported
 is                     included


             +
 are                    inserted
                        loaded
 was                    lodged
 were                   placed
 be                     posted
 been                   provided
                        registered
                        reported to
                        stored
                        submitted
                        uploaded to
How accurately were these methods able to
identify papers with evidence of public
database submissions?
Recall:   % of papers cited in database submission fields
          that were found by our methods
Recall:   % of papers cited in database submission fields
          that were found by our methods



                                                Best
                                            method for
                                               recall
                                            depends on
                                             database
Recall:   % of papers cited in database submission fields
          that were found by our methods



                                            “accession”
                                             good for
                                               some,
                                             <url> for
                                              others
Recall:   % of papers cited in database submission fields
          that were found by our methods



                                             lexical
                                             regular
                                           expressions
                                             do well
                                             overall
Precision:      % of papers found by our methods
  that were cited in database submissions fields
Precision:      % of papers found by our methods
  that were cited in database submissions fields
                                          lexical
                                          regular
                                        expressions
                                          do well
                                          overall,

                                          bag-of-
                                        words does
                                        even better
Precision:      % of papers found by our methods
  that were cited in database submissions fields



                                        Precision of
                                           simple
                                          patterns
                                        depends on
                                         database
Precision:      % of papers found by our methods
  that were cited in database submissions fields

                                          Simple
                                       patterns do
                                        poorly on
                                         the most
                                          popular
                                        databases
                                       (those with
                                         the most
                                       statements
                                        of reuse?)
Precision vs. Recall plot of all methods
               for each database.




Diverse!
Relative strength of methods for this task
             across databases

                    bag of words


                           <lexical patterns>
                             <accession>
            <url>
                                      “accession”
Limitations:


• bias due to manual screening of negatives
• database-centric classifier
• approach requires computational access to
  literature full text!
Impact:


• A recent version that runs in PubMed Central:
    • could increase GEO article links by 2.6%
    • by 5.5% annually when all NIH in PMC
    • double the recall (to 80%),
      double these estimates
• 40 links already added by GEO staff!
Ongoing work:


1. Continue focusing on methods that use existing
   full-text query interfaces, like PubMed Central
2. Use this tool to evaluate
   the patterns and prevalence of
   biomedical research data sharing and reuse
Thanks to
 the Dept of Biomedical Informatics at the U of Pittsburgh,

 the NLM for funding through training grant 5 T15 LM007059-22,

 and everyone who publishes “gold” open access,
   thereby facilitates reuse of article full text for studies like this.


       My shared data: www.dbmi.pitt.edu/piwowar
               Share your research data too!
Our manual filter for additional positive classifications
 identified more cases in some databases than others: we
      reclassified 19% of [article,database] cases from
   ArrayExpress as positive despite an omitted literature
link, compared to 11%, 7%, 2%, and 1% for GEO, Genbank,
  PDB, and SMD respectively (see Table 2 for raw number
   of cases). The most common situations included: the
  database entry listed a citation for another paper by the
  same authors, the entry listed an erroneous PubMed ID,
the entry included a citation without a PubMed ID, or the
               entry had a blank citation field.
Usage?


• scientists looking for datasets for reuse
• curators looking for primary citations
• researchers studying data sharing behaviour
Regular expression

•   Precise one +


•   "(b(accession.{0,20}(for|at).{0,100}(is|are)))",


•                                     r"(b(raw|original|our|complete|detailed).{0,20}data)",


•                                     r"(b(we|have|is|was|were|is|are|be|have|has|been).(exported|gave|given|listed|provided|reported))"


•                                   ]) + ")"
Precise Regular expression

•   we
    have
    has
    is
    are
    was
    were
    be
    been

    accessioned|added|archived|assigned|deposited|entered|imported|included|inserted|loaded|lodged|placed|
    posted|provided|registered|reported.to|stored|submitted|uploaded.to))",

    is|are|will.be|made).{0,20}(available|accessible)

    (be).(accessed|browsed|downloaded|found|obtained|queried|retrieved|searched|viewed)

    (through|under|as).{0,20}accession

    (given)|new|received|assigned).{0,20}(accession)

    (data.{0,20}availability|for public distribution|for.{0,20}release upon publication|for the.{0,20}data.{0,20}
    generated|from this study have.{0,20}accession|data.{0,10}from this study|access to.{0,20}data.
Stopwords are important!
Recall
Precision
Evaluation
•   queried 24 000 articles across 27 journals
•   25% mentioned one of the database names
•   development set of 4434
    training set of 2000
    test set of 1028
Research data




http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png;
                                                           http://en.wikipedia.org/wiki/Image:Microarray2.gif;
           http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Research data




http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png;
                                                           http://en.wikipedia.org/wiki/Image:Microarray2.gif;
           http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Research data



                PAST MEDICAL HISTORY:
     Past medical history showed she had superficial
     phlebitis times two in the past, had non-insulin
       dependent diabetes mellitus for four years.
        She had been hypothyroid for three years.
            HISTORY OF PRESENT ILLNESS:
          The patient is a 58-year-old female, …



http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png;
                                                           http://en.wikipedia.org/wiki/Image:Microarray2.gif;
           http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Research data



                PAST MEDICAL HISTORY:
     Past medical history showed she had superficial
     phlebitis times two in the past, had non-insulin
       dependent diabetes mellitus for four years.
        She had been hypothyroid for three years.
            HISTORY OF PRESENT ILLNESS:
          The patient is a 58-year-old female, …



http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png;
                                                           http://en.wikipedia.org/wiki/Image:Microarray2.gif;
           http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Research data



                PAST MEDICAL HISTORY:
     Past medical history showed she had superficial
     phlebitis times two in the past, had non-insulin
       dependent diabetes mellitus for four years.
        She had been hypothyroid for three years.
            HISTORY OF PRESENT ILLNESS:
          The patient is a 58-year-old female, …



http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png;
                                                           http://en.wikipedia.org/wiki/Image:Microarray2.gif;
           http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Research data



                PAST MEDICAL HISTORY:
     Past medical history showed she had superficial
     phlebitis times two in the past, had non-insulin
       dependent diabetes mellitus for four years.
        She had been hypothyroid for three years.
            HISTORY OF PRESENT ILLNESS:
          The patient is a 58-year-old female, …



http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png;
                                                           http://en.wikipedia.org/wiki/Image:Microarray2.gif;
           http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Más contenido relacionado

La actualidad más candente

Data reuse and scholarly reward: understanding practice and building infrastr...
Data reuse and scholarly reward: understanding practice and building infrastr...Data reuse and scholarly reward: understanding practice and building infrastr...
Data reuse and scholarly reward: understanding practice and building infrastr...Todd Vision
 
State of the Art Natural Language Processing at Scale with Alexander Thomas a...
State of the Art Natural Language Processing at Scale with Alexander Thomas a...State of the Art Natural Language Processing at Scale with Alexander Thomas a...
State of the Art Natural Language Processing at Scale with Alexander Thomas a...Databricks
 
Deep Learning for Domain-Specific Entity Extraction from Unstructured Text wi...
Deep Learning for Domain-Specific Entity Extraction from Unstructured Text wi...Deep Learning for Domain-Specific Entity Extraction from Unstructured Text wi...
Deep Learning for Domain-Specific Entity Extraction from Unstructured Text wi...Databricks
 

La actualidad más candente (8)

Data reuse and scholarly reward: understanding practice and building infrastr...
Data reuse and scholarly reward: understanding practice and building infrastr...Data reuse and scholarly reward: understanding practice and building infrastr...
Data reuse and scholarly reward: understanding practice and building infrastr...
 
Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoop
 
State of the Art Natural Language Processing at Scale with Alexander Thomas a...
State of the Art Natural Language Processing at Scale with Alexander Thomas a...State of the Art Natural Language Processing at Scale with Alexander Thomas a...
State of the Art Natural Language Processing at Scale with Alexander Thomas a...
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Deep Learning for Domain-Specific Entity Extraction from Unstructured Text wi...
Deep Learning for Domain-Specific Entity Extraction from Unstructured Text wi...Deep Learning for Domain-Specific Entity Extraction from Unstructured Text wi...
Deep Learning for Domain-Specific Entity Extraction from Unstructured Text wi...
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Saic aqua summary
Saic aqua summarySaic aqua summary
Saic aqua summary
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 

Destacado

Presentation at Sri Lanka college of venereologists 2011
Presentation at Sri Lanka college of venereologists 2011Presentation at Sri Lanka college of venereologists 2011
Presentation at Sri Lanka college of venereologists 2011Dr Ajith Karawita
 
Size estimation of most at risk populations
Size estimation of most at risk populationsSize estimation of most at risk populations
Size estimation of most at risk populationsDr Ajith Karawita
 
Sri lankan experience on reduction of hiv stigma and discrimination among hea...
Sri lankan experience on reduction of hiv stigma and discrimination among hea...Sri lankan experience on reduction of hiv stigma and discrimination among hea...
Sri lankan experience on reduction of hiv stigma and discrimination among hea...Dr Ajith Karawita
 
Clinical materials for medicine VI
Clinical materials for medicine VIClinical materials for medicine VI
Clinical materials for medicine VIDr Ajith Karawita
 
Integration of biomedical literature and databases
Integration of biomedical literature and databasesIntegration of biomedical literature and databases
Integration of biomedical literature and databasesLars Juhl Jensen
 
Indexing of biomedical literature
Indexing of biomedical literatureIndexing of biomedical literature
Indexing of biomedical literatureDr Ajith Karawita
 

Destacado (7)

Presentation at Sri Lanka college of venereologists 2011
Presentation at Sri Lanka college of venereologists 2011Presentation at Sri Lanka college of venereologists 2011
Presentation at Sri Lanka college of venereologists 2011
 
Size estimation of most at risk populations
Size estimation of most at risk populationsSize estimation of most at risk populations
Size estimation of most at risk populations
 
Sri lankan experience on reduction of hiv stigma and discrimination among hea...
Sri lankan experience on reduction of hiv stigma and discrimination among hea...Sri lankan experience on reduction of hiv stigma and discrimination among hea...
Sri lankan experience on reduction of hiv stigma and discrimination among hea...
 
Clinical materials for medicine VI
Clinical materials for medicine VIClinical materials for medicine VI
Clinical materials for medicine VI
 
Integration of biomedical literature and databases
Integration of biomedical literature and databasesIntegration of biomedical literature and databases
Integration of biomedical literature and databases
 
Indexing of biomedical literature
Indexing of biomedical literatureIndexing of biomedical literature
Indexing of biomedical literature
 
Alexa
AlexaAlexa
Alexa
 

Similar a Piwowar AMIA 2008: Identifying data sharing in biomedical literature

Full text
Full textFull text
Full textbutest
 
From metadata to data curation: the role of libraries in data exchange
From metadata to data curation: the role of libraries in data exchangeFrom metadata to data curation: the role of libraries in data exchange
From metadata to data curation: the role of libraries in data exchangeLIBER Europe
 
The role of libraries in data exchange
The role of libraries in data exchangeThe role of libraries in data exchange
The role of libraries in data exchangeLIBER Europe
 
Evolving Roles in Scholarly Communications
Evolving Roles in Scholarly Communications Evolving Roles in Scholarly Communications
Evolving Roles in Scholarly Communications LIBER Europe
 
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseTowards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseHilmar Lapp
 
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...Susanna-Assunta Sansone
 
BIOLINK 2008: Linking database submissions to primary citations with PubMe...
BIOLINK 2008:    Linking database submissions to primary citations with PubMe...BIOLINK 2008:    Linking database submissions to primary citations with PubMe...
BIOLINK 2008: Linking database submissions to primary citations with PubMe...Heather Piwowar
 
Record matching over query results
Record matching over query resultsRecord matching over query results
Record matching over query resultsambitlick
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017philippbayer
 
Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identificationguest453b14
 
Dataset citation and identification
Dataset citation and identificationDataset citation and identification
Dataset citation and identificationAdam Farquhar
 
Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identificationguest453b14
 
Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identificationguest453b14
 
online Record Linkage
online Record Linkageonline Record Linkage
online Record LinkagePriya Pandian
 
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence ReadsPipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence ReadsAdam Bradley
 
Description &amp; annotation of biomedical experimental data sets: work in p...
Description &amp; annotation of biomedical experimental data sets:  work in p...Description &amp; annotation of biomedical experimental data sets:  work in p...
Description &amp; annotation of biomedical experimental data sets: work in p...jenferguson
 
Evaluating Databases
Evaluating DatabasesEvaluating Databases
Evaluating DatabasesJohn Pell
 

Similar a Piwowar AMIA 2008: Identifying data sharing in biomedical literature (20)

Full text
Full textFull text
Full text
 
From metadata to data curation: the role of libraries in data exchange
From metadata to data curation: the role of libraries in data exchangeFrom metadata to data curation: the role of libraries in data exchange
From metadata to data curation: the role of libraries in data exchange
 
The role of libraries in data exchange
The role of libraries in data exchangeThe role of libraries in data exchange
The role of libraries in data exchange
 
Evolving Roles in Scholarly Communications
Evolving Roles in Scholarly Communications Evolving Roles in Scholarly Communications
Evolving Roles in Scholarly Communications
 
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseTowards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
 
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
 
BIOLINK 2008: Linking database submissions to primary citations with PubMe...
BIOLINK 2008:    Linking database submissions to primary citations with PubMe...BIOLINK 2008:    Linking database submissions to primary citations with PubMe...
BIOLINK 2008: Linking database submissions to primary citations with PubMe...
 
Record matching over query results
Record matching over query resultsRecord matching over query results
Record matching over query results
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identification
 
Dataset citation and identification
Dataset citation and identificationDataset citation and identification
Dataset citation and identification
 
Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identification
 
Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identification
 
Bioinformatics principles and applications
Bioinformatics principles and applicationsBioinformatics principles and applications
Bioinformatics principles and applications
 
online Record Linkage
online Record Linkageonline Record Linkage
online Record Linkage
 
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence ReadsPipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
 
B4OS-2012
B4OS-2012B4OS-2012
B4OS-2012
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
 
Description &amp; annotation of biomedical experimental data sets: work in p...
Description &amp; annotation of biomedical experimental data sets:  work in p...Description &amp; annotation of biomedical experimental data sets:  work in p...
Description &amp; annotation of biomedical experimental data sets: work in p...
 
Evaluating Databases
Evaluating DatabasesEvaluating Databases
Evaluating Databases
 

Más de Heather Piwowar

Calculating how much your University spends on Open Access--and what to do ab...
Calculating how much your University spends on Open Access--and what to do ab...Calculating how much your University spends on Open Access--and what to do ab...
Calculating how much your University spends on Open Access--and what to do ab...Heather Piwowar
 
How to Calculate OA APC Spend for Your University
How to Calculate OA APC Spend for Your UniversityHow to Calculate OA APC Spend for Your University
How to Calculate OA APC Spend for Your UniversityHeather Piwowar
 
Intro to Managing Serials with Net Cost per Paid Use
Intro to Managing Serials with Net Cost per Paid UseIntro to Managing Serials with Net Cost per Paid Use
Intro to Managing Serials with Net Cost per Paid UseHeather Piwowar
 
The Future of OA: 
The Impact of Open Access on Readership and Subscription ...
 The Future of OA: 
The Impact of Open Access on Readership and Subscription ... The Future of OA: 
The Impact of Open Access on Readership and Subscription ...
The Future of OA: 
The Impact of Open Access on Readership and Subscription ...Heather Piwowar
 
The time has come to talk of... who should own scholarly infrastructure?
 The time has come to talk of... who should own scholarly infrastructure? The time has come to talk of... who should own scholarly infrastructure?
The time has come to talk of... who should own scholarly infrastructure?Heather Piwowar
 
What kinds of open have 
made a difference in scholarly communication infrast...
What kinds of open have 
made a difference in scholarly communication infrast...What kinds of open have 
made a difference in scholarly communication infrast...
What kinds of open have 
made a difference in scholarly communication infrast...Heather Piwowar
 
Data science needs Data and lots of it
Data science needs Data and lots of itData science needs Data and lots of it
Data science needs Data and lots of itHeather Piwowar
 
Impactstory OA week 2017
Impactstory OA week 2017Impactstory OA week 2017
Impactstory OA week 2017Heather Piwowar
 
Software-Native metrics: Depsy lessons learned
Software-Native metrics: Depsy lessons learnedSoftware-Native metrics: Depsy lessons learned
Software-Native metrics: Depsy lessons learnedHeather Piwowar
 
What's your Impactstory?
What's your Impactstory?What's your Impactstory?
What's your Impactstory?Heather Piwowar
 
capturing the impact of software AAS 2017
capturing the impact of software AAS 2017capturing the impact of software AAS 2017
capturing the impact of software AAS 2017Heather Piwowar
 
Software-Native metrics: Depsy lessons learned
Software-Native metrics: Depsy lessons learnedSoftware-Native metrics: Depsy lessons learned
Software-Native metrics: Depsy lessons learnedHeather Piwowar
 
submission summary for #WSSSPE Policy session on Credit, Citation, and Impact
submission summary for #WSSSPE Policy session on Credit, Citation, and Impactsubmission summary for #WSSSPE Policy session on Credit, Citation, and Impact
submission summary for #WSSSPE Policy session on Credit, Citation, and ImpactHeather Piwowar
 
Building Skyscrapers with our Scholarship
Building Skyscrapers with our ScholarshipBuilding Skyscrapers with our Scholarship
Building Skyscrapers with our ScholarshipHeather Piwowar
 
Right time, right place, to change the world
Right time, right place, to change the worldRight time, right place, to change the world
Right time, right place, to change the worldHeather Piwowar
 
No more waiting! Tools that work Today to reveal dataset use
No more waiting!  Tools that work Today to reveal dataset useNo more waiting!  Tools that work Today to reveal dataset use
No more waiting! Tools that work Today to reveal dataset useHeather Piwowar
 
Analyzing data about our data
Analyzing data about our dataAnalyzing data about our data
Analyzing data about our dataHeather Piwowar
 

Más de Heather Piwowar (20)

Calculating how much your University spends on Open Access--and what to do ab...
Calculating how much your University spends on Open Access--and what to do ab...Calculating how much your University spends on Open Access--and what to do ab...
Calculating how much your University spends on Open Access--and what to do ab...
 
Unsub Lightning Talk
Unsub Lightning TalkUnsub Lightning Talk
Unsub Lightning Talk
 
How to Calculate OA APC Spend for Your University
How to Calculate OA APC Spend for Your UniversityHow to Calculate OA APC Spend for Your University
How to Calculate OA APC Spend for Your University
 
Intro to Managing Serials with Net Cost per Paid Use
Intro to Managing Serials with Net Cost per Paid UseIntro to Managing Serials with Net Cost per Paid Use
Intro to Managing Serials with Net Cost per Paid Use
 
The Future of OA: 
The Impact of Open Access on Readership and Subscription ...
 The Future of OA: 
The Impact of Open Access on Readership and Subscription ... The Future of OA: 
The Impact of Open Access on Readership and Subscription ...
The Future of OA: 
The Impact of Open Access on Readership and Subscription ...
 
The time has come to talk of... who should own scholarly infrastructure?
 The time has come to talk of... who should own scholarly infrastructure? The time has come to talk of... who should own scholarly infrastructure?
The time has come to talk of... who should own scholarly infrastructure?
 
What kinds of open have 
made a difference in scholarly communication infrast...
What kinds of open have 
made a difference in scholarly communication infrast...What kinds of open have 
made a difference in scholarly communication infrast...
What kinds of open have 
made a difference in scholarly communication infrast...
 
Data science needs Data and lots of it
Data science needs Data and lots of itData science needs Data and lots of it
Data science needs Data and lots of it
 
Oadoi and libraries
Oadoi and librariesOadoi and libraries
Oadoi and libraries
 
Impactstory OA week 2017
Impactstory OA week 2017Impactstory OA week 2017
Impactstory OA week 2017
 
Paperbuzz sneak peek
Paperbuzz sneak peekPaperbuzz sneak peek
Paperbuzz sneak peek
 
Software-Native metrics: Depsy lessons learned
Software-Native metrics: Depsy lessons learnedSoftware-Native metrics: Depsy lessons learned
Software-Native metrics: Depsy lessons learned
 
What's your Impactstory?
What's your Impactstory?What's your Impactstory?
What's your Impactstory?
 
capturing the impact of software AAS 2017
capturing the impact of software AAS 2017capturing the impact of software AAS 2017
capturing the impact of software AAS 2017
 
Software-Native metrics: Depsy lessons learned
Software-Native metrics: Depsy lessons learnedSoftware-Native metrics: Depsy lessons learned
Software-Native metrics: Depsy lessons learned
 
submission summary for #WSSSPE Policy session on Credit, Citation, and Impact
submission summary for #WSSSPE Policy session on Credit, Citation, and Impactsubmission summary for #WSSSPE Policy session on Credit, Citation, and Impact
submission summary for #WSSSPE Policy session on Credit, Citation, and Impact
 
Building Skyscrapers with our Scholarship
Building Skyscrapers with our ScholarshipBuilding Skyscrapers with our Scholarship
Building Skyscrapers with our Scholarship
 
Right time, right place, to change the world
Right time, right place, to change the worldRight time, right place, to change the world
Right time, right place, to change the world
 
No more waiting! Tools that work Today to reveal dataset use
No more waiting!  Tools that work Today to reveal dataset useNo more waiting!  Tools that work Today to reveal dataset use
No more waiting! Tools that work Today to reveal dataset use
 
Analyzing data about our data
Analyzing data about our dataAnalyzing data about our data
Analyzing data about our data
 

Último

Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...GENUINE ESCORT AGENCY
 
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...Call Girls in Nagpur High Profile
 
Call Girls Bangalore Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Bangalore Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Bangalore Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Bangalore Just Call 8250077686 Top Class Call Girl Service AvailableDipal Arora
 
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
O898O367676 Call Girls In Ahmedabad Escort Service Available 24×7 In Ahmedabad
O898O367676 Call Girls In Ahmedabad Escort Service Available 24×7 In AhmedabadO898O367676 Call Girls In Ahmedabad Escort Service Available 24×7 In Ahmedabad
O898O367676 Call Girls In Ahmedabad Escort Service Available 24×7 In AhmedabadGENUINE ESCORT AGENCY
 
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any TimeTop Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any TimeCall Girls Delhi
 
Call Girls Faridabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Faridabad Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Faridabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Faridabad Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...aartirawatdelhi
 
VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋
VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋
VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋TANUJA PANDEY
 
Call Girls Siliguri Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Siliguri Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Siliguri Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Siliguri Just Call 8250077686 Top Class Call Girl Service AvailableDipal Arora
 
Call Girls Visakhapatnam Just Call 8250077686 Top Class Call Girl Service Ava...
Call Girls Visakhapatnam Just Call 8250077686 Top Class Call Girl Service Ava...Call Girls Visakhapatnam Just Call 8250077686 Top Class Call Girl Service Ava...
Call Girls Visakhapatnam Just Call 8250077686 Top Class Call Girl Service Ava...Dipal Arora
 
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...parulsinha
 
Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...
Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...
Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...Ishani Gupta
 
Call Girls Jabalpur Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Jabalpur Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Jabalpur Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Jabalpur Just Call 8250077686 Top Class Call Girl Service AvailableDipal Arora
 
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort ServicePremium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Servicevidya singh
 
Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...
Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...
Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...Dipal Arora
 
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...Taniya Sharma
 
Top Rated Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...
Top Rated  Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...Top Rated  Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...
Top Rated Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...chandars293
 
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...Dipal Arora
 
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...jageshsingh5554
 

Último (20)

Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
 
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
 
Call Girls Bangalore Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Bangalore Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Bangalore Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Bangalore Just Call 8250077686 Top Class Call Girl Service Available
 
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
 
O898O367676 Call Girls In Ahmedabad Escort Service Available 24×7 In Ahmedabad
O898O367676 Call Girls In Ahmedabad Escort Service Available 24×7 In AhmedabadO898O367676 Call Girls In Ahmedabad Escort Service Available 24×7 In Ahmedabad
O898O367676 Call Girls In Ahmedabad Escort Service Available 24×7 In Ahmedabad
 
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any TimeTop Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
 
Call Girls Faridabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Faridabad Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Faridabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Faridabad Just Call 9907093804 Top Class Call Girl Service Available
 
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
 
VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋
VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋
VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋
 
Call Girls Siliguri Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Siliguri Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Siliguri Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Siliguri Just Call 8250077686 Top Class Call Girl Service Available
 
Call Girls Visakhapatnam Just Call 8250077686 Top Class Call Girl Service Ava...
Call Girls Visakhapatnam Just Call 8250077686 Top Class Call Girl Service Ava...Call Girls Visakhapatnam Just Call 8250077686 Top Class Call Girl Service Ava...
Call Girls Visakhapatnam Just Call 8250077686 Top Class Call Girl Service Ava...
 
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
 
Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...
Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...
Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...
 
Call Girls Jabalpur Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Jabalpur Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Jabalpur Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Jabalpur Just Call 8250077686 Top Class Call Girl Service Available
 
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort ServicePremium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
 
Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...
Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...
Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...
 
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
 
Top Rated Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...
Top Rated  Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...Top Rated  Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...
Top Rated Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...
 
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
 
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
 

Piwowar AMIA 2008: Identifying data sharing in biomedical literature

  • 1. Identifying data sharing in the biomedical literature Heather Piwowar and Wendy Chapman Department of Biomedical Informatics, U of Pittsburgh
  • 2. Our full paper: Visualized as a “Wordle” (font size ~ word frequency, location and orientation are random)
  • 3. Created at IBM’s data sharing and visualization site Many Eyes
  • 4. Our aim: Identify research articles for which the authors have shared their datasets For this research: sharing = submitted to centralized databases
  • 5.
  • 6.
  • 7. Links between article and data are important
  • 8. The data provides detail for the results of the article
  • 9. The article provides detail for the data
  • 10. Specialized searching methods help us find articles OR data... but what about when we want articles WITH data?
  • 11. How can we find articles that have shared their datasets?
  • 12. Sometimes the links are easy to discover
  • 13. 1. Through database citations: When authors upload data to a database, they have the opportunity to cite the paper that describes the data collection
  • 14.
  • 15.
  • 16. Unfortunately, the citation is often left blank because the data is submitted before Text the paper is published
  • 17. 2. Through hyperlink urls in the text Authors often reference their datasets within their paper with a website url
  • 18.
  • 19. But the meaning of the hyperlinks is ambiguous. Sometimes they point to datasets that have been accessed, rather than submitted.
  • 20. But the meaning of the hyperlinks is ambiguous. Sometimes they point to datasets that have been accessed, rather than submitted.
  • 21. And often the text contains no hyperlinks at all:
  • 22. 3. Through text mining
  • 23. What if we could extract phrases like “data of the experiment can be accessed at”
  • 24. full-text phrases containing “... accessed”
  • 25. “can be accessed” suggests data is shared
  • 26. BUT “was/were accessed” suggests data reuse!
  • 27. full-text phrases containing “... downloaded”
  • 29. while “can be downloaded” suggests data sharing
  • 30. Our aim: Identify research articles for which the authors have shared their raw datasets. Proposed approach: Develop a system to identify statements of shared data from an article’s full text.
  • 31. Materials: Full text from a subset of the open access literature Database submission citations from five databases: • Genbank • Protein Data Bank • Gene Expression Omnibus • ArrayExpress • Stanford Microarray Database
  • 32. Our Gold standard: An article was considered to have a “shared dataset” if the article was cited within the primary submission field of a database entry (+ a small amount of manual screening to find additional positives based on full text)
  • 33. Approach: For those articles that mention database names, • Extract a 300-character window around every mention of a database name • Apply various mining algorithms to decide if there is evidence that the authors deposited data from this study in the database
  • 34. Results: • queried 24 000 articles across 27 journals • 25% of all open access articles mentioned one of the database names (50% Genbank) • development set of 4434 articles training set of 2000 test set of 1028
  • 35. True positives: 23% of the articles that mentioned a database were cited from within a database submission field = evidence that article shared its data!
  • 36. Three simple methods for identifying sharing Does the excerpt surrounding the database name contain: 1. the word “accession” 2. an accession number 3. a URL
  • 37. Two complex methods: 4. A manually-derived regular expression to match lexical cues that suggest sharing 5. An automatically-derived bag of words decision tree
  • 38. Snippet of manually-developed regular expression accessioned added archived we assigned deposited have entered has imported is included + are inserted loaded was lodged were placed be posted been provided registered reported to stored submitted uploaded to
  • 39. How accurately were these methods able to identify papers with evidence of public database submissions?
  • 40. Recall: % of papers cited in database submission fields that were found by our methods
  • 41. Recall: % of papers cited in database submission fields that were found by our methods Best method for recall depends on database
  • 42. Recall: % of papers cited in database submission fields that were found by our methods “accession” good for some, <url> for others
  • 43. Recall: % of papers cited in database submission fields that were found by our methods lexical regular expressions do well overall
  • 44. Precision: % of papers found by our methods that were cited in database submissions fields
  • 45. Precision: % of papers found by our methods that were cited in database submissions fields lexical regular expressions do well overall, bag-of- words does even better
  • 46. Precision: % of papers found by our methods that were cited in database submissions fields Precision of simple patterns depends on database
  • 47. Precision: % of papers found by our methods that were cited in database submissions fields Simple patterns do poorly on the most popular databases (those with the most statements of reuse?)
  • 48. Precision vs. Recall plot of all methods for each database. Diverse!
  • 49. Relative strength of methods for this task across databases bag of words <lexical patterns> <accession> <url> “accession”
  • 50. Limitations: • bias due to manual screening of negatives • database-centric classifier • approach requires computational access to literature full text!
  • 51. Impact: • A recent version that runs in PubMed Central: • could increase GEO article links by 2.6% • by 5.5% annually when all NIH in PMC • double the recall (to 80%), double these estimates • 40 links already added by GEO staff!
  • 52. Ongoing work: 1. Continue focusing on methods that use existing full-text query interfaces, like PubMed Central 2. Use this tool to evaluate the patterns and prevalence of biomedical research data sharing and reuse
  • 53. Thanks to the Dept of Biomedical Informatics at the U of Pittsburgh, the NLM for funding through training grant 5 T15 LM007059-22, and everyone who publishes “gold” open access, thereby facilitates reuse of article full text for studies like this. My shared data: www.dbmi.pitt.edu/piwowar Share your research data too!
  • 54.
  • 55. Our manual filter for additional positive classifications identified more cases in some databases than others: we reclassified 19% of [article,database] cases from ArrayExpress as positive despite an omitted literature link, compared to 11%, 7%, 2%, and 1% for GEO, Genbank, PDB, and SMD respectively (see Table 2 for raw number of cases). The most common situations included: the database entry listed a citation for another paper by the same authors, the entry listed an erroneous PubMed ID, the entry included a citation without a PubMed ID, or the entry had a blank citation field.
  • 56. Usage? • scientists looking for datasets for reuse • curators looking for primary citations • researchers studying data sharing behaviour
  • 57. Regular expression • Precise one + • "(b(accession.{0,20}(for|at).{0,100}(is|are)))", • r"(b(raw|original|our|complete|detailed).{0,20}data)", • r"(b(we|have|is|was|were|is|are|be|have|has|been).(exported|gave|given|listed|provided|reported))" • ]) + ")"
  • 58. Precise Regular expression • we have has is are was were be been accessioned|added|archived|assigned|deposited|entered|imported|included|inserted|loaded|lodged|placed| posted|provided|registered|reported.to|stored|submitted|uploaded.to))", is|are|will.be|made).{0,20}(available|accessible) (be).(accessed|browsed|downloaded|found|obtained|queried|retrieved|searched|viewed) (through|under|as).{0,20}accession (given)|new|received|assigned).{0,20}(accession) (data.{0,20}availability|for public distribution|for.{0,20}release upon publication|for the.{0,20}data.{0,20} generated|from this study have.{0,20}accession|data.{0,10}from this study|access to.{0,20}data.
  • 62. Evaluation • queried 24 000 articles across 27 journals • 25% mentioned one of the database names • development set of 4434 training set of 2000 test set of 1028
  • 63. Research data http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  • 64. Research data http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  • 65. Research data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  • 66. Research data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  • 67. Research data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
  • 68. Research data PAST MEDICAL HISTORY: Past medical history showed she had superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for four years. She had been hypothyroid for three years. HISTORY OF PRESENT ILLNESS: The patient is a 58-year-old female, … http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441