SlideShare a Scribd company logo
1 of 4
Linking database submissions to primary citations with PubMed Central
Heather A. Piwowar and Wendy W. Chapman
Department of Biomedical Informatics, University of Pittsburgh


Background: Dataset submissions are growing             available in abundance in PubMed. Analysis of
exponentially.         Links  between      dataset      machine-readable full text would permit a much
submissions and primary literature that describe        deeper and wider scope of study, but
the data collection are useful for many reasons:        assembling a corpus has been hindered by the
rich documentation, proper attribution, improved        complex, disparate, and decentralized access
information retrieval, and enhanced text/data           processes and licenses of publisher websites.
integration for analysis. Unfortunately, many           While PMC does not permit automated
database submissions do not include primary             downloading of non-Open Access full text (as
citation links, as database submissions are often       per publisher licenses), full text can be queried
made prior to publication. We suggest that              from the PMC interface. Integrating the ability to
automated tools can be developed to help                query the full text of all future NIH-funded
identify links between dataset submissions and          research reports in combination with MeSH
the primary literature. These tools require full        terms and other NCBI Entrez database links
text to differentiate cases of data sharing from        offers exciting possibilities.
data reuse and other contexts. In this study, we
                                                        In this report, we explore the potential of one
explore the possibility that deep analysis of full
                                                        such application: linking articles that describe
text may not be necessary, thereby enabling the
                                                        data collection to their database submission
querying of all reports in PubMed Central.
                                                        entries. Databases that store research datasets
Methods: We trained machine learning tree
                                                        often include citation links to the articles that
and rule-based classifiers on full-text open-
                                                        describe the initial generation and use of the
access article unigram vectors, with the
                                                        datasets. As we discuss below, these links are
existence of a primary citation link from NCBI’s
                                                        valuable, often missing, and time-consuming to
Gene Expression Omnibus (GEO) database
                                                        manually derive. We previously developed
submission records as the binary output class.
                                                        several NLP systems to identify declarations of
We manually combined and simplified the
                                                        database submission within research articles2,
classifier trees and rules to create a query
                                                        however these systems required access to
compatible with the interface for PubMed
                                                        complete full text for feature extraction. To take
Central.
                                                        advantage of the PMC resource, here we
Results: The query identified 40% of non-OA
                                                        develop a system restricted to rules that can be
articles with dataset submission links from GEO
                                                        expressed within the PMC query interface.
(recall), and 65% of the returned articles without
dataset submission links were manually judged           We apply our system to gene expression
to include statements of dataset deposit despite        microarray studies deposited in NCBI's Gene
having no link from the database (applicable            Expression Omnibus (GEO) database3. Gene
precision).                                             expression data are expensive to collect, often
Conclusion: We hope this work inspires future           but not always shared, and valuable for reuse.
enhancements, and highlights the opportunities          The GEO database is the largest repository for
for simple full-text queries in PubMed Central          gene expression datasets, is well integrated with
given the mandated influx of NIH-funded                 PMC query results, and contains links from
research reports.                                       submitted datasets to primary citation reports.

Introduction                                            Methods

The expected deluge of full-text biomedical             Our goal was to develop a PMC query for
research articles into PubMed Central (PMC), as         retrieving articles that mention depositing a
mandated by recent NIH policy1, creates many            dataset into GEO. We developed the query
opportunities for improving research tools and          using a selection of Open Access (OA) articles,
processes. Most biomedical text mining and              and evaluated it on non-OA articles.
natural language processing (NLP) has been
limited to titles and abstracts: these are              We used a gold standard based on our previous
                                                        work.2 Positive cases came from two sources:
all OA articles that were linked from the GEO         (geo OR omnibus)
DataSet primary submission field, plus articles       AND microarray
without a primary citation link from the GEO          AND "gene expression"
database that were nonetheless judged to have         AND accession
deposited data into GEO. Manual judgment for          NOT (databases
the selected OA articles was based on reviewing            OR user OR users
the full-text reports.    Negative cases were              OR (public AND accessed)
considered those articles that were not linked             OR (downloaded AND published))
from GEO Datasets and were manually
classified as lacking any indication within their     This query retrieved 772 articles, of which 455
full text that they had deposited a dataset into      were not open access. The results included 385
GEO.                                                  of the 966 PubMed Central non-OA articles with
                                                      links to the GEO Datasets (“pmc gds”[filter]
We used NCBI's Entrez E-Utilities, PubMed             NOT "open access"[filter]), for a recall of
Central, Python, TagHelper Tools4, and Weka5          40%.
to remove rare words (<40 occurrences) and
stopwords, create unigram bag-of-word vectors,        Next, we limited the query to non-OA articles
automatically select features, and build tree         without a PMC link to the GEO Datasets. We
(J48) and rule (PART) machine learning                manually determined that 44 of the 68 results
classifiers for a variety of parameter values. We     included a statement of dataset submission to
manually derived a PMC-compatible query               GEO within their full-text report. This indicates
based on the most robust feature selection and        an overall query precision of 94% (385+44/455)
classifier results.                                   for retrieving articles that have deposited
                                                      datasets into GEO and an applicable precision
Recall was calculated by determining what             of 65% (44/68) for retrieving articles that don’t
percentage of the non-OA (since OA was used           have PMC links but should. Our error analysis
in training) PMC articles with links to Gene          of the 24 false positives found that 13 of the
Expression Datasets were found by the query.          articles referenced GEO datasets in the context
We evaluated applicable precision by manually         of dataset reuse rather than submission
reviewing the non-OA query hits for articles that     (including 2 reusing their own work), 4
are not currently linked to GEO datasets and          referenced GEO in the context of platform
determining whether they indeed included              descriptions rather than datasets, and 5 didn’t
statements of dataset submission to GEO.              reference the GEO database at all but rather
                                                      mentioned the word “geo” for another purpose,
Finally, we compared the current count of NIH-        usually the beta-geo gene.
funded, GEO-linked articles in PubMed to those
currently within PMC to project the possible          The PubMed database contains 4291 articles
impact our query might have once all NIH-             with      links    from     GEO     DataSets
funded datasets are deposited in PMC.                 (in PubMed: “pubmed gds”[filter]). Thus, the
                                                      addition of an estimated 115 (177*65%) novel
Results                                               true positive links would increase the current
                                                      number        of    dataset-submission-to-primary-
The training set was composed of open-access          citation links by about 2.6%.
articles, including 550 positive examples (articles
that had links from the GEO primary citation          We also estimated how the query impact might
fields or were manually determined to have            increase once new NIH-funded articles are
shared data in GEO) and 165 negatives (articles       deposited in PMC. PMC contains 202 articles
without links from GEO). We combined the rules        published in 2007, funded by the NIH, and linked
and tree branches that occurred most frequently       from GEO DataSets.          In comparison, the
across the trained machine learning classifiers to    PubMed database contains 596 such articles—
compose the following PubMed Central query:           almost three times as many. Our query returned
                                                      39 hits for NIH-funded articles published in 2007
                                                      that were not linked from GEO DataSets. If all
                                                      NIH-funded articles were in PMC, and if similar
                                                      patterns exist for microarray papers that share
                                                      their data but do not currently have links from
the GEO database, we estimate our query could         serves as rich documentation for the dataset,
return roughly 117 (39*3) new articles per year       whether as free text or as meta-data mark-up as
identifying data sharing not included in primary      illustrated     by    the   BioLit   PDB     Clone
citations, of which 76 (117*65%) might be true        (http://biolit.ucsd.edu/pdb/). Second, the citation
positives. This would increase the annual count       provides a crucial mechanism for attributing
of primary citation links by about 5.5% (1310         recognition to the originators of the dataset upon
NIH and non-NIH PubMed articles with GEO              reuse.6 Third, the citation provides a link for
Database links in 2007 + 76 projected                 enhanced information retrieval or text/data
additions).                                           integration pathways.7,8

The trivial query, "gene expression omnibus”          Unfortunately, links to primary citations are often
AND (submitted OR deposited) resulted in a            missing from database submission entries
34% recall and 90% overall precision.                 because datasets are usually submitted before
                                                      publication details are known.9           Evidence
Discussion                                            suggests that a significant number of links from
                                                      database submissions to the primary literature
Database submissions often include a link to the      may be missing. For example, the PDB data
research article that describes the original data     uniformity project of 2000 found that 33% of
collection conditions and interpretations. Our        submission entries lacked a citation. Half of
results suggest a simple query on full-text can       these were recovered automatically using the list
automatically identify database submission            of submitter names, 40% through manual
primary citation links with a precision of 94% and    searches of PubMed and the Thomson ISI
recall of 40%. A trivial full-text query identified   databases, and 10% (3% of total) were
articles with 90% precision and 34% recall.           presumed to represent work that was never
Precision for the subset of articles without          published.10 More recently, another large-scale
existing links from the GEO database was 65%.         PDB remediation project looked at improving the
The methods we describe can be used to                quality of many fields, including primary
develop queries for identifying primary citations     citations. As of May 2005, 8508 (27%) of the
across a wide variety of datatypes and                31663      database      submissions       required
databases.                                            remediation due to inconsistent or missing
                                                      PubMed IDs and citation information. A report
The approach outlined in this study is much           near the end of the remediation process11
more practical than a complex regular-                estimated that manual searches found PubMed
expression classifier running on article full text.   IDs for 1226 entries, citation information without
Processing full-text articles requires not only       PubMed IDs for 387 entries, and about 700 (2%
access licenses and reuse permissions (or a           of 31663) were presumed unpublished. These
limitation to open access content) but also the       examples suggest that a sizeable number of
maintenance of a text repository and                  entries may be missing citation fields, and that
classification system. Querying full text through     most of them are recoverable. Unfortunately,
PubMed Central, in contrast, is publicly              these efforts are time-intensive and thus difficult
available, requires no infrastructure beyond an       to incorporate into the workflow of busy
internet connection, and covers all OA and non-       biocurators.12 NLP is already being used to aid
OA articles within PMC.                               database curation in a variety of tasks13, and we
                                                      believe it can also help biocurators identify
We imagine this query could be used in two            missing links to primary citations.
ways. It could be used by dataset-seekers, by
appending it onto PubMed or PMC queries to            Procedurally, our query results could be
find articles with shared datasets. Alternatively,    manually confirmed and then used to update
it could be used by biocurators as a tool for         database records. GEO asks for omitted
identifying primary citations that may be missing     citations
from their database submission fields. This           (http://www.ncbi.nlm.nih.gov/geo/info/ucitations.
latter use has broad implications, which we           html); we have sent them our findings and they
discuss further below.                                have updated their database to include the
                                                      missing links identified in this study. Other
Links between shared datasets and primary             databases, however, consider the submission
citations have many purposes. First, the citation     record the property of the submitter14 and are
thus unlikely to add citations without permission.         database and tools update. Nucleic
Perhaps in these situations an automated                   Acids Res 35(2007).
system could be developed to email submitters        4.    Rose,      C.P.,     et    al.    Analyzing
requesting they add or permit the addition of the          Collaborative      Learning      Processes
citation.                                                  Automatically: Exploiting the Advances
                                                           of    Computational        Linguistics     in
The performance of our query could undoubtedly             Computer-Supported             Collaborative
be improved through systematic refinement.15               Learning, International Journal of
Future work could involve deriving additional              Computer      Supported        Collaborative
cues through bootstrapping and semi-supervised             Learning (In Press).
learning, including stemmed words with               5.    Witten, I.H. & Frank, E. Data Mining:
wildcards, and refining the query based on error           Practical machine learning tools and
analysis (for example, excluding hits on beta-             techniques,      2nd    Edition,     Morgan
geo).      Additional improvements could be                Kaufmann, San Francisco (2005).
achieved if the PMC query capabilities were          6.    Compete, collaborate, compel. Nat
enhanced. For this application, it would be                Genet 39(2007).
particularly useful to remove negation and modal     7.    Butte, A.J. & Chen, R. Finding disease-
verbs from the stop word list                              related genomic experiments within an
(http://www.ncbi.nlm.nih.gov/books/bv.fcgi?                international repository: first steps in
highlight=stopwords&rid=helppubmed.table.pub               translational bioinformatics. AMIA Annu
medhelp.T43 ) so they could be included within             Symp Proc, 106-110 (2006).
query n-grams.                                       8.    Muller, H.M., Kenny, E.E. & Sternberg,
                                                           P.W. Textpresso: an ontology-based
Linking shared datasets to primary citations               information retrieval and extraction
increases their value: the datasets become                 system for biological literature. PLoS
easier to find, easier to understand, easier to            Biol 2(2004).
responsibly acknowledge, and easier to               9.    Piwowar, H.A. & Chapman, W.W. A
integrate with other information.         Dataset          review of journal policies for sharing
deposits are growing exponentially. As NIH-                research data. Available from Nature
funded research makes its way to PMC, the                  Precedings<http://hdl.handle.net/10101/
opportunity for creating links between datasets            npre.2008.1700.1>, (2008).
and full-text articles increases enormously. We      10.   Bhat, T.N., et al. The PDB data
hope this study provides a useful preliminary tool         uniformity project. Nucleic Acids Res 29,
and inspires further research in this area.                214-218 (2001).
                                                     11.   PDBj News Letter. in Volume 7, March
Our manual annotation results are available at             2006<http://www.pdbj.org/NewsLetter/n
http://www.dbmi.pitt.edu/piwowar .                         ewsletter_vol7_e.pdf> (2006).
                                                     12.   Burkhardt, K., Schneider, B. & Ory, J. A
Funding                                                    biocurator perspective: annotation at the
National  Library of   Medicine  (5T15-                    Research Collaboratory for Structural
LM007059-19 to HAP, 1R01-LM009427-01 to                    Bioinformatics Protein Data Bank. PLoS
WWC)                                                       Computational Biology 2(2006).
                                                     13.   Karamanis, N., et al. Natural Language
                                                           Processing in aid of FlyBase curators.
References
                                                           BMC Bioinformatics 9(2008).
1.      NOT-OD-08-033 Revised Policy on              14.   Pennisi, E. DNA DATA: Proposal to
        Enhancing Public Access to Archived                'Wikify' GenBank Meets Stiff Resistance.
        Publications Resulting from NIH-Funded             Science 319, 1598-1599 (2008).
        Research.                                    15.   Zhang, L., Ajiferuke, I. & Sampson, M.
2.      Piwowar, H.A. & Chapman, W.W.                      Optimizing search strategies to identify
        Identifying Data Sharing in Biomedical             randomized        controlled     trials    in
        Literature. Available from Nature                  MEDLINE. BMC Medical Research
        Precedings<http://hdl.handle.net/10101/            Methodology 6, 23 (2006).
        npre.2008.1721.1> (2008).
3.      Barrett, T., et al. NCBI GEO: mining tens
        of millions of expression profiles--

More Related Content

What's hot

Semantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionSemantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionTELKOMNIKA JOURNAL
 
NRNB Annual Report 2018
NRNB Annual Report 2018NRNB Annual Report 2018
NRNB Annual Report 2018Alexander Pico
 
Context Driven Technique for Document Classification
Context Driven Technique for Document ClassificationContext Driven Technique for Document Classification
Context Driven Technique for Document ClassificationIDES Editor
 
An Efficient Approach for Requirement Traceability Integrated With Software R...
An Efficient Approach for Requirement Traceability Integrated With Software R...An Efficient Approach for Requirement Traceability Integrated With Software R...
An Efficient Approach for Requirement Traceability Integrated With Software R...IOSR Journals
 
DNA Query Language DNAQL: A Novel Approach
DNA Query Language DNAQL: A Novel ApproachDNA Query Language DNAQL: A Novel Approach
DNA Query Language DNAQL: A Novel ApproachEditor IJCATR
 
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...ijdmtaiir
 
NRNB Annual Report 2012
NRNB Annual Report 2012NRNB Annual Report 2012
NRNB Annual Report 2012Alexander Pico
 
A Survey on Bioinformatics Tools
A Survey on Bioinformatics ToolsA Survey on Bioinformatics Tools
A Survey on Bioinformatics Toolsidescitation
 
Statistical Analysis based Hypothesis Testing Method in Biological Knowledge ...
Statistical Analysis based Hypothesis Testing Method in Biological Knowledge ...Statistical Analysis based Hypothesis Testing Method in Biological Knowledge ...
Statistical Analysis based Hypothesis Testing Method in Biological Knowledge ...ijcsa
 
Gene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -TutorialGene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -TutorialDmitry Grapov
 
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...IJCI JOURNAL
 
Overall Vision for NRNB: 2015-2020
Overall Vision for NRNB: 2015-2020Overall Vision for NRNB: 2015-2020
Overall Vision for NRNB: 2015-2020Alexander Pico
 
Data Mining in Rediology reports
Data Mining in Rediology reportsData Mining in Rediology reports
Data Mining in Rediology reportsSaeed Mehrabi
 

What's hot (19)

Ceis 1
Ceis 1Ceis 1
Ceis 1
 
Semantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionSemantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detection
 
NRNB EAC Meeting 2012
NRNB EAC Meeting 2012NRNB EAC Meeting 2012
NRNB EAC Meeting 2012
 
OpenTox Europe 2013
OpenTox Europe 2013OpenTox Europe 2013
OpenTox Europe 2013
 
NRNB Annual Report 2018
NRNB Annual Report 2018NRNB Annual Report 2018
NRNB Annual Report 2018
 
Context Driven Technique for Document Classification
Context Driven Technique for Document ClassificationContext Driven Technique for Document Classification
Context Driven Technique for Document Classification
 
An Efficient Approach for Requirement Traceability Integrated With Software R...
An Efficient Approach for Requirement Traceability Integrated With Software R...An Efficient Approach for Requirement Traceability Integrated With Software R...
An Efficient Approach for Requirement Traceability Integrated With Software R...
 
DNA Query Language DNAQL: A Novel Approach
DNA Query Language DNAQL: A Novel ApproachDNA Query Language DNAQL: A Novel Approach
DNA Query Language DNAQL: A Novel Approach
 
NETTAB 2013
NETTAB 2013NETTAB 2013
NETTAB 2013
 
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
 
NRNB Annual Report 2012
NRNB Annual Report 2012NRNB Annual Report 2012
NRNB Annual Report 2012
 
A Survey on Bioinformatics Tools
A Survey on Bioinformatics ToolsA Survey on Bioinformatics Tools
A Survey on Bioinformatics Tools
 
E0322035037
E0322035037E0322035037
E0322035037
 
Statistical Analysis based Hypothesis Testing Method in Biological Knowledge ...
Statistical Analysis based Hypothesis Testing Method in Biological Knowledge ...Statistical Analysis based Hypothesis Testing Method in Biological Knowledge ...
Statistical Analysis based Hypothesis Testing Method in Biological Knowledge ...
 
Gene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -TutorialGene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -Tutorial
 
NETTAB 2012
NETTAB 2012NETTAB 2012
NETTAB 2012
 
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...
 
Overall Vision for NRNB: 2015-2020
Overall Vision for NRNB: 2015-2020Overall Vision for NRNB: 2015-2020
Overall Vision for NRNB: 2015-2020
 
Data Mining in Rediology reports
Data Mining in Rediology reportsData Mining in Rediology reports
Data Mining in Rediology reports
 

Similar to Full text

BIOLINK 2008: Linking database submissions to primary citations with PubMe...
BIOLINK 2008:    Linking database submissions to primary citations with PubMe...BIOLINK 2008:    Linking database submissions to primary citations with PubMe...
BIOLINK 2008: Linking database submissions to primary citations with PubMe...Heather Piwowar
 
Automatically Generating Wikipedia Articles: A Structure-Aware Approach
Automatically Generating Wikipedia Articles:  A Structure-Aware ApproachAutomatically Generating Wikipedia Articles:  A Structure-Aware Approach
Automatically Generating Wikipedia Articles: A Structure-Aware ApproachGeorge Ang
 
RDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest GroupRDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest GroupAnita de Waard
 
NLP_BioAssayPoster
NLP_BioAssayPosterNLP_BioAssayPoster
NLP_BioAssayPosterSuman Lama
 
Pub med+
Pub med+Pub med+
Pub med+RyanRM
 
IJERD(www.ijerd.com)International Journal of Engineering Research and Develop...
IJERD(www.ijerd.com)International Journal of Engineering Research and Develop...IJERD(www.ijerd.com)International Journal of Engineering Research and Develop...
IJERD(www.ijerd.com)International Journal of Engineering Research and Develop...IJERD Editor
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244Yasel Cruz
 
Novel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information ExtractionNovel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information Extractionijsrd.com
 
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS AIRCC Publishing Corporation
 
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusA Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusAIRCC Publishing Corporation
 
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusA Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological Corpusijcsit
 
Towards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational DatabaseTowards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational Databaseijbuiiir1
 
Investigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysisInvestigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysisCatherine Canevet
 
bio data
bio databio data
bio data007dcp
 

Similar to Full text (20)

BIOLINK 2008: Linking database submissions to primary citations with PubMe...
BIOLINK 2008:    Linking database submissions to primary citations with PubMe...BIOLINK 2008:    Linking database submissions to primary citations with PubMe...
BIOLINK 2008: Linking database submissions to primary citations with PubMe...
 
Automatically Generating Wikipedia Articles: A Structure-Aware Approach
Automatically Generating Wikipedia Articles:  A Structure-Aware ApproachAutomatically Generating Wikipedia Articles:  A Structure-Aware Approach
Automatically Generating Wikipedia Articles: A Structure-Aware Approach
 
RDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest GroupRDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest Group
 
NLP_BioAssayPoster
NLP_BioAssayPosterNLP_BioAssayPoster
NLP_BioAssayPoster
 
String.pptx
String.pptxString.pptx
String.pptx
 
Pub med+
Pub med+Pub med+
Pub med+
 
IJERD(www.ijerd.com)International Journal of Engineering Research and Develop...
IJERD(www.ijerd.com)International Journal of Engineering Research and Develop...IJERD(www.ijerd.com)International Journal of Engineering Research and Develop...
IJERD(www.ijerd.com)International Journal of Engineering Research and Develop...
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244
 
Novel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information ExtractionNovel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information Extraction
 
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
 
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusA Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
 
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusA Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
 
Towards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational DatabaseTowards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational Database
 
Clicking Past Google
Clicking Past GoogleClicking Past Google
Clicking Past Google
 
Investigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysisInvestigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysis
 
Data Retrieval Systems
Data Retrieval SystemsData Retrieval Systems
Data Retrieval Systems
 
bio data
bio databio data
bio data
 
BioNLPSADI
BioNLPSADIBioNLPSADI
BioNLPSADI
 
Entrez databases
Entrez databasesEntrez databases
Entrez databases
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Full text

  • 1. Linking database submissions to primary citations with PubMed Central Heather A. Piwowar and Wendy W. Chapman Department of Biomedical Informatics, University of Pittsburgh Background: Dataset submissions are growing available in abundance in PubMed. Analysis of exponentially. Links between dataset machine-readable full text would permit a much submissions and primary literature that describe deeper and wider scope of study, but the data collection are useful for many reasons: assembling a corpus has been hindered by the rich documentation, proper attribution, improved complex, disparate, and decentralized access information retrieval, and enhanced text/data processes and licenses of publisher websites. integration for analysis. Unfortunately, many While PMC does not permit automated database submissions do not include primary downloading of non-Open Access full text (as citation links, as database submissions are often per publisher licenses), full text can be queried made prior to publication. We suggest that from the PMC interface. Integrating the ability to automated tools can be developed to help query the full text of all future NIH-funded identify links between dataset submissions and research reports in combination with MeSH the primary literature. These tools require full terms and other NCBI Entrez database links text to differentiate cases of data sharing from offers exciting possibilities. data reuse and other contexts. In this study, we In this report, we explore the potential of one explore the possibility that deep analysis of full such application: linking articles that describe text may not be necessary, thereby enabling the data collection to their database submission querying of all reports in PubMed Central. entries. Databases that store research datasets Methods: We trained machine learning tree often include citation links to the articles that and rule-based classifiers on full-text open- describe the initial generation and use of the access article unigram vectors, with the datasets. As we discuss below, these links are existence of a primary citation link from NCBI’s valuable, often missing, and time-consuming to Gene Expression Omnibus (GEO) database manually derive. We previously developed submission records as the binary output class. several NLP systems to identify declarations of We manually combined and simplified the database submission within research articles2, classifier trees and rules to create a query however these systems required access to compatible with the interface for PubMed complete full text for feature extraction. To take Central. advantage of the PMC resource, here we Results: The query identified 40% of non-OA develop a system restricted to rules that can be articles with dataset submission links from GEO expressed within the PMC query interface. (recall), and 65% of the returned articles without dataset submission links were manually judged We apply our system to gene expression to include statements of dataset deposit despite microarray studies deposited in NCBI's Gene having no link from the database (applicable Expression Omnibus (GEO) database3. Gene precision). expression data are expensive to collect, often Conclusion: We hope this work inspires future but not always shared, and valuable for reuse. enhancements, and highlights the opportunities The GEO database is the largest repository for for simple full-text queries in PubMed Central gene expression datasets, is well integrated with given the mandated influx of NIH-funded PMC query results, and contains links from research reports. submitted datasets to primary citation reports. Introduction Methods The expected deluge of full-text biomedical Our goal was to develop a PMC query for research articles into PubMed Central (PMC), as retrieving articles that mention depositing a mandated by recent NIH policy1, creates many dataset into GEO. We developed the query opportunities for improving research tools and using a selection of Open Access (OA) articles, processes. Most biomedical text mining and and evaluated it on non-OA articles. natural language processing (NLP) has been limited to titles and abstracts: these are We used a gold standard based on our previous work.2 Positive cases came from two sources:
  • 2. all OA articles that were linked from the GEO (geo OR omnibus) DataSet primary submission field, plus articles AND microarray without a primary citation link from the GEO AND "gene expression" database that were nonetheless judged to have AND accession deposited data into GEO. Manual judgment for NOT (databases the selected OA articles was based on reviewing OR user OR users the full-text reports. Negative cases were OR (public AND accessed) considered those articles that were not linked OR (downloaded AND published)) from GEO Datasets and were manually classified as lacking any indication within their This query retrieved 772 articles, of which 455 full text that they had deposited a dataset into were not open access. The results included 385 GEO. of the 966 PubMed Central non-OA articles with links to the GEO Datasets (“pmc gds”[filter] We used NCBI's Entrez E-Utilities, PubMed NOT "open access"[filter]), for a recall of Central, Python, TagHelper Tools4, and Weka5 40%. to remove rare words (<40 occurrences) and stopwords, create unigram bag-of-word vectors, Next, we limited the query to non-OA articles automatically select features, and build tree without a PMC link to the GEO Datasets. We (J48) and rule (PART) machine learning manually determined that 44 of the 68 results classifiers for a variety of parameter values. We included a statement of dataset submission to manually derived a PMC-compatible query GEO within their full-text report. This indicates based on the most robust feature selection and an overall query precision of 94% (385+44/455) classifier results. for retrieving articles that have deposited datasets into GEO and an applicable precision Recall was calculated by determining what of 65% (44/68) for retrieving articles that don’t percentage of the non-OA (since OA was used have PMC links but should. Our error analysis in training) PMC articles with links to Gene of the 24 false positives found that 13 of the Expression Datasets were found by the query. articles referenced GEO datasets in the context We evaluated applicable precision by manually of dataset reuse rather than submission reviewing the non-OA query hits for articles that (including 2 reusing their own work), 4 are not currently linked to GEO datasets and referenced GEO in the context of platform determining whether they indeed included descriptions rather than datasets, and 5 didn’t statements of dataset submission to GEO. reference the GEO database at all but rather mentioned the word “geo” for another purpose, Finally, we compared the current count of NIH- usually the beta-geo gene. funded, GEO-linked articles in PubMed to those currently within PMC to project the possible The PubMed database contains 4291 articles impact our query might have once all NIH- with links from GEO DataSets funded datasets are deposited in PMC. (in PubMed: “pubmed gds”[filter]). Thus, the addition of an estimated 115 (177*65%) novel Results true positive links would increase the current number of dataset-submission-to-primary- The training set was composed of open-access citation links by about 2.6%. articles, including 550 positive examples (articles that had links from the GEO primary citation We also estimated how the query impact might fields or were manually determined to have increase once new NIH-funded articles are shared data in GEO) and 165 negatives (articles deposited in PMC. PMC contains 202 articles without links from GEO). We combined the rules published in 2007, funded by the NIH, and linked and tree branches that occurred most frequently from GEO DataSets. In comparison, the across the trained machine learning classifiers to PubMed database contains 596 such articles— compose the following PubMed Central query: almost three times as many. Our query returned 39 hits for NIH-funded articles published in 2007 that were not linked from GEO DataSets. If all NIH-funded articles were in PMC, and if similar patterns exist for microarray papers that share their data but do not currently have links from
  • 3. the GEO database, we estimate our query could serves as rich documentation for the dataset, return roughly 117 (39*3) new articles per year whether as free text or as meta-data mark-up as identifying data sharing not included in primary illustrated by the BioLit PDB Clone citations, of which 76 (117*65%) might be true (http://biolit.ucsd.edu/pdb/). Second, the citation positives. This would increase the annual count provides a crucial mechanism for attributing of primary citation links by about 5.5% (1310 recognition to the originators of the dataset upon NIH and non-NIH PubMed articles with GEO reuse.6 Third, the citation provides a link for Database links in 2007 + 76 projected enhanced information retrieval or text/data additions). integration pathways.7,8 The trivial query, "gene expression omnibus” Unfortunately, links to primary citations are often AND (submitted OR deposited) resulted in a missing from database submission entries 34% recall and 90% overall precision. because datasets are usually submitted before publication details are known.9 Evidence Discussion suggests that a significant number of links from database submissions to the primary literature Database submissions often include a link to the may be missing. For example, the PDB data research article that describes the original data uniformity project of 2000 found that 33% of collection conditions and interpretations. Our submission entries lacked a citation. Half of results suggest a simple query on full-text can these were recovered automatically using the list automatically identify database submission of submitter names, 40% through manual primary citation links with a precision of 94% and searches of PubMed and the Thomson ISI recall of 40%. A trivial full-text query identified databases, and 10% (3% of total) were articles with 90% precision and 34% recall. presumed to represent work that was never Precision for the subset of articles without published.10 More recently, another large-scale existing links from the GEO database was 65%. PDB remediation project looked at improving the The methods we describe can be used to quality of many fields, including primary develop queries for identifying primary citations citations. As of May 2005, 8508 (27%) of the across a wide variety of datatypes and 31663 database submissions required databases. remediation due to inconsistent or missing PubMed IDs and citation information. A report The approach outlined in this study is much near the end of the remediation process11 more practical than a complex regular- estimated that manual searches found PubMed expression classifier running on article full text. IDs for 1226 entries, citation information without Processing full-text articles requires not only PubMed IDs for 387 entries, and about 700 (2% access licenses and reuse permissions (or a of 31663) were presumed unpublished. These limitation to open access content) but also the examples suggest that a sizeable number of maintenance of a text repository and entries may be missing citation fields, and that classification system. Querying full text through most of them are recoverable. Unfortunately, PubMed Central, in contrast, is publicly these efforts are time-intensive and thus difficult available, requires no infrastructure beyond an to incorporate into the workflow of busy internet connection, and covers all OA and non- biocurators.12 NLP is already being used to aid OA articles within PMC. database curation in a variety of tasks13, and we believe it can also help biocurators identify We imagine this query could be used in two missing links to primary citations. ways. It could be used by dataset-seekers, by appending it onto PubMed or PMC queries to Procedurally, our query results could be find articles with shared datasets. Alternatively, manually confirmed and then used to update it could be used by biocurators as a tool for database records. GEO asks for omitted identifying primary citations that may be missing citations from their database submission fields. This (http://www.ncbi.nlm.nih.gov/geo/info/ucitations. latter use has broad implications, which we html); we have sent them our findings and they discuss further below. have updated their database to include the missing links identified in this study. Other Links between shared datasets and primary databases, however, consider the submission citations have many purposes. First, the citation record the property of the submitter14 and are
  • 4. thus unlikely to add citations without permission. database and tools update. Nucleic Perhaps in these situations an automated Acids Res 35(2007). system could be developed to email submitters 4. Rose, C.P., et al. Analyzing requesting they add or permit the addition of the Collaborative Learning Processes citation. Automatically: Exploiting the Advances of Computational Linguistics in The performance of our query could undoubtedly Computer-Supported Collaborative be improved through systematic refinement.15 Learning, International Journal of Future work could involve deriving additional Computer Supported Collaborative cues through bootstrapping and semi-supervised Learning (In Press). learning, including stemmed words with 5. Witten, I.H. & Frank, E. Data Mining: wildcards, and refining the query based on error Practical machine learning tools and analysis (for example, excluding hits on beta- techniques, 2nd Edition, Morgan geo). Additional improvements could be Kaufmann, San Francisco (2005). achieved if the PMC query capabilities were 6. Compete, collaborate, compel. Nat enhanced. For this application, it would be Genet 39(2007). particularly useful to remove negation and modal 7. Butte, A.J. & Chen, R. Finding disease- verbs from the stop word list related genomic experiments within an (http://www.ncbi.nlm.nih.gov/books/bv.fcgi? international repository: first steps in highlight=stopwords&rid=helppubmed.table.pub translational bioinformatics. AMIA Annu medhelp.T43 ) so they could be included within Symp Proc, 106-110 (2006). query n-grams. 8. Muller, H.M., Kenny, E.E. & Sternberg, P.W. Textpresso: an ontology-based Linking shared datasets to primary citations information retrieval and extraction increases their value: the datasets become system for biological literature. PLoS easier to find, easier to understand, easier to Biol 2(2004). responsibly acknowledge, and easier to 9. Piwowar, H.A. & Chapman, W.W. A integrate with other information. Dataset review of journal policies for sharing deposits are growing exponentially. As NIH- research data. Available from Nature funded research makes its way to PMC, the Precedings<http://hdl.handle.net/10101/ opportunity for creating links between datasets npre.2008.1700.1>, (2008). and full-text articles increases enormously. We 10. Bhat, T.N., et al. The PDB data hope this study provides a useful preliminary tool uniformity project. Nucleic Acids Res 29, and inspires further research in this area. 214-218 (2001). 11. PDBj News Letter. in Volume 7, March Our manual annotation results are available at 2006<http://www.pdbj.org/NewsLetter/n http://www.dbmi.pitt.edu/piwowar . ewsletter_vol7_e.pdf> (2006). 12. Burkhardt, K., Schneider, B. & Ory, J. A Funding biocurator perspective: annotation at the National Library of Medicine (5T15- Research Collaboratory for Structural LM007059-19 to HAP, 1R01-LM009427-01 to Bioinformatics Protein Data Bank. PLoS WWC) Computational Biology 2(2006). 13. Karamanis, N., et al. Natural Language Processing in aid of FlyBase curators. References BMC Bioinformatics 9(2008). 1. NOT-OD-08-033 Revised Policy on 14. Pennisi, E. DNA DATA: Proposal to Enhancing Public Access to Archived 'Wikify' GenBank Meets Stiff Resistance. Publications Resulting from NIH-Funded Science 319, 1598-1599 (2008). Research. 15. Zhang, L., Ajiferuke, I. & Sampson, M. 2. Piwowar, H.A. & Chapman, W.W. Optimizing search strategies to identify Identifying Data Sharing in Biomedical randomized controlled trials in Literature. Available from Nature MEDLINE. BMC Medical Research Precedings<http://hdl.handle.net/10101/ Methodology 6, 23 (2006). npre.2008.1721.1> (2008). 3. Barrett, T., et al. NCBI GEO: mining tens of millions of expression profiles--