Improving Interoperability of Text Mining Tools with BioC
1. We would like to thank Don Comeau, Rezarta Doğan, and John Wilbur
for their discussion and help with the BioC tools. This research was
supported by the Intramural Research Program of the NIH, National
Library of Medicine.
Acknowledgments
Wei, C. H., Harris, B. R., Kao, H. Y., Lu, Z. (2013) tmVar: A text mining approach for extracting sequence variants in biomedical literature, Bioinformatics, 29 (11), 1433-1439.
Wei, C.H., Kao, H.Y., Lu, Z. (2012) SR4GN: a species recognition software tool for gene normalization. PloS one, 7, e38460.
Leaman, R., Islamaj Dogan, R., Lu, Z. (2013) DNorm: Disease Name Normalization with Pairwise Learning to Rank. Bioinformatics.
Wei, C.H., Kao, H.Y., Lu, Z. (2013) PubTator: a web-based text mining tool for assisting biocuration. Nucleic acids research, 41, W518-522.
Wei, C.H., Kao, H.Y. (2011) Cross-species gene normalization by species inference. BMC bioinformatics, 12 Suppl 8, S5.
References
The lack of interoperability among text mining tools is a major bottleneck in creating more complex applications. Despite the availability of numerous methods and techniques for various text
mining tasks, combining different tools requires substantial efforts and time. In response, BioC offers a minimalistic approach to tool interoperability by stipulating minimal changes to
existing tools and applications. In this study, we introduce several state-of-the-art text mining tools developed at the NCBI, and modify these tools to make them BioC compatible. Our
toolkit can be accessed at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/.
The NCBI Text Mining Toolkit
Improving Interoperability of Text Mining Tools with BioC
Ritu Khare, Chih-Hsuan Wei, Yuqing Mao, Robert Leaman, Zhiyong Lu
National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD, 20894
Abstract
DNormDNorm
tmVartmVar
SR4GNSR4GN
tmChemtmChem
GenNormGenNorm
PubMed
Abstract
Disease
Mentions
with
MEDIC
IDs
Mutation
Mentions
Species
Mentions
with
Taxonomy
IDs
Chemical
Mentions
Gene
Mentions
with
Entrez
IDs
Annotations
for
Various
BioConcepts
Concept
Recognition
and
Annotation
Toolkit
PubMed
Abstracts
or
Full-‐Text
Articles
NER tools Bioconcept
Programming
Language(s)
Method F-measure
tmChem Chemical Java, Perl, C++ CRF 88.27%
DNorm Disease Java CRF 80.90%
tmVar Mutation Perl, C++ CRF 91.39%
SR4GN Species Perl Rule based 85.42%
GenNorm Gene Perl Statistical 92.89%
Tools Bioconcept
PubMed/
PMC XML
BioC Free Text PubTator GenNorm
tmChem Chemical √ √ √
DNorm Disease √ √ √
tmVar Mutation √ √ √ √
SR4GN Species √ √ √ √
GenNorm Gene √ √ √ √
PubTator N/A √ √ √
Table 1. Summary of Concept Recognition Tools
Table 2. Compatible input/output formats
Building BioC Compatible Versions
Figure 1. The NCBI Toolkit Figure 2. Tool Features
² tmChem achieved the best performance in BioCreative IV CHEMDNER task on
chemical entity mention recognition.
² GenNorm achieved the best performance in BioCreative III Gene Normalization task
² DNorm achieved the best performance in 2013 ShARe/CLEF shared task for
normalizing disease names in clinical notes
Conclusions
BioC was easy to learn and straightforward to implement. Only minimal changes were required
to re-package the NCBI toolkit with BioC. Our tools are now interoperable with each other, and
with several other tools to build more powerful text mining applications and offer wider usage.
Figure 3. BioC Input and Output Format
BioC comprises: XML format describing how to present text documents and annotations, and
functions to read/write documents in the BioC XML format.
Steps to build BioC compatible tools: Modify the input/output format of the tool and create a
key file to interpret the BioC annotation file.
Table 2. Input/Output formats supported by our tools
Offset
Identifiers
Mentions
Types