SlideShare una empresa de Scribd logo
1 de 1
Descargar para leer sin conexión
We would like to thank Don Comeau, Rezarta Doğan, and John Wilbur
for their discussion and help with the BioC tools. This research was
supported by the Intramural Research Program of the NIH, National
Library of Medicine.
Acknowledgments
Wei, C. H., Harris, B. R., Kao, H. Y., Lu, Z. (2013) tmVar: A text mining approach for extracting sequence variants in biomedical literature, Bioinformatics, 29 (11), 1433-1439.
Wei, C.H., Kao, H.Y., Lu, Z. (2012) SR4GN: a species recognition software tool for gene normalization. PloS one, 7, e38460.
Leaman, R., Islamaj Dogan, R., Lu, Z. (2013) DNorm: Disease Name Normalization with Pairwise Learning to Rank. Bioinformatics.
Wei, C.H., Kao, H.Y., Lu, Z. (2013) PubTator: a web-based text mining tool for assisting biocuration. Nucleic acids research, 41, W518-522.
Wei, C.H., Kao, H.Y. (2011) Cross-species gene normalization by species inference. BMC bioinformatics, 12 Suppl 8, S5.
References
The lack of interoperability among text mining tools is a major bottleneck in creating more complex applications. Despite the availability of numerous methods and techniques for various text
mining tasks, combining different tools requires substantial efforts and time. In response, BioC offers a minimalistic approach to tool interoperability by stipulating minimal changes to
existing tools and applications. In this study, we introduce several state-of-the-art text mining tools developed at the NCBI, and modify these tools to make them BioC compatible. Our
toolkit can be accessed at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/.
The NCBI Text Mining Toolkit
Improving Interoperability of Text Mining Tools with BioC
Ritu Khare, Chih-Hsuan Wei, Yuqing Mao, Robert Leaman, Zhiyong Lu
National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD, 20894
Abstract
DNormDNorm
tmVartmVar
SR4GNSR4GN
tmChemtmChem
GenNormGenNorm
PubMed	
  
Abstract
Disease	
  Mentions	
  
with	
  MEDIC	
  IDs
Mutation	
  Mentions
Species	
  Mentions	
  with	
  
Taxonomy	
  IDs
Chemical	
  Mentions
Gene	
  Mentions	
  
with	
  Entrez	
  IDs
Annotations	
  for	
  
Various	
  BioConcepts
Concept	
  Recognition	
  
and	
  Annotation	
  Toolkit
PubMed	
  Abstracts	
  
or	
  Full-­‐Text	
  Articles
NER tools Bioconcept
Programming
Language(s)
Method F-measure
tmChem Chemical Java, Perl, C++ CRF 88.27%
DNorm Disease Java CRF 80.90%
tmVar Mutation Perl, C++ CRF 91.39%
SR4GN Species Perl Rule based 85.42%
GenNorm Gene Perl Statistical 92.89%
Tools Bioconcept
PubMed/
PMC XML
BioC Free Text PubTator GenNorm
tmChem Chemical √ √ √
DNorm Disease √ √ √
tmVar Mutation √ √ √ √
SR4GN Species √ √ √ √
GenNorm Gene √ √ √ √
PubTator N/A √ √ √
Table 1. Summary of Concept Recognition Tools
Table 2. Compatible input/output formats
Building BioC Compatible Versions
Figure 1. The NCBI Toolkit Figure 2. Tool Features
² tmChem achieved the best performance in BioCreative IV CHEMDNER task on
chemical entity mention recognition.
² GenNorm achieved the best performance in BioCreative III Gene Normalization task
² DNorm achieved the best performance in 2013 ShARe/CLEF shared task for
normalizing disease names in clinical notes
Conclusions
BioC was easy to learn and straightforward to implement. Only minimal changes were required
to re-package the NCBI toolkit with BioC. Our tools are now interoperable with each other, and
with several other tools to build more powerful text mining applications and offer wider usage.
Figure 3. BioC Input and Output Format
BioC comprises: XML format describing how to present text documents and annotations, and
functions to read/write documents in the BioC XML format.
Steps to build BioC compatible tools: Modify the input/output format of the tool and create a
key file to interpret the BioC annotation file.
Table 2. Input/Output formats supported by our tools
Offset
Identifiers
Mentions
Types

Más contenido relacionado

La actualidad más candente

Bg Seminar(Phytophthora Db)
Bg Seminar(Phytophthora Db)Bg Seminar(Phytophthora Db)
Bg Seminar(Phytophthora Db)Bongsoo Park
 
Pathema: A Clade Specific Bioinformatics Resource Center
Pathema: A Clade Specific Bioinformatics Resource CenterPathema: A Clade Specific Bioinformatics Resource Center
Pathema: A Clade Specific Bioinformatics Resource CenterPathema
 
Making the most of phenotypes in ontology-based biomedical knowledge discovery
Making the most of phenotypes in ontology-based biomedical knowledge discoveryMaking the most of phenotypes in ontology-based biomedical knowledge discovery
Making the most of phenotypes in ontology-based biomedical knowledge discoveryMichel Dumontier
 
Dennis Sheeter, Vita 0116
Dennis Sheeter, Vita 0116Dennis Sheeter, Vita 0116
Dennis Sheeter, Vita 0116Dennis Sheeter
 
Marker assisted selection in legume crops
Marker assisted selection in legume cropsMarker assisted selection in legume crops
Marker assisted selection in legume cropsBasavaraj Panjagal
 
Phenopackets as applied to variant interpretation
Phenopackets as applied to variant interpretation Phenopackets as applied to variant interpretation
Phenopackets as applied to variant interpretation mhaendel
 
Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...
Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...
Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...mhaendel
 
Envisioning a world where everyone helps solve disease
Envisioning a world where everyone helps solve diseaseEnvisioning a world where everyone helps solve disease
Envisioning a world where everyone helps solve diseasemhaendel
 
David Schindel - DNA Barcoding and the consortium for the barcode of life (CBOL)
David Schindel - DNA Barcoding and the consortium for the barcode of life (CBOL)David Schindel - DNA Barcoding and the consortium for the barcode of life (CBOL)
David Schindel - DNA Barcoding and the consortium for the barcode of life (CBOL)Consortium for the Barcode of Life (CBOL)
 
Parasitic plants steal genes from hosts to weaken their defenses
Parasitic plants steal genes from hosts to weaken their defensesParasitic plants steal genes from hosts to weaken their defenses
Parasitic plants steal genes from hosts to weaken their defensesMabel_Berry
 
Quality Assessment of Biomedical Metadata using Topic Modeling
Quality Assessment of Biomedical Metadata using Topic ModelingQuality Assessment of Biomedical Metadata using Topic Modeling
Quality Assessment of Biomedical Metadata using Topic ModelingStuti Nayak
 
Database Of Rose Varieties Eucarpia Leiden 2009
Database Of Rose Varieties Eucarpia Leiden 2009Database Of Rose Varieties Eucarpia Leiden 2009
Database Of Rose Varieties Eucarpia Leiden 2009renesmulders
 
DNA Bar-code to Distinguish the Species
DNA Bar-code to Distinguish the SpeciesDNA Bar-code to Distinguish the Species
DNA Bar-code to Distinguish the SpeciesRoya Shariati
 
The IGM N-of-One Diagnostic Interpretation for Genetic Disorders
The IGM N-of-One Diagnostic Interpretation for Genetic DisordersThe IGM N-of-One Diagnostic Interpretation for Genetic Disorders
The IGM N-of-One Diagnostic Interpretation for Genetic Disorderssp3347
 
DNA Testing: Living Longer Via Personal Genomics
DNA Testing: Living Longer Via Personal GenomicsDNA Testing: Living Longer Via Personal Genomics
DNA Testing: Living Longer Via Personal GenomicsMelanie Swan
 
2018 05 24-waldron-itcr
2018 05 24-waldron-itcr2018 05 24-waldron-itcr
2018 05 24-waldron-itcrLevi Waldron
 

La actualidad más candente (19)

Bg Seminar(Phytophthora Db)
Bg Seminar(Phytophthora Db)Bg Seminar(Phytophthora Db)
Bg Seminar(Phytophthora Db)
 
Pathema: A Clade Specific Bioinformatics Resource Center
Pathema: A Clade Specific Bioinformatics Resource CenterPathema: A Clade Specific Bioinformatics Resource Center
Pathema: A Clade Specific Bioinformatics Resource Center
 
Making the most of phenotypes in ontology-based biomedical knowledge discovery
Making the most of phenotypes in ontology-based biomedical knowledge discoveryMaking the most of phenotypes in ontology-based biomedical knowledge discovery
Making the most of phenotypes in ontology-based biomedical knowledge discovery
 
Dennis Sheeter, Vita 0116
Dennis Sheeter, Vita 0116Dennis Sheeter, Vita 0116
Dennis Sheeter, Vita 0116
 
Prashant esa2017
Prashant esa2017Prashant esa2017
Prashant esa2017
 
Marker assisted selection in legume crops
Marker assisted selection in legume cropsMarker assisted selection in legume crops
Marker assisted selection in legume crops
 
Resume_JWS
Resume_JWSResume_JWS
Resume_JWS
 
Phenopackets as applied to variant interpretation
Phenopackets as applied to variant interpretation Phenopackets as applied to variant interpretation
Phenopackets as applied to variant interpretation
 
Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...
Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...
Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...
 
Envisioning a world where everyone helps solve disease
Envisioning a world where everyone helps solve diseaseEnvisioning a world where everyone helps solve disease
Envisioning a world where everyone helps solve disease
 
David Schindel - DNA Barcoding and the consortium for the barcode of life (CBOL)
David Schindel - DNA Barcoding and the consortium for the barcode of life (CBOL)David Schindel - DNA Barcoding and the consortium for the barcode of life (CBOL)
David Schindel - DNA Barcoding and the consortium for the barcode of life (CBOL)
 
Parasitic plants steal genes from hosts to weaken their defenses
Parasitic plants steal genes from hosts to weaken their defensesParasitic plants steal genes from hosts to weaken their defenses
Parasitic plants steal genes from hosts to weaken their defenses
 
Quality Assessment of Biomedical Metadata using Topic Modeling
Quality Assessment of Biomedical Metadata using Topic ModelingQuality Assessment of Biomedical Metadata using Topic Modeling
Quality Assessment of Biomedical Metadata using Topic Modeling
 
Database Of Rose Varieties Eucarpia Leiden 2009
Database Of Rose Varieties Eucarpia Leiden 2009Database Of Rose Varieties Eucarpia Leiden 2009
Database Of Rose Varieties Eucarpia Leiden 2009
 
DNA Bar-code to Distinguish the Species
DNA Bar-code to Distinguish the SpeciesDNA Bar-code to Distinguish the Species
DNA Bar-code to Distinguish the Species
 
The IGM N-of-One Diagnostic Interpretation for Genetic Disorders
The IGM N-of-One Diagnostic Interpretation for Genetic DisordersThe IGM N-of-One Diagnostic Interpretation for Genetic Disorders
The IGM N-of-One Diagnostic Interpretation for Genetic Disorders
 
DNA Testing: Living Longer Via Personal Genomics
DNA Testing: Living Longer Via Personal GenomicsDNA Testing: Living Longer Via Personal Genomics
DNA Testing: Living Longer Via Personal Genomics
 
2018 05 24-waldron-itcr
2018 05 24-waldron-itcr2018 05 24-waldron-itcr
2018 05 24-waldron-itcr
 
Mikel egana itbam_2010_ogo_system
Mikel egana itbam_2010_ogo_systemMikel egana itbam_2010_ogo_system
Mikel egana itbam_2010_ogo_system
 

Destacado

(quasi) AFORISMI POLITICI (5)
(quasi)  AFORISMI  POLITICI  (5)(quasi)  AFORISMI  POLITICI  (5)
(quasi) AFORISMI POLITICI (5)tramerper
 
CopyofCHARAKINSTITUTEOFBUSINESSMANAGEMENTNovember2012
CopyofCHARAKINSTITUTEOFBUSINESSMANAGEMENTNovember2012CopyofCHARAKINSTITUTEOFBUSINESSMANAGEMENTNovember2012
CopyofCHARAKINSTITUTEOFBUSINESSMANAGEMENTNovember2012Tariq Zafar
 
Pamela J. Brown - Sleep Education Synthesis (Final Draft) - NNU Master's Degree
Pamela J. Brown - Sleep Education Synthesis (Final Draft) - NNU Master's DegreePamela J. Brown - Sleep Education Synthesis (Final Draft) - NNU Master's Degree
Pamela J. Brown - Sleep Education Synthesis (Final Draft) - NNU Master's DegreePamela Brown
 
Bettina løland garde
Bettina løland gardeBettina løland garde
Bettina løland gardebettinagarde
 
Welcome to prompt travels
Welcome to prompt travelsWelcome to prompt travels
Welcome to prompt travelsBharath Kumaran
 
ประกาศผลการคัดเลือกนักเรียนเข้าศึกษาต่อชั้นมัธยมศึกษาปีที่ 4 ประจำปีการศึกษา ...
ประกาศผลการคัดเลือกนักเรียนเข้าศึกษาต่อชั้นมัธยมศึกษาปีที่ 4 ประจำปีการศึกษา ...ประกาศผลการคัดเลือกนักเรียนเข้าศึกษาต่อชั้นมัธยมศึกษาปีที่ 4 ประจำปีการศึกษา ...
ประกาศผลการคัดเลือกนักเรียนเข้าศึกษาต่อชั้นมัธยมศึกษาปีที่ 4 ประจำปีการศึกษา ...Sayojang Ws
 
Hydrotherapy for Healing
Hydrotherapy for Healing Hydrotherapy for Healing
Hydrotherapy for Healing Safeguard Tubs
 
Diccionario de terminos tecnologia
Diccionario de terminos tecnologiaDiccionario de terminos tecnologia
Diccionario de terminos tecnologiaSandra Milena Ruiz
 

Destacado (13)

(quasi) AFORISMI POLITICI (5)
(quasi)  AFORISMI  POLITICI  (5)(quasi)  AFORISMI  POLITICI  (5)
(quasi) AFORISMI POLITICI (5)
 
Media
MediaMedia
Media
 
Ntics paul guanochanga
Ntics paul guanochangaNtics paul guanochanga
Ntics paul guanochanga
 
CopyofCHARAKINSTITUTEOFBUSINESSMANAGEMENTNovember2012
CopyofCHARAKINSTITUTEOFBUSINESSMANAGEMENTNovember2012CopyofCHARAKINSTITUTEOFBUSINESSMANAGEMENTNovember2012
CopyofCHARAKINSTITUTEOFBUSINESSMANAGEMENTNovember2012
 
Pamela J. Brown - Sleep Education Synthesis (Final Draft) - NNU Master's Degree
Pamela J. Brown - Sleep Education Synthesis (Final Draft) - NNU Master's DegreePamela J. Brown - Sleep Education Synthesis (Final Draft) - NNU Master's Degree
Pamela J. Brown - Sleep Education Synthesis (Final Draft) - NNU Master's Degree
 
Bettina løland garde
Bettina løland gardeBettina løland garde
Bettina løland garde
 
Welcome to prompt travels
Welcome to prompt travelsWelcome to prompt travels
Welcome to prompt travels
 
ประกาศผลการคัดเลือกนักเรียนเข้าศึกษาต่อชั้นมัธยมศึกษาปีที่ 4 ประจำปีการศึกษา ...
ประกาศผลการคัดเลือกนักเรียนเข้าศึกษาต่อชั้นมัธยมศึกษาปีที่ 4 ประจำปีการศึกษา ...ประกาศผลการคัดเลือกนักเรียนเข้าศึกษาต่อชั้นมัธยมศึกษาปีที่ 4 ประจำปีการศึกษา ...
ประกาศผลการคัดเลือกนักเรียนเข้าศึกษาต่อชั้นมัธยมศึกษาปีที่ 4 ประจำปีการศึกษา ...
 
Hydrotherapy for Healing
Hydrotherapy for Healing Hydrotherapy for Healing
Hydrotherapy for Healing
 
P,nari
P,nariP,nari
P,nari
 
Diccionario de terminos tecnologia
Diccionario de terminos tecnologiaDiccionario de terminos tecnologia
Diccionario de terminos tecnologia
 
Curso de evaluación higiénica
Curso de evaluación higiénicaCurso de evaluación higiénica
Curso de evaluación higiénica
 
Voith Reference
Voith ReferenceVoith Reference
Voith Reference
 

Similar a Improving Interoperability of Text Mining Tools with BioC

Welch Wordifier Bosc2009
Welch Wordifier Bosc2009Welch Wordifier Bosc2009
Welch Wordifier Bosc2009bosc
 
Introduction to BioNLP and its applications
Introduction to BioNLP and its applicationsIntroduction to BioNLP and its applications
Introduction to BioNLP and its applicationsShankaiYan
 
Human Reference Genome Browser Presentation at BIO-ITWorld 2008
Human Reference Genome Browser Presentation at BIO-ITWorld 2008Human Reference Genome Browser Presentation at BIO-ITWorld 2008
Human Reference Genome Browser Presentation at BIO-ITWorld 2008Saul Kravitz
 
OKC Grand Rounds 2009
OKC Grand Rounds 2009OKC Grand Rounds 2009
OKC Grand Rounds 2009Sean Davis
 
Bioinformatics & its scope in biotech.
Bioinformatics & its scope in biotech.Bioinformatics & its scope in biotech.
Bioinformatics & its scope in biotech.Muhammad Hunan Faiz
 
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationVisual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationNils Gehlenborg
 
provenance of microarray experiments
provenance of microarray experimentsprovenance of microarray experiments
provenance of microarray experimentsHelena Deus
 
The Monarch Initiative: From Model Organism to Precision Medicine
The Monarch Initiative: From Model Organism to Precision MedicineThe Monarch Initiative: From Model Organism to Precision Medicine
The Monarch Initiative: From Model Organism to Precision Medicinemhaendel
 
Prediction Of Regulatory Elements
Prediction Of Regulatory ElementsPrediction Of Regulatory Elements
Prediction Of Regulatory ElementsSupriya Karkra
 
Text mining and deep learning for biomedicine
Text mining and deep learning for biomedicineText mining and deep learning for biomedicine
Text mining and deep learning for biomedicineZhiyong Lu, PhD FACMI
 
creation of DNA barcoding database with website
creation of DNA barcoding database with websitecreation of DNA barcoding database with website
creation of DNA barcoding database with websiteJunaidAKG
 
Relation Extraction using Hybrid Approach and an Ensemble Algorithm
Relation Extraction using Hybrid Approach and an Ensemble AlgorithmRelation Extraction using Hybrid Approach and an Ensemble Algorithm
Relation Extraction using Hybrid Approach and an Ensemble AlgorithmMangaiK4
 
Relation Extraction using Hybrid Approach and an Ensemble Algorithm
Relation Extraction using Hybrid Approach and an Ensemble AlgorithmRelation Extraction using Hybrid Approach and an Ensemble Algorithm
Relation Extraction using Hybrid Approach and an Ensemble AlgorithmMangaiK4
 
General Principles of Toxicogenomics
General Principles of ToxicogenomicsGeneral Principles of Toxicogenomics
General Principles of Toxicogenomicscwoodland
 

Similar a Improving Interoperability of Text Mining Tools with BioC (20)

BIOINFORMATICS.pptx
BIOINFORMATICS.pptxBIOINFORMATICS.pptx
BIOINFORMATICS.pptx
 
Welch Wordifier Bosc2009
Welch Wordifier Bosc2009Welch Wordifier Bosc2009
Welch Wordifier Bosc2009
 
Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013
Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013
Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013
 
Introduction to BioNLP and its applications
Introduction to BioNLP and its applicationsIntroduction to BioNLP and its applications
Introduction to BioNLP and its applications
 
Kishor Presentation
Kishor PresentationKishor Presentation
Kishor Presentation
 
Human Reference Genome Browser Presentation at BIO-ITWorld 2008
Human Reference Genome Browser Presentation at BIO-ITWorld 2008Human Reference Genome Browser Presentation at BIO-ITWorld 2008
Human Reference Genome Browser Presentation at BIO-ITWorld 2008
 
OKC Grand Rounds 2009
OKC Grand Rounds 2009OKC Grand Rounds 2009
OKC Grand Rounds 2009
 
rheumatoid arthritis
rheumatoid arthritisrheumatoid arthritis
rheumatoid arthritis
 
Bioinformatics & its scope in biotech.
Bioinformatics & its scope in biotech.Bioinformatics & its scope in biotech.
Bioinformatics & its scope in biotech.
 
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationVisual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient Stratification
 
Bioinfo
BioinfoBioinfo
Bioinfo
 
provenance of microarray experiments
provenance of microarray experimentsprovenance of microarray experiments
provenance of microarray experiments
 
The Monarch Initiative: From Model Organism to Precision Medicine
The Monarch Initiative: From Model Organism to Precision MedicineThe Monarch Initiative: From Model Organism to Precision Medicine
The Monarch Initiative: From Model Organism to Precision Medicine
 
Prediction Of Regulatory Elements
Prediction Of Regulatory ElementsPrediction Of Regulatory Elements
Prediction Of Regulatory Elements
 
Text mining and deep learning for biomedicine
Text mining and deep learning for biomedicineText mining and deep learning for biomedicine
Text mining and deep learning for biomedicine
 
creation of DNA barcoding database with website
creation of DNA barcoding database with websitecreation of DNA barcoding database with website
creation of DNA barcoding database with website
 
Relation Extraction using Hybrid Approach and an Ensemble Algorithm
Relation Extraction using Hybrid Approach and an Ensemble AlgorithmRelation Extraction using Hybrid Approach and an Ensemble Algorithm
Relation Extraction using Hybrid Approach and an Ensemble Algorithm
 
Relation Extraction using Hybrid Approach and an Ensemble Algorithm
Relation Extraction using Hybrid Approach and an Ensemble AlgorithmRelation Extraction using Hybrid Approach and an Ensemble Algorithm
Relation Extraction using Hybrid Approach and an Ensemble Algorithm
 
General Principles of Toxicogenomics
General Principles of ToxicogenomicsGeneral Principles of Toxicogenomics
General Principles of Toxicogenomics
 
apcp Deeksha Bhartiya
apcp Deeksha  Bhartiyaapcp Deeksha  Bhartiya
apcp Deeksha Bhartiya
 

Improving Interoperability of Text Mining Tools with BioC

  • 1. We would like to thank Don Comeau, Rezarta Doğan, and John Wilbur for their discussion and help with the BioC tools. This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. Acknowledgments Wei, C. H., Harris, B. R., Kao, H. Y., Lu, Z. (2013) tmVar: A text mining approach for extracting sequence variants in biomedical literature, Bioinformatics, 29 (11), 1433-1439. Wei, C.H., Kao, H.Y., Lu, Z. (2012) SR4GN: a species recognition software tool for gene normalization. PloS one, 7, e38460. Leaman, R., Islamaj Dogan, R., Lu, Z. (2013) DNorm: Disease Name Normalization with Pairwise Learning to Rank. Bioinformatics. Wei, C.H., Kao, H.Y., Lu, Z. (2013) PubTator: a web-based text mining tool for assisting biocuration. Nucleic acids research, 41, W518-522. Wei, C.H., Kao, H.Y. (2011) Cross-species gene normalization by species inference. BMC bioinformatics, 12 Suppl 8, S5. References The lack of interoperability among text mining tools is a major bottleneck in creating more complex applications. Despite the availability of numerous methods and techniques for various text mining tasks, combining different tools requires substantial efforts and time. In response, BioC offers a minimalistic approach to tool interoperability by stipulating minimal changes to existing tools and applications. In this study, we introduce several state-of-the-art text mining tools developed at the NCBI, and modify these tools to make them BioC compatible. Our toolkit can be accessed at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/. The NCBI Text Mining Toolkit Improving Interoperability of Text Mining Tools with BioC Ritu Khare, Chih-Hsuan Wei, Yuqing Mao, Robert Leaman, Zhiyong Lu National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD, 20894 Abstract DNormDNorm tmVartmVar SR4GNSR4GN tmChemtmChem GenNormGenNorm PubMed   Abstract Disease  Mentions   with  MEDIC  IDs Mutation  Mentions Species  Mentions  with   Taxonomy  IDs Chemical  Mentions Gene  Mentions   with  Entrez  IDs Annotations  for   Various  BioConcepts Concept  Recognition   and  Annotation  Toolkit PubMed  Abstracts   or  Full-­‐Text  Articles NER tools Bioconcept Programming Language(s) Method F-measure tmChem Chemical Java, Perl, C++ CRF 88.27% DNorm Disease Java CRF 80.90% tmVar Mutation Perl, C++ CRF 91.39% SR4GN Species Perl Rule based 85.42% GenNorm Gene Perl Statistical 92.89% Tools Bioconcept PubMed/ PMC XML BioC Free Text PubTator GenNorm tmChem Chemical √ √ √ DNorm Disease √ √ √ tmVar Mutation √ √ √ √ SR4GN Species √ √ √ √ GenNorm Gene √ √ √ √ PubTator N/A √ √ √ Table 1. Summary of Concept Recognition Tools Table 2. Compatible input/output formats Building BioC Compatible Versions Figure 1. The NCBI Toolkit Figure 2. Tool Features ² tmChem achieved the best performance in BioCreative IV CHEMDNER task on chemical entity mention recognition. ² GenNorm achieved the best performance in BioCreative III Gene Normalization task ² DNorm achieved the best performance in 2013 ShARe/CLEF shared task for normalizing disease names in clinical notes Conclusions BioC was easy to learn and straightforward to implement. Only minimal changes were required to re-package the NCBI toolkit with BioC. Our tools are now interoperable with each other, and with several other tools to build more powerful text mining applications and offer wider usage. Figure 3. BioC Input and Output Format BioC comprises: XML format describing how to present text documents and annotations, and functions to read/write documents in the BioC XML format. Steps to build BioC compatible tools: Modify the input/output format of the tool and create a key file to interpret the BioC annotation file. Table 2. Input/Output formats supported by our tools Offset Identifiers Mentions Types