SlideShare a Scribd company logo
1 of 22
Download to read offline
Ritu	
  Khare,	
  Chih-­‐Hsuan	
  Wei,	
  Yuqing	
  Mao,	
  Robert	
  Leaman,	
  Zhiyong	
  Lu	
  
National	
  Center	
  for	
  Biotechnology	
  Information	
  (NCBI)	
  
National	
  Institutes	
  of	
  Health	
  	
  
1	
  
¡  Motivation	
  
	
  
¡  Our	
  Text	
  Mining	
  Tools	
  
	
  
¡  Building	
  BioC	
  Compatible	
  Tools	
  
	
  
¡  Results	
  and	
  Conclusions	
  
2	
  
¡  Building	
  complex	
  text	
  mining	
  applications	
  requires	
  
combining	
  different	
  tools	
  developed	
  by	
  different	
  
groups	
  
¡  Each	
  tool	
  is	
  developed	
  independently	
  
§  Group	
  conventions:	
  data	
  representation,	
  programming,	
  
execution	
  environments	
  
¡  Heterogeneity	
  in	
  data/text	
  representations	
  limits	
  
and	
  slows	
  down	
  
§  tool	
  interoperability,	
  application	
  development,	
  and	
  
research	
  and	
  innovation.	
  
3	
  
EXISTING	
  SOLUTIONS	
  	
  
	
  
¡  Unstructured	
  information	
  
management	
  architecture	
  
(UIMA)	
  –	
  2004	
  
¡  General	
  Architecture	
  for	
  
Text	
  Engineering	
  (GATE)	
  -­‐	
  
2009	
  
¡  Steep	
  Learning	
  Curve	
  	
  
¡  Substantial	
  Development	
  
and	
  Re-­‐development	
  time	
  
BIOC	
  
¡  Minimal	
  change	
  
requirement	
  to	
  existing	
  
applications	
  and	
  datasets	
  
¡  BioC	
  family	
  
§  XML	
  formats	
  to	
  present	
  text	
  
documents	
  and	
  annotations	
  
§  Functions	
  (C++,	
  JAVA)	
  to	
  read/
write	
  documents	
  in	
  BioC	
  
format	
  	
  	
  
4	
  
¡  Motivation	
  
	
  
¡  Our	
  Text	
  Mining	
  Tools	
  
	
  
¡  Building	
  BioC	
  Compatible	
  Tools	
  
	
  
¡  Results	
  and	
  Conclusions	
  
5	
  
6	
  
DNormDNorm
tmVartmVar
SR4GNSR4GN
tmChemtmChem
GenNormGenNorm
PubMed	
  
Abstract
Disease	
  Mentions	
  
with	
  MEDIC	
  IDs
Mutation	
  Mentions
Species	
  Mentions	
  with	
  
Taxonomy	
  IDs
Chemical	
  Mentions
Gene	
  Mentions	
  
with	
  Entrez	
  IDs
Annotations	
  for	
  
Various	
  BioConcepts
Concept	
  Recognition	
  
and	
  Annotation	
  Toolkit
PubMed	
  Abstracts	
  
or	
  Full-­‐Text	
  Articles
DNorm	
  
Disease	
  Mentions	
  with	
  MEDIC	
  
IDs	
  (F-­‐measure=	
  80.90%)	
  
tmVar	
  
Mutation	
  Mentions	
  	
  
(F-­‐measure=	
  91.39%)	
  
SR4GN	
  
Species	
  Mentions	
  with	
  Taxonomy	
  
IDs	
  (F-­‐measure=	
  85.42%)	
  
tmChem	
  
Chemical	
  Mentions	
  	
  
(F-­‐measure=	
  88.27%)	
  
GenNorm	
  
Gene	
  Mentions	
  with	
  Entrez	
  
IDs	
  (F-­‐measure=	
  92.89%)	
  
Annotations	
  with	
  various	
  
BioConcepts	
  
NER	
  tools	
  
Programming	
  
Language	
  
Method	
  
Formats	
  
PubMed/	
  
PMC	
  XML	
  
Free	
  Text	
  
PubTator	
  
Format	
  
GenNorm	
  
Format	
  
tmChem	
  
(Chemical)	
  
Java,	
  Perl,	
  C++	
   CRF	
   √	
   √	
  
DNorm	
  
(Disease)	
  
Java	
   CRF	
   √	
   √	
  
tmVar	
  
(Mutation)	
  
Perl,	
  C++	
   CRF	
   √	
   √	
   √	
  
SR4GN	
  
(Species)	
  
Perl	
   Rule-­‐based	
   √	
   √	
   √	
  
GenNorm	
  
(Gene)	
  
Perl	
   Statistical	
  	
   √	
   √	
   √	
  
PubTator	
   Perl,	
  JavaScript	
   Web	
  server	
   √	
   √	
  
7	
  
8	
  
¡  Official	
  corpus	
  for	
  BioCreative	
  IV	
  GO	
  Task	
  	
  
¡  200	
  full-­‐text	
  articles	
  along	
  with	
  their	
  gene	
  
ontology	
  (GO)	
  annotations	
  	
  	
  
§  evidence	
  sentences	
  
§  gene/protein	
  entities,	
  GO	
  terms,	
  GO	
  evidence	
  
codes	
  
¡  Developed	
  by	
  expert	
  GO	
  curators	
  via	
  a	
  web-­‐
based	
  annotation	
  tool.	
  	
  
9	
  
¡  Motivation	
  
	
  
¡  The	
  NCBI	
  Text	
  Mining	
  Toolkit	
  
	
  
¡  Building	
  BioC	
  Compatible	
  Tools	
  
	
  
¡  Results	
  and	
  Conclusions	
  
10	
  
¡  The	
  BioC	
  family	
  	
  
§  	
  XML	
  DTD	
  	
  
▪  how	
  to	
  present	
  text	
  
document	
  and	
  annotations	
  
(higher-­‐level	
  semantics)	
  
§  C++	
  and	
  Java	
  Libraries	
  	
  
▪  functions/classes	
  to	
  read/
write	
  documents	
  in	
  BioC	
  
format	
  	
  
¡  BioC	
  Recommendations	
  
§  Full-­‐text	
  articles	
  and	
  
Annotations	
  
▪  Present	
  in	
  BioC	
  XML	
  Format	
  	
  
▪  Keep	
  in	
  separate	
  files	
  
§  Key	
  file	
  	
  
▪  describes	
  how	
  data	
  should	
  
be	
  interpreted	
  in	
  the	
  
annotation	
  file	
  (lower-­‐level	
  
semantics)	
  
▪  needs	
  to	
  be	
  created	
  for	
  a	
  
specific	
  type	
  of	
  data.	
  	
  
11	
  
¡  Steps	
  taken	
  to	
  comply	
  our	
  tools	
  with	
  BioC	
  
§  Created	
  the	
  key	
  file	
  
§  Modified	
  the	
  input/output	
  formats	
  of	
  the	
  tools	
  
▪  Added	
  the	
  BioC	
  format	
  as	
  a	
  new	
  option	
  for	
  input/output	
  
	
  
¡  Challenges	
  
§  Defining	
  an	
  appropriate	
  key	
  file	
  	
  
§  Offset	
  calculation	
  	
  
§  Translating	
   web-­‐based	
   annotation	
   file	
   to	
   BioC	
  
annotation	
  file	
  (Unicode	
  to	
  ASCII	
  conversion)	
  
12	
  
¡  Motivation	
  
	
  
¡  Our	
  Text	
  Mining	
  Tools	
  
	
  
¡  Building	
  BioC	
  Compatible	
  Tools	
  
	
  
¡  Results	
  and	
  Conclusions	
  
13	
  
¡  Common	
  key	
  file	
  for	
  all	
  tools	
  since	
  they	
  are	
  designed	
  for	
  
similar	
  types	
  of	
  data	
  	
  
14	
  
id:	
  	
  PubMed	
  id.	
  
Passage:	
  	
  e.g.,	
  title,	
  abstract	
  
Offset	
  of	
  the	
  passage	
  
Id	
  of	
  the	
  bioconcept	
  
Offset	
  of	
  the	
  bioconcept	
  
Length	
  of	
  the	
  bioconcept	
  
Mention	
  of	
  the	
  bioconcept	
  
date:	
  	
  the	
  time	
  annotation	
  create	
  
NER	
  
tools	
  
bioconcept	
  
PubMed/	
  
PMC	
  XML	
  
BioC	
  
Free	
  
Text	
  
PubTator	
   GenNorm	
  
tmChem	
   Chemical	
   √	
   √	
   √	
  
DNorm	
   Disease	
   √	
   √	
   √	
  
tmVar	
   Mutation	
   √	
   √	
   √	
   √	
  
SR4GN	
   Species	
   √	
   √	
   √	
   √	
  
GenNorm	
   Gene	
   √	
   √	
   √	
   √	
  
PubTator	
   N/A	
   √	
   √	
   √	
  
15	
  
Our	
  Text	
  Mining	
  Toolkit	
  available	
  for	
  public	
  access:	
  
http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/	
  
16	
  
BioC	
  
Article	
  File	
  
BioC	
  Annotation	
  	
  
File	
  
DNorm	
   tmVar	
   tmChem	
   SR4GN	
   GenNorm	
  
Identifying	
  
Disease
Identifying	
  
Mutation
Identifying	
  
chemical
Identifying	
  
Species
Identifying	
  
Gene
17	
  
id:	
  	
  PubMed	
  id.	
  
passage:	
  	
  title	
  
date:	
  	
  the	
  time	
  file	
  download	
  
passage:	
  	
  abstract	
  
18	
  
Id	
  of	
  the	
  bioconcept	
  
Offset	
  of	
  the	
  bioconcept	
  
Length	
  of	
  the	
  bioconcept	
  
Mention	
  of	
  the	
  bioconcept	
  
Type	
  of	
  the	
  bioconcept	
  
Time:	
  	
  Time	
  annotation	
  created.	
  
ID:	
  PMID	
  of	
  the	
  article.	
  
GO	
  term:	
  e.g.,	
  receptor-­‐mediated	
  endocytosis	
  
GO	
  evidence	
  code:	
  e.g.,	
  Inferred	
  from	
  Mutant	
  
Phenotype	
  (IMP)	
  
Curatable	
  entity:	
  i.e.,	
  gene	
  or	
  gene	
  product	
  
Text:	
  GO	
  evidence	
  text	
  
¡  Our	
  experience	
  with	
  BioC	
  	
  
§  Minimal	
  changes	
  required	
  to	
  prepare	
  BioC	
  versions	
  	
  
§  Easy	
  to	
  learn	
  and	
  use	
  
§  Improved	
  interoperability	
  within	
  the	
  toolkit	
  
¡  Implications	
  	
  
§  Improved	
  interoperability	
  
▪  With	
  other	
  tools	
  to	
  build	
  sophisticated	
  applications	
  
§  The	
  key	
  file	
  could	
  evolve	
  as	
  a	
  standard	
  for	
  concept	
  
recognition	
  and	
  normalization	
  tasks	
  
§  Anticipate	
  broader	
  usage	
  of	
  our	
  tools	
  as	
  BioC	
  gains	
  
popularity	
  	
  
20	
  
¡  BioC	
  Developers	
  
§  W.	
  John	
  Wilbur	
  
§  Rezarta	
  Islamaj	
  Doğan	
  	
  
§  Donald	
  Comeau	
  	
  
¡  Intramural	
  Research	
  Program	
  of	
  the	
  NIH,	
  
National	
  Library	
  Medicine	
  
21	
  
¡  Chih-Hsuan Wei
§  weic4@ncbi.nlm.nih.gov
§  +1 301-594-5290
22	
  

More Related Content

Viewers also liked

"LinkedIn 101 for Nonprofits", An Axelson Center Webinar.
"LinkedIn 101 for Nonprofits", An Axelson Center Webinar."LinkedIn 101 for Nonprofits", An Axelson Center Webinar.
"LinkedIn 101 for Nonprofits", An Axelson Center Webinar.Box
 
AppSec USA - LASCON Edition
AppSec USA - LASCON EditionAppSec USA - LASCON Edition
AppSec USA - LASCON EditionSherif Koussa
 
How Good of a Java Developer are You?
How Good of a Java Developer are You?How Good of a Java Developer are You?
How Good of a Java Developer are You?Sherif Koussa
 
10 Nonprofit Success Stories Using LinkedIn - Stanford Bus 109 Lecture 1/21/14
10 Nonprofit Success Stories Using LinkedIn - Stanford Bus 109 Lecture 1/21/1410 Nonprofit Success Stories Using LinkedIn - Stanford Bus 109 Lecture 1/21/14
10 Nonprofit Success Stories Using LinkedIn - Stanford Bus 109 Lecture 1/21/14Box
 
Security Code Reviews. Does Your Code Need an Open Heart Surgery and The 6 Po...
Security Code Reviews. Does Your Code Need an Open Heart Surgery and The 6 Po...Security Code Reviews. Does Your Code Need an Open Heart Surgery and The 6 Po...
Security Code Reviews. Does Your Code Need an Open Heart Surgery and The 6 Po...Sherif Koussa
 
Security Code Review: Magic or Art?
Security Code Review: Magic or Art?Security Code Review: Magic or Art?
Security Code Review: Magic or Art?Sherif Koussa
 
Simplified Security Code Review Process
Simplified Security Code Review ProcessSimplified Security Code Review Process
Simplified Security Code Review ProcessSherif Koussa
 
Toward Creating a gold Standard of Drug Indications from FDA Drug Labels
Toward Creating a gold Standard of Drug Indications from FDA Drug LabelsToward Creating a gold Standard of Drug Indications from FDA Drug Labels
Toward Creating a gold Standard of Drug Indications from FDA Drug LabelsThe Children's Hospital of Philadelphia
 
LinkedIn Tips for Nonprofit HR Professionals.
LinkedIn Tips for Nonprofit HR Professionals.LinkedIn Tips for Nonprofit HR Professionals.
LinkedIn Tips for Nonprofit HR Professionals.Box
 
Exploiting Semantic Structure for Mapping User-specified Form Terms to SNOMED...
Exploiting Semantic Structure for Mapping User-specified Form Terms to SNOMED...Exploiting Semantic Structure for Mapping User-specified Form Terms to SNOMED...
Exploiting Semantic Structure for Mapping User-specified Form Terms to SNOMED...The Children's Hospital of Philadelphia
 
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...The Children's Hospital of Philadelphia
 

Viewers also liked (20)

Remote Mentoring Young Girls in STEM through MAGIC
Remote Mentoring Young Girls in STEM through MAGICRemote Mentoring Young Girls in STEM through MAGIC
Remote Mentoring Young Girls in STEM through MAGIC
 
Understanding EMR Error Control Practices Among Gynecologic Physicians
Understanding EMR Error Control Practices Among Gynecologic PhysiciansUnderstanding EMR Error Control Practices Among Gynecologic Physicians
Understanding EMR Error Control Practices Among Gynecologic Physicians
 
Dissertation Defense Presentation
Dissertation Defense PresentationDissertation Defense Presentation
Dissertation Defense Presentation
 
"LinkedIn 101 for Nonprofits", An Axelson Center Webinar.
"LinkedIn 101 for Nonprofits", An Axelson Center Webinar."LinkedIn 101 for Nonprofits", An Axelson Center Webinar.
"LinkedIn 101 for Nonprofits", An Axelson Center Webinar.
 
AppSec USA - LASCON Edition
AppSec USA - LASCON EditionAppSec USA - LASCON Edition
AppSec USA - LASCON Edition
 
Can Clinicians Create High-Quality Databases?
Can Clinicians Create High-Quality Databases?Can Clinicians Create High-Quality Databases?
Can Clinicians Create High-Quality Databases?
 
Mike thelwall ritu
Mike thelwall rituMike thelwall ritu
Mike thelwall ritu
 
How Good of a Java Developer are You?
How Good of a Java Developer are You?How Good of a Java Developer are You?
How Good of a Java Developer are You?
 
10 Nonprofit Success Stories Using LinkedIn - Stanford Bus 109 Lecture 1/21/14
10 Nonprofit Success Stories Using LinkedIn - Stanford Bus 109 Lecture 1/21/1410 Nonprofit Success Stories Using LinkedIn - Stanford Bus 109 Lecture 1/21/14
10 Nonprofit Success Stories Using LinkedIn - Stanford Bus 109 Lecture 1/21/14
 
Security Code Reviews. Does Your Code Need an Open Heart Surgery and The 6 Po...
Security Code Reviews. Does Your Code Need an Open Heart Surgery and The 6 Po...Security Code Reviews. Does Your Code Need an Open Heart Surgery and The 6 Po...
Security Code Reviews. Does Your Code Need an Open Heart Surgery and The 6 Po...
 
Security Code Review: Magic or Art?
Security Code Review: Magic or Art?Security Code Review: Magic or Art?
Security Code Review: Magic or Art?
 
Simplified Security Code Review Process
Simplified Security Code Review ProcessSimplified Security Code Review Process
Simplified Security Code Review Process
 
Matching Conceptual Models Using Multivariate Analysis
Matching Conceptual Models Using Multivariate AnalysisMatching Conceptual Models Using Multivariate Analysis
Matching Conceptual Models Using Multivariate Analysis
 
Toward Creating a gold Standard of Drug Indications from FDA Drug Labels
Toward Creating a gold Standard of Drug Indications from FDA Drug LabelsToward Creating a gold Standard of Drug Indications from FDA Drug Labels
Toward Creating a gold Standard of Drug Indications from FDA Drug Labels
 
Crowdsourcing in NLP
Crowdsourcing in NLPCrowdsourcing in NLP
Crowdsourcing in NLP
 
LinkedIn Tips for Nonprofit HR Professionals.
LinkedIn Tips for Nonprofit HR Professionals.LinkedIn Tips for Nonprofit HR Professionals.
LinkedIn Tips for Nonprofit HR Professionals.
 
Exploiting Semantic Structure for Mapping User-specified Form Terms to SNOMED...
Exploiting Semantic Structure for Mapping User-specified Form Terms to SNOMED...Exploiting Semantic Structure for Mapping User-specified Form Terms to SNOMED...
Exploiting Semantic Structure for Mapping User-specified Form Terms to SNOMED...
 
Two Layered HMMs for Search Interface Segmentation
Two Layered HMMs for Search Interface SegmentationTwo Layered HMMs for Search Interface Segmentation
Two Layered HMMs for Search Interface Segmentation
 
Prospectus presentation
Prospectus presentation Prospectus presentation
Prospectus presentation
 
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
 

Similar to NCBI Text Mining Tools Compatible with BioC Format

The Role of Metadata in Reproducible Computational Research
The Role of Metadata in Reproducible Computational ResearchThe Role of Metadata in Reproducible Computational Research
The Role of Metadata in Reproducible Computational ResearchJeremy Leipzig
 
Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Trish Whetzel
 
Functional ANNOTATION OF GENOME.pptx
Functional ANNOTATION OF GENOME.pptxFunctional ANNOTATION OF GENOME.pptx
Functional ANNOTATION OF GENOME.pptxUmerjibranRaza
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataBarry Smith
 
Health Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha NoyHealth Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha NoyHealth Data Consortium
 
Recent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP ApproachesRecent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP ApproachesIRJET Journal
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit
 
Integrating Xtext Language Server support in Visual Studio Code
Integrating Xtext Language Server support in Visual Studio CodeIntegrating Xtext Language Server support in Visual Studio Code
Integrating Xtext Language Server support in Visual Studio CodeKarsten Thoms
 
Making project data avalialble eNanomapper through Database
Making project data avalialble eNanomapper through  DatabaseMaking project data avalialble eNanomapper through  Database
Making project data avalialble eNanomapper through DatabaseNina Jeliazkova
 
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
 Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ... Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...Syed Ahmad Chan Bukhari, PhD
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchDavid Ruau
 
E-Utilities
E-UtilitiesE-Utilities
E-Utilitiesmkim8
 
NCBI API - Integration into analysis code
NCBI API - Integration into analysis codeNCBI API - Integration into analysis code
NCBI API - Integration into analysis codeJiwoong Kim
 
Portable and reproducible bioinformatic analysis. Neoantigen discovery.
Portable and reproducible bioinformatic analysis. Neoantigen discovery.Portable and reproducible bioinformatic analysis. Neoantigen discovery.
Portable and reproducible bioinformatic analysis. Neoantigen discovery.Vladimir Kovacevic
 
A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...
A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...
A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...Matthieu Schapranow
 

Similar to NCBI Text Mining Tools Compatible with BioC Format (20)

The Role of Metadata in Reproducible Computational Research
The Role of Metadata in Reproducible Computational ResearchThe Role of Metadata in Reproducible Computational Research
The Role of Metadata in Reproducible Computational Research
 
Msr17a.ppt
Msr17a.pptMsr17a.ppt
Msr17a.ppt
 
Msr17a.ppt
Msr17a.pptMsr17a.ppt
Msr17a.ppt
 
Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications
 
Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
 
Functional ANNOTATION OF GENOME.pptx
Functional ANNOTATION OF GENOME.pptxFunctional ANNOTATION OF GENOME.pptx
Functional ANNOTATION OF GENOME.pptx
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort Data
 
User manual
User manualUser manual
User manual
 
Health Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha NoyHealth Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha Noy
 
Recent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP ApproachesRecent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP Approaches
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
 
Integrating Xtext Language Server support in Visual Studio Code
Integrating Xtext Language Server support in Visual Studio CodeIntegrating Xtext Language Server support in Visual Studio Code
Integrating Xtext Language Server support in Visual Studio Code
 
Neo4j and bioinformatics
Neo4j and bioinformaticsNeo4j and bioinformatics
Neo4j and bioinformatics
 
Making project data avalialble eNanomapper through Database
Making project data avalialble eNanomapper through  DatabaseMaking project data avalialble eNanomapper through  Database
Making project data avalialble eNanomapper through Database
 
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
 Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ... Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
 
E-Utilities
E-UtilitiesE-Utilities
E-Utilities
 
NCBI API - Integration into analysis code
NCBI API - Integration into analysis codeNCBI API - Integration into analysis code
NCBI API - Integration into analysis code
 
Portable and reproducible bioinformatic analysis. Neoantigen discovery.
Portable and reproducible bioinformatic analysis. Neoantigen discovery.Portable and reproducible bioinformatic analysis. Neoantigen discovery.
Portable and reproducible bioinformatic analysis. Neoantigen discovery.
 
A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...
A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...
A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...
 

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

NCBI Text Mining Tools Compatible with BioC Format

  • 1. Ritu  Khare,  Chih-­‐Hsuan  Wei,  Yuqing  Mao,  Robert  Leaman,  Zhiyong  Lu   National  Center  for  Biotechnology  Information  (NCBI)   National  Institutes  of  Health     1  
  • 2. ¡  Motivation     ¡  Our  Text  Mining  Tools     ¡  Building  BioC  Compatible  Tools     ¡  Results  and  Conclusions   2  
  • 3. ¡  Building  complex  text  mining  applications  requires   combining  different  tools  developed  by  different   groups   ¡  Each  tool  is  developed  independently   §  Group  conventions:  data  representation,  programming,   execution  environments   ¡  Heterogeneity  in  data/text  representations  limits   and  slows  down   §  tool  interoperability,  application  development,  and   research  and  innovation.   3  
  • 4. EXISTING  SOLUTIONS       ¡  Unstructured  information   management  architecture   (UIMA)  –  2004   ¡  General  Architecture  for   Text  Engineering  (GATE)  -­‐   2009   ¡  Steep  Learning  Curve     ¡  Substantial  Development   and  Re-­‐development  time   BIOC   ¡  Minimal  change   requirement  to  existing   applications  and  datasets   ¡  BioC  family   §  XML  formats  to  present  text   documents  and  annotations   §  Functions  (C++,  JAVA)  to  read/ write  documents  in  BioC   format       4  
  • 5. ¡  Motivation     ¡  Our  Text  Mining  Tools     ¡  Building  BioC  Compatible  Tools     ¡  Results  and  Conclusions   5  
  • 6. 6   DNormDNorm tmVartmVar SR4GNSR4GN tmChemtmChem GenNormGenNorm PubMed   Abstract Disease  Mentions   with  MEDIC  IDs Mutation  Mentions Species  Mentions  with   Taxonomy  IDs Chemical  Mentions Gene  Mentions   with  Entrez  IDs Annotations  for   Various  BioConcepts Concept  Recognition   and  Annotation  Toolkit PubMed  Abstracts   or  Full-­‐Text  Articles DNorm   Disease  Mentions  with  MEDIC   IDs  (F-­‐measure=  80.90%)   tmVar   Mutation  Mentions     (F-­‐measure=  91.39%)   SR4GN   Species  Mentions  with  Taxonomy   IDs  (F-­‐measure=  85.42%)   tmChem   Chemical  Mentions     (F-­‐measure=  88.27%)   GenNorm   Gene  Mentions  with  Entrez   IDs  (F-­‐measure=  92.89%)   Annotations  with  various   BioConcepts  
  • 7. NER  tools   Programming   Language   Method   Formats   PubMed/   PMC  XML   Free  Text   PubTator   Format   GenNorm   Format   tmChem   (Chemical)   Java,  Perl,  C++   CRF   √   √   DNorm   (Disease)   Java   CRF   √   √   tmVar   (Mutation)   Perl,  C++   CRF   √   √   √   SR4GN   (Species)   Perl   Rule-­‐based   √   √   √   GenNorm   (Gene)   Perl   Statistical     √   √   √   PubTator   Perl,  JavaScript   Web  server   √   √   7  
  • 9. ¡  Official  corpus  for  BioCreative  IV  GO  Task     ¡  200  full-­‐text  articles  along  with  their  gene   ontology  (GO)  annotations       §  evidence  sentences   §  gene/protein  entities,  GO  terms,  GO  evidence   codes   ¡  Developed  by  expert  GO  curators  via  a  web-­‐ based  annotation  tool.     9  
  • 10. ¡  Motivation     ¡  The  NCBI  Text  Mining  Toolkit     ¡  Building  BioC  Compatible  Tools     ¡  Results  and  Conclusions   10  
  • 11. ¡  The  BioC  family     §   XML  DTD     ▪  how  to  present  text   document  and  annotations   (higher-­‐level  semantics)   §  C++  and  Java  Libraries     ▪  functions/classes  to  read/ write  documents  in  BioC   format     ¡  BioC  Recommendations   §  Full-­‐text  articles  and   Annotations   ▪  Present  in  BioC  XML  Format     ▪  Keep  in  separate  files   §  Key  file     ▪  describes  how  data  should   be  interpreted  in  the   annotation  file  (lower-­‐level   semantics)   ▪  needs  to  be  created  for  a   specific  type  of  data.     11  
  • 12. ¡  Steps  taken  to  comply  our  tools  with  BioC   §  Created  the  key  file   §  Modified  the  input/output  formats  of  the  tools   ▪  Added  the  BioC  format  as  a  new  option  for  input/output     ¡  Challenges   §  Defining  an  appropriate  key  file     §  Offset  calculation     §  Translating   web-­‐based   annotation   file   to   BioC   annotation  file  (Unicode  to  ASCII  conversion)   12  
  • 13. ¡  Motivation     ¡  Our  Text  Mining  Tools     ¡  Building  BioC  Compatible  Tools     ¡  Results  and  Conclusions   13  
  • 14. ¡  Common  key  file  for  all  tools  since  they  are  designed  for   similar  types  of  data     14   id:    PubMed  id.   Passage:    e.g.,  title,  abstract   Offset  of  the  passage   Id  of  the  bioconcept   Offset  of  the  bioconcept   Length  of  the  bioconcept   Mention  of  the  bioconcept   date:    the  time  annotation  create  
  • 15. NER   tools   bioconcept   PubMed/   PMC  XML   BioC   Free   Text   PubTator   GenNorm   tmChem   Chemical   √   √   √   DNorm   Disease   √   √   √   tmVar   Mutation   √   √   √   √   SR4GN   Species   √   √   √   √   GenNorm   Gene   √   √   √   √   PubTator   N/A   √   √   √   15   Our  Text  Mining  Toolkit  available  for  public  access:   http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/  
  • 16. 16   BioC   Article  File   BioC  Annotation     File   DNorm   tmVar   tmChem   SR4GN   GenNorm   Identifying   Disease Identifying   Mutation Identifying   chemical Identifying   Species Identifying   Gene
  • 17. 17   id:    PubMed  id.   passage:    title   date:    the  time  file  download   passage:    abstract  
  • 18. 18   Id  of  the  bioconcept   Offset  of  the  bioconcept   Length  of  the  bioconcept   Mention  of  the  bioconcept   Type  of  the  bioconcept  
  • 19. Time:    Time  annotation  created.   ID:  PMID  of  the  article.   GO  term:  e.g.,  receptor-­‐mediated  endocytosis   GO  evidence  code:  e.g.,  Inferred  from  Mutant   Phenotype  (IMP)   Curatable  entity:  i.e.,  gene  or  gene  product   Text:  GO  evidence  text  
  • 20. ¡  Our  experience  with  BioC     §  Minimal  changes  required  to  prepare  BioC  versions     §  Easy  to  learn  and  use   §  Improved  interoperability  within  the  toolkit   ¡  Implications     §  Improved  interoperability   ▪  With  other  tools  to  build  sophisticated  applications   §  The  key  file  could  evolve  as  a  standard  for  concept   recognition  and  normalization  tasks   §  Anticipate  broader  usage  of  our  tools  as  BioC  gains   popularity     20  
  • 21. ¡  BioC  Developers   §  W.  John  Wilbur   §  Rezarta  Islamaj  Doğan     §  Donald  Comeau     ¡  Intramural  Research  Program  of  the  NIH,   National  Library  Medicine   21  
  • 22. ¡  Chih-Hsuan Wei §  weic4@ncbi.nlm.nih.gov §  +1 301-594-5290 22