Powerpoint exploring the locations used in television show Time Clash
NCBI Text Mining Tools Compatible with BioC Format
1. Ritu
Khare,
Chih-‐Hsuan
Wei,
Yuqing
Mao,
Robert
Leaman,
Zhiyong
Lu
National
Center
for
Biotechnology
Information
(NCBI)
National
Institutes
of
Health
1
2. ¡ Motivation
¡ Our
Text
Mining
Tools
¡ Building
BioC
Compatible
Tools
¡ Results
and
Conclusions
2
3. ¡ Building
complex
text
mining
applications
requires
combining
different
tools
developed
by
different
groups
¡ Each
tool
is
developed
independently
§ Group
conventions:
data
representation,
programming,
execution
environments
¡ Heterogeneity
in
data/text
representations
limits
and
slows
down
§ tool
interoperability,
application
development,
and
research
and
innovation.
3
4. EXISTING
SOLUTIONS
¡ Unstructured
information
management
architecture
(UIMA)
–
2004
¡ General
Architecture
for
Text
Engineering
(GATE)
-‐
2009
¡ Steep
Learning
Curve
¡ Substantial
Development
and
Re-‐development
time
BIOC
¡ Minimal
change
requirement
to
existing
applications
and
datasets
¡ BioC
family
§ XML
formats
to
present
text
documents
and
annotations
§ Functions
(C++,
JAVA)
to
read/
write
documents
in
BioC
format
4
5. ¡ Motivation
¡ Our
Text
Mining
Tools
¡ Building
BioC
Compatible
Tools
¡ Results
and
Conclusions
5
6. 6
DNormDNorm
tmVartmVar
SR4GNSR4GN
tmChemtmChem
GenNormGenNorm
PubMed
Abstract
Disease
Mentions
with
MEDIC
IDs
Mutation
Mentions
Species
Mentions
with
Taxonomy
IDs
Chemical
Mentions
Gene
Mentions
with
Entrez
IDs
Annotations
for
Various
BioConcepts
Concept
Recognition
and
Annotation
Toolkit
PubMed
Abstracts
or
Full-‐Text
Articles
DNorm
Disease
Mentions
with
MEDIC
IDs
(F-‐measure=
80.90%)
tmVar
Mutation
Mentions
(F-‐measure=
91.39%)
SR4GN
Species
Mentions
with
Taxonomy
IDs
(F-‐measure=
85.42%)
tmChem
Chemical
Mentions
(F-‐measure=
88.27%)
GenNorm
Gene
Mentions
with
Entrez
IDs
(F-‐measure=
92.89%)
Annotations
with
various
BioConcepts
7. NER
tools
Programming
Language
Method
Formats
PubMed/
PMC
XML
Free
Text
PubTator
Format
GenNorm
Format
tmChem
(Chemical)
Java,
Perl,
C++
CRF
√
√
DNorm
(Disease)
Java
CRF
√
√
tmVar
(Mutation)
Perl,
C++
CRF
√
√
√
SR4GN
(Species)
Perl
Rule-‐based
√
√
√
GenNorm
(Gene)
Perl
Statistical
√
√
√
PubTator
Perl,
JavaScript
Web
server
√
√
7
9. ¡ Official
corpus
for
BioCreative
IV
GO
Task
¡ 200
full-‐text
articles
along
with
their
gene
ontology
(GO)
annotations
§ evidence
sentences
§ gene/protein
entities,
GO
terms,
GO
evidence
codes
¡ Developed
by
expert
GO
curators
via
a
web-‐
based
annotation
tool.
9
10. ¡ Motivation
¡ The
NCBI
Text
Mining
Toolkit
¡ Building
BioC
Compatible
Tools
¡ Results
and
Conclusions
10
11. ¡ The
BioC
family
§
XML
DTD
▪ how
to
present
text
document
and
annotations
(higher-‐level
semantics)
§ C++
and
Java
Libraries
▪ functions/classes
to
read/
write
documents
in
BioC
format
¡ BioC
Recommendations
§ Full-‐text
articles
and
Annotations
▪ Present
in
BioC
XML
Format
▪ Keep
in
separate
files
§ Key
file
▪ describes
how
data
should
be
interpreted
in
the
annotation
file
(lower-‐level
semantics)
▪ needs
to
be
created
for
a
specific
type
of
data.
11
12. ¡ Steps
taken
to
comply
our
tools
with
BioC
§ Created
the
key
file
§ Modified
the
input/output
formats
of
the
tools
▪ Added
the
BioC
format
as
a
new
option
for
input/output
¡ Challenges
§ Defining
an
appropriate
key
file
§ Offset
calculation
§ Translating
web-‐based
annotation
file
to
BioC
annotation
file
(Unicode
to
ASCII
conversion)
12
13. ¡ Motivation
¡ Our
Text
Mining
Tools
¡ Building
BioC
Compatible
Tools
¡ Results
and
Conclusions
13
14. ¡ Common
key
file
for
all
tools
since
they
are
designed
for
similar
types
of
data
14
id:
PubMed
id.
Passage:
e.g.,
title,
abstract
Offset
of
the
passage
Id
of
the
bioconcept
Offset
of
the
bioconcept
Length
of
the
bioconcept
Mention
of
the
bioconcept
date:
the
time
annotation
create
15. NER
tools
bioconcept
PubMed/
PMC
XML
BioC
Free
Text
PubTator
GenNorm
tmChem
Chemical
√
√
√
DNorm
Disease
√
√
√
tmVar
Mutation
√
√
√
√
SR4GN
Species
√
√
√
√
GenNorm
Gene
√
√
√
√
PubTator
N/A
√
√
√
15
Our
Text
Mining
Toolkit
available
for
public
access:
http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/
17. 17
id:
PubMed
id.
passage:
title
date:
the
time
file
download
passage:
abstract
18. 18
Id
of
the
bioconcept
Offset
of
the
bioconcept
Length
of
the
bioconcept
Mention
of
the
bioconcept
Type
of
the
bioconcept
19. Time:
Time
annotation
created.
ID:
PMID
of
the
article.
GO
term:
e.g.,
receptor-‐mediated
endocytosis
GO
evidence
code:
e.g.,
Inferred
from
Mutant
Phenotype
(IMP)
Curatable
entity:
i.e.,
gene
or
gene
product
Text:
GO
evidence
text
20. ¡ Our
experience
with
BioC
§ Minimal
changes
required
to
prepare
BioC
versions
§ Easy
to
learn
and
use
§ Improved
interoperability
within
the
toolkit
¡ Implications
§ Improved
interoperability
▪ With
other
tools
to
build
sophisticated
applications
§ The
key
file
could
evolve
as
a
standard
for
concept
recognition
and
normalization
tasks
§ Anticipate
broader
usage
of
our
tools
as
BioC
gains
popularity
20
21. ¡ BioC
Developers
§ W.
John
Wilbur
§ Rezarta
Islamaj
Doğan
§ Donald
Comeau
¡ Intramural
Research
Program
of
the
NIH,
National
Library
Medicine
21