SlideShare a Scribd company logo
1 of 80
Concept recognition and its
application for protein function
prediction
Christopher Funk, Ph.D Candidate
University of Colorado School of Medicine
3/18/2015
Ph.D Committee:
Dr. Larry Hunter
Dr. Kevin Cohen
Dr. Karin Verspoor
Dr. Asa Ben-Hur
Dr. Joan Hooper 0
Growth in PubMed
0
200000
400000
600000
800000
1000000
1200000 1914
1918
1922
1926
1930
1934
1938
1942
1946
1950
1954
1958
1962
1966
1970
1974
1978
1982
1986
1990
1994
1998
2002
2006
2010
2014
Publicationsperyear
Year
1
Biomedical Knowledge Lifecycle
Medline
PMC
GenBank
Pfam
GEO
FlyBase
UniProt/Swi
ssProt
Experimental
Data
Literature
Biomedical
Databases
2
Biomedical Knowledge Lifecycle
Medline
PMC
GenBank
Pfam
GEO
FlyBase
UniProt/Swi
ssProt
Experimental
Data
Literature
Biomedical
Databases
3
Biomedical Knowledge Lifecycle
Medline
PMC
GenBank
Pfam
GEO
FlyBase
UniProt/Swi
ssProt
Experimental
Data
Literature
Biomedical
Databases
4
Biomedical Knowledge Lifecycle
Medline
PMC
GenBank
Pfam
GEO
FlyBase
UniProt/Swi
ssProt
Experimental
Data
Literature
Biomedical
Databases
5
Manual Curation is a Bottleneck
Medline
PMC
GenBank
Pfam
GEO
FlyBase
UniProt/Swi
ssProt
Experimental
Data
Literature Baumgartner et al. 2007
Biomedical
Databases
6
My dissertation
Medline
PMC
GenBank
Pfam
GEO
FlyBase
UniProt/Swi
ssProt
Experimental
Data
Literature
Biomedical
Databases
7
Natural
Language
Processing
Pipeline
My dissertation
Medline
PMC
GenBank
Pfam
GEO
FlyBase
UniProt/Swi
ssProt
Experimental
Data
Literature
Biomedical
Databases
Natural
Language
Processing
Pipeline
8
My dissertation
Medline
PMC
GenBank
Pfam
GEO
FlyBase
UniProt/Swi
ssProt
Experimental
Data
Literature
Biomedical
Databases
Text-mined data
Data
Predictions/
Hypothesis
Machine Learning
9
My dissertation
Medline
PMC
GenBank
Pfam
GEO
FlyBase
UniProt/Swi
ssProt
Experimental
Data
Literature
Biomedical
Databases
Text-mined data
Data
Predictions/
Hypothesis
Machine Learning
10
My dissertation
Medline
PMC
GenBank
Pfam
GEO
FlyBase
UniProt/Swi
ssProt
Experimental
Data
Literature
Biomedical
Databases
Text-mined data
Data
Predictions/
Hypothesis
Machine Learning
Validation
11
Biomedical ontologies
• Great enabling technology of bioinformatics
• Contain concepts linked with hierarchical
relationships
• Over 400 different ontologies in NCBO BioPortal
12
Concept/
Term
Gene Ontology
• Represents standardized way to refer to
functions
– UniProt-GOA
• Three branches:
– Cellular Component
– Biological Process
– Molecular Function
13
Named entity recognition
Previous in vitro experiments using renal
cell lines suggest recessive Aqp2
mutations result in improper trafficking
of the mutant water pore.
cell type protein
sequenceEntity molecular function
biological process
chemical
sequenceEntity
14
Concept recognition/normalization
Previous in vitro experiments using renal
cell lines suggest recessive Aqp2
mutations result in improper trafficking
of the mutant water pore.
GO:0005623 – “cell”
CL:0000000 – “cell”
PR:000004182 – “aquaporin-2”
EG:359 – “Aqp2”
SO:0001059 – “sequence_alteration” GO:0006810 – “transport”
SO:0001059 – “sequence_alteration” GO:0015250 – “water channel activity”
CHEBI:15377 – “water”
15
Link to vast knowledge sources
Previous in vitro experiments using renal
cell lines suggest recessive Aqp2
mutations result in improper trafficking
of the mutant water pore.
PR:000004182 – “aquaporin-2”
EG:359 – “Aqp2”
GO:0006810 – “transport”
GO:0015250 – “water channel activity”
CHEBI:15377 – “water”
16
Allows linking to vast other data
Previous in vitro experiments using renal
cell lines suggest recessive Aqp2
mutations result in improper trafficking
of the mutant water pore.
PR:000004182 – “aquaporin-2”
EG:359 – “Aqp2”
GO:0006810 – “transport”
GO:0015250 – “water channel activity”
17
Allows linking to vast other data
Previous in vitro experiments using renal
cell lines suggest recessive Aqp2
mutations result in improper trafficking
of the mutant water pore.
PR:000004182 – “aquaporin-2”
EG:359 – “Aqp2”
GO:0006810 – “transport”
GO:0015250 – “water channel activity”
CHEBI:15377 – “water”
18
Link to vast knowledge sources
Previous in vitro experiments using renal
cell lines suggest recessive Aqp2
mutations result in improper trafficking
of the mutant water pore.
PR:000004182 – “aquaporin-2”
EG:359 – “Aqp2”
GO:0006810 – “transport”
GO:0015250 – “water channel activity”
CHEBI:15377 – “water”
19
Outline of talk
• Biomedical concept recognition
– Comprehensive evaluation of prominent systems
– Improving recognition of complex Gene Ontology
concepts
• Application to protein function
prediction
– Exploring types of literature features that will aid
in identification of function from text
20
I hypothesize that…
• Performance among prominent concept
recognition systems will widely vary depending
on parameter combination and ontology
• Automatic rule-based generation of synonyms for
Gene Ontology concepts can improve recognition
• Literature mined features, including recognized
concepts, will be useful for prediction of protein
function
21
Outline of talk
• Biomedical concept recognition
– Comprehensive evaluation of prominent systems
– Improving recognition of complex Gene Ontology
concepts
• Application to protein function
prediction
– Exploring types of literature features that will aid
in identification of function from text
22
How well can we perform at this task?
• BioCreative I – genes (yeast): F-measure 0.92 (Hirschman et al 2005)
• BioCreative II – genes: F-measure 0.81 (Morgan et al 2008)
• Mgrep – biological processes: Precision 60% (Shah et al 2009)
• MetaMap – biological processes: Precision 63% (Shah et al 2009)
• MetaMap – diseases: F-measure 0.61 (Kang et al 2013)
• Peregrine – diseases: F-measure 0.64 (Kang et al 2013)
• Whatizit – diseases: F-measure 0.55 (Jimeno et al 2008)
• Lucene – diseases: F-measure 0.78 (Dogan et al 2012)
• MetaMap – diseases: F-measure 0.75 (Dogan et al 2012)
23
Colorado Richly Annotated Full Text
Corpus (CRAFT)
• 97 full text documents, 67 which are in public
release.
• Expertly annotated
• Ontologies
– Cell type
– Sequence Ontology
– NCBI Taxonomy
– ChEBI
– Protein Ontology
– Gene Ontology (3 branches)
24
Experimental setup
• Three dictionary based systems:
– NCBO Annotator (96 combinations)
• wholeWordOnly, filterNumber, stopWords, stopWordsCaseSensitive, minTermSize,
withSynonyms
– MetaMap (864 combinations)
• model, gaps, wordOrder, acronymAbb, derivationalVars, scoreFilter, minTermSize
– Concept Mapper (576 combinations)
• searchStrategy, caseMatch, stemmer, orderIndependentLookup, findAllMatches, stopWords,
synonyms
• Performance: precision, recall, and F-measure
• Exact match of both text span and ontological identifier
– very strict standard!
25
Maximum
F-measure per
ontology and
system
26
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0 Best performance for all tools on all ontologies
Precision
Recall
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
f=0.1
f=0.2
f=0.3
f=0.4
f=0.5
f=0.6
f=0.7
f=0.8
f=0.9
●
Systems
MetaMap
Concept Mapper
NCBO Annotator
Ontologies
GO_CC
GO_MF
GO_BP
SO
CL
PR
NCBITAXON
CHEBI
Parameter selection matters
27
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Cell Type Ontology
Precision
Recall
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
f=0.1
f=0.2
f=0.3
f=0.4
f=0.5
f=0.6
f=0.7
f=0.8
f=0.9
MetaMap
Concept Mapper
NCBO Annotator
Default Param
A pipeline for OBO concept
recognition
• ConceptMapper based pipeline
• Utilizes best performing combination for
evaluated ontologies
• http://sourceforge.net/projects/bionlp-
uima/files/nlp-pipelines/v0.5/
• Input: Any text and OBO file.
• Output: List of concepts from ontology contained
within the text in multiple output files (xml, a1,
inline)
28
Recognition
of Gene
Ontology
terms is
poor
Funk et al. 2014 29
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0 Performance of GO on CRAFT
Precision
Recall
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
f=0.1
f=0.2
f=0.3
f=0.4
f=0.5
f=0.6
f=0.7
f=0.8
f=0.9
Systems
MetaMap
Concept Mapper
NCBO Annotator
Ontologies
CC
MF
BP
30
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0 Performance of GO on CRAFT
Precision
Recall
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
f=0.1
f=0.2
f=0.3
f=0.4
f=0.5
f=0.6
f=0.7
f=0.8
f=0.9
Systems
MetaMap
Concept Mapper
NCBO Annotator
Ontologies
CC
MF
BP
Neji (Campos et al 2013)
Whatizit (Rebholz-Shuhmann et al 2008)
Case sensitivity and
information gain (Groza et al
accepted)
Concept variation within text
GO:0006900 – membrane budding
Variation in PMID: 12925238
• Lipid rafts play a key role in
membrane budding…
• …involvement of annexin A7 in
budding of vesicles…
• …Ca2+-mediated vesiculation
process was not impared.
• Red blood cells which lack the
ability to vesiculate casuse…
• Having excluded a direct role
in vesicle formation…
31
Gene Ontology vs. natural language
GO:0006900 – membrane budding
[Term]
id: GO:0006900
name: membrane budding
…
def: "The evagination of a membrane,
resulting in formation of a vesicle.”
…
synonym: "membrane evagination”
synonym: "nonselective vesicle assembly”
synonym: "vesicle biosynthesis”
synonym: "vesicle formation”
…
Variation in PMID: 12925238
• Lipid rafts play a key role in
membrane budding…
• …involvement of annexin A7 in
budding of vesicles…
• …Ca2+-mediated vesiculation
process was not impared.
• Red blood cells which lack the
ability to vesiculate casuse…
• Having excluded a direct role
in vesicle formation…
32
Gene Ontology vs. natural language
GO:0006900 – membrane budding
[Term]
id: GO:0006900
name: membrane budding
…
def: "The evagination of a membrane,
resulting in formation of a vesicle.”
…
synonym: "membrane evagination”
synonym: "nonselective vesicle assembly”
synonym: "vesicle biosynthesis”
synonym: "vesicle formation”
…
Variation in PMID: 12925238
• Lipid rafts play a key role in
membrane budding…
• …involvement of annexin A7 in
budding of vesicles…
• …Ca2+-mediated vesiculation
process was not impared.
• Red blood cells which lack the
ability to vesiculate casuse…
• Having excluded a direct role
in vesicle formation…
33
Related work
• TermGenie allows for on-the-fly creation of
concepts and has modules for automatic
synonym generation for a few classes of
concepts (Dietze et al 2014)
• Hamon et al 2008 automatically generate
synonym sets from Gene Ontology concepts
– {F-actin, actin filament}
34
GO concepts are built compositionally
35
Simple processes
cell differentiation
cell proliferation
cell activation
Specific cells
T-cell
Leukocyte
…
Different types of regulation
Regulation of biological process
Negative regulation of BP
Positive regulation of BP
GO concepts are built compositionally
36
Simple processes
cell differentiation
cell proliferation
cell activation
Specific cells
T-cell
Leukocyte
…
Different types of regulation
Regulation of biological process
Negative regulation of BP
Positive regulation of BP
T-cell differentiation
T-cell proliferation
T-cell activation
Regulation of cell differentiation
Regulation of cell proliferation
Positive regulation of cell differentiation
Positive regulation of cell proliferation
Regulation of T-cell differentiation
Regulation of T-cell differentiation
Positive regulation of T-cell differentiation
Positive regulation of T-cell proliferation
…
GO concepts are built compositionally
37
Simple processes
cell differentiation
cell proliferation
cell activation
Specific cells
T-cell
Leukocyte
…
Different types of regulation
Regulation of biological process
Negative regulation of BP
Positive regulation of BP
T-cell differentiation
T-cell proliferation
T-cell activation
Regulation of cell differentiation
Regulation of cell proliferation
Positive regulation of cell differentiation
Positive regulation of cell proliferation
Regulation of T-cell differentiation
Regulation of T-cell differentiation
Positive regulation of T-cell differentiation
Positive regulation of T-cell proliferation
…
GO concepts are built compositionally
38
Simple processes
cell differentiation
cell proliferation
cell activation
Specific cells
T-cell
Leukocyte
…
Different types of regulation
Regulation of biological process
Negative regulation of BP
Positive regulation of BP
T-cell differentiation
T-cell proliferation
T-cell activation
Regulation of cell differentiation
Regulation of cell proliferation
Positive regulation of cell differentiation
Positive regulation of cell proliferation
Regulation of T-cell differentiation
Regulation of T-cell differentiation
Positive regulation of T-cell differentiation
Positive regulation of T-cell proliferation
GO concepts are built compositionally
39
T-cell differentiation
T-cell proliferation
T-cell activation
Regulation of cell differentiation
Regulation of cell proliferation
Positive regulation of cell differentiation
Positive regulation of cell proliferation
Regulation of T-cell differentiation
Regulation of T-cell differentiation
Positive regulation of T-cell differentiation
Positive regulation of T-cell proliferation
…
Simple processes
cell differentiation
cell proliferation
cell activation
Specific cells
T-cell
Leukocyte
…
Different types of regulation
Regulation of biological process
Negative regulation of BP
Positive regulation of BP
Decompositional rules
• Obol was designed for parse ontology terms
and identify missing terms/relationships. (Mungall
et al 2004)
• 11 decompositional rules adapted from Obol
grammars
Obol: process(P that positively regulates(F)) =>
[positive],regulation(P),[of],biological process(P)
Mine: Biological Process concept =>
“positive regulation of”, Biological Process concept
40
Syntactic and derivational rules
• External ontological mappings (Cell type ontology for now)
• Input from biologist and ontologist
• Derivations from WordNet and Lexical Variant Generator
• Manually analyzing of CRAFT annotations
– GO:0050729 positive regulation of inflammatory response
• proinflammatory
• pro-inflammatory
– GO:0045597 positive regulation of cell differentiation
• differentiation-promoting
– GO:0043065 positive regulation of apoptosis
• up-regulation of apoptosis
• pro-apoptotic
– GO:0007131 meiotic recombination
• recombination in meiosis
– GO:0040020 regulation of meiosis
• meiotic regulatory
41
Example application of rules
42
Example application of rules
43
Example application of rules
44
Example application of rules
45
Generated synonyms:
cyclic AMP biosynthesis (current synonym)
adenosine 3’,5’-cyclophosphate biosynthesis (current synonym)
formation of cAMP
cAMP production
generation of cAMP
…
Example application of rules
46
Generated synonyms:
activation of cyclic AMP biosynthesis
adenosine 3’,5’-cyclophosphate biosynthesis enhancement
formation of cAMP activation
stimulation of cAMP production
Stimulation of generation of cAMP
…
Synonyms appear in the literature
• A drug-like antagonist inhibits thyrotropin
receptor-mediated stimulation of cAMP
production in Graves' orbital fibroblasts.
– PMCID: 3407388
• These data suggest that ethanol treatment
increases in vitro hCG production in human
placental trophoblasts by enhancing cAMP
production.
– PMID: 9413929
47
18 rules
generated 291k
synonyms for
16k GO
concepts (66%).
Increase in F-
measure for all
GO of 0.14.
Overall
performance
0.64.
48
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Precision
Recall
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
f=0.1
f=0.2
f=0.3
f=0.4
f=0.5
f=0.6
f=0.7
f=0.8
f=0.9
●
Ontologies
CC
MF
BP
GO
Rules improve performance on
the CRAFT corpus
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Precision
Recall
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
f=0.1
f=0.2
f=0.3
f=0.4
f=0.5
f=0.6
f=0.7
f=0.8
f=0.9
●
Ontologies
CC
MF
BP
GO
49
Neji (Campos et al 2013)
Whatizit (Rebholz-Shuhmann et al 2008)
Case sensitivity and
information gain (Groza et al
accepted)
Compositional rules show higher performance than
any reported numbers.
1 million full text corpus – rules
produced 42% more annotations and
18 % more concepts
50
0
10
20
30
40
50
60
70
undefined low high
Numberofannotations(inmillions)
Information Content
OBO only
With rules
Examples of new concepts identified
• GO:0032342 - aldosterone biosynthetic process
– aldosterone biosynthesis
– formation of aldosterone
– …
• GO:0050926 - regulation of positive chemotaxis
– chemoattractant stimulation
– upregulation of chemoattractants
– …
• GO:0048672 - positive regulation of collateral
sprouting
– promotion of collateral sprouting
– stimulation of axon branches
– …
51
Manual evaluation of random samples
reveals reduction in accuracy
52
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
undefined low high overall
Accuracy
Information Content
OBO only
With rules
Manual error analysis
• 3 main types of errors introduced through
compositional rules:
1. Stemming/lemmatization creating incorrect concepts
(60%)
• collagen binding activation => collagen binding activity
• glycine import => importance of glycine
2. Incorporation of non-exact synonyms (25%)
• negative regulation of ryanodine-sensitive calcium-release
channel activity => anti-ryanodine receptor
3. Inclusion of incorrect punctuation (15%)
• negative regulation of transcription regulator activity =>
“transcriptional regulator; inhibits”
• Two simple fixes removed ~850k total errors and
increased accuracy of rules from 0.74 => 0.82 on the
random sample.
53
Manual error analysis
• 4 main types of errors introduced through compositional
rules:
1. Stemming/lemmatization creating incorrect concepts (60%)
• collagen binding activation => collagen binding activity
• glycine import => importance of glycine
• positive regulation of NK T cell activation => “Natural Killer T Cell
Activation Promotes…”
2. Incorporation of non-exact synonyms (25%)
• negative regulation of ryanodine-sensitive calcium-release channel
activity => anti-ryanodine receptor
3. Inclusion of incorrect punctuation (15%)
• negative regulation of transcription regulator activity =>
“transcriptional regulator; inhibits”
• Two simple fixes removed ~850k total errors and
increased accuracy of annotations produced by rules from
0.74 => 0.82 on the random sample.
54
Outline of talk
• Biomedical concept recognition
– Comprehensive evaluation of prominent systems
– Improving recognition of complex Gene Ontology
concepts
• Application to protein function
prediction
– Exploring types of literature features that will aid
in identification of function from text
55
Growth of sequence databases and
functional annotations
56http://gorbi.irb.hr/en/method/growth-of-sequence-databases/
Protein function prediction
• Experimentally determining
function is time consuming
and expensive
• Task: Given a protein, what
are the functions it performs?
• Function is everything that
happens to or through a
protein (Rost et al. 2003)
• Specify function by the Gene
Ontology (GO)
57
Commonly used methods/features
• Transfer of function based on
homology (Bork et al 1998, Rost et al 2003, Xin et al 2013)
• Amino acid sequence(Jensen et al 2003, Martin et
al 2004, Clark et al 2011)
• 3D structure (Pal et al 2005, Laskowski et al 2005)
• Co-localization (Walker et al 1999, Klomp et al 2012)
• Protein interaction networks (Deng et al
2003, Nabieva et al 2005)
• Microarray experiments (Huttenhower et al
2006, Sokolov et al 2013)
• Combinations of all above (Costello et al
2009, Sokolov et al 2010)
58
Literature based function prediction
• Which literature features?
• How to combine them?
• BioCreative IV (2014) - <protein,document> -> <protein,document,GOterm>
– K-nearest neighbor based on similar abstracts was best performing (F-
measure 0.13) (Gobeill et al)
• Exploit document similarity to establish relationships between
genes (Shakay et al 2000, Raychaudhuri et al 2002, Chaussabel et al 2002, Gobeill et al 2014)
• Protein-protein co-occurrence supplements PPI (Gabow et al 2008)
• Critical Assessment of Functional Annotation (2011)
– Wong and Shatkay 2014 train and test a classifier and characterize
proteins using key-terms from related abstracts. Only predict concepts
at 2nd level of GO.
– Bjorne et al 2011 utilize biomedical events on their ability to predict
385 GO concepts with F-measure of 0.09.
59
Literature based function prediction
• Which literature features?
• How to combine them?
• Exploit document similarity to establish relationships between
genes (Shakay et al 2000, Raychaudhuri et al 2002, Chaussabel et al 2002, Gobeill et al 2014)
• BioCreative IV (2014) - <protein,document> -> <protein,document,GOterm>
– K-nearest neighbor based on similar abstracts was best performing (F-
measure 0.13) (Gobeill et al 2014)
• Protein-protein co-occurrence supplements PPI (Gabow et al 2008)
• Critical Assessment of Functional Annotation (2011)
– Wong and Shatkay 2014 train and test a classifier and characterize
proteins using key-terms from related abstracts. Only predict concepts
at 2nd level of GO.
– Bjorne et al 2011 utilize biomedical events on their ability to predict
385 GO concepts with F-measure of 0.09.
60
Literature based function prediction
• Co-mentions – co-occurring entities within a
specified span of text.
– Protein-Protein
– GO-GO
– Protein-GO
• Sentence
• Non-sentence
• Bag of words (BoW)
– All words in sentence where protein is mentioned
61
Literature based function prediction
• Co-mentions – co-occurring entities within a
specified span of text.
– Protein-Protein
– GO-GO
– Protein-GO
• Intra-sentence
• Inter-sentence
• Bag of words (BoW)
– All words in sentence where protein is mentioned
62
Feature Extraction
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
63
Feature Extraction
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
64
Feature Extraction
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Bag of words:
WordsSent1(membrane, otherwise, known, … , proteolytic, enzyme, known,
extracellular, invasion, … , progression)
WordsSent2(protein, and, message, levels, of, was , …)
65
Feature Extraction
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Protein GO term co-mentions:
intra_comen(P50281, GO:0008237), intra_comen(P50281, GO:0006508),
intra_comen(P50281, GO:0009056), intra_comen(P50281, GO:0031012),
inter_comen(P50281, GO:0010467), inter_comen(P50281, GO:0005623)
66
Feature Extraction
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Protein GO term co-mentions:
inter_comen(P50281, GO:0008237), inter_comen(P50281, GO:0006508),
inter_comen(P50281, GO:0009056), inter_comen(P50281, GO:0031012),
inter_comen(P50281, GO:0010467), inter_comen(P50281, GO:0005623)
67
Feature Representation
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Bag of Words:
P40281, known=2, membrane=1, protein=1, proteolytic=1, enzyme=1, …
Protein GO term co-mentions (intra-sentence):
P40281, GO:0008237=1, GO:0006508=1, GO:0009056=1, GO:0031012=1,…
Protein GO term co-mentions (inter-sentence):
P40281, GO:0010467=2, GO:0005623=2,…
68
Evaluate literature features within
GOstruct framework (Sokolov et al 2010)
• Multi-view hierarchical
support vector machine
(SVM) framework
designed to predict
entire Gene Ontology at
once
• Combine many different
types of features for
prediction
• One of top performing
systems in CAFA 2011 (Adapted from
Sokolov et al 2013)
69
Extraction & Analysis pipeline
70
Only using literature is useful for
function prediction
71
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
MF BP CC
Macro-averagedF-measure
Gene Ontology Branch
Baseline (co-mentions as predictions) Co-mentions BoW Co-mentions + BoW
Literature features approach performance of
commonly used biological features
(Sokolov et al 2013, Kahanda et al unpublished)
72
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
MF BP CC
Macro-averagedF-measure
Trans/Localization
Homology
Network
Literature
All Combined
Top predicted GO concepts using synonym
rules have higher information content
73
Some false positive have literature
support
• GCNT1 – carbohydrate metabolic process
(Q02742 - GO:0005975)
– “Genes related to carbohydrate metabolism
include PPP1R3C, B3GNT1, and GCNT1…” -
PMID:23646466
• CERS2 – ceramide biosynthetic process
(Q96G23 - GO:0046513)
– “…CersS2, which uses C22-CoA for ceramide
synthesis…” -PMID:22144673
74
Future directions
• Explore interaction of dictionary and machine
learning based methods for concept recognition
• Extend and refine GO synonym generation rules
• Use already created gold standard of functionally
annotated co-mentions to reduce the high false
positive rate (70%)
• Provide “noisy” large collection of extracted co-
mentions to biologists to explore interactively
75
Contributions
• Performed a comprehensive evaluation of
prominent general concept recognition systems
for eight biomedical ontologies against a gold
standard corpus.
• Created more variable set of Gene Ontology
synonyms utilizing concept compositionality and
used them to improve recognition of GO
concepts within the literature.
• Showed the utility of literature mined features,
including mined concepts, for automated protein
function prediction and validation.
76
Publications
First author
• Christopher Funk, K Bretonnel Cohen, Lawrence Hunter, Karin Verspoor “Simple Gene ontology synonym
generation rules lead to increase in biomedical concept recognition” (alsmost submitted 2015)
• Christopher Funk, Indika Kahanda, Asa Ben-Hur, and Karin Verspoor (2014) “Evaluating a variety of text-
mined features for automatic protein function prediction” Journal of Biomedical Semantics (accepted).
• Christopher Funk, Indika Kahanda, Asa Ben-Hur, and Karin Verspoor (2014) “Evaluating a variety of text-
mined features for automatic protein function prediction” BioOntologies Special Interest Group ISMB
2014.
• Christopher Funk, Lawrence E. Hunter, and K. Bretonnel Cohen “Combining heterogeneous data for
prediction of disease related and pharmacogenes” Pacific Symposium of Biocomputing 2014.
• Christopher Funk, William Baumgartner Jr., Benjamin Garcia, Christophe Roeder, Michael Bada, K.
Bretonnel Cohen, Lawrence E. Hunter, and Karin Verspoor “Large-scale biomedical concept recognition:
An evaluation of current automatic annotators and their parameters” BMC Bioinformatics 2014.
Co-author
• Artem Sokolov, Christopher Funk, Kiley Graim, Karin Verspoor, Asa Ben-Hur “Combining Heterogeneous
Data Sources for Protein Function Prediction” BMC Bioinformatics 2013.
• Radivojac, Predrag and Clark, Wyatt T and Oron, Tal Ronnen and Schnoes, Alexandra M and Wittkop,
Tobias and Sokolov, Artem and Graim, Kiley and Funk, Christopher and Verspoor, Karin and Ben-Hur, Asa
and others “A large-scale evaluation of computational protein function prediction” Nature Methods 2013.
• Karin Verspoor, K. Bretonnel Cohen, Arrick Lanfranchi, Colin Warner, Helen L. Johnson, Christophe Roeder,
Jinho D. Choi, Christopher Funk, Yuriy Malenkiy, Miriam Eckert, Nianwen Xue, William A. Baumgartner Jr.,
Michael Bada, Martha Palmer, and Lawrence E. Hunter “A corpus of full-text journal articles is a robust
evaluation tool for revealing differences in performance of biomedical natural language processing tools”
BMC Bioinformatics 2012.
• K. Bretonnel Cohen, Karin Verspoor, Michael Bada, Christopher Funk, and Lawrence E. Hunter (accepted)
“The Colorado Richly Annotated Full Text (CRAFT) corpus: Multi-model annotation in the biomedical
domain.” In Nancy Ide and James Pustejovsky, editors, Handbook of Linguistic Annotation.
77
Acknowledgements
• All paper co-authors
– William Baumgartner Jr.
– Benjamin Garcia
– Christophe Roeder
– Michael Bada
– Kevin Cohen
– Lawrence E. Hunter
– Karin Verspoor
• CPBS program
– David Knox
– Mike Hinterberg
– Mike Bada
– Meg Pirrung
– Charlotte Siska
– Natalya Panteleyva
– Negacy Hailu
• Committee
– Larry Hunter
– Karin Verspoor
– Kevin Cohen
– Joan Hooper
– Asa Ben-Hur
• CSU grad students
– Artem Sokolov
– Kiley Graim
– Indika Kahanda
– Fahad Ullah
• Funding
– NIH 2T15LM009451
78
References• Dogan, Rezarta and Lu, Zhiyong “An Inference Method for Disease Name Normalization” 2012
• Blaschke, Christian et al “Evaluation of BioCreative assessment of task 2” 2005
• Morgan, Alexander et al “Overview of BioCreative II gene normalization” 2008
• Mao, Yuqing et al “Overview of the gene ontology task at BioCreative IV” 2014
• Hirschman, Lynette et al “Overview of BioCreative task 1B: normalized gene list” 2005
• Kang et al “Using rule-based natural language processing to improve disease normalization in biomedical text” 2013
• Jimeno, Antonio et al “Assessment of disease named entity recognition on a corpus of annotated sentences” 2008
• Shah, Nigam et al “Comparison of concept recognizers for building the Open Biomedical Annotator” 2009
• Mungall et al “Obol: integrating language an meaning in bio-ontologies” 2004
• Rost et al “Automatic prediction of protein function” 2003
• Shore et al “Fibrodysplasia ossificans progressiva: a human genetic disorder of extraskeletal bone formation, or—how does one tissue become
another? “ 2012
• Goichi et al “Cartilage dierentiation regulating gene” https://www.google.com/patents/WO2003087375A1?cl=en 2003
• Van der Borght et al “Reduced neurogenesis in the rat hippocampus following high fructose consumption” 2011
• Goncalves et al “The cox-2 inhibitors, meloxicam and nimesulide, suppress neurogenesis in the adult mouse brain” 2010
• Bork et al “Predicting function: from genes to genomes and back” 1999
• Xin et al “Computational methods for identification of functional residues within protein structures” 2003
• Jensen et al “Prediction of human protein function from post-translational modification and localization features” 2003
• Martin et al “Gotcha: a new method for prediction of protein function assessed by the annotation of seven genomes” 2004
• Clark et al “Analysis of protein function and its prediction from amino acid sequence” 2011
• Pal et al “Inference of protein function from protein structure” 2005
• Lakowski et al “Protein function prediction using local 3D templates” 2005
• Walker et al “Prediction of gene function by genome-scale expression analysis: prostate cancer associated genes” 1999
• Klomp et al “Genome-wide matching of genes to cellular roles using guilt-by-association models derived from single sample analysis” 2012
• Deng et al “Prediction of protein function from protein/protein interaction data: a probablistic approach” 2003
• Nabieva et al “Whole-proteome prediction of protein function via graph-theorietic of interatcion maps” 2005
• Huttenhower et al “A scalable method for integration and functional analysis of multiple microarray datasets” 2006
• Sokolov et al ”Hierarichical classification of gene ontology terms using the Gostruct method” 2010
• Sokolov et al “Combining heterogeneous data sources for accurate functional annotation of proteins” 2013
• Costello et al “Gene networks in Drosophilia melenagaster: integrating experimental data to prediction protein function” 2009
• Shatkay et al “Texts-as-data: using text-based features for proteins representation and for computational prediction of their characteristics” 2014
• Bjorne et al “A machine learning model and evaluation of text mining protein function prediction” 2011
• Shatkay et al “Finding themes in Medline documents: probabalistic similarity search” 2000
• Chaussabel et al “Mining microarray expression data by literature profiling” 2002
• Raychaudhuri et al “Associating Gene Ontology codes with genes using a maximum entropy analysis of biomedical literature” 2002
• Gabow et al “Improving protein function prediction methods with integrationof literature data” 2008 79

More Related Content

What's hot

Cancer genome databases & Ecological databases
Cancer genome databases & Ecological databases Cancer genome databases & Ecological databases
Cancer genome databases & Ecological databases Waliullah Wali
 
DNA Testing: Living Longer Via Personal Genomics
DNA Testing: Living Longer Via Personal GenomicsDNA Testing: Living Longer Via Personal Genomics
DNA Testing: Living Longer Via Personal GenomicsMelanie Swan
 
Using Neo4j technologies for the management of systems biology models
Using Neo4j technologies for the management of systems biology modelsUsing Neo4j technologies for the management of systems biology models
Using Neo4j technologies for the management of systems biology modelsUniversity Medicine Greifswald
 
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...Alejandra Gonzalez-Beltran
 

What's hot (6)

GoTermsAnalysisWithR
GoTermsAnalysisWithRGoTermsAnalysisWithR
GoTermsAnalysisWithR
 
Kishor Presentation
Kishor PresentationKishor Presentation
Kishor Presentation
 
Cancer genome databases & Ecological databases
Cancer genome databases & Ecological databases Cancer genome databases & Ecological databases
Cancer genome databases & Ecological databases
 
DNA Testing: Living Longer Via Personal Genomics
DNA Testing: Living Longer Via Personal GenomicsDNA Testing: Living Longer Via Personal Genomics
DNA Testing: Living Longer Via Personal Genomics
 
Using Neo4j technologies for the management of systems biology models
Using Neo4j technologies for the management of systems biology modelsUsing Neo4j technologies for the management of systems biology models
Using Neo4j technologies for the management of systems biology models
 
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
 

Similar to Computational Biology thesis defense

New Target Prediction and Visualization Tools Incorporating Open Source Molec...
New Target Prediction and Visualization Tools Incorporating Open Source Molec...New Target Prediction and Visualization Tools Incorporating Open Source Molec...
New Target Prediction and Visualization Tools Incorporating Open Source Molec...Sean Ekins
 
Function and Phenotype Prediction through Data and Knowledge Fusion
Function and Phenotype Prediction through Data and Knowledge FusionFunction and Phenotype Prediction through Data and Knowledge Fusion
Function and Phenotype Prediction through Data and Knowledge FusionKarin Verspoor
 
2014 Taverna Tutorial Introduction to eScience and workflows
2014 Taverna Tutorial Introduction to eScience and workflows2014 Taverna Tutorial Introduction to eScience and workflows
2014 Taverna Tutorial Introduction to eScience and workflowsmyGrid team
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Golden Helix Inc
 
Functional annotation of invertebrate genomes
Functional annotation of invertebrate genomesFunctional annotation of invertebrate genomes
Functional annotation of invertebrate genomesSurya Saha
 
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017David Cook
 
UKSG 2023 - Will artificial intelligence change how readers use the research ...
UKSG 2023 - Will artificial intelligence change how readers use the research ...UKSG 2023 - Will artificial intelligence change how readers use the research ...
UKSG 2023 - Will artificial intelligence change how readers use the research ...UKSG: connecting the knowledge community
 
Bigger Data to Increase Drug Discovery
Bigger Data to Increase Drug DiscoveryBigger Data to Increase Drug Discovery
Bigger Data to Increase Drug DiscoverySean Ekins
 
Analysis with biological pathways:
Analysis with biological pathways: Analysis with biological pathways:
Analysis with biological pathways: Chris Evelo
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Ian Foster
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917GenomeInABottle
 
WikiPathways: how open source and open data can make omics technology more us...
WikiPathways: how open source and open data can make omics technology more us...WikiPathways: how open source and open data can make omics technology more us...
WikiPathways: how open source and open data can make omics technology more us...Chris Evelo
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsmikaelhuss
 
Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...Chris Evelo
 
Una estrategia para la integración de ontologías, servicios web y PLN en el a...
Una estrategia para la integración de ontologías, servicios web y PLN en el a...Una estrategia para la integración de ontologías, servicios web y PLN en el a...
Una estrategia para la integración de ontologías, servicios web y PLN en el a...Anubis Hosein
 
Exploiting bigger data and collaborative tools for predictive drug discovery
Exploiting bigger data and collaborative tools for predictive drug discovery Exploiting bigger data and collaborative tools for predictive drug discovery
Exploiting bigger data and collaborative tools for predictive drug discovery Sean Ekins
 
ISB nov 2014
ISB nov 2014ISB nov 2014
ISB nov 2014mcdonadt
 

Similar to Computational Biology thesis defense (20)

New Target Prediction and Visualization Tools Incorporating Open Source Molec...
New Target Prediction and Visualization Tools Incorporating Open Source Molec...New Target Prediction and Visualization Tools Incorporating Open Source Molec...
New Target Prediction and Visualization Tools Incorporating Open Source Molec...
 
EBI industry program 2018
EBI industry program 2018EBI industry program 2018
EBI industry program 2018
 
Function and Phenotype Prediction through Data and Knowledge Fusion
Function and Phenotype Prediction through Data and Knowledge FusionFunction and Phenotype Prediction through Data and Knowledge Fusion
Function and Phenotype Prediction through Data and Knowledge Fusion
 
2014 Taverna Tutorial Introduction to eScience and workflows
2014 Taverna Tutorial Introduction to eScience and workflows2014 Taverna Tutorial Introduction to eScience and workflows
2014 Taverna Tutorial Introduction to eScience and workflows
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
Functional annotation of invertebrate genomes
Functional annotation of invertebrate genomesFunctional annotation of invertebrate genomes
Functional annotation of invertebrate genomes
 
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
 
ASHG_2014_AP
ASHG_2014_APASHG_2014_AP
ASHG_2014_AP
 
UKSG 2023 - Will artificial intelligence change how readers use the research ...
UKSG 2023 - Will artificial intelligence change how readers use the research ...UKSG 2023 - Will artificial intelligence change how readers use the research ...
UKSG 2023 - Will artificial intelligence change how readers use the research ...
 
Bigger Data to Increase Drug Discovery
Bigger Data to Increase Drug DiscoveryBigger Data to Increase Drug Discovery
Bigger Data to Increase Drug Discovery
 
Analysis with biological pathways:
Analysis with biological pathways: Analysis with biological pathways:
Analysis with biological pathways:
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
Introduction to 16S Microbiome Analysis
Introduction to 16S Microbiome AnalysisIntroduction to 16S Microbiome Analysis
Introduction to 16S Microbiome Analysis
 
WikiPathways: how open source and open data can make omics technology more us...
WikiPathways: how open source and open data can make omics technology more us...WikiPathways: how open source and open data can make omics technology more us...
WikiPathways: how open source and open data can make omics technology more us...
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...
 
Una estrategia para la integración de ontologías, servicios web y PLN en el a...
Una estrategia para la integración de ontologías, servicios web y PLN en el a...Una estrategia para la integración de ontologías, servicios web y PLN en el a...
Una estrategia para la integración de ontologías, servicios web y PLN en el a...
 
Exploiting bigger data and collaborative tools for predictive drug discovery
Exploiting bigger data and collaborative tools for predictive drug discovery Exploiting bigger data and collaborative tools for predictive drug discovery
Exploiting bigger data and collaborative tools for predictive drug discovery
 
ISB nov 2014
ISB nov 2014ISB nov 2014
ISB nov 2014
 

Recently uploaded

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 

Recently uploaded (20)

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 

Computational Biology thesis defense

  • 1. Concept recognition and its application for protein function prediction Christopher Funk, Ph.D Candidate University of Colorado School of Medicine 3/18/2015 Ph.D Committee: Dr. Larry Hunter Dr. Kevin Cohen Dr. Karin Verspoor Dr. Asa Ben-Hur Dr. Joan Hooper 0
  • 2. Growth in PubMed 0 200000 400000 600000 800000 1000000 1200000 1914 1918 1922 1926 1930 1934 1938 1942 1946 1950 1954 1958 1962 1966 1970 1974 1978 1982 1986 1990 1994 1998 2002 2006 2010 2014 Publicationsperyear Year 1
  • 7. Manual Curation is a Bottleneck Medline PMC GenBank Pfam GEO FlyBase UniProt/Swi ssProt Experimental Data Literature Baumgartner et al. 2007 Biomedical Databases 6
  • 13. Biomedical ontologies • Great enabling technology of bioinformatics • Contain concepts linked with hierarchical relationships • Over 400 different ontologies in NCBO BioPortal 12 Concept/ Term
  • 14. Gene Ontology • Represents standardized way to refer to functions – UniProt-GOA • Three branches: – Cellular Component – Biological Process – Molecular Function 13
  • 15. Named entity recognition Previous in vitro experiments using renal cell lines suggest recessive Aqp2 mutations result in improper trafficking of the mutant water pore. cell type protein sequenceEntity molecular function biological process chemical sequenceEntity 14
  • 16. Concept recognition/normalization Previous in vitro experiments using renal cell lines suggest recessive Aqp2 mutations result in improper trafficking of the mutant water pore. GO:0005623 – “cell” CL:0000000 – “cell” PR:000004182 – “aquaporin-2” EG:359 – “Aqp2” SO:0001059 – “sequence_alteration” GO:0006810 – “transport” SO:0001059 – “sequence_alteration” GO:0015250 – “water channel activity” CHEBI:15377 – “water” 15
  • 17. Link to vast knowledge sources Previous in vitro experiments using renal cell lines suggest recessive Aqp2 mutations result in improper trafficking of the mutant water pore. PR:000004182 – “aquaporin-2” EG:359 – “Aqp2” GO:0006810 – “transport” GO:0015250 – “water channel activity” CHEBI:15377 – “water” 16
  • 18. Allows linking to vast other data Previous in vitro experiments using renal cell lines suggest recessive Aqp2 mutations result in improper trafficking of the mutant water pore. PR:000004182 – “aquaporin-2” EG:359 – “Aqp2” GO:0006810 – “transport” GO:0015250 – “water channel activity” 17
  • 19. Allows linking to vast other data Previous in vitro experiments using renal cell lines suggest recessive Aqp2 mutations result in improper trafficking of the mutant water pore. PR:000004182 – “aquaporin-2” EG:359 – “Aqp2” GO:0006810 – “transport” GO:0015250 – “water channel activity” CHEBI:15377 – “water” 18
  • 20. Link to vast knowledge sources Previous in vitro experiments using renal cell lines suggest recessive Aqp2 mutations result in improper trafficking of the mutant water pore. PR:000004182 – “aquaporin-2” EG:359 – “Aqp2” GO:0006810 – “transport” GO:0015250 – “water channel activity” CHEBI:15377 – “water” 19
  • 21. Outline of talk • Biomedical concept recognition – Comprehensive evaluation of prominent systems – Improving recognition of complex Gene Ontology concepts • Application to protein function prediction – Exploring types of literature features that will aid in identification of function from text 20
  • 22. I hypothesize that… • Performance among prominent concept recognition systems will widely vary depending on parameter combination and ontology • Automatic rule-based generation of synonyms for Gene Ontology concepts can improve recognition • Literature mined features, including recognized concepts, will be useful for prediction of protein function 21
  • 23. Outline of talk • Biomedical concept recognition – Comprehensive evaluation of prominent systems – Improving recognition of complex Gene Ontology concepts • Application to protein function prediction – Exploring types of literature features that will aid in identification of function from text 22
  • 24. How well can we perform at this task? • BioCreative I – genes (yeast): F-measure 0.92 (Hirschman et al 2005) • BioCreative II – genes: F-measure 0.81 (Morgan et al 2008) • Mgrep – biological processes: Precision 60% (Shah et al 2009) • MetaMap – biological processes: Precision 63% (Shah et al 2009) • MetaMap – diseases: F-measure 0.61 (Kang et al 2013) • Peregrine – diseases: F-measure 0.64 (Kang et al 2013) • Whatizit – diseases: F-measure 0.55 (Jimeno et al 2008) • Lucene – diseases: F-measure 0.78 (Dogan et al 2012) • MetaMap – diseases: F-measure 0.75 (Dogan et al 2012) 23
  • 25. Colorado Richly Annotated Full Text Corpus (CRAFT) • 97 full text documents, 67 which are in public release. • Expertly annotated • Ontologies – Cell type – Sequence Ontology – NCBI Taxonomy – ChEBI – Protein Ontology – Gene Ontology (3 branches) 24
  • 26. Experimental setup • Three dictionary based systems: – NCBO Annotator (96 combinations) • wholeWordOnly, filterNumber, stopWords, stopWordsCaseSensitive, minTermSize, withSynonyms – MetaMap (864 combinations) • model, gaps, wordOrder, acronymAbb, derivationalVars, scoreFilter, minTermSize – Concept Mapper (576 combinations) • searchStrategy, caseMatch, stemmer, orderIndependentLookup, findAllMatches, stopWords, synonyms • Performance: precision, recall, and F-measure • Exact match of both text span and ontological identifier – very strict standard! 25
  • 27. Maximum F-measure per ontology and system 26 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 Best performance for all tools on all ontologies Precision Recall 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 f=0.1 f=0.2 f=0.3 f=0.4 f=0.5 f=0.6 f=0.7 f=0.8 f=0.9 ● Systems MetaMap Concept Mapper NCBO Annotator Ontologies GO_CC GO_MF GO_BP SO CL PR NCBITAXON CHEBI
  • 28. Parameter selection matters 27 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 Cell Type Ontology Precision Recall 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 f=0.1 f=0.2 f=0.3 f=0.4 f=0.5 f=0.6 f=0.7 f=0.8 f=0.9 MetaMap Concept Mapper NCBO Annotator Default Param
  • 29. A pipeline for OBO concept recognition • ConceptMapper based pipeline • Utilizes best performing combination for evaluated ontologies • http://sourceforge.net/projects/bionlp- uima/files/nlp-pipelines/v0.5/ • Input: Any text and OBO file. • Output: List of concepts from ontology contained within the text in multiple output files (xml, a1, inline) 28
  • 30. Recognition of Gene Ontology terms is poor Funk et al. 2014 29 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 Performance of GO on CRAFT Precision Recall 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 f=0.1 f=0.2 f=0.3 f=0.4 f=0.5 f=0.6 f=0.7 f=0.8 f=0.9 Systems MetaMap Concept Mapper NCBO Annotator Ontologies CC MF BP
  • 31. 30 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 Performance of GO on CRAFT Precision Recall 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 f=0.1 f=0.2 f=0.3 f=0.4 f=0.5 f=0.6 f=0.7 f=0.8 f=0.9 Systems MetaMap Concept Mapper NCBO Annotator Ontologies CC MF BP Neji (Campos et al 2013) Whatizit (Rebholz-Shuhmann et al 2008) Case sensitivity and information gain (Groza et al accepted)
  • 32. Concept variation within text GO:0006900 – membrane budding Variation in PMID: 12925238 • Lipid rafts play a key role in membrane budding… • …involvement of annexin A7 in budding of vesicles… • …Ca2+-mediated vesiculation process was not impared. • Red blood cells which lack the ability to vesiculate casuse… • Having excluded a direct role in vesicle formation… 31
  • 33. Gene Ontology vs. natural language GO:0006900 – membrane budding [Term] id: GO:0006900 name: membrane budding … def: "The evagination of a membrane, resulting in formation of a vesicle.” … synonym: "membrane evagination” synonym: "nonselective vesicle assembly” synonym: "vesicle biosynthesis” synonym: "vesicle formation” … Variation in PMID: 12925238 • Lipid rafts play a key role in membrane budding… • …involvement of annexin A7 in budding of vesicles… • …Ca2+-mediated vesiculation process was not impared. • Red blood cells which lack the ability to vesiculate casuse… • Having excluded a direct role in vesicle formation… 32
  • 34. Gene Ontology vs. natural language GO:0006900 – membrane budding [Term] id: GO:0006900 name: membrane budding … def: "The evagination of a membrane, resulting in formation of a vesicle.” … synonym: "membrane evagination” synonym: "nonselective vesicle assembly” synonym: "vesicle biosynthesis” synonym: "vesicle formation” … Variation in PMID: 12925238 • Lipid rafts play a key role in membrane budding… • …involvement of annexin A7 in budding of vesicles… • …Ca2+-mediated vesiculation process was not impared. • Red blood cells which lack the ability to vesiculate casuse… • Having excluded a direct role in vesicle formation… 33
  • 35. Related work • TermGenie allows for on-the-fly creation of concepts and has modules for automatic synonym generation for a few classes of concepts (Dietze et al 2014) • Hamon et al 2008 automatically generate synonym sets from Gene Ontology concepts – {F-actin, actin filament} 34
  • 36. GO concepts are built compositionally 35 Simple processes cell differentiation cell proliferation cell activation Specific cells T-cell Leukocyte … Different types of regulation Regulation of biological process Negative regulation of BP Positive regulation of BP
  • 37. GO concepts are built compositionally 36 Simple processes cell differentiation cell proliferation cell activation Specific cells T-cell Leukocyte … Different types of regulation Regulation of biological process Negative regulation of BP Positive regulation of BP T-cell differentiation T-cell proliferation T-cell activation Regulation of cell differentiation Regulation of cell proliferation Positive regulation of cell differentiation Positive regulation of cell proliferation Regulation of T-cell differentiation Regulation of T-cell differentiation Positive regulation of T-cell differentiation Positive regulation of T-cell proliferation …
  • 38. GO concepts are built compositionally 37 Simple processes cell differentiation cell proliferation cell activation Specific cells T-cell Leukocyte … Different types of regulation Regulation of biological process Negative regulation of BP Positive regulation of BP T-cell differentiation T-cell proliferation T-cell activation Regulation of cell differentiation Regulation of cell proliferation Positive regulation of cell differentiation Positive regulation of cell proliferation Regulation of T-cell differentiation Regulation of T-cell differentiation Positive regulation of T-cell differentiation Positive regulation of T-cell proliferation …
  • 39. GO concepts are built compositionally 38 Simple processes cell differentiation cell proliferation cell activation Specific cells T-cell Leukocyte … Different types of regulation Regulation of biological process Negative regulation of BP Positive regulation of BP T-cell differentiation T-cell proliferation T-cell activation Regulation of cell differentiation Regulation of cell proliferation Positive regulation of cell differentiation Positive regulation of cell proliferation Regulation of T-cell differentiation Regulation of T-cell differentiation Positive regulation of T-cell differentiation Positive regulation of T-cell proliferation
  • 40. GO concepts are built compositionally 39 T-cell differentiation T-cell proliferation T-cell activation Regulation of cell differentiation Regulation of cell proliferation Positive regulation of cell differentiation Positive regulation of cell proliferation Regulation of T-cell differentiation Regulation of T-cell differentiation Positive regulation of T-cell differentiation Positive regulation of T-cell proliferation … Simple processes cell differentiation cell proliferation cell activation Specific cells T-cell Leukocyte … Different types of regulation Regulation of biological process Negative regulation of BP Positive regulation of BP
  • 41. Decompositional rules • Obol was designed for parse ontology terms and identify missing terms/relationships. (Mungall et al 2004) • 11 decompositional rules adapted from Obol grammars Obol: process(P that positively regulates(F)) => [positive],regulation(P),[of],biological process(P) Mine: Biological Process concept => “positive regulation of”, Biological Process concept 40
  • 42. Syntactic and derivational rules • External ontological mappings (Cell type ontology for now) • Input from biologist and ontologist • Derivations from WordNet and Lexical Variant Generator • Manually analyzing of CRAFT annotations – GO:0050729 positive regulation of inflammatory response • proinflammatory • pro-inflammatory – GO:0045597 positive regulation of cell differentiation • differentiation-promoting – GO:0043065 positive regulation of apoptosis • up-regulation of apoptosis • pro-apoptotic – GO:0007131 meiotic recombination • recombination in meiosis – GO:0040020 regulation of meiosis • meiotic regulatory 41
  • 46. Example application of rules 45 Generated synonyms: cyclic AMP biosynthesis (current synonym) adenosine 3’,5’-cyclophosphate biosynthesis (current synonym) formation of cAMP cAMP production generation of cAMP …
  • 47. Example application of rules 46 Generated synonyms: activation of cyclic AMP biosynthesis adenosine 3’,5’-cyclophosphate biosynthesis enhancement formation of cAMP activation stimulation of cAMP production Stimulation of generation of cAMP …
  • 48. Synonyms appear in the literature • A drug-like antagonist inhibits thyrotropin receptor-mediated stimulation of cAMP production in Graves' orbital fibroblasts. – PMCID: 3407388 • These data suggest that ethanol treatment increases in vitro hCG production in human placental trophoblasts by enhancing cAMP production. – PMID: 9413929 47
  • 49. 18 rules generated 291k synonyms for 16k GO concepts (66%). Increase in F- measure for all GO of 0.14. Overall performance 0.64. 48 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 Precision Recall 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 f=0.1 f=0.2 f=0.3 f=0.4 f=0.5 f=0.6 f=0.7 f=0.8 f=0.9 ● Ontologies CC MF BP GO Rules improve performance on the CRAFT corpus
  • 50. 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 Precision Recall 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 f=0.1 f=0.2 f=0.3 f=0.4 f=0.5 f=0.6 f=0.7 f=0.8 f=0.9 ● Ontologies CC MF BP GO 49 Neji (Campos et al 2013) Whatizit (Rebholz-Shuhmann et al 2008) Case sensitivity and information gain (Groza et al accepted) Compositional rules show higher performance than any reported numbers.
  • 51. 1 million full text corpus – rules produced 42% more annotations and 18 % more concepts 50 0 10 20 30 40 50 60 70 undefined low high Numberofannotations(inmillions) Information Content OBO only With rules
  • 52. Examples of new concepts identified • GO:0032342 - aldosterone biosynthetic process – aldosterone biosynthesis – formation of aldosterone – … • GO:0050926 - regulation of positive chemotaxis – chemoattractant stimulation – upregulation of chemoattractants – … • GO:0048672 - positive regulation of collateral sprouting – promotion of collateral sprouting – stimulation of axon branches – … 51
  • 53. Manual evaluation of random samples reveals reduction in accuracy 52 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 undefined low high overall Accuracy Information Content OBO only With rules
  • 54. Manual error analysis • 3 main types of errors introduced through compositional rules: 1. Stemming/lemmatization creating incorrect concepts (60%) • collagen binding activation => collagen binding activity • glycine import => importance of glycine 2. Incorporation of non-exact synonyms (25%) • negative regulation of ryanodine-sensitive calcium-release channel activity => anti-ryanodine receptor 3. Inclusion of incorrect punctuation (15%) • negative regulation of transcription regulator activity => “transcriptional regulator; inhibits” • Two simple fixes removed ~850k total errors and increased accuracy of rules from 0.74 => 0.82 on the random sample. 53
  • 55. Manual error analysis • 4 main types of errors introduced through compositional rules: 1. Stemming/lemmatization creating incorrect concepts (60%) • collagen binding activation => collagen binding activity • glycine import => importance of glycine • positive regulation of NK T cell activation => “Natural Killer T Cell Activation Promotes…” 2. Incorporation of non-exact synonyms (25%) • negative regulation of ryanodine-sensitive calcium-release channel activity => anti-ryanodine receptor 3. Inclusion of incorrect punctuation (15%) • negative regulation of transcription regulator activity => “transcriptional regulator; inhibits” • Two simple fixes removed ~850k total errors and increased accuracy of annotations produced by rules from 0.74 => 0.82 on the random sample. 54
  • 56. Outline of talk • Biomedical concept recognition – Comprehensive evaluation of prominent systems – Improving recognition of complex Gene Ontology concepts • Application to protein function prediction – Exploring types of literature features that will aid in identification of function from text 55
  • 57. Growth of sequence databases and functional annotations 56http://gorbi.irb.hr/en/method/growth-of-sequence-databases/
  • 58. Protein function prediction • Experimentally determining function is time consuming and expensive • Task: Given a protein, what are the functions it performs? • Function is everything that happens to or through a protein (Rost et al. 2003) • Specify function by the Gene Ontology (GO) 57
  • 59. Commonly used methods/features • Transfer of function based on homology (Bork et al 1998, Rost et al 2003, Xin et al 2013) • Amino acid sequence(Jensen et al 2003, Martin et al 2004, Clark et al 2011) • 3D structure (Pal et al 2005, Laskowski et al 2005) • Co-localization (Walker et al 1999, Klomp et al 2012) • Protein interaction networks (Deng et al 2003, Nabieva et al 2005) • Microarray experiments (Huttenhower et al 2006, Sokolov et al 2013) • Combinations of all above (Costello et al 2009, Sokolov et al 2010) 58
  • 60. Literature based function prediction • Which literature features? • How to combine them? • BioCreative IV (2014) - <protein,document> -> <protein,document,GOterm> – K-nearest neighbor based on similar abstracts was best performing (F- measure 0.13) (Gobeill et al) • Exploit document similarity to establish relationships between genes (Shakay et al 2000, Raychaudhuri et al 2002, Chaussabel et al 2002, Gobeill et al 2014) • Protein-protein co-occurrence supplements PPI (Gabow et al 2008) • Critical Assessment of Functional Annotation (2011) – Wong and Shatkay 2014 train and test a classifier and characterize proteins using key-terms from related abstracts. Only predict concepts at 2nd level of GO. – Bjorne et al 2011 utilize biomedical events on their ability to predict 385 GO concepts with F-measure of 0.09. 59
  • 61. Literature based function prediction • Which literature features? • How to combine them? • Exploit document similarity to establish relationships between genes (Shakay et al 2000, Raychaudhuri et al 2002, Chaussabel et al 2002, Gobeill et al 2014) • BioCreative IV (2014) - <protein,document> -> <protein,document,GOterm> – K-nearest neighbor based on similar abstracts was best performing (F- measure 0.13) (Gobeill et al 2014) • Protein-protein co-occurrence supplements PPI (Gabow et al 2008) • Critical Assessment of Functional Annotation (2011) – Wong and Shatkay 2014 train and test a classifier and characterize proteins using key-terms from related abstracts. Only predict concepts at 2nd level of GO. – Bjorne et al 2011 utilize biomedical events on their ability to predict 385 GO concepts with F-measure of 0.09. 60
  • 62. Literature based function prediction • Co-mentions – co-occurring entities within a specified span of text. – Protein-Protein – GO-GO – Protein-GO • Sentence • Non-sentence • Bag of words (BoW) – All words in sentence where protein is mentioned 61
  • 63. Literature based function prediction • Co-mentions – co-occurring entities within a specified span of text. – Protein-Protein – GO-GO – Protein-GO • Intra-sentence • Inter-sentence • Bag of words (BoW) – All words in sentence where protein is mentioned 62
  • 64. Feature Extraction Target: P50281 – Matrix metalloproteinase 14 (MMP14) 63
  • 65. Feature Extraction Target: P50281 – Matrix metalloproteinase 14 (MMP14) 64
  • 66. Feature Extraction Target: P50281 – Matrix metalloproteinase 14 (MMP14) Bag of words: WordsSent1(membrane, otherwise, known, … , proteolytic, enzyme, known, extracellular, invasion, … , progression) WordsSent2(protein, and, message, levels, of, was , …) 65
  • 67. Feature Extraction Target: P50281 – Matrix metalloproteinase 14 (MMP14) Protein GO term co-mentions: intra_comen(P50281, GO:0008237), intra_comen(P50281, GO:0006508), intra_comen(P50281, GO:0009056), intra_comen(P50281, GO:0031012), inter_comen(P50281, GO:0010467), inter_comen(P50281, GO:0005623) 66
  • 68. Feature Extraction Target: P50281 – Matrix metalloproteinase 14 (MMP14) Protein GO term co-mentions: inter_comen(P50281, GO:0008237), inter_comen(P50281, GO:0006508), inter_comen(P50281, GO:0009056), inter_comen(P50281, GO:0031012), inter_comen(P50281, GO:0010467), inter_comen(P50281, GO:0005623) 67
  • 69. Feature Representation Target: P50281 – Matrix metalloproteinase 14 (MMP14) Bag of Words: P40281, known=2, membrane=1, protein=1, proteolytic=1, enzyme=1, … Protein GO term co-mentions (intra-sentence): P40281, GO:0008237=1, GO:0006508=1, GO:0009056=1, GO:0031012=1,… Protein GO term co-mentions (inter-sentence): P40281, GO:0010467=2, GO:0005623=2,… 68
  • 70. Evaluate literature features within GOstruct framework (Sokolov et al 2010) • Multi-view hierarchical support vector machine (SVM) framework designed to predict entire Gene Ontology at once • Combine many different types of features for prediction • One of top performing systems in CAFA 2011 (Adapted from Sokolov et al 2013) 69
  • 71. Extraction & Analysis pipeline 70
  • 72. Only using literature is useful for function prediction 71 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 MF BP CC Macro-averagedF-measure Gene Ontology Branch Baseline (co-mentions as predictions) Co-mentions BoW Co-mentions + BoW
  • 73. Literature features approach performance of commonly used biological features (Sokolov et al 2013, Kahanda et al unpublished) 72 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 MF BP CC Macro-averagedF-measure Trans/Localization Homology Network Literature All Combined
  • 74. Top predicted GO concepts using synonym rules have higher information content 73
  • 75. Some false positive have literature support • GCNT1 – carbohydrate metabolic process (Q02742 - GO:0005975) – “Genes related to carbohydrate metabolism include PPP1R3C, B3GNT1, and GCNT1…” - PMID:23646466 • CERS2 – ceramide biosynthetic process (Q96G23 - GO:0046513) – “…CersS2, which uses C22-CoA for ceramide synthesis…” -PMID:22144673 74
  • 76. Future directions • Explore interaction of dictionary and machine learning based methods for concept recognition • Extend and refine GO synonym generation rules • Use already created gold standard of functionally annotated co-mentions to reduce the high false positive rate (70%) • Provide “noisy” large collection of extracted co- mentions to biologists to explore interactively 75
  • 77. Contributions • Performed a comprehensive evaluation of prominent general concept recognition systems for eight biomedical ontologies against a gold standard corpus. • Created more variable set of Gene Ontology synonyms utilizing concept compositionality and used them to improve recognition of GO concepts within the literature. • Showed the utility of literature mined features, including mined concepts, for automated protein function prediction and validation. 76
  • 78. Publications First author • Christopher Funk, K Bretonnel Cohen, Lawrence Hunter, Karin Verspoor “Simple Gene ontology synonym generation rules lead to increase in biomedical concept recognition” (alsmost submitted 2015) • Christopher Funk, Indika Kahanda, Asa Ben-Hur, and Karin Verspoor (2014) “Evaluating a variety of text- mined features for automatic protein function prediction” Journal of Biomedical Semantics (accepted). • Christopher Funk, Indika Kahanda, Asa Ben-Hur, and Karin Verspoor (2014) “Evaluating a variety of text- mined features for automatic protein function prediction” BioOntologies Special Interest Group ISMB 2014. • Christopher Funk, Lawrence E. Hunter, and K. Bretonnel Cohen “Combining heterogeneous data for prediction of disease related and pharmacogenes” Pacific Symposium of Biocomputing 2014. • Christopher Funk, William Baumgartner Jr., Benjamin Garcia, Christophe Roeder, Michael Bada, K. Bretonnel Cohen, Lawrence E. Hunter, and Karin Verspoor “Large-scale biomedical concept recognition: An evaluation of current automatic annotators and their parameters” BMC Bioinformatics 2014. Co-author • Artem Sokolov, Christopher Funk, Kiley Graim, Karin Verspoor, Asa Ben-Hur “Combining Heterogeneous Data Sources for Protein Function Prediction” BMC Bioinformatics 2013. • Radivojac, Predrag and Clark, Wyatt T and Oron, Tal Ronnen and Schnoes, Alexandra M and Wittkop, Tobias and Sokolov, Artem and Graim, Kiley and Funk, Christopher and Verspoor, Karin and Ben-Hur, Asa and others “A large-scale evaluation of computational protein function prediction” Nature Methods 2013. • Karin Verspoor, K. Bretonnel Cohen, Arrick Lanfranchi, Colin Warner, Helen L. Johnson, Christophe Roeder, Jinho D. Choi, Christopher Funk, Yuriy Malenkiy, Miriam Eckert, Nianwen Xue, William A. Baumgartner Jr., Michael Bada, Martha Palmer, and Lawrence E. Hunter “A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools” BMC Bioinformatics 2012. • K. Bretonnel Cohen, Karin Verspoor, Michael Bada, Christopher Funk, and Lawrence E. Hunter (accepted) “The Colorado Richly Annotated Full Text (CRAFT) corpus: Multi-model annotation in the biomedical domain.” In Nancy Ide and James Pustejovsky, editors, Handbook of Linguistic Annotation. 77
  • 79. Acknowledgements • All paper co-authors – William Baumgartner Jr. – Benjamin Garcia – Christophe Roeder – Michael Bada – Kevin Cohen – Lawrence E. Hunter – Karin Verspoor • CPBS program – David Knox – Mike Hinterberg – Mike Bada – Meg Pirrung – Charlotte Siska – Natalya Panteleyva – Negacy Hailu • Committee – Larry Hunter – Karin Verspoor – Kevin Cohen – Joan Hooper – Asa Ben-Hur • CSU grad students – Artem Sokolov – Kiley Graim – Indika Kahanda – Fahad Ullah • Funding – NIH 2T15LM009451 78
  • 80. References• Dogan, Rezarta and Lu, Zhiyong “An Inference Method for Disease Name Normalization” 2012 • Blaschke, Christian et al “Evaluation of BioCreative assessment of task 2” 2005 • Morgan, Alexander et al “Overview of BioCreative II gene normalization” 2008 • Mao, Yuqing et al “Overview of the gene ontology task at BioCreative IV” 2014 • Hirschman, Lynette et al “Overview of BioCreative task 1B: normalized gene list” 2005 • Kang et al “Using rule-based natural language processing to improve disease normalization in biomedical text” 2013 • Jimeno, Antonio et al “Assessment of disease named entity recognition on a corpus of annotated sentences” 2008 • Shah, Nigam et al “Comparison of concept recognizers for building the Open Biomedical Annotator” 2009 • Mungall et al “Obol: integrating language an meaning in bio-ontologies” 2004 • Rost et al “Automatic prediction of protein function” 2003 • Shore et al “Fibrodysplasia ossificans progressiva: a human genetic disorder of extraskeletal bone formation, or—how does one tissue become another? “ 2012 • Goichi et al “Cartilage dierentiation regulating gene” https://www.google.com/patents/WO2003087375A1?cl=en 2003 • Van der Borght et al “Reduced neurogenesis in the rat hippocampus following high fructose consumption” 2011 • Goncalves et al “The cox-2 inhibitors, meloxicam and nimesulide, suppress neurogenesis in the adult mouse brain” 2010 • Bork et al “Predicting function: from genes to genomes and back” 1999 • Xin et al “Computational methods for identification of functional residues within protein structures” 2003 • Jensen et al “Prediction of human protein function from post-translational modification and localization features” 2003 • Martin et al “Gotcha: a new method for prediction of protein function assessed by the annotation of seven genomes” 2004 • Clark et al “Analysis of protein function and its prediction from amino acid sequence” 2011 • Pal et al “Inference of protein function from protein structure” 2005 • Lakowski et al “Protein function prediction using local 3D templates” 2005 • Walker et al “Prediction of gene function by genome-scale expression analysis: prostate cancer associated genes” 1999 • Klomp et al “Genome-wide matching of genes to cellular roles using guilt-by-association models derived from single sample analysis” 2012 • Deng et al “Prediction of protein function from protein/protein interaction data: a probablistic approach” 2003 • Nabieva et al “Whole-proteome prediction of protein function via graph-theorietic of interatcion maps” 2005 • Huttenhower et al “A scalable method for integration and functional analysis of multiple microarray datasets” 2006 • Sokolov et al ”Hierarichical classification of gene ontology terms using the Gostruct method” 2010 • Sokolov et al “Combining heterogeneous data sources for accurate functional annotation of proteins” 2013 • Costello et al “Gene networks in Drosophilia melenagaster: integrating experimental data to prediction protein function” 2009 • Shatkay et al “Texts-as-data: using text-based features for proteins representation and for computational prediction of their characteristics” 2014 • Bjorne et al “A machine learning model and evaluation of text mining protein function prediction” 2011 • Shatkay et al “Finding themes in Medline documents: probabalistic similarity search” 2000 • Chaussabel et al “Mining microarray expression data by literature profiling” 2002 • Raychaudhuri et al “Associating Gene Ontology codes with genes using a maximum entropy analysis of biomedical literature” 2002 • Gabow et al “Improving protein function prediction methods with integrationof literature data” 2008 79

Editor's Notes

  1. Explosion in literature, much information is contained within these articles. -> Dire need for automatic methods to access, extract information, summarize, answer questions, etc from the biomedical literature. These methods need to be able to scale to handle large amounts of data.
  2. General flowchart of biological knowledge.
  3. Publish results, deposit sequences & expression data
  4. Hypothesis are generated and this is a continual process <NEXT> The process by which most annotations appear in DB is human curation.
  5. Literature -> DB is mostly done through human curation/verification
  6. The BOTTLENECK lies in curation from literature into databases. MAN CURATION CANNOT KEEP UP My thesis fits into the chat in two different places.
  7. Most people evaluate against databases…WHICH ARE INCOMPLETE!!!
  8. Ontologies have become a great enabling technology of modern bioinformatics 1) Database curation -- Linking of data across organisms 2) Play important role in natural language processing – terminology and semantic constraints on entities 3) Formal representation/reasoning TERM = CONCEPT
  9. An important first task for many NLP pipelines is named entity recognition – identification of entities within text. This is about as good as it gets – cant really do much else with the entities. Not very helpful for computers to understand and perform sophisticated tasks.
  10. The task of CR has not received as much effort -> much more difficult problem. See many different concepts. When I say recognition I also imply normalization…to an ontological identifier. Semantically lossless.
  11. CR allows us to link to vast knowledge sources MUCH MORE INTEROPERABLE WITH LARGE CURATED DATABASES.
  12. More information about the protein, sequence, related proteins in other organisms.
  13. Concept recognition – specifically that of the Gene Ontology. 2 applications where mining gene ontology terms from the literature has been used. EVIDENCE for DB CURATION, along with other applications (humans, formal representation)
  14. Concept recognition – specifically that of the Gene Ontology. 2 applications where mining gene ontology terms from the literature has been used. EVIDENCE for DB CURATION, along with other applications (humans, formal representation)
  15. Very few formal evaluations, no gold standard – no recall/fmeasure. No other comprehensive evaluatsions. DIFFERENTIATE ME:
  16. CRAFT corpus, gold standard annotated with multiple biomedical ontologies. What I compared the prominent systems against.
  17. PROMINENT GENERAL CONCEPT REGONITION SYSTEMS – using ontologies and can ground to their identifiers. There are some systems designed for specific ontology – mostly gene/proteins. Systems have variety of parameters – describe & talk.
  18. Describe plot. See widely varying performance – Range of 0.1 – 0.83
  19. Defaults are not always the best – most time not. Provide possible parameter values based upon characteristics of the ontology. Many other conclusions and manual analysis that I don’t have time to talk about here….
  20. Useful for the community – many applications besides the work presented here. ConceptMapper
  21. Important to be able to recognize these complex functions and process from the literature. The better we are able to identify them correctly automatically in text, more usefule and impactful many downstream tasks that utilize this type of conceptual information extracted from the literature – such as manual curation
  22. Since the original comparison was done there have been a few other systems evaluated on CRAFT. Due to an artifact of CRAFT, BP and MF were combined and so they evaluated. We separated the branches as the concepts within are different.
  23. GO was not designed for text-mining but for representation of biological knowledge/function. So a gap exists between what is in the ontology and how concepts are expressed in text.
  24. GO was not designed for text-mining but for representation of biological knowledge/function. So a gap exists between what is in the ontology and how concepts are expressed in text.
  25. GO was not designed for text-mining but for representation of biological knowledge/function. So a gap exists between what is in the ontology and how concepts are expressed in text.
  26. Term genie, has some templates for synonym generation – basically generates synonyms that are already contained. Sets of rules could be added to this program. Hamon learn synonym sets from GO concepts – unfortunately these are not available. But could easily be added incorporated to these rules.
  27. OBO terms are built from more simpler terms Incorporate more basic synonyms within the more complex terms -> as compositionally combining them.
  28. More difficult to generate variability from a larger concept – I want to exploit this undyerling principal to decompose into the most basic – generate synonyms then compositionally build taking advantage of the synonyms generated.
  29. Logical def -- Blue_car = car AND <has_color_blue>
  30. How many total synonyms?
  31. Arrows represent improvement that is made with the additional synonyms generated the compositional rules.
  32. Since the original comparison was done there have been a few other systems evaluated on CRAFT. Due to an artifact of CRAFT, BP and MF were combined and so they evaluated. We separated the branches as the concepts within are different. CRAFT is smaller corpus, with only 1100 unique GO concepts. Would like to see the impact my rules have on a larger corpus.
  33. Binned by information content (based upon curated annotations) LOW – less useful. Overall see 42.5% increase in annotations produced. See 2.1k new concepts identified. Also ~2,100 more concepts (12k vs 14k)
  34. Quality control check…
  35. No GOLD STANDARD…must manually evaluate. Evaluated if text span identified conveyed the same semantic meaning as the original concept. Evaluated 1% of concepts OBO and 10% of new concepts identified.
  36. Removed the rule that generates X activation, and removed annotations with unmatched quotes, parenthesis, semicolons. Overall accruacry for annotations having same meaning as original concept is = 0.88.
  37. Concept recognition – specifically that of the Gene Ontology. 2 applications where mining gene ontology terms from the literature has been used. EVIDENCE for DB CURATION, along with other applications (humans, formal representation)
  38. Experimentally determining the function of a protein is time consuming and expensive. Computational approaches can help biologists gain insight into what functions proteins perform.
  39. USING LITERATURE One way could specify function is using GENE ONTOLOGY
  40. Most computational methods use those described in the previous slide. What about incorporating the information from the exponentially growing literature? CAFA 1 – 54 methods only 3 utilizied literature based features. Key-terms = informative terms derived from Z-score statistic
  41. Most computational methods use those described in the previous slide. What about incorporating the information from the exponentially growing literature? BC had a simpler task, relate protein document pairs to GO terms within that document – methods had trouble with this. CAFA 1 – 54 methods only 3 utilizied literature based features. DIFFERENTIATES MY WORK IS SIZE OF LITERATURE MINED, DO NOT RELY ON CURATED PROTEIN LINKS.
  42. Proxy to true relationships Captures context around protein mentions, not specific to function. Hypothesize that proteins mentions within similar context have similar function.
  43. Proxy to true relationships PREV EXPERIMENTS: protein-GO are most useful within Gostruct framework Captures context around protein mentions, not specific to function. Hypothesize that proteins mentions within similar context have similar function.
  44. Walk through an pulling out features from literature…
  45. We see 2 examples of
  46. Take all the words in a sentence where proteins are made. We did LC remove common stop words but did not use stemming/lemmatization.
  47. GO dictionary that identifies GO terms. See some sentence and non-sentence co-mentions.
  48. In terms of this protein, all are non-sentence.
  49. Each protein is represented on a single line. Intra and inter are distinct sets. AGGREGATED OVER THE ENTIRE LITERATURE TONS OF DATA!!!! 4 million unique co-mentions an 230 million TOTAL!!!! NEED MACHINE LEARNING TO HELP IDENTIFY PATTERNS IN NOISY DATA
  50. Too much data for humans to deal with -- ML to help learn patterns from the data itself. Labels are binary representation of GeneOntology hierarchy. DIFFERENTIATE THIS WORK: NUMBER OF CONCEPTS PREDICTED
  51. Literature = All MEDLINE and PMCOA (24 million abstracts & 600k full text document) DIFFERENTIATES THIS WORK: AMMOUNT OF LITERATURE NUMBER OF IMPLIED RELATIONSHIPS – unlike other methods we
  52. Baseline – only co-mentions as predictions. ABSTRACTED AWAY BECAUSE OF LEVEL OF RECALL is IMPRESSIVE. Surprising that BoW performed so well. VALUE in combining targeted information extraction methods along with word-level features STRESS: UNDERLYING DATA IS DIFFERENT, PROTEINS PREDICTED BETTER BY SOME THAN OTHERS
  53. Top performing GO terms predicted from enhanced co-mentions are of higher information content than the top performing GO terms from original co-mentions. And have higher performance – more specific & MORE USEFUL!!!!
  54. DID THIS MANUALLY – would prefer to have this generated/ranked automatically. GOA annotations have strict requirement for annotation – must be primary experimental evidence/support. Plenty of supporting sentences that don’t meet the criteria to include within GOA, but appear to be correct – still incredibly useful.
  55. Evaluated parameter performance, much more manual error analysis, and presented CR pipeline for the use within the community Showed they improved performance on CRAFT, performed manual accuracy and error analsyis from annotations on large collection of full text, show promise in real world applications – AFP Simple scalable co-mention/bag-of-words extracted from the literature at large scale approach performance of commonly used biological features. Literature has use beyond input for validation of false positives.