Annotation Systems & Implementation Issues - Suzanna Lewis
1. Motivation
Your research is valuable
All advances in knowledge are incremental, with
each new idea ultimately building on earlier
knowledge such as you are gathering.
2. Losing data at a rapid rate
up to 80% unavailable after 20 years
2
http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416
3. Data valuation
Information is infinitely shareable without any loss of
value
Reuse increases the value derived from the original
investment
By combining data, their value increases
The more these assets are used, the more additional
knowledge can be gathered (data science)
As a corollary, unshared or insufficiently documented
information is less valuable
The more accurate and complete the information is,
the more useful, and therefore valuable, it is
Moody and Walsh 1999
7. eye
what kinds of
things exist?
what are the
relationships
between
these things?
ommatidium
sense organeye disc
is_a
part_of
develops
from
A biological ontology is:
A machine interpretable representation of
some aspect of biological reality
8. October 25, 2016
Ontology defined
The science of what is: of the kinds and
structures of the objects, and their properties
and relations in every area of reality.
The classification of entities and the relations
between them.
Defined by a scientific field's vocabulary and by
the canonical formulations of its theories.
Seeks to solve problems which arise in these
domains.
10. Ontologies help with decision
making
handy ontology tells us what’s there…
Where
should I
eat…?
11. Ontologies don’t just organize data; they
also facilitate inference,
and that creates new knowledge, often
unconsciously in the user.
(Presumable)
country of origin
Type of cuisine
12. What a 5 year old child (or a computer) will likely infer
about the world from this helpful ontology…
Flag of fresh juice
‘Frozen Yogurt’ cuisine in
search of a national identity?
Where delicatessen food
hails from from…
Fresh Juice is a national cuisine…
13. Information retrieval is not straightforward
18-day pregnant females
female (lactating)
individual female
worker caste (female)
2 yr old female
female (pregnant)
lgb*cc females
sex: female
400 yr. old female
female (outbred)
mare
female, other
adult female
female parent
female (worker)
female child
asexual female
female plant
monosex female
femal
femlale
diploid female
female(gynoecious)
remale
metafemale
f
femele
semi-engorged female
sterile female
famale
female, pooled
sexual oviparous female
normal female
femail
femalen
sterile female worker
sf
female
females
strictly female
vitellogenic replete
female
female - worker
females only
tetraploid female
worker
female (alate sexual)
gynoecious
thelytoky
hexaploid female
female (calf)
healthy female
female (gynoecious)
female (f-o)
hen
probably female (based
on morphology)
castrate female
female with eggs
ovigerous female
3 female
cf.female
female worker
oviparous sexual
females
female (phenotype)
cystocarpic female
female, 6-8 weeks old
worker bee
female mice
dikaryon
female, virgin
female enriched
female, spayed
dioecious female
female, worker
pseudohermaprhoditic
female
Courtesy of N. Silvester and S. Orchard, European Nucleotide Archive, EMBL-
EBI
14. October 25, 2016
Motivation is to represent biology
accurately
Inferences and decisions we make are
based upon what we know of the
biological reality.
An ontology is a computable
representation of this underlying biological
reality.
Enables a computer to reason over the
data in (some of) the ways that we do.
15. Annotation bottleneck
Even the best research will be for naught if
data can never be found again.
An active lab can easily generate 10-100GB of
data per month, and it is very difficult to
manage on this scale.
Must be annotated at the rate at which it is
generated
And the data must be integrated with other data
Furthermore, the effort put into generating this data
will be utterly wasted if the curated data cannot be
reliably computed upon.
17. Ontologies must be shared
Communities form scientific theories
that seek to explain all of the existing evidence
and can be used for prediction
The computable representation must also be
shared
Thus ontology development is inherently
collaborative
October 25, 2016
18. October 25, 2016
Ontologies must be used
Usage feeds back on ontology development and
improves the ontology
It improves even more when these data are used
to answer research questions
There will be fewer problems in the ontology and
more commitment to fixing remaining problems
when important research data is involved that
scientists depend upon
19. Why do we need rules for good
ontology?
Ontologies must be intelligible
To humans (for annotation) and
To machines (for searching, reasoning and error-checking)
Makes it easier to find the most accurate term(s) to use
Avoids annotation errors
Makes it easier for new curators to learn and understand
Makes it easier to combine with other ontologies and terminologies
Makes automatic reasoning possible for searching & inference
Bottom line:
Following basic rules makes more useful ontologies
20. October 25, 2016
First Rule: Univocity
Terms (including those describing
relations) should have the same meanings
on every occasion of use.
In other words, they should refer to the
same kinds of entities in reality
22. Comparison is difficult, especially across species or across
databases that each use one of these different variants
Disambiguation
Use a single term, and
plenty of synonyms
Gluconeogenesis
Synonyms:
Glucose synthesis
Glucose biosynthesis
Glucose formation
Glucose anabolism
24. = tooth bud initiation
= cellular bud initiation
= flower bud initiation
Include plain “bud initiation” as a synonym for each of
these terms
Classification rule:
Disambiguation
25. October 25, 2016
Second Rule: Positivity
Complements of classes are not
themselves classes.
Terms such as ‘non-mammal’ or ‘non-
membrane’ do not designate genuine
classes.
26. October 25, 2016
The Challenge of Positivity
Some organelles are membrane-bound.
A centrosome is not a membrane bound organelle,
but it still may be considered an organelle.
27. October 25, 2016
Positivity
Note the logical difference between
“non-membrane-bound organelle” and
“not a membrane-bound organelle”
The latter includes everything that is not a
membrane bound organelle!
28. October 25, 2016
Third Rule: Objectivity
Which classes exist is not a function of our
biological knowledge.
Terms such as ‘unknown’ or
‘unclassified’ or ‘unlocalized’ do not
designate biological natural kinds.
29. Objectivity
How can we annotate when we know that
we don’t have any information?
Annotate to root nodes and use the ND (no data)
evidence code
Similar strategies can be used for any
situation more specific information is not
yet known
October 25, 2016
31. Ontologies are graphs, where the nodes (terms in the
ontology ) are connected by edges (relationships
between the terms)
is-a
part-of
Fourth Rule: Use defined
relationships
mitochondrial
membrane
chloroplast
Cell
membrane
Chloroplast
membrane
32. Reasoning is critical
Prokaryotic and
Eukaryotic cell are
declared disjoints
Fungal cell is a
Eukaryotic cell
Spore is a Fungal cell
and a Prokaryotic cell
Satisfiable?
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0022006
32
Prokaryotic
Cell
Eukaryotic
Cell
Fungal
Cell
Spore
disjoint
33. Reasoning is critical
Solution: clarify spore
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0022006
33
Prokaryotic
Cell
Eukaryotic
Cell
Fungal
Cell
disjoint
Actinomycete
Type Spore
Mycetozoa
Type Spore
34. October 25, 2016
Fifth Rule: Intelligibility of Definitions
The terms used in a definition should be
simpler (more intelligible) than the term to
be defined
otherwise the definition provides no
assistance
to human understanding
for machine processing
35. October 25, 2016
Sixth Rule: Keep it Real
When building or maintaining an ontology,
always think carefully at how classes
(types, kinds, species) relate to instances
in reality
36. October 25, 2016
The Rules
1. Univocity: Terms should have the same meanings
on every occasion of use
2. Positivity: Terms such as ‘non-mammal’ or ‘non-
membrane’ do not designate genuine classes.
3. Objectivity: Terms such as ‘unknown’ or
‘unclassified’ or ‘unlocalized’ do not designate
biological natural kinds.
4. Single Inheritance: No class in a classification
hierarchy should have more than one is_a parent
on the immediate higher level
5. Intelligibility of Definitions: The terms used in a
definition should be simpler (more intelligible) than
the term to be defined
6. Basis in Reality: When building or maintaining an
ontology, always think carefully at how classes
relate to instances in reality
7. Distinguish Universals and Instances
37. Natural Language Computable Ontology
+ Large existing body of information
+ Highly expressive
- Ambiguous (making it difficult and
unreliable to compute on) - Less expressive
+ Logical
+ Precise
How to best describe biology?
41. Once a genome is sequenced…
What are the parts? (sequence features)
Protein coding genes (coding sequence)
Non coding RNAs (rRNA, snoRNA, tRNA, microRNA
antisense RNA)
Promoters and regulatory regions
Transposons
Recombination hotspots, origins of replication
Centromeres & telomeres
…
47. APOLLO
annotation editing environment
BECOMING ACQUAINTED WITH APOLLO
Color by CDS frame,
toggle strands, set color
scheme and highlights.
Upload evidence files
(GFF3, BAM, BigWig),
add combination and
sequence search
tracks.
Query the genome using
BLAT.
Navigation and zoom.
Search for a gene
model or a scaffold.
Get coordinates and “rubber
band” selection for zooming.
Login
User-created
annotations.
Annotator
panel.
Evidence
Tracks
Stage and
cell-type
specific
transcription
data.
http://genomearchitect.org/web_apollo_user_guide
55. GCGAAGTGCCAACTTCTACACACACAAAG
GCGAAGTGCCAACTTCTACACACACAAAG
For example – ontologically described
genotypes/variants
intrinsic genotype
genomic variation
complementgenomic background
= +
CGTAGC
CGTACC
apchu745/+; fgfa8ti282/ti282(AB)
genomic variation
complement
variant single locus
complement
variant allele
sequence alteration
has_part has_part
apchu745/+
apchu745
hu745
has_part has_part
has_part has_part
X
AACGTACCGACGCTCGCTACGGGCGTATC
(AB) apchu745/+; fgf8ati282/ti282
apchu745/+; fgf8ati282/ti282
GCGAAGTGCCAACTTCTACACACACAAAG
GCGAAGTGCCAACTTCTACACACACAAAG
AACGTAGCGACGCTCGCTACGGGCGTATC
AACGTACCGACGCTCGCTACGGGCGTATC X
ACAC
X
X
X
X
AACGTAGCGACGCTCGCTACGGGCGTATC
X ACAC
X
X
X
X
X
59. Ancestral inference
• Integration at points of common ancestry
• Infer “hidden” character of living organisms
• Explicitly leverage evolutionary relationships
E.c.
A.t. MTHFR1
A.t. MTHFR2
D.d.
S.p.
S.c. MET13
D.m.
A.g.
S.p.
S.c. MET12
C.e.
D.r.
G.g.
H.s. MTHFR
R.n.
M.m.
divergence
Biochemistry: purification and assay
Genetics: mutant phenotypes
60. What is transitive annotation?
Related genes have a common function because their common
ancestor had that function.
Not just an inference about one gene. It is also an inference for
The most recent common ancestor (MRCA)
Continuous inheritance since the MRCA
Potential inheritance by other descendants of the MRCA
Gene in
Yeast
Gene in
Mouse
Function X
Gene in
Opisthokont
MRCA
Function X
Function X
Gene in
Zebrafish
Function X
Function X
Gene in
Human
Function X
Function X
61. 61
• Green indicates experimental
• Black dot indicates direct
experimental data.
dot indicates a more
general functional class
inferred from ontology
Red indicates NOT
function for the gene
All nodes have persistent identifiers
which are retained across different
builds of the protein family trees.
cholinesterase
carboxylic ester hydrolase
Evolutionary event type:
duplication
speciation
62. • PAINTed nodes –
• 3 steps carried out by
curator
• Gain & Loss of function
• Inferred By Descendants
• Experimental annotations
provide evidence
• Inferred by Ancestry
• Propagation to
unannotated leaves
carboxylic ester
hydrolase
Node with loss of function
Gaudet, P., et al. (2011). Phylogenetic-
based propagation of functional
annotations within the Gene
Ontology consortium. Briefings in
Bioinformatics, 12(5), 449–62.
doi:10.1093/bib/bbr042
Node with gain of
function- cholinesterase
66. Motivation: multi-scale knowledge
models of mechanistic biology
Bai, J. P. F., & Abernethy, D. R. (n.d.). Systems Pharmacology to Predict Drug Toxicity : Integration Across Levels of Biological Organization ∗,
451–473. doi:10.1146/annurev-pharmtox-011112-140248
67. A data model for causal ontology
annotations: “LEGO”
Activity
GO:nnnnnnn
What: <molecule>
68. A data model for causal ontology
annotations: “LEGO”
Activity
GO:nnnnnnn
What: <molecule>
Where: GO/CL/Uberon
69. A data model for causal ontology
annotations: “LEGO”
Activity
GO:nnnnnnn
What: <molecule>
Where: GO/CL/Uberon
Activity
GO:nnnnnnn
What: <molecule>
Where: GO/CL/Uberon
Relationship
RO:nnnnnnn
70. A data model for causal ontology
annotations: “LEGO”
Activity
GO:nnnnnnn
What: <molecule>
Where: GO/CL/Uberon
Activity
GO:nnnnnnn
What: <molecule>
Where: GO/CL/Uberon
Relationship
RO:nnnnnnn
Evidence: ECO, SEPIO
Source: PMID, ORCID, ...
71. Process
GO:nnnnnnn
A data model for causal ontology
annotations: “LEGO”
Activity
GO:nnnnnnn
What: <molecule>
Where: GO/CL/Uberon
Activity
GO:nnnnnnn
What: <molecule>
Where: GO/CL/Uberon
Relationship
RO:nnnnnnn
72. A data model for causal ontology
annotations: “LEGO”
GTPase activity
GO:0003924
What: TEM1 S000004529
Where: spindle pole
GO:0000922
GTPase inhibitor activity
GO:0005095
What: BFA1
S000003814
Where: spindle pole
GO:0000922
73. Exit from mitosis
GO:0010458
A data model for causal ontology
annotations: “LEGO”
GTPase activity
GO:0003924
What: TEM1 S000004529
Where: spindle pole
GO:0000922
GTPase inhibitor activity
GO:0005095
What: BFA1
S000003814
Where: spindle pole
GO:0000922