1. T H E W O R L D O F
B I O C U R AT I O N
O P T I M I Z I N G I T S I M PA C T
April 7, 2014—Seventh International Biocuration Conference
2. S O M E O N E W H O I S R E S P O N S I B L E F O R T H E
C A R E A N D S U P E R V I S I O N O F B I O L O G I C A L
K N O W L E D G E R E S O U R C E S A N D T H E I R U S E
W H A T I S A B I O C U R A T O R ?
3. W H AT D O B I O C U R AT O R S D O T O D AY ?
• Credits to Kaveh Bazargan ᔥ
• @kaveh1000
21. I S B
C A P T U R I N G
K N O W L E D G E
D E S I G N I N G E X P E R I M E N T S C O L L E C T I N G D ATA
R E V I E W I N G
C O N C L U S I O N S
W R I T I N G
U P
R E S U LT S
22. ~ 3 0 0 B I O C U R A T O R S
B I O C U R AT I O N I N V E R S I O N
D E S I G N I N G
E X P E R I M E N T S
C O L L E C T I N G D ATA
W R I T I N G U P R E S U LT S
R E V I E W I N G C O N C L U S I O N S
C A P T U R I N G K N O W L E D G E
http://www.nsf.gov/statistics/nsf13331/pdf/nsf13331.pdf
H U N D R E D S O F T H O U S A N D S O F G R A D
S T U D E N T S
P O S T- D O C S
L A B O R AT O R I E S
J O U R N A L S
24. E A R LY I N T E R V E N T I O N —
S U P P O R T I N G S TA N D A R D S
• Promote community-accepted identifiers, ontologies,
& formats
25. S U P P O R T S TA N D A R D S , T H E Y ’ R E O U R
F R I E N D
• November, 1999
• 45 biologists
• 14 days
• 140 megabases of Drosophila genome
!
• Published in March 2000
G E N E O N T O L O G Y, E T A L .
26. Q U E S T F O R
O R T H O L O G S
questfororthologs.org/ — www.ebi.ac.uk/reference_proteomes
27. Q U E S T F O R
O R T H O L O G S
• 30 phylogenomic databases
questfororthologs.org/ — www.ebi.ac.uk/reference_proteomes
28. Q U E S T F O R
O R T H O L O G S
• 30 phylogenomic databases
• Vary in # of species, taxonomic range, sampling density,
and methodology
questfororthologs.org/ — www.ebi.ac.uk/reference_proteomes
29. Q U E S T F O R
O R T H O L O G S
• 30 phylogenomic databases
• Vary in # of species, taxonomic range, sampling density,
and methodology
• Joint benchmarking effort
questfororthologs.org/ — www.ebi.ac.uk/reference_proteomes
30. Q U E S T F O R
O R T H O L O G S
• 30 phylogenomic databases
• Vary in # of species, taxonomic range, sampling density,
and methodology
• Joint benchmarking effort
• Only possible through the use of shared reference
proteomes and formats
questfororthologs.org/ — www.ebi.ac.uk/reference_proteomes
31. Q U E S T F O R
O R T H O L O G S
• 30 phylogenomic databases
• Vary in # of species, taxonomic range, sampling density,
and methodology
• Joint benchmarking effort
• Only possible through the use of shared reference
proteomes and formats
questfororthologs.org/ — www.ebi.ac.uk/reference_proteomes
32. E A R LY I N T E R V E N T I O N —
S U P P O R T I N G S TA N D A R D S
• Promote community-accepted identifiers, ontologies, & formats
• Develop and follow guidelines (paper and web-based)
• e.g. Gaudet, P., et al. Towards BioDBcore: a community-defined
information specification for biological databases. Database
2011. PMCID: PMC3017395
• Resource Identification Initiative
• www.force11.org/Resource_identification_initiative
• Vasilevsky NA, et al. On the reproducibility of science: unique
identification of research resources in the biomedical literature.
PeerJ. 2013 Sep 5;1:e148. doi: 10.7717/peerj.148. PubMed
PMID: 24032093; PubMed Central PMCID: PMC3771067.
33. E A R LY I N T E R V E N T I O N —
S U P P O R T I N G S TA N D A R D S
• Promote community-accepted identifiers, ontologies,
& formats
• Embed community accepted standards in the lab
environment
34. K N O C K O U T
M O U S E
P R O J E C T 2
• Broad standardized phenotyping of knockout mice on a
standard genetic background
• Data collection from many centres
• www.mousephenotype.org
35. K N O C K O U T
M O U S E
P R O J E C T 2
• Broad standardized phenotyping of knockout mice on a
standard genetic background
• Data collection from many centres
• www.mousephenotype.org
Cindy Smith
36. P R O T O C O L S A R E S TA N D A R D I Z E D
R E Q U I R E U S E O F PA R T I C U L A R O N T O L O G Y
T E R M S T O D E S C R I B E P H E N O T Y P E
37. E A R LY I N T E R V E N T I O N —
S U P P O R T I N G S TA N D A R D S
• Promote community-accepted identifiers, ontologies,
& formats
• Embed community accepted standards in the lab
environment
• Work with labs to embed standards into their data
generation pipeline
38. E A R LY I N T E R V E N T I O N —
S U P P O R T I N G S TA N D A R D S
• Promote community-accepted identifiers, ontologies,
& formats
• Embed community accepted standards in the lab
environment
• Stealth standards
39. S TA N D A R D S T H R O U G H U T I L I T Y —
A P O L L O
C S I R O V I D E O — D E M O A T G E N O M E A R C H I T E C T. O R G
40. S TA N D A R D S T H R O U G H U T I L I T Y —
A P O L L O
C S I R O V I D E O — D E M O A T G E N O M E A R C H I T E C T. O R G
42. T O O L S F O R T H E C O M M U N I T Y
• Web-based so researchers anywhere have access
43. T O O L S F O R T H E C O M M U N I T Y
• Web-based so researchers anywhere have access
• Concurrent access supports real-time collaboration
44. T O O L S F O R T H E C O M M U N I T Y
• Web-based so researchers anywhere have access
• Concurrent access supports real-time collaboration
• Built-in support for standards (transparently compliant)
45. T O O L S F O R T H E C O M M U N I T Y
• Web-based so researchers anywhere have access
• Concurrent access supports real-time collaboration
• Built-in support for standards (transparently compliant)
• Automatic generation of ready-made computable
data
46. T O O L S F O R T H E C O M M U N I T Y
• Web-based so researchers anywhere have access
• Concurrent access supports real-time collaboration
• Built-in support for standards (transparently compliant)
• Automatic generation of ready-made computable
data
• Client-side application relieves server bottleneck and
supports privacy
47. E A R LY I N T E R V E N T I O N —
S U P P O R T I N G S TA N D A R D S
• Promote community-accepted identifiers, ontologies, & formats
• Embed community accepted standards in the lab environment
• Stealth standards
• Re-purpose internal curation tools for external users
• Provide on-line documentation, hands-on training and rapid-response user
help
• Work with educators to make these tools an integral part of the curriculum
• e.g. CACAO (Critical Assessment of Community Annotation using
Ontologies), ecoliwiki.net/colipedia/index.php/CACAO_0.1
• DNA subway (Apollo)
49. • CANTO: curation.pombase.org
• Structured Digital Abstracts
• Identifiers for all named genes, proteins, metabolites or other objects in the
article
• Main results described in simple ontology terms
• Experimental evidence types
• Not only a synopsis of the results but computer-readable
• Gerstein, M., et al. Structured digital abstract makes text mining easy.
Nature 447, 142 (10 May 2007) | doi:10.1038/447142a.
• Minimal Information reporting guidelines
• http://mibbi.sourceforge.net/portal.shtml
S U B M I T T I N G D ATA —
I N A S T R U C T U R E D WAY
52. P U B L I S H I N G
• First there were letters
53. P U B L I S H I N G
• First there were letters
• Then Henry Oldenburg created the first scientific journal in 1665
54. P U B L I S H I N G
• First there were letters
• Then Henry Oldenburg created the first scientific journal in 1665
• Result: too much to absorb
55. P U B L I S H I N G
• First there were letters
• Then Henry Oldenburg created the first scientific journal in 1665
• Result: too much to absorb
Washed away on the sea of information
56. P E E R A N D E D I T O R I A L
R E V I E W B E C A M E A F I LT E R
C O N S E Q U E N T LY …
57. • Figshare: figshare.org
• iDigBio: www.idigbio.org
• Dryad: datadryad.org
• eLife: www.elifesciences.org
• Unlike journal articles, the scale of web-native
publishing may overwhelm attempts at manual
curation (using current strategies)
T H E M E D I U M O F P U B L I C AT I O N I S
C H A N G I N G
59. S C H O L A R S H I P : B E Y O N D T H E PA P E R . J A S O N P R I E M .
N AT U R E 4 9 5 , 4 3 7 – 4 4 0 ( 2 8 M A R C H 2 0 1 4 )
“…powerful, online filters will distill communities
impact judgements algorithmically”
S O M E S AY N O …
60. D O W E N E E D T O C U R AT E ?
• Resolution of differences
• Clarity, eliminating noise
• Validation & design of automated methods
61. E V E N A P L A C E L I K E G O O G L E U S E S
C U R AT O R S ( * A N D S O F T WA R E )
• Hundreds of operators per country
• Multiple kinds of errors: overlapping jurisdictions, accidental
merges, road maps to satellite images mismatch, etc.
• Every road that you see has been hand-massaged
!
!
http://www.theatlantic.com/technology/archive/2012/09/how-google-builds-its-maps-and-what-it-means-for-the-future-of-everything/
261913/
62. D O W E N E E D T O C U R AT E ?
• Resolution of differences
• Clarity, eliminating noise
• Validation & design of automated methods
63. C L A R I T Y
• Answer boxes: Quick answers to concrete questions
!
!
!
!
64. C L A R I T Y
• Answer boxes: Quick answers to concrete questions
!
!
!
!
65. C L A R I T Y
• Answer boxes: Quick answers to concrete questions
!
!
!
!
66. C L A R I T Y
• Answer boxes: Quick answers to concrete questions
!
!
!
!
• Much of this information comes
from Freebase which is structured
in terms of entities and properties
67. C L A R I T Y
• Answer boxes: Quick answers to concrete questions
!
!
!
!
• Much of this information comes
from Freebase which is structured
in terms of entities and properties
Robert West, et al. Knowledge Base Completion via Search-Based
Question Answering. http://www.cs.ubc.ca/~murphyk/Papers/www14.pdf
WWW’14 April 7–11, 2014, Seoul, Korea. ACM 978-1-4503-2744-2/14/04.
DOI:2568032
68. D O W E N E E D T O C U R AT E ?
• Resolution of differences
• Clarity, eliminating noise
• Validation & design of automated methods
69. • PDF is still the dominant form of distribution
• PDF “Annotation”
• UTOPIA, www.utopiadocs.com
• DOMEO, swan.mindinformatics.org
• Textpresso, www.textpresso.org
• All of these are still lacking domain specifics (or need to be taught)
• FORCE11, www.force11.org
• Common goal is advancing scientific communications
• Beyond the PDF
L I T E R AT U R E I S I N F O R M AT I V E
B U T I S N O T I N F O R M AT I O N
X
70. VA L I D AT I O N A N D D E S I G N O F
A U T O M AT E D M E T H O D S
71. VA L I D AT I O N A N D D E S I G N O F
A U T O M AT E D M E T H O D S
72. VA L I D AT I O N A N D D E S I G N O F
A U T O M AT E D M E T H O D S
Write/modify
software
73. VA L I D AT I O N A N D D E S I G N O F
A U T O M AT E D M E T H O D S
Run the algorithm
Write/modify
software
74. VA L I D AT I O N A N D D E S I G N O F
A U T O M AT E D M E T H O D S
Run the algorithm
Write/modify
software
Evaluate results
75. VA L I D AT I O N A N D D E S I G N O F
A U T O M AT E D M E T H O D S
• Requires trusted reference datasets!
Run the algorithm
Write/modify
software
Evaluate results
76. VA L I D AT I O N A N D D E S I G N O F
A U T O M AT E D M E T H O D S
• Requires trusted reference datasets!
• Biocurators are partners with developers!
Run the algorithm
Write/modify
software
Evaluate results
77. S C H O L A R S H I P : B E Y O N D T H E PA P E R . J A S O N P R I E M .
N AT U R E 4 9 5 , 4 3 7 – 4 4 0 ( 2 8 M A R C H 2 0 1 4 )
“…powerful, online filters will distill communities
impact judgements algorithmically”
D O W E N E E D T O
C U R AT E ?
78. T H E PA R A B L E O F G O O G L E F L U : T R A P S I N B I G D ATA
A N A LY S I S . D AV I D L A Z E R E T A L . S C I E N C E 1 4 M A R C H 2 0 1 4 :
V O L . 3 4 3 N O . 6 1 7 6 P P. 1 2 0 3 - 1 2 0 5
“‘Big data hubris” is the often implicit assumption that
big data are a substitute for, rather than a supplement
to, traditional data collection and analysis.”
D O W E N E E D T O
C U R AT E ?
79. D O W E N E E D T O C U R AT E ?
• Yes
!
!
!
!
80. D O W E N E E D T O C U R AT E ?
• Yes
!
!
!
!
• But…
81. S Y S T E M AT I C R E V I E W &
C R I T I C I S M I S R E Q U I R E D
O U R S T R E N G T H I S I N Q U A L I T Y O F T H E I N F O R M A T I O N W E C A N
P R O V I D E
82. C U S I C K , M . , E T A L . L I T E R AT U R E - C U R AT E D P R O T E I N
I N T E R A C T I O N D ATA S E T S
N AT M E T H O D S . J A N 2 0 0 9 ; 6 ( 1 ) : 3 9 – 4 6 .
P M C I D : P M C 2 6 8 3 7 4 5
“…literature curated datasets have inherent
reliability difficulties…”
H O W C A N B I O C U R AT O R S
A D D R E S S C R I T I C I S M S ?
83. G R E E N B E R G , S . , H O W C I TAT I O N D I S T O R T I O N S C R E AT E U N F O U N D E D
A U T H O R I T Y: A N A LY S I S O F A C I TAT I O N N E T W O R K
B M J J U LY 2 0 0 9 ; 3 3 9 D O I : H T T P : / / D X . D O I . O R G / 1 0 . 1 1 3 6 /
T H E R I S K ( B Y A N A L O G Y )
56
84. W E ' R E R E S P O N S I B L E F O R T H E Q U A L I T Y
• “Reviewing the quality of the data is an obligation of
any entity that assumes responsibility over the data.”
• Limor Peer et al., IDCC 2014
85. PA I N T A P O P T O S I S - S U M M A RY
• 52 families annotated:
- 8 were par$cipants in execution phase of apoptosis;
• 44 others are either:
A. upstream
of
apoptosis
B. phenotypes
C. targets
86. Example 1: Protein (cytochrome c) upstream of
apoptosis execution
Cytochrome c is directly involved in apoptotic DNA fragmentation
87. Example 1: Protein (cytochrome c) upstream of
apoptosis execution
Cytochrome c is directly involved in apoptotic DNA fragmentation
➢ [Cells] – [cytochrome c] = No apoptotic DNA fragmentation
88. Example 1: Protein (cytochrome c) upstream of
apoptosis execution
Cytochrome c is directly involved in apoptotic DNA fragmentation
➢ [Cells] – [cytochrome c] = No apoptotic DNA fragmentation
➢ [Cells] – [cytochrome c] + [cytochrome c] = apoptotic DNA fragmentation
89. Example 2: Phenotype of reduced cell survival and
increased DNA fragmentation
• E3 ubiquitin-protein ligase TRAF7
was annotated to execution phase of apoptosis
➢ Exogenous expression of TRAF7
➢ No other data in terms of where
in apoptosis this may be.
!
➢ All we know is altering TRAF7
levels affects apoptosis.
92. Example 3: Target
DSG2 was annotated to execution phase of
apoptosis
DSG2 is a *target* of a protease (caspase), and
although its degradation indeed seems to be a part of
apoptosis it does not *mediate* apoptosis.
93. P R O V E T H E N E E D F O R B I O C U R AT I O N
• Publish: Quantitative improvements before/after
• Publish: Curator consistency studies
• Publish: Independent external reviews
94. R E C O G N I T I O N & C R E D I T
O R C I D . O R G
99. W H AT I S A B I O C U R AT O R ?
• A highly skilled and trained keeper of our biological
heritage of knowledge.
100. W H AT I S A B I O C U R AT O R ?
• A highly skilled and trained keeper of our biological
heritage of knowledge.
• A content specialist who understands the research and
can succinctly distill biological research results into
computable data
101. W H AT I S A B I O C U R AT O R ?
• A highly skilled and trained keeper of our biological
heritage of knowledge.
• A content specialist who understands the research and
can succinctly distill biological research results into
computable data
• Considers the ease of finding this information, its
relatedness to other information, and its research and
educational usability
102. B6.Cg-‐Alms1foz/fox/J
increased
weight,
adipose
tissue
volume,
glucose
homeostasis
altered
ALSM1(NM_015120.4)
[c.10775delC]
+
[-‐]
GENOTYPE
PHENOTYPE
obesity,
diabetes
mellitus,
insulin
resistance
increased
food
intake,
hyperglycemia,
insulin
resistance
kcnj11c14/c14;
insrt143/+(AB)
M O D E L S R E C A P I T U L AT E VA R I O U S
P H E N O T Y P I C A S P E C T S O F D I S E A S E
103. B6.Cg-‐Alms1foz/fox/J
increased
weight,
adipose
tissue
volume,
glucose
homeostasis
altered
GENOTYPE
PHENOTYPE
obesity,
diabetes
mellitus,
insulin
resistance
increased
food
intake,
hyperglycemia,
insulin
resistance
kcnj11c14/c14;
insrt143/+(AB)
M O D E L S R E C A P I T U L AT E VA R I O U S
P H E N O T Y P I C A S P E C T S O F D I S E A S E
?
104. R E S E A R C H R E S O U R C E S
Doelken S C et al. Dis. Model.
Mech. 2013;6:358-372
105. Smedley D et al. Database. 2013; bat025
Mungall CJ et al. Genome Biol. 2010; 11(1):R2
Washington N et al. Plos Biol 2009; e1000247
C R O S S - S P E C I E S P H E N O T Y P E C O M PA R I S O N S
B Y S E M A N T I C S I M I L A R I T Y
107. PHENOTYPIC INTERPRETATION OF VARIANTS IN EXOMES (PHIVE)
Whole exome
Remove off-target and
common variants
Variant score
from allele freq and pathogenicity
Phenotype score
from phenotypic similarity
PhenIX/PhIVE score
to give final candidates
http://monarchinitiative.org
108. C O N F I R M E D D I A G N O S E S
• Infantile Parkinsonism-dystonia
• Wiedemann Steiner syndrome
• de novo SYNGAP1 mutation leading autosomal dominant
mental retardation
• Frank-ter Haar syndrome
• Infantile hypophosphatasia
• … (~28%)
109. R E L AT E D N E S S A C R O S S B I O L O G Y
110. R E L AT E D N E S S A C R O S S B I O L O G Y
• Bio-Curator, not bio-Archivist
• Actively trying to represent current best understanding
111. R E L AT E D N E S S A C R O S S B I O L O G Y
• Bio-Curator, not bio-Archivist
• Actively trying to represent current best understanding
• Support interoperability
112. R E L AT E D N E S S A C R O S S B I O L O G Y
• Bio-Curator, not bio-Archivist
• Actively trying to represent current best understanding
• Support interoperability
• Support research and educational usability
113. R E L AT E D N E S S A C R O S S B I O L O G Y
• Bio-Curator, not bio-Archivist
• Actively trying to represent current best understanding
• Support interoperability
• Support research and educational usability
• Support inference
114. R E L AT E D N E S S A C R O S S B I O L O G Y
• Bio-Curator, not bio-Archivist
• Actively trying to represent current best understanding
• Support interoperability
• Support research and educational usability
• Support inference
• Not just for supporting searches, not just for finding
PDF/online papers!
122. B I O D I V E R S I T Y D ATA J O U R N A L
F R O M W R I T I N G , S U B M I S S I O N , P E E R - R E V I E W, E D I T I N G , P U B L I C AT I O N T O D I S S E M I N AT I O N !
124. W H AT C A N I S B D O ?
• Tangible support of standards efforts
• QfO, RII, MI, publish guidelines, validators …
125. W H AT C A N I S B D O ?
• Tangible support of standards efforts
• QfO, RII, MI, publish guidelines, validators …
• Create a curation mindset across the entire life cycle
• Support embedded/repurposed software, education, actively
engage with text-miners, provide on-line support …
126. W H AT C A N I S B D O ?
• Tangible support of standards efforts
• QfO, RII, MI, publish guidelines, validators …
• Create a curation mindset across the entire life cycle
• Support embedded/repurposed software, education, actively
engage with text-miners, provide on-line support …
• Prove the necessity for curation
• Publish studies, greater emphasis on review and quality (assessment)
127. W H AT C A N I S B D O ?
• Tangible support of standards efforts
• QfO, RII, MI, publish guidelines, validators …
• Create a curation mindset across the entire life cycle
• Support embedded/repurposed software, education, actively
engage with text-miners, provide on-line support …
• Prove the necessity for curation
• Publish studies, greater emphasis on review and quality (assessment)
• Work with traditional publishers
• FORCE11, structured submissions
128. W H AT C A N Y O U D O ?
• Consider
• The ease of finding information
• Its relatedness to other information
• Its research and educational usability
129. R E S E A R C H ? ?
Y O U , T H E
B I O C U R AT O R
I S B
130. A C K N O W L E D G E M E N T S A N D T H A N K S
Y O U A R E N O T A L O N E