This document discusses the need for a digital infrastructure for biomedicine to facilitate access to biomedical data and knowledge. It describes existing linked open data resources like Bio2RDF that integrate data from sources like DrugBank, CTD, and UniProt. It also discusses the use of ontologies like the Semantic Science Integrated Ontology and the Translational Medicine Ontology to semantically integrate these resources. Finally, it introduces tools and frameworks like HyQue that use this digital infrastructure to facilitate knowledge discovery and hypothesis evaluation by leveraging semantic web technologies.
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
From Biological Data to Clinical Applications: Positioning a digital infrastructure for the future of biomedicine.
1. From biological data to clinical applications:
positioning a digital infrastructure for the
future of biomedicine
Michel Dumontier, Ph.D.
Associate Professor of Bioinformatics, Department of Biology, School of
Computer Science, Institute of Biochemistry, Carleton University
Professeur Associé, Université Laval
Ottawa Institute of Systems Biology
Ottawa-Carleton Institute of Biomedical Engineering
1 DERI::Digital Infrastructure for Biomedicine
5. uncovering a sufficient amount of evidence to support/refute
a hypothesis is becoming increasingly difficult
it requires a lot of digging around
5 DERI::Digital Infrastructure for Biomedicine
6. continuous growth in research literature
Source:http://www.nlm.nih.gov/bsd/stats/cit_added.html
6 DERI::Digital Infrastructure for Biomedicine
7. access to increasing amounts of biomedical data
7 DERI::Digital Infrastructure for Biomedicine
8. access to the most effective software to
predict, compare and evaluate
8 DERI::Digital Infrastructure for Biomedicine
9. ultimately, we answer questions by building
sophisticated workflows
9 DERI::Digital Infrastructure for Biomedicine
10. What if we could automatically answer a
question using available data and services?
10 DERI::Digital Infrastructure for Biomedicine
11. The Semantic Web
is the new global web of knowledge
It involves standards for publishing, sharing and querying
facts, expert knowledge and services
It is a scalable approach to the
discovery of independently formulated
and distributed knowledge
11 DERI::Digital Infrastructure for Biomedicine
12. Link all the
data!!!
12 DERI::Digital Infrastructure for Biomedicine
13. something you can search,
lookup, link to, query for
and check consistency and
veracity of
13 DERI::Digital Infrastructure for Biomedicine
14. an emerging linked data network
14 “Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
DERI::Digital Infrastructure for Biomedicine
15. Life Science Data Contributors
• Bio2RDF
• Chem2Bio2RDF
• LODD (HCLS)
15 DERI::Digital Infrastructure for Biomedicine
16. • > 40 biological datasets from independent
providers
• > 3 billion triples
16 DERI::Digital Infrastructure for Biomedicine
17. linked data for the life sciences
An Open Source Project for the Provision of
Scalable, Decentralized Data with Global Mirroring
and Customizable Query Resolution
Francois Belleau, Laval University
Marc-Alexandre Nolin, Laval University
Peter Ansell, Queensland University of Technology
Michel Dumontier, Carleton University
17 DERI::Digital Infrastructure for Biomedicine
18. Bio2RDF resources are identified using IRIs
• Data providers’ record identifiers are
maintained from source
http://bio2rdf.org/namespace:identifier
• E.g.: DrugBank’s resource IRI for
Leucovorin
http://bio2rdf.org/drugbank:DB00650
18 DERI::Digital Infrastructure for Biomedicine
19. vocabulary and resource namespaces are used
to describe auxiliary resources
• Vocabulary namespaces are used for dataset
specific types and predicates
http://bio2rdf.org/drugbank_vocabulary:Drug
• Entities arising from n-ary relations are
identified in the resource namespace
http://bio2rdf.org/drugbank_resource:DB00440_DB00650
DERI::Digital Infrastructure for
19
Biomedicine
20. 20 DERI::Digital Infrastructure for Biomedicine
21. Every Bio2RDF dataset now contains
provenance metadata
21 DERI::Digital Infrastructure for Biomedicine
22. Bio2RDF types include biological,
information content & processual entities
CTD: Chemical, Disease, Chemical-Disease Interaction,
Chemical-Gene Interaction
Entrez Gene: Gene, Model Organism, Publication
HGNC: Accession Number, Gene, Gene Symbol
iRefIndex: Protein Complex, Protein Interaction
MGI: Gene Marker, Gene Symbol
PharmGKB: Association, Disease, Drug, Gene
SGD: Enzyme, Pathway, Protein, RNA, Reaction,
Location, Experiment
22 DERI::Digital Infrastructure for Biomedicine
23. Heterogeneous biological data on the
semantic web is difficult to query
Question: Find all proteins that interact with beta
amyloid (uniprot:P05067)
UniProt Protein PDB Protein
?
SELECT * WHERE { iRefIndex Protein
?protein a bio2rdf:Protein .
?protein bio2rdf:interacts_with uniprot:P05067 .
}
Physical interaction? Genetic interaction?
Pathway interaction?
23 DERI::Digital Infrastructure for Biomedicine
24. Uncertainty in what is being said
with a simple triple
imagine a statement between two types, C1 and C2
C1 R C2
nucleus part-of cell
does it mean
For every C1 there is a C2 that is related by R?
For every C2 there is a C1 that is related by R?
For some C1, there is a C2 that is related by R, or vice versa?
Every C1 is a kind of C2? or vice versa?
C1s and C2s are the same kind?
There is no C1 that is also a C2?
we need to commit to a particular meaning that can be universally
interpreted – this formalization will then hold across datasets
24 DERI::Digital Infrastructure for Biomedicine
25. RDF-based Linked Data is a great
first step, but it’s not enough.
25 From linked data to linked knowledge through syntactic and semantic normalization.
DERI::Digital Infrastructure for Biomedicine
26. ontology as a
strategy to
formally represent
and integrate
knowledge
26 DERI::Digital Infrastructure for Biomedicine
27. Have you heard of OWL?
27 DERI::Digital Infrastructure for Biomedicine
28. The Web Ontology Language
(OWL) Has Explicit Semantics
Can therefore be used to capture knowledge in a
machine understandable way
28 DERI::Digital Infrastructure for Biomedicine
29. SIO provides an OWL ontology for the
representation of diverse biomedical knowledge
29 DERI::Digital Infrastructure for Biomedicine
30. 30 DERI::Digital Infrastructure for Biomedicine
31. Semantic data integration, consistency checking
and query answering over Bio2RDF with the
Semanticscience Integrated Ontology (SIO)
uniprot:P05067
uniprot:P05067
refseq:NP_009225.1
is a is a
uniprot:Protein
uniprot:Protein
refseq:Protein
refseq:Protein
dataset
is a is a
is a
sio:protein
ontology
Knowledge Base
Querying Bio2RDF Linked Open Data with a Global Schema. Alison Callahan, José Cruz-
Toledo and Michel Dumontier. to be presented at Bio-ontologies 2012.
31 DERI::Digital Infrastructure for Biomedicine
32. Use CTD & SGD to find all chemicals and proteins
that participate in the same GO process
SELECT *
FROM <http://bio2rdf.org/ctd>
WHERE {
?chemical a sio:SIO_010004. # 'chemical entity'
?chemical rdfs:label ?chemicalLabel.
?chemical sio:SIO_000062 ?process. # 'is participant in'
?process rdfs:label ?processLabel.
SERVICE <http://sgd.bio2rdf.org/sparql> {
?protein a sio:SIO_010043. # ‘protein’
?protein sio:SIO_000062 ?process.
?gene sio:SIO_010078 ?protein. # ‘encodes’
?gene rdfs:label ?geneLabel.
}
}
32 DERI::Digital Infrastructure for Biomedicine
33. More sophisticated OWL-based Data Integration,
Consistency Checking and Discovery
• Checking the consistency of semantic annotations [1]
– Formalized semantic annotations in SBML models as OWL axioms.
Automated reasoning uncovered inconsistencies in 16 models.
• e.g. alpha-D-glucose phosphate is not the required ATP in an ATP-dependent
reaction (GO + ChEBI + disjoint + closure axioms)
• Finding significant biomedical associations [2]
– found significant associations between genes, drugs, diseases and
pathways using Drugbank, PharmGKB, CTD, PID across categories
of drugs (ChEBI, ATC, MeSH) and diseases (DO, MeSH)
– 22,653 pathway-disease type associations (6304 over; 16,349 under)
• carcinosarcoma (DOID:4236) and Zidovudine Pathway (PharmGKB:PA165859361)
– 13,826 pathway-chemical type associations (12,564 over; 1262 under)
• drug clopidogrel (CHEBI:37941) with Endothelin signaling pathway
(PharmGKB:PA164728163);
http://pharmgkb-owl.googlecode.com
1. Integrating systems biology models and biomedical ontologies. BMC Systems Biology. 2011. 5 : 124
2. Identifying aberrant pathways through integrated analysis of knowledge in pharmacogenomics. Bioinformatics. 2012. in press
33 DERI::Digital Infrastructure for Biomedicine
35. Integration of patient record data with Linked Open Data
through the Translational Medicine Ontology
223 mappings : 60 TMO classes to 201 target classes
from over 40 ontologies and 8 datasets
35 DERI::Digital Infrastructure for Biomedicine
36. Formalization of the Dubois
AD diagnostic criteria for
decision support
# the panel is a textual entity
dubois:panel2 a iao:IAO_0000300 .
dubois:panel2 rdfs:label "Alzheimer Disease diagnostic criteria as reported in
panel 2 of dubois et al - pubmed:17616482 [dubois:panel2]".
# the panel is about alzheimer disease
dubois:panel2 iao:is_about diseasome:74.
# the panel is from the article
dubois:panel2 ro:part_of <http://bio2rdf.org/pubmed:17616482>.
# the panel is about diagnostic criterion
dubois:panel2 iao:is_about tmo:TMO_0068.
#inclusion criterion
dubois:10 rdfs:label "Proven AD autosomal dominant mutation within the
immediate family [dubois:10]" ;
a tmo:TMO_0069;
ro:part_of dubois:panel2;
iao:is_about diseasome:74.
# exclusion criterion
dubois:16 rdfs:label "Major depression [dubois:16]" ;
a tmo:TMO_0070;
ro:part_of dubois:panel2;
iao:is_about diseasome:74.
36 DERI::Digital Infrastructure for Biomedicine
37. TMKB for pharmaceutical and clinical
research, and health care
Pharmaceutical Research
• Which existing marketed drugs might potentially be re-purposed for
AD because they are known to modulate genes that are implicated
in the disease?
– 57 compounds or classes of compounds that are used to treat 45 diseases,
including AD, hyper/hypotension, diabetes and obesity
Clinical research
• Identify an AD clinical trial for a drug with a different mechanism of
action (MOA) than the drug that the patient is currently taking
– Of the 438 drugs linked to AD trials, only 58 are in active trials and only 2
(Doxorubicin and IL-2) have a documented MOA. 78 AD-associated drugs have
an established MOA.
Health care
• Have any of my AD patients been treated for other neurological
conditions as this might impact their diagnosis?
– Patient 2 is also being treated for depression.
http://esw.w3.org/topic/HCLSIG/PharmaOntology/Queries
37 DERI::Digital Infrastructure for Biomedicine
38. Personal Health Lens
Observation: Patients often look up new/alternative drugs to treat their
condition or alleviate side effects.
Opportunity: A patient-centric health care application that identifies
contraindications for drugs mentioned on web pages using the patient’s
own health data
Components:
• RDFized patient data
• Bio2RDF semantically annotated data
• SADI semantic web services to process the page and retrieve data
• SHARE automatic workflow composition
38 DERI::Digital Infrastructure for Biomedicine
39. SADI enables discovery and access
to Semantic Web Services
The Semantic Automated Discovery
and Integration (SADI) framework
makes it easy to create Semantic
Web Services using OWL classes as
service inputs and outputs
http://sadiframework.org
~700 bioinformatic services as of May 29, 2012
Mark Wilkinson, UBC
Michel Dumontier, Carleton University
Christopher Baker, UNB
39 DERI::Digital Infrastructure for Biomedicine
40. 40 DERI::Digital Infrastructure for Biomedicine
41. 41 DERI::Digital Infrastructure for Biomedicine
42. 42 DERI::Digital Infrastructure for Biomedicine
43. The SADI+SHARE workflow and reasoning
was personalized to YOUR medical data
uses the patient’s data
contraindication
rationale
sources
43 DERI::Digital Infrastructure for Biomedicine
44. so how do we get at the supporting evidence?
44 DERI::Digital Infrastructure for Biomedicine
45. HyQue
HyQue is the Hypothesis query and evaluation system
• A platform for knowledge discovery
• Facilitates hypothesis formulation and evaluation
• Leverages Semantic Web technologies to provide access to
facts, expert knowledge and web services
• Conforms to a simplified event-based model
• Supports evaluation against positive and negative findings
• Transparent and reproducible evidence prioritization
• Provenance of across all elements of hypothesis testing
– trace a hypothesis to its evaluation, including the data and rules used
Evaluating scientific hypotheses using the SPARQL Inferencing Notation. Extended Semantic Web Conference
(ESWC 2012). Heraklion, Crete. May 27-31, 2012.
HyQue: evaluating hypotheses using Semantic Web technologies. J Biomed Semantics. 2011 May 17;2 Suppl 2:S3.
45 DERI::Digital Infrastructure for Biomedicine
46. HyQue Architecture
Ontologies
Services
46 DERI::Digital Infrastructure for Biomedicine
47. Event-based data model
HyQue events denote a phenomenon involving two
objects: ‘agent’ and ‘target’ . In addition, we can specify the
location of this event (e.g. located in nucleus, or under
some genetic background)
Currently supported events
Event 1. protein-protein binding
‘has agent’ agent 2. protein-nucleic acid binding
‘has target’ target 3. molecular activation
‘is located in’ location
4. molecular inhibition
5. gene induction
‘is negated’ boolean
6. gene repression
7. transport
47 DERI::Digital Infrastructure for Biomedicine
48. HyQue domain rules CALCULATE a quantitative
measure of evidence for an event
‘induce’ rule (maximum score: 5):
– Is event negated? GO:0010628
• If yes, subtract 2
– Is event of type ‘induce’? CHEBI:36080
• If yes, add 1; if no, subtract 1
– Is agent of type ‘protein’ or ‘RNA’?
• If yes, add 1; if type ‘gene’, subtract 1
– Is target of type ‘gene’? SO:0000236
• If yes, add 1; if no, subtract 1
– Does agent have known ‘transcription factor activity’?
• If yes, add 1 GO:0003700
– Is event located in the ‘nucleus’?
• If yes, add 1; if no, subtract 1
GO:0005634
48 DERI::Digital Infrastructure for Biomedicine
49. Combination of system and domain rules to
retrieve and score data, and add new triples
Event - induction SPIN induction rule
:e1 a go:0010628;
hyque:agent sgd:Gal4p;
hyque:target sgd:GAL1 .
hyque:is_negated "0" ;
49 DERI::Digital Infrastructure for Biomedicine
50. Customization of rules/data sources will generate
different evidence-based evaluations
50 DERI::Digital Infrastructure for Biomedicine
51. Reproducible eScience
LOD for Hypothesis, Rules, Data and Evaluation
51 DERI::Digital Infrastructure for Biomedicine
52. 52 DERI::Digital Infrastructure for Biomedicine
53. A digital infrastructure
for the future of biomedicine
• Semantic Web technologies offer a powerful integrative
platform across facts, expert knowledge and services
• The ability to publish, link to, retrieve, check consistency
of, query biomedical knowledge will yield an explosion of
health-related applications.
• By formalizing biomedical data, we can integrate
molecular to clinical data, and gain insight into how living
systems respond to chemical agents
– implications drug discovery & delivery of health care
53 DERI::Digital Infrastructure for Biomedicine
54. Acknowledgements
Bio2RDF OWL-Based Data Integration
Peter Ansell, Francois Belleau, Allison Robert Hoehndorf, John Gennari, Sarah
Callahan, Jacques Corbeil, Jose Cruz- Wimalaratne, Bernard de Bono, Daniel Cook,
Toledo, Alex De Leon, Steve Etlinger, and George Gkoutos
James Hogan, Nichealla Keath, Jean
Morissette, Marc-Alexandre Nolin, Nicole
Tourigny, Philippe Rigault and Paul Roe SADI: Christopher Baker, Melanie Courtot,
Jose Cruz-Toledo, Steve Etlinger, Nichealla
Keath, Artjom Klein, Luke McCarthy, Silvane
HyQue Paixao, Ben Vandervalk, Natalia Villanueva-
Alison Callahan Rosales, Mark Wilkinson
Lab W3C HCLS: J Luciano, B Andersson, C
Glen Newton (NLP), Gordana Lenert Batchelor, O Bodenreider, T Clark, C
(PGx), Dana Klassen @ DERI, Denney, C Domarew, T Gambet, L Harland,
Leonid Chepelev @ UoO, Natalia A Jentzsch, V Kashyap, P Kos, J Kozlovsky,
Villanueva-Rosales @ UoTexas, Xueying T Lebo, SM Marshall, JP McCusker, DL
Chen @ IBM China, Mykola Konyk McGuinness, C Ogbuji, E Pichler, R Powers,
E Prud hommeaux, M Samwald, L Schriml,
PJ Tonellato, PL Whetzel, J Zhao, S
Stephens, C Denney, J Luciano, J McGurk,
54
Lynn Schriml, and Peter J. Tonellato. Biomedicine
DERI::Digital Infrastructure for