When are two entries about a concept in different datasets the same? If they have the same name, properties, or some other criteria? The choice depends upon the application to which the data will be put. However, existing Linked Data approaches provide a single global view over the data with no way of varying the notion of equivalence to be applied.
In this presentation, I will introduce Scientific lenses, an approach that enables applications to vary the equivalence conditions between linked datasets. They have been deployed in the Open PHACTS Discovery Platform – a large scale data integration platform for drug discovery. To cater for different use cases, the platform allows the application of different lenses which vary the equivalence rules to be applied based on the context and interpretation of the links.
Scientific Lenses over Linked Data An approach to support multiple integrated views
1. Scientific Lenses over
Linked Data
An approach to support
multiple integrated views
Alasdair J G Gray
A.J.G.Gray@hw.ac.uk
alasdairjggray.co.uk
@gray_alasdair
2. Open PHACTS Use Case
“Let me compare MW, logP
and PSA for launched
inhibitors of human &
mouse oxidoreductases”
Chemical Properties (Chemspider)
Launched drugs (Drugbank)
Human => Mouse (Homologene)
Protein Families (Enzyme)
Bioactivty Data (ChEMBL)
… other info (Uniprot/Entrez etc.)
16 October 2014 Scientific Lenses – A. J. G. Gray 1
3. Discovery Platform
Apps
Method
Calls
Domain API
Drug Discovery Platform
Interactive
responses
Production quality
integration platform
16 October 2014 Scientific Lenses – A. J. G. Gray 2
4. App Ecosystem An “App Store”?
Explorer Explorer2 ChemBioNavigator Target Dossier Pharmatrek Helium
MOE Collector Cytophacts Utopia Garfield SciBite
KNIME Mol. Data Sheets PipelinePilot scinav.it Taverna
http://www.openphactsfoundation.org/apps.html
16 October 2014
5. API Hits
April 2013 – March 2014: 15.8m
April 2014 – Sept 2014: 14m
Total: 29.8 million
16 October 2014 Scientific Lenses – A. J. G. Gray 4
6. Linked Data API
Drug
Target Pathway
Disease (1.4)
https://dev.openphacts.org/
16 October 2014 Scientific Lenses – A. J. G. Gray 5
7. Open PHACTS Data
Source Initial Records Triples Properties
ChEMBL 1,481,473 304,360,749 77
DrugBank 19,628 517,584 74
UniProt 564,246 405,473,138 82
ENZYME 6,187 73,838 2
ChEBI 40,575 1,673,863 2
GeneOntology 38,137 2,447,682 26
GOA 661,232 1,765,622,393 15
ChemSpider 1,361,568 215,193,441 23
ConceptWiki 2,828,966 4,291,131 1
WikiPathways 946 1,949,074 34
16 October 2014 Scientific Lenses – A. J. G. Gray 6
8. Dataset Descriptions in the
Open Pharmacological
Space
14 January 2013
Being replaced by W3C
HCLS community profile
http://tiny.cc/hcls-datadesc-ed
OPS Dataset Descriptions – A. J.
G. Gray 7
9. OPS Discovery Platform
Linked Data API (RDF/XML, TTL, JSON)
Semantic Workflow Engine
VoID
Nanopub
Db
Data Cache
(Virtuoso Triple Store)
Domain
Specific
Services
Identity
Resolution
Service
Chemistry
Registration
Normalisation
& Q/C
Identifier
Management
Service
Indexing
Core Platform
“Adenosine
receptor 2a”
P12374
EC2.43.4
CS4532
VoID
Db
VoID
Nanopub
Db
VoID
Db
VoID
Nanopub
Public Content Commercial
Public Ontologies
User
Annotations
Apps
10. Multiple Identities
Andy Law's Third Law
“The number of unique identifiers assigned to an individual is
never less than the number of Institutions involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws/
GB:29384
P12047
X31045
16 October 2014 Scientific Lenses – A. J. G. Gray
9
Are these the
same thing?
11. Gleevec®: Imatinib Mesylate
Imatinib
Imatinib MesylateMesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N
ChemSpider Drugbank PubChem
16 October 2014 Scientific Lenses – A. J. G. Gray 10
12. Gleevec®: Imatinib Mesylate
Imatinib
Are these records the same?
It depends upon your task!
Imatinib MesylateMesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N
ChemSpider Drugbank PubChem
16 October 2014 Scientific Lenses – A. J. G. Gray 11
13. Genes == Proteins?
BRCA1: Chromosome 17
Breast cancer type 1
susceptibility protein
http://en.wikipedia.org/wiki/File:Protei
n_BRCA1_PDB_1jm7.png
http://en.wikipedia.org/wiki/File:BRCA1
_en.png
16 October 2014 Scientific Lenses – A. J. G. Gray 12
14. Genes == Proteins?
BRCA1: Chromosome 17
Breast cancer type 1
susceptibility protein
http://en.wikipedia.org/wiki/File:Protei
n_BRCA1_PDB_1jm7.png
http://en.wikipedia.org/wiki/File:BRCA1
_en.png
Are these records the same?
It depends upon your task!
16 October 2014 Scientific Lenses – A. J. G. Gray 13
15. Example Use Cases
I need to perform an
analysis, give me details
of the active compound
in Gleevec.
Which targets are
known to interact
with Gleevec?
16 October 2014 Scientific Lenses – A. J. G. Gray 14
16. Structure Lens
I need to perform an analysis, give me
Strict Relaxed
Analysing Browsing
skos:exactMatch
(InChI)
Scientific Lenses – A. J. G. Gray 15
16 October 2014
details of the active compound in
Gleevec.
17. Name Lens
Which targets are known to interact
Strict Relaxed
Analysing Browsing
skos:closeMatch
(Drug Name)
skos:exactMatch
(InChI)
skos:closeMatch
(Drug Name)
Scientific Lenses – A. J. G. Gray 16
16 October 2014
with Gleevec?
18. What is a Scientific Lens?
A lens defines a conceptual view over the data
Specifies operational equivalence conditions
Consists of:
Identifier (URI)
Title
(dct:title)
Description
(dct:description)
Documentation link
(dcat:landingPage)
Creator
(pav:createdBy)
Timestamp
(pav:createdOn)
Equivalence rules
(bdb:linksetJustification)
16 October 2014 Scientific Lenses – A. J. G. Gray 17
19. Lens Effects: Ibuprofen
Ibuprofen consists of two equally active stereoisomers.
• Stereoisomers not always represented in data
Users wish to retrieve information for any stereoisomer.
CHEMBL427526
CHEMBL521
CHEMBL175
16 October 2014 Scientific Lenses – A. J. G. Gray 18
20. Default Lens
Ibuprofen consists of two equally active stereoisomers.
• Stereoisomers not always represented in data
Users wish to retrieve information for any stereoisomer.
16 October 2014 Scientific Lenses – A. J. G. Gray 19
21. Stereoisomer Lens
Ibuprofen consists of two equally active stereoisomers.
• Stereoisomers not always represented in data
Users wish to retrieve information for any stereoisomer.
16 October 2014 Scientific Lenses – A. J. G. Gray 20
22. Mapping Generation
✔
ops:OPS437281
has_stereoundefined_parent
[ci:CHEMINF_000456]
ops:OPS380297
is_stereoisomer_of
[ci:CHEMINF_000461]
ops:OPS380292
Other relationships
• has part
• is tautomer of
• uncharged counterpart
• isotope
…
16 October 2014 Scientific Lenses – A. J. G. Gray 21
23. Initial Connectivity
Datasets 37
Linksets 104
Links 7,096,712
Justifications 7
16 October 2014 Scientific Lenses – A. J. G. Gray 22
26. Co-reference
Computation
Rules ensure
Unrestricted
transitivity within
conceptual type
Restrict crossing
conceptual types
Based on justifications
Provenance captured
0..*
0..*
0..*
0..1
0..1
16 October 2014 Scientific Lenses – A. J. G. Gray 25
27. Initial Connectivity
Datasets 37
Linksets 104
Links 7,096,712
Justification
s
7
16 October 2014 Scientific Lenses – A. J. G. Gray 26
28. Inferred Connectivity
Datasets 37
Linksets 883
Links 17,383,846
Justifications 7
16 October 2014 Scientific Lenses – A. J. G. Gray 27
30. Lenses: Under the hood
GRAPH <http://rdf.chemspider.com> {
cw:979b545d-f9a9 cheminf:logd ?logd .
?iri cheminf:logd ?logd .
FILTER (?iri = cw:979b545d-f9a9 ||
?iri = cs:2157 ||
?iri = chembl:1280 ||
?iri = db:db00945 )
}
GRAPH <http://…
Q, L1 Q’
Query
Expander
Service
Identity
Mapping
Service
(BridgeDB)
Mappings
Profiles
cw:979b545d-f9a9, L1
[cw:979b545d-f9a9,
cs:2157,
chembl:1280,
db:db00945]
• Can also be achieved through UNION
• IMS call adds overhead
16 October 2014 Scientific Lenses – A. J. G. Gray 29
31. Experiment
Is it feasible to use a stand-off
mapping service?
Base lines (no external call):
“Perfect” URIs
Linked data querying
Expansion approaches (external service
call):
FILTER by Graph
UNION by Graph
C. Y. A. Brenninkmeijer, C. A. Goble, A. J. G. Gray, P. T. Groth, A. Loizou, S.
Pettifer: Including Co-referent URIs in a SPARQL Query. COLD 2013.
http://ceur-ws.org/Vol-1034/BrenninkmeijerEtAl_COLD2013.pdf
32. “Perfect” URI Baseline
WHERE {
GRAPH <chemspider> {
cs:2157 cheminf:logp ?logp .
}
GRAPH <chembl> {
chembl_mol:m1280 cheminf:mw ?mw .
}
}
16 October 2014 Scientific Lenses – A. J. G. Gray 31
33. Linked Data Baseline
WHERE {
GRAPH <chemspider> {
cs:2157 cheminf:logp ?logp .
}
GRAPH <chembl> {
?chemblid cheminf:mw ?mw .
}
cs:2157 skos:exactMatch ?chemblid .
}
16 October 2014 Scientific Lenses – A. J. G. Gray 32
34. Queries
Drawn from Open PHACTS API:
1. Simple compound information (1)
2. Compound information (1)
3. Compound pharmacology (M)
4. Simple target information (1)
5. Target information (1)
6. Target pharmacology (M)
16 October 2014 Scientific Lenses – A. J. G. Gray 33
35. Queries
Drawn from Open PHACTS API:
1. Simple compound information (1)
2. Compound information (1)
3. Compound pharmacology (M)
4. Simple target information (1)
5. Target information (1)
6. Target pharmacology (M)
16 October 2014 Scientific Lenses – A. J. G. Gray 34
36. Data:
167,783,592 triples
Mappings:
2,114,584 triples
Lenses:
1
Experiment Data
16 October 2014 Scientific Lenses – A. J. G. Gray 35
42. Conclusions
Scientific data is complex and messy
Requires flexibility in linking
Equivalence depends upon context
Lenses provide support for operation
equivalence
Chemical structures support automatic
computing of links with justification
16 October 2014 Scientific Lenses – A. J. G. Gray 47
43. Acknowledgements
Royal Society of Chemistry
Colin Batchelor
Karen Karapetyan
Jon Steele
Valery Tkachenko
Antony Williams
University of Manchester
Christian Brenninkmeijer
Ian Dunlop
Carole Goble
Steve Pettifer
Robert Stevens
Swiss Institute for Bioinformatics
Christine Chichester
European Bioinformatics Institute
Mark Davies
Anna Gaulton
John Overington
University of Vienna
Daniela Digles
Maastricht University
Chris Evelo
Andra Waagmeester
Egon Willighagen
VU University of Amsterdam
Paul Groth
Antonis Loizou
Connected Discovery
Lee Harland
16 October 2014 Scientific Lenses – A. J. G. Gray 48
44. Questions
Alasdair J G Gray
A.J.G.Gray@hw.ac.uk
alasdairjggray.co.uk
@gray_alasdair
Open PHACTS
pmu@openphacts.org
openphacts.org
@open_phacts
Notas del editor
1 of 83 business driver questions
Took a team of 5 experienced researchers 6 hours to manually gather the answer
A platform for integrated pharmacology data
Relied upon by pharma companies
Public domain, commercial, and private data sources
Provides domain specific API
Making it easy to build multiple drug discovery applications: examples developed in the project
Actively being used
Since launch (April 2013): 30million hits
Linked data API: multiple response formats (JSON, RDF, XML, CSV …)
3scala deployment, extensive memcaching
Public dataset
Provenance of data returned in response
Import data into cache
API calls populate SPARQL queries
Integration approach
Data kept in original model
Data cached in central triple store
API call translated to SPARQL query
Query expressed in terms of original data
Queries expanded by IMS to cover URIs of original datasets
Concept appears in multiple datasets, each with its own identifier
This talk is about supporting the multiple identities that exist
Rather than define a single approach, we want to support the use of multiple identifiers
Example drug: Gleevec Cancer drug for leukemia
Lookup in three popular public chemical databases Different results
Chemistry is complicated, often simplified for convenience
Data is messy!
Are these records the same? It depends on what you are doing with the data!
Each captures a subtly different view of the world
Chemistry is complicated, often simplified for convenience
Data is messy!
Do genes == proteins? Different conceptual types: gene and protein
Biological data is complicated simplified for convenience
----
But if you’re saying why genes=proteins you may also want to be prepared for questions of when genes!=proteins. Splice variation is a common example, n the FAS receptor:
http://en.wikipedia.org/wiki/Alternative_splicing#Exon_definition:_Fas_receptor there is one gene but it can be made into two distinct proteins - which have different biological effects), so you can obviously mix bio data that shouldnt be mixed by integrating these two functions on the same ID. [We currently dont handle this well in OPS]
And the most used example here, the ghrelin gene is transcribed into a protein which is cleaved in two to form two completely different hormones, ghrelin and obestatin, which do very different things. But come from the same gene http://en.wikipedia.org/wiki/Ghrelin#Synthesis_and_variants
Often used as a shortcut for retrieval: BRCA1 easier to remember and type!
Require the ability to equate them in the IMS
----
But if you’re saying why genes=proteins you may also want to be prepared for questions of when genes!=proteins. Splice variation is a common example, n the FAS receptor:
http://en.wikipedia.org/wiki/Alternative_splicing#Exon_definition:_Fas_receptor there is one gene but it can be made into two distinct proteins - which have different biological effects), so you can obviously mix bio data that shouldnt be mixed by integrating these two functions on the same ID. [We currently dont handle this well in OPS]
And the most used example here, the ghrelin gene is transcribed into a protein which is cleaved in two to form two completely different hormones, ghrelin and obestatin, which do very different things. But come from the same gene http://en.wikipedia.org/wiki/Ghrelin#Synthesis_and_variants
Analysis requires precise knowledge of the form of the compound across datasets
Targets is a search activity, some likely to be mis-entered
We use lenses to change the links between the data
Interested in physiochemical properties of Gleevec
Interested in biomedical and pharmacological properties
sameAs != sameAs depends on your point of view
Links relate individual data instances: source, target, predicate, reason.
Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
Lens enables certain relationships and disables others
Alters links between the data
Default lens matches structures
Only get data back associated with the structure entered with
Really want all information about Ibuprofen
Need a different lens
Validate structure: Source data is messy!
Identify common problems:
Charge imbalance
Stereochemistry
Compute physiochemical properties
Identify related properties based on structure
17 relationship types
Can enter with IDs from any of the supported datasets
Platform extracts data from certain datasets
These need to be connected
Here there is no issue in computing transitive as they are all the same compound based on InChI key
Would compute the full set of links
Insulin Receptor
Issue when linking through PDB due to the way that proteins are crystalised
Can enter with IDs from any of the supported datasets
These are 1.3 figures
In 1.4
130 raw linksets with 6,985,278 links
40,802 computed linksets with 25,584,293 links
Implementation available
IMS takes query and expands URIs
Query with URIs
Extract URIs
Find equivalents under a certain lens (Isolates lens behaviour)
Expand query
Optimise based on context
Result size in brackets
Orange are actual OPS queries
Subset of the OPS data
Linked data approach performs badly with query 6 due to the query construction
Name being bound to the chemical structure returned
Focus on other queries
In general expansion is slower than base lines
Worst case delta: 0.01842 (under 20ms)
Human perception is 0.050 to 0.2 (50 -200ms)
Focus on query 6
No linked data as it performed very poorly on this query
Size of result obliterates external call cost