When are two entries about a small molecule in different datasets the same? If they have the same drug name, chemical structure, or some other criteria? The choice depends upon the application to which the data will be put. However, existing Linked Data approaches provide a single global view over the data with no way of varying the notion of equivalence to be applied.
In this paper, we present an approach to enable applications to choose the equivalence criteria to apply between datasets. Thus, supporting multiple dynamic views over the Linked Data. For chemical data, we show that multiple sets of links can be automatically generated according to different equivalence criteria and published with semantic descriptions capturing their context and interpretation. This approach has been applied within a large scale public-private data integration platform for drug discovery. To cater for different use cases, the platform allows the application of different lenses which vary the equivalence rules to be applied based on the context and interpretation of the links.
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Scientific lenses to support multiple views over linked chemistry data
1. Scientific lenses to support
multiple views over linked
Chemistry data
Alasdair J G Gray
A.J.G.Gray@hw.ac.uk
alasdairjggray.co.uk
@gray_alasdair
Open PHACTS
pmu@openphacts.org
openphacts.org
@open_phacts
3. Gleevec®: Imatinib Mesylate
Imatinib
Imatinib MesylateMesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N
ChemSpider Drugbank PubChem
21 October 2014 Scientific Lenses – A. J. G. Gray 2
4. Gleevec®: Imatinib Mesylate
Imatinib
Are these records the same?
It depends upon your task!
Imatinib MesylateMesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N
ChemSpider Drugbank PubChem
21 October 2014 Scientific Lenses – A. J. G. Gray 3
5. Example Use Cases
I need to perform an
analysis, give me details
of the active compound
in Gleevec.
Which targets are
known to interact
with Gleevec?
21 October 2014 Scientific Lenses – A. J. G. Gray 4
6. Structure Lens
I need to perform an analysis, give me
Strict Relaxed
Analysing Browsing
skos:exactMatch
(InChI)
Scientific Lenses – A. J. G. Gray 5
21 October 2014
details of the active compound in
Gleevec.
7. Name Lens
Which targets are known to interact
Strict Relaxed
Analysing Browsing
skos:closeMatch
(Drug Name)
skos:exactMatch
(InChI)
skos:closeMatch
(Drug Name)
Scientific Lenses – A. J. G. Gray 6
21 October 2014
with Gleevec?
8. What is a Scientific Lens?
A lens defines a conceptual view over the data
Specifies operational equivalence conditions
Consists of:
Identifier (URI)
Title
(dct:title)
Description
(dct:description)
Documentation link
(dcat:landingPage)
Creator
(pav:createdBy)
Timestamp
(pav:createdOn)
Equivalence rules
(bdb:linksetJustification)
16 October 2014 Scientific Lenses – A. J. G. Gray 7
9. Lens Effects: Ibuprofen
Ibuprofen consists of two equally active stereoisomers.
• Stereoisomers not always represented in data
Users wish to retrieve information for any stereoisomer.
CHEMBL427526
CHEMBL521
CHEMBL175
21 October 2014 Scientific Lenses – A. J. G. Gray 8
10. Default Lens
Ibuprofen consists of two equally active stereoisomers.
• Stereoisomers not always represented in data
Users wish to retrieve information for any stereoisomer.
21 October 2014 Scientific Lenses – A. J. G. Gray 9
11. Stereoisomer Lens
Ibuprofen consists of two equally active stereoisomers.
• Stereoisomers not always represented in data
Users wish to retrieve information for any stereoisomer.
21 October 2014 Scientific Lenses – A. J. G. Gray 10
12. Mapping Generation
✔
ops:OPS437281
has_stereoundefined_parent
[ci:CHEMINF_000456]
ops:OPS380297
is_stereoisomer_of
[ci:CHEMINF_000461]
ops:OPS380292
Other relationships
• has part
• is tautomer of
• uncharged counterpart
• isotope
…
21 October 2014 Scientific Lenses – A. J. G. Gray 11
15. OPS Discovery Platform
Linked Data API (RDF/XML, TTL, JSON)
Semantic Workflow Engine
VoID
Nanopub
Db
Data Cache
(Virtuoso Triple Store)
Domain
Specific
Services
Identity
Resolution
Service
Chemistry
Registration
Normalisation
& Q/C
Identifier
Management
Service
Indexing
Core Platform
“Adenosine
receptor 2a”
P12374
EC2.43.4
CS4532
VoID
Db
VoID
Nanopub
Db
VoID
Db
VoID
Nanopub
Public Content Commercial
Public Ontologies
User
Annotations
Apps
21 October 2014 Scientific Lenses – A. J. G. Gray 14
16. Lenses: Under the hood
GRAPH <http://rdf.chemspider.com> {
cw:979b545d-f9a9 cheminf:logd ?logd .
?iri cheminf:logd ?logd .
FILTER (?iri = cw:979b545d-f9a9 ||
?iri = cs:2157 ||
?iri = chembl:1280 ||
?iri = db:db00945 )
}
GRAPH <http://…
Q, L1 Q’
Query
Expander
Service
Identity
Mapping
Service
(BridgeDB)
Mappings
Profiles
cw:979b545d-f9a9, L1
[cw:979b545d-f9a9,
cs:2157,
chembl:1280,
db:db00945]
• IMS call adds overhead
• Call time below human perception [1]
• Can also be achieved through UNION
[1] C. Y. A. Brenninkmeijer, C. Goble, A. J. G. Gray, P. Groth, A.
Loizou, and S. Pettifer, “Including Co-referent URIs in a SPARQL
Query,” COLD2013, http://ceur-ws.org/Vol-1034/
21 October 2014 Scientific Lenses – A. J. G. Gray 15
17. API Hits
April 2013 – March 2014: 15.8m
April 2014 – Sept 2014: 14m
Total: 29.8 million
21 October 2014 Scientific Lenses – A. J. G. Gray 16
18. Conclusions
Scientific data is complex and messy
Requires flexibility in linking
Equivalence depends upon context
Lenses provide support for operational
equivalence
Chemical structures support automatic
computing of links with justification
21 October 2014 Scientific Lenses – A. J. G. Gray 17
19. Co-authors
Royal Society of Chemistry
Colin Batchelor
Karen Karapetyan
Jon Steele
Valery Tkachenko
Antony Williams
University of Manchester
Christian Brenninkmeijer
Ian Dunlop
Carole Goble
Steve Pettifer
Robert Stevens
Swiss Institute for Bioinformatics
Christine Chichester
European Bioinformatics Institute
Mark Davies
Anna Gaulton
John Overington
University of Vienna
Daniela Digles
Maastricht University
Chris Evelo
Andra Waagmeester
Egon Willighagen
VU University of Amsterdam
Paul Groth
Antonis Loizou
Connected Discovery
Lee Harland
21 October 2014 Scientific Lenses – A. J. G. Gray 18
20. Questions
Alasdair J G Gray
A.J.G.Gray@hw.ac.uk
alasdairjggray.co.uk
@gray_alasdair
Open PHACTS
pmu@openphacts.org
openphacts.org
@open_phacts
Demo at stall 33 this evening!
21 October 2014 Scientific Lenses – A. J. G. Gray 19
21. Open PHACTS Data
Source Initial Records Triples Properties
ChEMBL 1,481,473 304,360,749 77
DrugBank 19,628 517,584 74
UniProt 564,246 405,473,138 82
ENZYME 6,187 73,838 2
ChEBI 40,575 1,673,863 2
GeneOntology 38,137 2,447,682 26
GOA 661,232 1,765,622,393 15
ChemSpider 1,361,568 215,193,441 23
ConceptWiki 2,828,966 4,291,131 1
WikiPathways 946 1,949,074 34
21 October 2014 Scientific Lenses – A. J. G. Gray 20
22. App Ecosystem An “App Store”?
Explorer Explorer2 ChemBioNavigator Target Dossier Pharmatrek Helium
MOE Collector Cytophacts Utopia Garfield SciBite
KNIME Mol. Data Sheets PipelinePilot scinav.it Taverna
http://www.openphactsfoundation.org/apps.html
21 October 2014 Scientific Lenses – A. J. G. Gray 21
23. Discovery Platform
Apps
Method
Calls
Domain API
Drug Discovery Platform
Interactive
responses
Production quality
integration platform
21 October 2014 Scientific Lenses – A. J. G. Gray 22
24. Linked Data API
Drug
Target Pathway
Disease (1.4)
https://dev.openphacts.org/
21 October 2014 Scientific Lenses – A. J. G. Gray 23
Notas del editor
Concept appears in multiple datasets, each with its own identifier
This talk is about supporting the multiple identities that exist
Rather than define a single approach, we want to support the use of multiple identifiers
Example drug: Gleevec Cancer drug for leukemia
Lookup in three popular public chemical databases
Different results
Are these records the same? It depends on what you are doing with the data!
Each captures a subtly different view of the world
Data is messy!
Analysis requires precise knowledge of the form of the compound across datasets
Targets is a search activity, some likely to be mis-entered
Interested in physiochemical properties of Gleevec
Interested in biomedical and pharmacological properties
sameAs != sameAs depends on your point of view
Links relate individual data instances: source, target, predicate, reason.
Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
Validate structure: Source data is messy!
Identify common problems:
Charge imbalance
Stereochemistry
Compute physiochemical properties
Identify related properties based on structure
17 relationship types
Pharmacology count 2370 3044
Import data into cache
API calls populate SPARQL queries
Integration approach
Data kept in original model
Data cached in central triple store
API call translated to SPARQL query
Query expressed in terms of original data
Queries expanded by IMS to cover URIs of original datasets
Query with URIs
Extract URIs
Find equivalents
Expand query
Optimise based on context
OPS Discovery Platform is actively being used
Lenses under active evaluation and refinement within the OPS consortium
Statistics to be added
1,030,727,289 triples
Hosted on beefy hardware; data in memory (aim)
A platform for integrated pharmacology data
Relied upon by pharma companies
Public domain, commercial, and private data sources
Provides domain specific API
Making it easy to build multiple drug discovery applications: examples developed in the project
Linked data API: multiple response formats (JSON, RDF, XML, CSV …)
3scala deployment
Public dataset