Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS project
1. Scientific Lenses over Linked Data:
Identity Management in the
Open PHACTS project
Alasdair J G Gray
A.J.G.Gray@hw.ac.uk
www.alasdairjggray.co.uk
@gray_alasdair
http://c745.r45.cf2.rackcdn.com/img/2009/le
ns_filter_coasters.jpg
2. Open PHACTS Use Case
“Let me compare MW, logP
and PSA for launched
inhibitors of human &
mouse oxidoreductases”
Chemical Properties (Chemspider)
Launched drugs (Drugbank)
Human => Mouse (Homologene)
Protein Families (Enzyme)
Bioactivty Data (ChEMBL)
… other info (Uniprot/Entrez etc.)
“Let me compare MW, logP
and PSA for launched
inhibitors of human &
mouse oxidoreductases”
21/05/2014 Brighton Seminar 1
3. Literature
PubChem
Genbank
Patents
Databases
Downloads
Data Integration Data Analysis
Firewalled Databases
Repeat @ each
company
x
Lowering industry firewalls: pre-competitive informatics in drug discovery
Nature Reviews Drug Discovery (2009) 8, 701-708 doi:10.1038/nrd2944
A single, shared
solution.
Funded under
• IMI: 2011-14
• ENSO: 2014-16
Pre-competitive Informatics
4. Open PHACTS Discovery Platform
21/05/2014 Brighton Seminar 3
Drug Discovery Platform
Apps
Domain API
Interactive
responses
Production quality
integration platform
Method
Calls
5. (April 2013 – March 2014)
15.8 million total hits
API Hits
8. OPS Discovery Platform
Nanopub
Db
VoID
Data Cache
(Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)
Domain
Specific
Services
Identity
Resolution
Service
Chemistry
Registration
Normalisation
& Q/C
Identifier
Management
Service
Indexing
CorePlatform
P12374
EC2.43.4
CS4532
“Adenosine
receptor 2a”
VoID
Db
Nanopub
Db
VoID
Db
VoID
Nanopub
VoID
Public Content Commercial
Public Ontologies
User
Annotations
Apps
11. Multiple Identities
Andy Law's Third Law
“The number of unique identifiers assigned to an individual is
never less than the number of Institutions involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws/
21/05/2014 Brighton Seminar 10
P12047
X31045
GB:29384 Are these the
same thing?
15. Multiple Links: Different Reasons
21/05/2014 Brighton Seminar 15
Link: skos:closeMatch
Reason: non-salt form
Link: skos:exactMatch
Reason: drug name
20. Genes == Proteins?
BRCA1
Breast cancer type 1
susceptibility protein
21/05/2014 Brighton Seminar 20
http://en.wikipedia.org/wiki/File:Pr
otein_BRCA1_PDB_1jm7.png
http://en.wikipedia.org/wiki/File:BRCA1_en.p
ng
27. OPS Discovery Platform
Nanopub
Db
VoID
Data Cache
(Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)
Domain
Specific
Services
Identity
Resolution
Service
Chemistry
Registration
Normalisation
& Q/C
Identifier
Management
Service
Indexing
CorePlatform
P12374
EC2.43.4
CS4532
“Adenosine
receptor 2a”
VoID
Db
Nanopub
Db
VoID
Db
VoID
Nanopub
VoID
Public Content Commercial
Public Ontologies
User
Annotations
Apps
28. ?iri cheminf:logd ?logd .
FILTER (?iri = cw:979b545d-f9a9 ||
?iri = cs:2157 ||
?iri = chembl:1280 ||
?iri = db:db00945 )
cw:979b545d-f9a9 cheminf:logd ?logd .
GRAPH <http://rdf.chemspider.com> {
}
cw:979b545d-f9a9 cheminf:logd ?logd .
Query Expansion
Identity
Mapping Service
(BridgeDB)
Query Expander
Service
Profiles
Mappings
Q, L1 Q’
[cw:979b545d-f9a9,
cs:2157,
chembl:1280,
db:db00945]
cw:979b545d-f9a9, L1
Can also be achieved through UNION
21/05/2014 Brighton Seminar 28
29. Experiment
Is it feasible to use a stand-off
mapping service?
• Base lines (no external call):
– “Perfect” URIs
– Linked data querying
• Expansion approaches (external service call):
– FILTER by Graph
– UNION by Graph
C. Y. A. Brenninkmeijer, C. A. Goble, A. J. G. Gray, P. T. Groth, A. Loizou, S. Pettifer: Including Co-
referent URIs in a SPARQL Query. COLD 2013.
http://ceur-ws.org/Vol-1034/BrenninkmeijerEtAl_COLD2013.pdf
21/05/2014 Brighton Seminar 29
38. Conclusions
• Computing co-reference advantageous
– Requires less raw linksets
– Larger coverage across datasets
• Rules ensure control
– Genes can equal proteins
– Compounds never equal proteins
• Provenance captured throughout
21/05/2014 Brighton Seminar 44
39. Conclusions
• Query expansion slower in general
– Due to separate service call
– Difference below human perception
– UNION faster than FILTER on Virtuoso
• Stand-off mappings feasible
• Infrastructure can support lenses
21/05/2014 Brighton Seminar 45
Strict Relaxed
Analysing Browsing
Pharma are all accessing, processing, storing & re-processing external research data
OPS: 29 partners
A platform for integrated pharmacology data
Relied upon by pharma companies
Public domain, commercial, and private data sources
Provides domain specific API
Making it easy to build multiple drug discovery applications: examples developed in the project
Public launch April 2013
17 apps
5 external
1 in partnership
Linked data API: multiple response formats (JSON, RDF, XML, CSV …)
3scala deployment
Public dataset
Import data into cache
API calls populate SPARQL queries
Integration approach
Data kept in original model
Data cached in central triple store
API call translated to SPARQL query
Query expressed in terms of original data
Queries expanded by IMS to cover URIs of original datasets
Example using Explorer application, see Ian’s demo of the new version in the demo session
User starts typing
Server sends back suggestions – User selects one
URI sent to platform
Integrated Information returned including provenance
Each captures a subtly different view of the world
Are they the same? … depends on your point of view
Example drug: Gleevec Cancer drug for leukemia
Lookup in three popular public chemical databases
Different results
Data is messy!
Enter with ChemSpider URI for Imatinib
This is not Gleevec
sameAs != sameAs depends on your point of view
Links relate individual data instances: source, target, predicate, reason.
Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
Interested in physiochemical properties of Gleevec
Interested in biomedical and pharmacological properties
Can enter with IDs from any of the supported datasets
Platform extracts data from certain datasets
These need to be connected
Here there is no issue in computing transitive as they are all the same compound based on InChI key
Would compute the full set of links
Do genes == proteins?
Different conceptual types: gene and protein
Often used as a shortcut for retrieval: BRCA1 easier to remember and type!
Require the ability to equate them in the IMS
----
But if you’re saying why genes=proteins you may also want to be prepared for questions of when genes!=proteins. Splice variation is a common example, n the FAS receptor:
http://en.wikipedia.org/wiki/Alternative_splicing#Exon_definition:_Fas_receptor there is one gene but it can be made into two distinct proteins - which have different biological effects), so you can obviously mix bio data that shouldnt be mixed by integrating these two functions on the same ID. [We currently dont handle this well in OPS]
And the most used example here, the ghrelin gene is transcribed into a protein which is cleaved in two to form two completely different hormones, ghrelin and obestatin, which do very different things. But come from the same gene http://en.wikipedia.org/wiki/Ghrelin#Synthesis_and_variants
Insulin Receptor
Issue when linking through PDB due to the way that proteins are crystalised
Can enter with IDs from any of the supported datasets
These are 1.3 figures
In 1.4
130 raw linksets with 6,985,278 links
40,802 computed linksets with 25,584,293 links
Implementation available
IMS takes query and expands URIs
Retinoic Acid
Reminder: enter with method and URI, implemented as a query
Challenge: can we efficiently support lenses
Lenses require stand-off mappings, implemented as extra service call
Query with URIs
Extract URIs
Find equivalents
Expand query
Optimise based on context
Result size in brackets
Orange are actual OPS queries
Subset of the OPS data
Linked data approach performs badly with query 6 due to the query construction
Name being bound to the chemical structure returned
Focus on other queries
In general expansion is slower than base lines
Worst case delta: 0.01842 (under 20ms)
Human perception is 0.050 to 0.2 (50 -200ms)
Focus on query 6
No linked data as it performed very poorly on this query
Size of result obliterates external call cost