Linked data relies on instance level links between potentially differing representations of concepts in multiple datasets. However, in large complex domains, such as pharmacology, the inter-relationship of data instances needs to consider the context (e.g. task, role) of the user and the assumptions they want to apply to the data. Such context is not taken into account in most linked data integration procedures. In this paper we argue that dataset links should be stored in a stand-off fashion, thus enabling different assumptions to be applied to the data links during query execution. We present the infrastructure developed for the Open PHACTS Discovery Platform to enable this and show through evaluation that the incurred performance cost is below the threshold of user perception.
http://ceur-ws.org/Vol-1034/BrenninkmeijerEtAl_COLD2013.pdf
1. Including Co-referent URIs
in a SPARQL Query
Christian Y A Brenninkmeijer,
Carole Goble, Alasdair J G Gray, Paul Groth,
Antonis Loizou, and Steve Pettifer
www.openphacts.org
@open_phacts
A.J.G.Gray@hw.ac.uk
@gray_alasdair
2. Multiple Identities
Andy Law's Third Law
“The number of unique identifiers assigned to an individual is
never less than the number of Institutions involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws.html
GB:29384
P12047
Are these the
same thing?
X31045
22/10/2013
COLD 2013
1
9. Open PHACTS Discovery Platform
Apps
Interactive
responses
Method
Calls
Domain API
Drug Discovery Platform
Production quality
integration platform
22/10/2013
COLD 2013
9
10. Integration Approach
•
•
•
•
Data kept in original model
Data cached in central triple store
API call translated to SPARQL query
Query expressed in terms of original data
22/10/2013
COLD 2013
10
11. OPS Discovery Platform
Core Platform
Apps
Identity
Resolution
Service
Identifier
Management
Service
“Adenosine
receptor 2a”
Linked Data API (RDF/XML, TTL, JSON)
P12374
EC2.43.4
CS4532
Domain
Specific
Services
Semantic Workflow Engine
Chemistry
Registration
Normalisatio
n & Q/C
Data Cache
(Virtuoso Triple Store)
Indexing
VoID
VoID
VoID
Nanopub
Public
Ontologies
Db
Db
22/10/2013
VoID
Nanopub
Db
Nanopub
Db
COLD 2013
Public Content
VoID
Commercial
User
Annotations
11
12. Platform Interaction
1. Resolve user input:
– User enters search text
– Resolve to a URI for concept
2. Request data for URI
– Expand URI to equivalent for each dataset
– Run resulting SPARQL query
22/10/2013
COLD 2013
12
13. Query Expansion
GRAPH <http://rdf.chemspider.com> {
cw:979b545d-f9a9 cheminf:logd ?logd .
?iri cheminf:logd ?logd .
FILTER (?iri = cw:979b545d-f9a9 ||
?iri = cs:2157 ||
cw:979b545d-f9a9, L
?iri = chembl:1280 || [cw:979b545d-f9a9, 1
cs:2157,
?iri = db:db00945 )
}
Q, L1
Q’
Query Expander
Service
chembl:1280,
db:db00945]
Identity
Mapping Service
(BridgeDB)
Can also be achieved through UNION
Mappings
Profiles
22/10/2013
COLD 2013
13
14. Experiment
Is it feasible to use a stand-off
mapping service?
• Base lines (no external call):
– “Perfect” URIs
– Linked data querying
• Expansion approaches (external service call):
– FILTER by Graph
– UNION by Graph
22/10/2013
COLD 2013
14
23. Conclusions
• Query expansion slower in general
– Due to separate service call
– Difference below human perception
– UNION faster than FILTER on Virtuoso
• Stand-off mappings feasible
• Infrastructure can support lenses
Strict
Relaxed
Analysing
Browsing
22/10/2013
COLD 2013
29
Each captures a subtly different view of the worldAre they the same? … depends on your point of view
Example drug:Gleevec Cancer drug for leukemiaLookup in three popular public chemical databasesDifferent resultsData is messy!
Enter with ChemSpider URI forImatinibThis is not Gleevec
sameAs != sameAs depends on your point of viewLinks relate individual data instances: source, target, predicate, reason.Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
A platform for integratedpharmacology data Reliedupon by pharma companiesPublic domain, commercial, and private data sourcesProvidesdomainspecific APIMakingiteasyto build multiple drugdiscoveryapplications:examplesdeveloped in the project
Step 2 requires expansion of URI to cover those used in data setsPerformed by query expansion service and IMS
Import data into cacheDomain specific APIAPI calls populate SPARQL queriesQueries expanded by IMS to cover URIs of original datasets
Step 2 requires expansion of URI to cover those used in data setsPerformed by query expansion service and IMS
Query with URIsExtract URIsFind equivalentsExpand queryOptimise based on context
Result size in brackets
Result size in brackets
Subset of the OPS data
Linked data approach performs badly with query 6 due to the query constructionName being bound to the chemical structure returned
Focus on other queriesIn general expansion is slower than base linesWorst case delta: 0.01842 (under 20ms)Human perception is 0.050 to 0.2
Focus on query 6No linked data as it performed very poorly on this querySize of result obliterates external call cost