This document discusses data integration challenges in a big data context using the Open PHACTS case study. Open PHACTS aims to integrate multiple biomedical data resources into a single open access point. It has developed a cloud-based production level system that provides semantic web-based APIs to access integrated data on diseases, tissues, targets, compounds and pathways. The system addresses issues like identity resolution, data quality, provenance and licensing to enable complex queries across diverse data sources.
Boost PC performance: How more available memory can improve productivity
Data Integration in a Big Data Context: An Open PHACTS Case Study
1. Data Integration in a
Big Data Context
Open PHACTS Case Study
Alasdair J G Gray
A.J.G.Gray@hw.ac.uk
alasdairjggray.co.uk
@gray_alasdair
2. Big Data
@gray_alasdair Big Data Integration 2
Volume Velocity
Variety Veracity
http://i.kinja-img.com/gawker-media/image/upload/lvzm0afp8kik5dctxiya.jpg
3. Open PHACTS Use Case
“Let me compare MW, logP
and PSA for launched
inhibitors of human &
mouse oxidoreductases”
Chemical Properties (Chemspider)
Launched drugs (Drugbank)
Human => Mouse (Homologene)
Protein Families (Enzyme)
Bioactivty Data (ChEMBL)
… other info (Uniprot/Entrez etc.)
“Let me compare MW, logP
and PSA for launched
inhibitors of human &
mouse oxidoreductases”
@gray_alasdair Big Data Integration 3
4. Open PHACTS Mission:
Integrate Multiple Research
Biomedical Data Resources
Into A Single Open & Free
Access Point
@gray_alasdair Big Data Integration 4
8. OPS Discovery Platform
@gray_alasdair Big Data Integration 8
Drug Discovery Platform
Apps
Domain API
Interactive
responses
Production quality
integration platform
Method
Calls
Standard Web
Technologies
9. App Ecosystem
@gray_alasdair
An “App Store”?
Explorer Explorer2 ChemBioNavigator Target Dossier Pharmatrek Helium
MOE Collector Cytophacts Utopia Garfield SciBite
KNIME Mol. Data Sheets PipelinePilot scinav.it Taverna
Big Data Integration 9https://www.openphacts.org/2/sci/apps.html
13. API Hits
@gray_alasdair Big Data Integration 13
0
10
20
30
40
50
60
Jan
2013
Feb
2013
Mar
2013
Apr
2013
May
2013
June
2013
July
2013
Aug
2013
Sept
2013
Oct
2013
Nov
2013
Dec
2013
Jan
2014
Feb
2014
Mar
2014
Apr
2014
May
2014
June
2014
July
2014
Aug
2014
Sept
2014
Oct
2014
Nov
2014
Dec
2014
Jan
2015
Feb
2015
Mar
2015
Apr
2015
May
2015
June
2015
NoofHits
Millions
Month
Public launch
of 1.2 API
1.3 API 1.4 API 1.5 API
14. OPS Discovery Platform
Nanopub
Db
VoID
Data Cache
(Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)
Domain
Specific
Services
Identity
Resolution
Service
Chemistry
Registration
Normalisation
& Q/C
Identifier
Management
Service
Indexing
CorePlatform
P12374
EC2.43.4
CS4532
“Adenosine
receptor 2a”
VoID
Db
Nanopub
Db
VoID
Db
VoID
Nanopub
VoID
Public Content Commercial
Public Ontologies
User
Annotations
Apps
@gray_alasdair Big Data Integration 14
16. John Wilbanks consulted for us
A framework built around STANDARD well-understood
Creative Commons licences – and how they interoperate
Deal with the problems by:
Interoperable licences
Appropriate terms
Declare expectations to users and
data publishers
One size won‘t fit all requirements
Data Licensing (Or Lack Of!)
20. P12047
X31045
GB:29384
Identity Mapping
@gray_alasdair Big Data Integration 20
Andy Law's Third Law
“The number of unique identifiers
assigned to an individual is never
less than the number of Institutions
involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws/
22. Gleevec®: Imatinib Mesylate
@gray_alasdair Big Data Integration 22
DrugbankChemSpider PubChem
Imatinib
MesylateImatinib Mesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N
Are these records the same?
It depends upon your task!
23. Big Data Integration 23
skos:exactMatch
(InChI)
Strict Relaxed
Analysing Browsing
Structure Lens
@gray_alasdair
I need to perform an analysis, give me
details of the active compound in
Gleevec.
24. Big Data Integration 24
skos:closeMatch
(Drug Name)
skos:closeMatch
(Drug Name)
skos:exactMatch
(InChI)
Strict Relaxed
Analysing Browsing
Name Lens
@gray_alasdair
Which targets are known to interact
with Gleevec?
29. Open PHACTS Approach
1. Know your audience
Web developers
2. Understand your use cases
Prioritised business questions
3. Identify access pathways
Identify data
Identify connections
Implement API
@gray_alasdair Big Data Integration 31
30. Questions
Alasdair J G Gray
A.J.G.Gray@hw.ac.uk
alasdairjggray.co.uk
@gray_alasdair
Open PHACTS
contact@openphacts.org
openphacts.org
@open_phacts
@gray_alasdair Big Data Integration 32
Notas del editor
Deriving value from the data
Volume: More data than you can process – relative term; complexity of processing
Velocity: Data constantly being generated
Variety: Multiple sources, formats, models
Veracity: Accuracy of the data
Open PHACTS: Not dealt with Velocity, although it is a challenge for us
1 of 83 business driver questions
Took a team of 5 experienced researchers 6 hours to manually gather the answer
Start of the project couldn’t be answered by a computer system
6 months in 30s with prototype
now subsecond
Pharma are all accessing, processing, storing & re-processing external research data Big waste of resources
No competitive advantage
OPS: 29 partners including many major pharma
83 questions ranked and top 20 taken as target
18 of top 20
A platform for integrated pharmacology data
Relied upon by pharma companies
Public domain, commercial, and private data sources
Provides domain specific API
Making it easy to build multiple drug discovery applications: examples developed in the project
Not just in-house apps
Actively being used for different purposes
Public launch April 2013
Averaging 20 million hits a month from the start of 2015
38 million in the last 30 days
Heavy usage from pharma, academia, and biotech
500+ registered users
Import data into cache
Integration approach
Data kept in original model but cached centrally
API call translated to SPARQL query
Query expressed in terms of original data
Queries expanded by IMS to cover URIs of original datasets
Data provided by many publishers
Originally in many formats: relational, SD files and RDF
Worked closely with publishers
Data licensing was a major issue
Over 3 billion triples – 12 datasets
Hosted on beefy hardware; data in memory (aim)
Extensive memcaching
Pose complex queries to extract data
Interactions needed to satisfy use cases
Gradually added additional types of data and interactions
No standard units
Even in curated sources!
Feedback issues to data providers
Validation & Standardization Platform
Developed by Royal Society of Chemistry
http://bit.ly/NZF5VB
Example drug: Gleevec Cancer drug for leukemia
Lookup in three popular public chemical databases Different results
Chemistry is complicated, often simplified for convenience
Data is messy!
Are these records the same? It depends on what you are doing with the data!
Each captures a subtly different view of the world
Chemistry is complicated, often simplified for convenience
Data is messy!
Interested in physiochemical properties of Gleevec
Interested in biomedical and pharmacological properties
sameAs != sameAs depends on your point of view
Links relate individual data instances: source, target, predicate, reason.
Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
Open for anybody
API grouped into theme areas
Two phase interaction:
Resolve thing to identifier
Retrieve data about the identifier