UiPath Platform: The Backend Engine Powering Your Automation - Session 1
Enabling Real-time Genome Data Research with In-memory Database Technology (SAP Life Science Forum 2013)
1. In-Memory Database Technology Enables
Real-Time Genome Data Research
SAP Life Science Forum, Dublin
June 04, 2013
Dr. Matthieu Schapranow
Hasso Plattner Institute
2. Agenda
■ Numbers You Should Know
■ Personalized Medicine
■ High-Performance In-Memory Genome (HIG) Project
■ Outlook
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
2
3. Agenda
■ Numbers You Should Know
■ Personalized Medicine
■ High-Performance In-Memory Genome (HIG) Project
■ Outlook
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
3
4. Numbers You Should Know
Conventional Cancer Therapies
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
0% 100%
Men
Women Will
Develop
Cancer
Will Never
Develop
Cancer
American Cancer Society, Surveillance Research, 2012
Chemotherapies
Fail
Work
4
5. Numbers You Should Know
The Human Genome Project
■ 1990: Human Genome (HG) project
started with 3B USD funding
■ 2000: 1st draft of the HG announced
■ 10 years until first HG version;
thousands of institutes involved
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
5
http://www.molecularecologist.com/next-gen-table-3a/
■ 2013: Latest Next-Generation Sequencing (NGS) device
“Illumina HiSeq 2500” costs ≈700k USD, which enables whole
genome sequencing in <2 days for < 10k USD per run
■ But: analysis takes up to weeks
■ What’s next? Real-time analysis of genome data!
6. Numbers You Should Know
Comparison of Costs
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
6
0,001
0,01
0,1
1
10
100
1000
10000
01.01.01
01.05.01
01.09.01
01.01.02
01.05.02
01.09.02
01.01.03
01.05.03
01.09.03
01.01.04
01.05.04
01.09.04
01.01.05
01.05.05
01.09.05
01.01.06
01.05.06
01.09.06
01.01.07
01.05.07
01.09.07
01.01.08
01.05.08
01.09.08
01.01.09
01.05.09
01.09.09
01.01.10
01.05.10
01.09.10
01.01.11
01.05.11
01.09.11
01.01.12
01.05.12
01.09.12
01.01.13
CostsinUSD
Comparison of Costs for Main Memory and Genome Sequencing
Costs per Megabyte RAM Costs per Megabase Sequencing
7. Agenda
■ Numbers You Should Know
■ Personalized Medicine
■ High-Performance In-Memory Genome (HIG) Project
■ Outlook
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
7
8. Personalized Medicine
Our Motivation
■ Today analysis of genome data, e.g. for personalized treatment,
takes 4-6 weeks (incl. biopsy, biological preparation, sequencing,
alignment, variant calling, full analysis, and evaluation)
■ In-memory technology is suitable to accelerate genome analysis
□ Highly parallel alignment / variant calling (data preparation)
□ Real-time analysis of individual patient and cohort data
□ Combined search in structured / unstructured data
■ Challenge: Can we analyze the entire data of
a patient, incl. Electronic Medical Record (EMR) and genome
data, during a doctor’s visit?
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
8
10. Personalized Medicine
Our Vision
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
10 Desirability
■ Integrated portfolio of specialized services
for clinicians, researchers, and patients
■ Include latest research results, e.g. most
effective therapies
Viability
■ Share data via the Internet to get
feedback from word-wide experts (cost-
saving)
■ Combine research data (publications,
annotations, genome data) from
international databases in a single
knowledge base
■ Enable personalized medicine also in far-
off regions and developing countries
Feasibility
■ Allele frequency count of 12B
records in < 1s
■ Identification of relevant
annotations out of 80M <1s
■ Integrated alignment and
variant calling within hours
instead of days
11. Personalized Medicine
User Requirements
For researchers
■ Enable real-time analysis of genome data
■ Automatic scan of pathways to identify cellular
impact of mutations
■ Free-text search in publications, diagnosis, and EMR
data (structured and unstructured data)
For clinicians
■ Preventive diagnostics to identify risk patients
■ Indicate pharmacokinetic correlations
■ Scan for comparable patient cases
For patients
■ Identify relevant clinical trials / experts
■ Start most appropriate therapy early based on all
evidences and latest knowledge
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
11
12. Agenda
■ Numbers You Should Know
■ Personalized Medicine
■ High-Performance In-Memory Genome (HIG) Project
■ Outlook
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
12
13. High-Performance In-Memory Genome Project
Integration of Genomic Data
■ Once DNA sequences
are generated by NGS
devices, HIG comes
into play
■ Preprocessing of DNA
(alignment, variant
calling) can be
modeled and is
executed as integrated
process
■ Results are stored in
in-memory database
to enable instant
analysis
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
13
14. High-Performance In-Memory Genome Project
The In-Memory Technology Toolbox
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
Any attribute
as index
Insert only
for time travel
Combined
column
and row store
+
No aggregate
tables
Minimal
projections
Partitioning
Analytics on
historical
datat
Single and
multi-tenancy
SQL interface
on columns &
rows
SQL
Reduction of
layers
x
x
Lightweight
Compression
Multi-core/
parallelization
On-the-fly
extensibility
+++
Active/passive
data storePA
Bulk load
Discovery Service
Read Event
Repositories
Verification
Services
SAP HANA
●
●
P A
up to 8.000 read
event notifications
per second
up to 2.000
requests
per second
Discovery Service
Read Event
Repositories
Verification
Services
SAP HANA
●
●
P A
up to 8.000 read
event notifications
per second
up to 2.000
requests
per second
+
+
++
T
Text Retrieval
and Extraction
Object to
relational
mapping
Dynamic
multi-
threading
within nodes
Map
reduce
No diskGroup Key
14
15. High-Performance In-Memory Genome Project
Challenges of Genome Data Analysis
Analysis of Genomic
Data
Alignment and
Variant Calling
Analysis of Annotations
in World-wide DBs
Bound To CPU Performance Memory Capacity
Duration Hours – Days Weeks
HPI Minutes Real-time
In-Memory
Technology
Multi-Core Partitioning & Compression
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
15
16. In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
High-Performance In-Memory Genome Project
Challenges of Genome Data Analysis
Analysis of Genomic
Data
Alignment and
Variant Calling
Analysis of Annotations
in World-wide DBs
Bound To CPU Performance Memory Capacity
Duration Hours – Days Weeks
HPI & SAP Minutes – Hours Interactively
In-Memory
Technology
Multi-Core Partitioning & Compression
16
17. High-Performance In-Memory Genome Project
Selected Research Topics
Improving Analyses:
■ Clustering of patient cohorts, e.g. k-means clustering
■ Combined search, e.g. in clinical trials and side-effect databases
■ Ad-hoc analysis of genetic pathways, e.g. to identify cause/effect
Improving Data Preparations:
■ Graphical modeling of Genome Data Processing (GDP) pipelines
■ Scheduling and execution of multiple GPD pipelines in parallel
■ App store for medical knowledge (bring algorithms to data)
■ Exchange of sensitive data, e.g. history-based access control
■ Billing processes for intellectual property and services
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
17
18. High-Performance In-Memory Genome Project
Genomics Analysis
Loaded part of 1,000 genomes pre-phase 1 dataset
■ Chromosome 1 of 629 individuals from the 1,000 genomes project
■ 12 billion entries in largest database table
■ 293 GB of data (compressed in HANA)
Results
■ Report SNPs failing quality control
UCSC 102.47 sec | SAP HANA 1.25 sec – 82x faster
■ Compute the alternative allele frequency for each variant/region
VCFtools 259 sec | SAP HANA 0.43 sec – 600x faster
■ Compute the total number of missing genotypes per individual
VCFtools 548 sec | SAP HANA 2 sec – 270x faster
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
18
Supported by Dr. Carlos Bustamante lab
19. Chromosome
Absolutefrequency
Number
of
Alleles
High-Performance In-Memory Genome Project
Working With Big Data
Loaded entire 1,000 genomes pre-phase 1 dataset
■ Queries on all chromosomes for all 629 individuals
■ 136 billion entries in largest database table
■ ≈1.2TB (compressed in HANA)
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
19
Query
results
using
R
connec0vity:
Report
all
varia0ons
in
BRCA1
and
BRCA2
Supported by Dr. Carlos Bustamante lab
20. High-Performance In-Memory Genome Project
Analysis of Patient Cohorts
20
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
■ Columnar storage optimizes
space requirements while
enabling enhancing calculation
performance
■ Single k-means clustering:
R 470ms vs. HANA 30ms (15:1)
■ >60k clusters are calculated in
<2s on 1,000 core cluster
■ è Interactive exploration of
clusters comes true
Why is a therapy only working in 80% of the patient cases?
21. High-Performance In-Memory Genome Project
Integration of Genetic Pathways
■ Storing and accessing graph data
within in-memory database (Active
Information Store)
■ 263 pathways KEGG pathways with
6,481 genetic components, 32,784
vertices, and 90,682 edges
■ Rank all pathways by evaluation of
node connections: IMDB <350ms
■ >5,5k rankings can be calculated in
<2s on 1,000 core cluster
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
21 What are known effects for a somatic mutation?
22. High-Performance In-Memory Genome Project
Search in Structured / Unstructured Data
■ In-memory technology enables entity extraction, e.g. age,
genes, and drugs
■ Integrated 30k free text documents from clinicaltrials.gov
■ Relational search on entities enables interactive comparison
■ Results by rated by relevant search criteria
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
22 What clinical trials are relevant for individual patient?
23. High-Performance In-Memory Genome Project
Architectural Overview
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
Cohort
Analysis
Pathway
Finder
Paper
Search
In-Memory Database
Clinical Trial
Finder
Pipeline
Editor
Extensions
App Store
Access
Control
Billing
Pipeline
Data
Genome
Data
Pathways
Genome
Metadata
Papers
Pipeline
Models
Analytical
Tools
23
...
...
...
24. Agenda
■ Numbers You Should Know
■ Personalized Medicine
■ High-Performance In-Memory Genome (HIG) Project
■ Outlook
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
24
25. The Vision
Combined Data and Expert’s Knowledge
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
25
26. The Future
Combined Information
Enable clinicians to:
■ Make evidence-based therapy
decisions at the patient’s bed
■ Exchange latest patient data
with international experts
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
26
Enable researchers to:
■ Investigate genomes of
patient cohorts to derive new
knowledge
■ Analyze results in
real-time
Enable patients to:
■ To identify risk factors long
before they turn into diseases
■ Identify experts and similar
patient cases to bring up
alternatives for individual
therapies
27. Thank you for your interest!
Keep in contact with us.
In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013
Hasso Plattner Institute
Enterprise Platform & Integration Concepts
Dr. Matthieu-P. Schapranow
August-Bebel-Str. 88
14482 Potsdam, Germany
Dr. Matthieu-P. Schapranow
schapranow@hpi.uni-potsdam.de
http://j.mp/schapranow
27