tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART a Data Warehouse for Translational Medicine at Takeda Pharmaceuticals
International
Dave Marberg, Takeda
We have used the tranSMART platform to construct a warehouse containing data from several
Takeda clinical trials, proprietary preclinical drug activity studies, 1600 Gene Expression
Omnibus studies, and data from TCGA, CCLE, and other sources. All gene expression data has
been globally normalized. We extended the tranSMART platform with a set of R function calls
to enable cross-study queries and analysis via the rich toolset available in R. The utility of the
data warehouse is exemplified by a study in which we built a predictive model for drug
sensitivities. The model was trained on gene expression and IC50 data from cell lines and was
found to correctly predict drug activity in oncology indications.
tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART and the One Min...
Similar a tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART a Data Warehouse for Translational Medicine at Takeda Pharmaceuticals International
Quantifying the content of biomedical semantic resources as a core for drug d...Syed Muhammad Ali Hasnain
Similar a tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART a Data Warehouse for Translational Medicine at Takeda Pharmaceuticals International (20)
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART a Data Warehouse for Translational Medicine at Takeda Pharmaceuticals International
1. tranSMART: a data warehouse for Translational Medicine
at Takeda Pharmaceuticals International Co.
transMART Community Workshop
November 2013
David Merberg
Bin Li
William Trepicchio
2. Outline
• Takeda’s tranSMART instance
– Goal
– Data content
– Enhancements
• Case Studies – Models for predicting erlotinib and sorafenib efficacy
1 |○○○○ |
DDMMYY
3. Takeda rationale for implementing tranSMART
• To provide a large, well organized, and integrated dataset consisting
of MPI/Takeda proprietary data, outsourced data, and valuable public
data.
• To provide an integrated environment for accessing clinical data and
molecular profiling data
– Low dimensional data – age, sex, weight, previous treatments, survival,
etc.
– High dimensional data – gene expression microarray, SNP, mutation,
NGS
• To provide tools that will enable Medical and Discovery scientists to
use this data warehouse for biomarker identification, patient
stratification, and drug targeting disease prediction, etc.
2 |○○○○ |
DDMMYY
4. Public data currently in Takeda tranSMART
• Gene Expression Omnibus (GEO)
– Approximately 1600 studies
– Approximately 200 key cancer studies manually curated; another ~150
cancer studies curated via text mining
– Most GEO datasets are cancer studies, but there are also samples from
cardiovascular disease, metabolic diseases, hematopoietic diseases,
and many others.
• The Cancer Genome Atlas (TCGA)
– Gene expression, SNP, and clinical data from close to 1000 patients
(brain, lung, and ovarian cancer)
• Large cell line panels
– The CCLE dataset, ~ 1000 cell lines, screened for 24 SOC drugs
– The Sanger dataset, ~ 1000 cell lines, screened on > 100 SOC drugs
3 |○○○○ |
DDMMYY
5. Proprietary data currently in Takeda tranSMART
• Velcade Trials
– Clinical observations
– Gene expression results
– Mutation data
• Commissioned Studies
– Oncopanel 240 – cell line response to Takeda and SOC compounds
• Drug response (IC50, EC50, cell cycle blocks, apoptosis induction, etc.)
• Mutation status
• Gene expression
– Oncotest – xenograft response to Takeda and SOC compounds
•
•
•
•
4 |○○○○ |
Drug response (IC50)
Mutation status
Gene expression
SNP
DDMMYY
6. OncoPanel 240 (Ricerca/Eurofins Panlabs)
• 240 well-defined tumor cell lines representing diverse tumor types
• Drug sensitivity screen results (IC50, EC50)
– for 13 Standard of Care anti-tumor compounds
– for 8 Takeda compounds targeting diverse pathways
• Baseline gene expression
• Mutation data
5 |○○○○ |
DDMMYY
7. Normalization of information in the data warehouse
• Gene expression data
– Globally normalized GEO gene expression data using frozen Robust
Multiarray Analysis (fMRA),
• Quantile based normalization
• Currently, only selected Affymetrix platforms are globally normalized
– Enabled grouping gene expression results from different labs and
different studies by disease
• Clinical information
– Curate clinical information to create consistent vocabulary
6 |○○○○ |
DDMMYY
8. R interface
• Enable direct access to tranSMART database tables
– Eliminates some limitations of web interface, E.g. inability to perform
multi-study queries and analyses.
– Provide a connection to the R environment, including diverse analysis
packages
• Sample functions
– getDistinctConcepts – given a keyword/string, returns study codes for
matching clinical concepts in the tranSMART database
– getGEXdata – given study codes, gets Gene Expression data from the
tranSMART database.
> br_concepts <transmart.getDistinctConcepts(,'Breast_Cancer')
> study_list <- unique(br_concepts$STUDYCODE)
> ITGB2_GEP_BR2 <transmart.getGEXData(study_list,
gene.list='ITGB2', data.pivot=F)
> hist(ITGB2_GEP_BR2$LOG_INTENSITY, br=50, xlim=c(5,12),
main="All ITGB2 GEP", xlab="GEP")
7 |○○○○ |
DDMMYY
9. Summary
• A data warehouse with a large store of gene expression, SNP, and
phenotypic data
– Clinical samples and cell lines
– Data normalized so that comparisons across studies are meaningful
– Vocabulary standardized across studies
• An R-interface to facilitate cross-study analysis using a large
collection of methods from statistics and machine learning
• A “toolbox” for achieving key Translational Medicine goals
– Bridging the gap between “omic” data generated in preclinical studies
and clinical results
– Predicting drug efficacy using clinical and pre-clinical information
collected for different purposes
• Case studies in using this toolbox follow . . .
8 |○○○○ |
DDMMYY
10. Building and using a model to predict drug sensitivity
MLN7243 IC50 distribution on Ricerca panel
4
Can we identify a
relationship between
baseline gene expression
and drug sensitivity in
cell lines . . .
2
0
1
IC50s
3
?
0
50
100
150
200
Cell lines
???
9 |○○○○ |
DDMMYY
. . . and then
extrapolate from that
relationship to use gene
expression to predict
drug efficacy in the
clinic?
11. Building the predictive models
4
MLN7243 IC50 distribution on Ricerca panel
2
IC50s
3
Oncopanel 240
drug sensitivity
0
1
Oncopanel 240
Expression data
0
50
100
150
200
Cell lines
•
•
•
•
Normalize all Oncopanel 240 expression data
Remove low-intensity and low-variance genes (to get robust signal)
Correlation based feature selection (gene expression vs IC50s)
Develop a methodology for deriving drug sensitivity models
– Based on Partial Least Squares Regression (PLSR)
– Captures consensus information from cancer cell line panel data
•
Use two SOC drugs as proof of concept for methodology
– Predict erlotinib (inhibits EGFR) sensitivity
– Predict sorafenib (inhibits VEGFR and PDGFR) sensitivity
– Use PFS from BATTLE trial to evaluate performance of models
10 |○○○○ |
DDMMYY
12. Accuracy of the erlotinib sensitivity model
Re-predicting Oncopanel 240 log2(IC50)
Accuracy estimation:
Upper boundary: 91%
Lower boundary: 77%
11 |○○○○ |
DDMMYY
13. Signature genes in the Erlotinib model reflect known
drug mechanism
Signature genes over-representing pathways
that contains an EGFR node
Signature genes over-connected to EGFR
EGFR
• Also, EGFR ligand NRG1 is among the signature genes
14. Real data tests of the models
• Test 1: The BATTLE clinical trial
– 255 lung cancer (NSCLC) patients, 131 with gene expression profile
data (GSE33072)
• 25 patients in erlotinib arm
• 39 patients in sorafenib arm
– Are the predictions of the PLSR models consistent with the results of the
BATTLE trial?
• Test 2: Predicting drug sensitivity across indications
– Use model to predict erlotinib and sorafenib sensitivity based on gene
expression data from 484 Gene Expression Omnibus datasets in Takeda
tranSMART instance
• 11,331 samples grouped into 19 major oncology indications
• Calculate percentage predicted drug sensitive tumors for each indication
• Compare predictions to results of phase III clinical trials and FDA approvals
13 |○○○○ |
DDMMYY
15. Test 1 – The BATTLE Trial: Survival analysis of groups
predicted to be drug sensitive/resistant by PLSR model
0.0 0.2 0.4 0.6 0.8 1.0
P = 0.09
HR = 0.43
0
1
2
3
4
Proportion of Cases
Proportion of Cases
(B)
E_model pred E_PFS
S_model pred S_PFS
0.0 0.2 0.4 0.6 0.8 1.0
(A)
5
P = 0.006
HR = 0.32
0
2
Monthes from Start of Therapy
(D)
2
4
6
8
10
Monthes from Start of Therapy
8
10
12
12
S_model pred E_PFS
0.0 0.2 0.4 0.6 0.8 1.0
Proportion of Cases
P = 0.32
HR = 1.87
0
6
Monthes from Start of Therapy
E_model pred S_PFS
0.0 0.2 0.4 0.6 0.8 1.0
Proportion of Cases
(C)
4
P = 0.54
HR = 1.32
0
1
2
3
4
5
Monthes from Start of Therapy
E: Erlotinib; S: Sorafenib; red: predicted sensitive; green: predicted resistant
14 |○○○○ |
DDMMYY
16. Test 2: Are predictions of erlotinib sensitivity, grouped
by indication, consistent with clinical results?
Kidney cancer is predicted
to be Erlotinib insensitive a phase III clinical trial failed
Lung cancer is predicted
to be erlotinib sensitive,
a phase III clinical trial succeeded,
(companion diagnostic available)
Potential new indication?
Multiple head and neck cancer
trials are going on now
15
17. Test 2: Are predictions of sorafenib sensitivity,
grouped by indication, consistent with clinical results?
Potential new indication?
Kidney and Liver cancers are
predicted to be Sorafenib
sensitive
Sorafenib has been approved
for Kidney and Liver cancers
16
18. Conclusions
• Using tranSMART, we created a large data warehouse to provide
computational support for biomarker identification, patient
stratification, and other Translational Medicine goals.
• Patient and cell line data can be grouped across studies by
indication or other attributes to increase statistical power. Grouping is
enabled by:
– Global normalization of numeric data
– Standardization of vocabulary
– An R interface that provides direct access to database tables
• Using erlotinib and sorafenib as case studies, we demonstrated that
the data warehouse and the R interface enable us to predict patient
stratification and drug efficacy in cancer indications.
17 |○○○○ |
DDMMYY
19. Acknowledgements
Takeda
Andy Dorner
Gene Shin
Andrew Krueger
Seema Grover
Jike Cui (now at Sanofi)
Thomson Reuters
Elona Kolpakova-Hart
18 |○○○○ |
DDMMYY
Recombinant by Deloitte
Jinlei Liu
Mike McDuffie
Hiaping Xia
21. Model test 2: How well do the models predicts
predict drug-indication efficacy profile?
Successful
Cancer Type
Lung Cancer
Liver Cancer
Kidney Cancer
Phase III trial FDA approval
Erlotinib
Sorafenib
Sorafenib
Number of
samples
329
85
218
% tumors predicted
Erlotinib sensitive
15.81
0.00
0.46 *
% tumors predicted
Sorafenib sensitive
0.61
31.76
24.77
* Erlotinib failed to show efficacy for kidney cancer in a phase III trial
20