SlideShare una empresa de Scribd logo
1 de 94
Descargar para leer sin conexión
Methods for High Dimensional Interactions
Sahir Rai Bhatnagar, PhD Candidate – McGill Biostatistics
Joint work with Yi Yang, Mathieu Blanchette and Celia Greenwood
Ludmer Center – May 19, 2016
Underlying objective of this talk
1
Motivation
one predictor variable at a time
Predictor Variable Phenotype
one predictor variable at a time
Predictor Variable Phenotype
Test 1
Test 2
Test 3
Test 4
Test 5
2
a network based view
Predictor Variable Phenotype
a network based view
Predictor Variable Phenotype
a network based view
Predictor Variable Phenotype
Test 1
3
system level changes due to environment
Predictor Variable PhenotypeEnvironment
A
B
system level changes due to environment
Predictor Variable PhenotypeEnvironment
A
B
Test 1
4
Motivating Dataset: Newborn epigenetic adaptations to gesta-
tional diabetes exposure (Luigi Bouchard, Sherbrooke)
Environment
Gestational
Diabetes
Large Data
Child’s epigenome
(p ≈ 450k)
Phenotype
Obesity measures
5
Differential Correlation between environments
(a) Gestational diabetes affected pregnancy (b) Controls
6
Gene Expression: COPD patients
(a) Gene Exp.: Never Smokers (b) Gene Exp.: Current Smokers
(c) Correlations: Never Smokers (d) Correlations: Current Smokers
7
Imaging Data: Topological properties and Age
8
Correlations differ between Age groups
9
NIH MRI brain study
Environment
Age
Large Data
Cortical Thickness
(p ≈ 80k)
Phenotype
Intelligence
10
Differential Networking
11
formal statement of initial problem
• n: number of subjects
12
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
12
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)
12
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)
• Yn×1: phenotype
12
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)
• Yn×1: phenotype
• En×1: environmental factor that has widespread effect on X and can
modify the relation between X and Y
12
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)
• Yn×1: phenotype
• En×1: environmental factor that has widespread effect on X and can
modify the relation between X and Y
Objective
• Which elements of X that are associated with Y , depend on E?
12
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
Phenotype (Behavioral
development, IQ
scores, Death)
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
Phenotype (Behavioral
development, IQ
scores, Death)
epidemiological study
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
Phenotype (Behavioral
development, IQ
scores, Death)
(epi)genetic/imaging associations
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
Phenotype (Behavioral
development, IQ
scores, Death)
(epi)genetic/imaging associations
(epi)genetic/imaging associations
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
Phenotype (Behavioral
development, IQ
scores, Death)
13
Is this mediation analysis?
14
Is this mediation analysis?
• No
14
Is this mediation analysis?
• No
• We are not making any causal claims i.e. direction of the arrows
14
Is this mediation analysis?
• No
• We are not making any causal claims i.e. direction of the arrows
• There are many untestable assumptions required for such analysis
→ not well understood for HD data
14
Methods
analysis strategies
marginal correlations (univariate p-value)
multiple testing adjustment
Single-Marker or Single Variable Tests
analysis strategies
marginal correlations (univariate p-value)
multiple testing adjustment
Single-Marker or Single Variable Tests
LASSO (convex penalty with one tuning parameter)
MCP, SCAD, Dantzig selector (non-convex penalty with two tuning parameters)
Group level penalization (group LASSO, SCAD and MCP)
Multivariate Regression Approaches Including Penalization Methods
analysis strategies
marginal correlations (univariate p-value)
multiple testing adjustment
Single-Marker or Single Variable Tests
LASSO (convex penalty with one tuning parameter)
MCP, SCAD, Dantzig selector (non-convex penalty with two tuning parameters)
Group level penalization (group LASSO, SCAD and MCP)
Multivariate Regression Approaches Including Penalization Methods
cluster features based on euclidean distance, correlation, connectivity
regression with group level summary (PCA, average)
Clustering Together with Regression
15
ECLUST - our proposed method: 3 phases
Original Data
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
3) Penalized
Regression
Yn×1∼ + ×E
16
the objective of statistical
methods is the reduction of data.
A quantity of data . . . is to be
replaced by relatively few quantities
which shall adequately represent
. . . the relevant information
contained in the original data.
- Sir R. A. Fisher, 1922
16
Underlying model
Y = β0 + β1U + β2U · E + ε (1)
17
Underlying model
Y = β0 + β1U + β2U · E + ε (1)
X ∼ F(α0 + α1U, ΣE ) (2)
17
Underlying model
Y = β0 + β1U + β2U · E + ε (1)
X ∼ F(α0 + α1U, ΣE ) (2)
• U: unobserved latent variable
• X: observed data which is a function of U
• ΣE : environment sensitive correlation matrix
17
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
3) Penalized
Regression
Yn×1∼ + ×E
18
advantages and disadvantages
General Approach Advantages Disadvantages
Single-Marker simple, easy to implement
multiple testing burden,
power, interpretability
Penalization
multivariate, variable
selection, sparsity, efficient
optimization algorithms
poor sensitivity with
correlated data, ignores
structure in design matrix,
interpretability
Environment Cluster with
Regression
multivariate, flexible
implementation,
group structure, takes
advantage of correlation,
interpretability
difficult to identify relevant
clusters, clustering is
unsupervised
19
Methods to detect gene clusters
Table 1: Methods to detect gene clusters
General Approach Formula
Correlation
pearson, spearman,
biweight midcorrelation
Correlation Scoring |ρE=1 − ρE=0|
Weighted Correlation
Scoring
c|ρE=1 − ρE=0|
Fisher’s Z
Transformation
|zij0−zij1|
√
1/(n0−3)+1/(n1−3)
20
Cluster Representation
Table 2: Methods to create cluster representations
General Approach Type
Unsupervised average
K principal components
Supervised partial least squares
21
Simulation Studies
Simulation Study 1
(a) Corr(XE=0) (b) Corr(XE=1)
(c) |Corr(XE=1) − Corr(XE=0)| (d) Corr(Xall)
22
Results: Jaccard Index and test set MSE
23
Simulation Study 2
24
TOM based on all subjects
(a) TOM(Xall)
25
TOM based on unexposed subjects
(a) TOM(XE=0)
26
TOM based on exposed subjects
(a) TOM(XE=1)
27
Difference of TOMs
(a) |TOM(XE=1) − TOM(XE=0)|
28
Results: Test set MSE
29
Strong Heredity Models
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main effects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
• g(·) is a known link function
• µ = E [Y |X, E, β, α]
• β = (β1, β2, . . . , βp, βE ) ∈ Rp+1
• α = (α1E , . . . , αpE ) ∈ Rp
30
Variable Selection
arg min
β0,β,α
1
2
Y − g(µ)
2
+ λ ( β 1 + α 1)
• Y − g(µ)
2
= i (yi − g(µi ))2
• β 1 = j |βj |
• α 1 = j |αj |
• λ ≥ 0: tuning parameter
31
Why Strong Heredity?
• Statistical Power: large main effects are more likely to lead to
detectable interactions than small ones
32
Why Strong Heredity?
• Statistical Power: large main effects are more likely to lead to
detectable interactions than small ones
• Interpretability: Assuming a model with interaction only is generally
not biologically plausible
32
Why Strong Heredity?
• Statistical Power: large main effects are more likely to lead to
detectable interactions than small ones
• Interpretability: Assuming a model with interaction only is generally
not biologically plausible
• Practical Sparsity: X1, E, X1 · E vs. X1, E, X2 · E
32
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main effects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
1Choi et al. 2010, JASA
2Chipman 1996, Canadian Journal of Statistics
33
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main effects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
Reparametrization1
: αjE = γjE βj βE .
1Choi et al. 2010, JASA
2Chipman 1996, Canadian Journal of Statistics
33
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main effects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
Reparametrization1
: αjE = γjE βj βE .
Strong heredity principle2
:
ˆαjE = 0 ⇒ ˆβj = 0 and ˆβE = 0
1Choi et al. 2010, JASA
2Chipman 1996, Canadian Journal of Statistics
33
Strong Heredity Model with Penalization
arg min
β0,β,γ
1
2
Y − g(µ)
2
+
λβ (w1β1 + · · · + wqβq + wE βE ) +
λγ (w1E γ1E + · · · + wqE γqE )
wj =
1
ˆβj
, wjE =
ˆβj
ˆβE
ˆαjE
34
Open source software
• Software implementation in R: http://sahirbhatnagar.com/eclust/
• Allows user specified interaction terms
• Automatically determines the optimal tuning parameters through
cross validation
• Can also be applied to genetic data (SNPs)
35
Feature Screening and
Non-linear associations
The most popular way of feature screening
How to fit statistical models when you have over 100,000 features?
36
The most popular way of feature screening
How to fit statistical models when you have over 100,000 features?
Marginal correlations, t-tests
• for each feature, calculate the correlation between X and Y
36
The most popular way of feature screening
How to fit statistical models when you have over 100,000 features?
Marginal correlations, t-tests
• for each feature, calculate the correlation between X and Y
• keep all features with correlation greater than some threshold
36
The most popular way of feature screening
How to fit statistical models when you have over 100,000 features?
Marginal correlations, t-tests
• for each feature, calculate the correlation between X and Y
• keep all features with correlation greater than some threshold
• However this procedure assumes a linear relationship between X and
Y
36
Non-linear feature screening: Kolmogorov-Smirnov Test
Mai & Zou (2012) proposed using the Kolmogorov-Smirnov (KS) test
statistic
ˆKj = sup
x
|ˆFj (x|Y = 1) − ˆFj (x|Y = 0)| (3)
Figure 8: Depiction of KS statistic
37
Non-linear Interaction Models
After feature screening, we can fit non-linear relationships between
X and Y
Yi = β0 + f (Xij ) + f (Xij , Ei ) + εi (4)
38
Conclusions
Conclusions and Contributions
• Large system-wide changes are observed in many environments
39
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
39
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
39
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
• Dimension reduction is achieved through leveraging the
environmental-class-conditional correlations
39
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
• Dimension reduction is achieved through leveraging the
environmental-class-conditional correlations
• Also, we develop and implement a strong heredity framework
within the penalized model
39
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
• Dimension reduction is achieved through leveraging the
environmental-class-conditional correlations
• Also, we develop and implement a strong heredity framework
within the penalized model
• R software: http://sahirbhatnagar.com/eclust/
39
Limitations
• There must be a high-dimensional signature of the exposure
40
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
40
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
• Two tuning parameters
40
What type of data is required to
use these methods
ECLUST method
1. environmental exposure (currently only binary)
2. a high dimensional dataset that can be affected by the exposure
3. a single phenotype (continuous or binary)
4. Must be a high-dimensional signature of the exposure
41
Strong Heredity and Non-linear Models
1. a single phenotype (continuous or binary)
2. environment variable (continuous or binary)
3. any number of predictor variables
42
Check out our Lab’s Software!
http://greenwoodlab.github.io/software/
43
acknowledgements
• Dr. Celia Greenwood
• Dr. Blanchette and Dr. Yang
• Dr. Luigi Bouchard, Andr´e Anne
Houde
• Dr. Steele, Dr. Kramer,
Dr. Abrahamowicz
• Maxime Turgeon, Kevin
McGregor, Lauren Mokry,
Marie Forest, Pablo Ginestet
• Greg Voisin, Vince Forgetta,
Kathleen Klein
• Mothers and children from the
study
44

Más contenido relacionado

Similar a High Dimensional Interactions and Environmental Effects

Combining co-expression and co-location for gene network inference in porcine...
Combining co-expression and co-location for gene network inference in porcine...Combining co-expression and co-location for gene network inference in porcine...
Combining co-expression and co-location for gene network inference in porcine...tuxette
 
High Dimensional Biological Data Analysis and Visualization
High Dimensional Biological Data Analysis and VisualizationHigh Dimensional Biological Data Analysis and Visualization
High Dimensional Biological Data Analysis and VisualizationDmitry Grapov
 
Foundations of Statistics in Ecology and Evolution. 8. Bayesian Statistics
Foundations of Statistics in Ecology and Evolution. 8. Bayesian StatisticsFoundations of Statistics in Ecology and Evolution. 8. Bayesian Statistics
Foundations of Statistics in Ecology and Evolution. 8. Bayesian StatisticsAndres Lopez-Sepulcre
 
DOE Project ANOVA Analysis Diet Type
DOE Project ANOVA Analysis Diet TypeDOE Project ANOVA Analysis Diet Type
DOE Project ANOVA Analysis Diet Typevidit jain
 
Genomic selection in Livestock
Genomic  selection in LivestockGenomic  selection in Livestock
Genomic selection in LivestockILRI
 
Integration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modelingIntegration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modelingUSC
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataUC Davis
 
a brief introduction to epistasis detection
a brief introduction to epistasis detectiona brief introduction to epistasis detection
a brief introduction to epistasis detectionHyun-hwan Jeong
 
Repurposing predictive tools for causal research
Repurposing predictive tools for causal researchRepurposing predictive tools for causal research
Repurposing predictive tools for causal researchGalit Shmueli
 
Constraints and Global Optimization for Gene Prediction Overlap Resolution
Constraints and Global Optimization for Gene Prediction Overlap ResolutionConstraints and Global Optimization for Gene Prediction Overlap Resolution
Constraints and Global Optimization for Gene Prediction Overlap ResolutionChristian Have
 
Split Criterions for Variable Selection Using Decision Trees
Split Criterions for Variable Selection Using Decision TreesSplit Criterions for Variable Selection Using Decision Trees
Split Criterions for Variable Selection Using Decision TreesNTNU
 
Thesis seminar
Thesis seminarThesis seminar
Thesis seminargvesom
 
Subgroup identification for precision medicine. a comparative review of 13 me...
Subgroup identification for precision medicine. a comparative review of 13 me...Subgroup identification for precision medicine. a comparative review of 13 me...
Subgroup identification for precision medicine. a comparative review of 13 me...SuciAidaDahhar
 
Basic Concepts of Experimental Design & Standard Design ( Statistics )
Basic Concepts of Experimental Design & Standard Design ( Statistics )Basic Concepts of Experimental Design & Standard Design ( Statistics )
Basic Concepts of Experimental Design & Standard Design ( Statistics )Hasnat Israq
 
Matching Weights to Simultaneously Compare Three Treatment Groups: a Simulati...
Matching Weights to Simultaneously Compare Three Treatment Groups: a Simulati...Matching Weights to Simultaneously Compare Three Treatment Groups: a Simulati...
Matching Weights to Simultaneously Compare Three Treatment Groups: a Simulati...Kazuki Yoshida
 
Exact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeExact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeBigMine
 

Similar a High Dimensional Interactions and Environmental Effects (20)

Combining co-expression and co-location for gene network inference in porcine...
Combining co-expression and co-location for gene network inference in porcine...Combining co-expression and co-location for gene network inference in porcine...
Combining co-expression and co-location for gene network inference in porcine...
 
High Dimensional Biological Data Analysis and Visualization
High Dimensional Biological Data Analysis and VisualizationHigh Dimensional Biological Data Analysis and Visualization
High Dimensional Biological Data Analysis and Visualization
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Var...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Var...MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Var...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Var...
 
Foundations of Statistics in Ecology and Evolution. 8. Bayesian Statistics
Foundations of Statistics in Ecology and Evolution. 8. Bayesian StatisticsFoundations of Statistics in Ecology and Evolution. 8. Bayesian Statistics
Foundations of Statistics in Ecology and Evolution. 8. Bayesian Statistics
 
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
 
DOE Project ANOVA Analysis Diet Type
DOE Project ANOVA Analysis Diet TypeDOE Project ANOVA Analysis Diet Type
DOE Project ANOVA Analysis Diet Type
 
Genomic selection in Livestock
Genomic  selection in LivestockGenomic  selection in Livestock
Genomic selection in Livestock
 
Integration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modelingIntegration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modeling
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic Data
 
a brief introduction to epistasis detection
a brief introduction to epistasis detectiona brief introduction to epistasis detection
a brief introduction to epistasis detection
 
Repurposing predictive tools for causal research
Repurposing predictive tools for causal researchRepurposing predictive tools for causal research
Repurposing predictive tools for causal research
 
Probit and logit model
Probit and logit modelProbit and logit model
Probit and logit model
 
Constraints and Global Optimization for Gene Prediction Overlap Resolution
Constraints and Global Optimization for Gene Prediction Overlap ResolutionConstraints and Global Optimization for Gene Prediction Overlap Resolution
Constraints and Global Optimization for Gene Prediction Overlap Resolution
 
Basen Network
Basen NetworkBasen Network
Basen Network
 
Split Criterions for Variable Selection Using Decision Trees
Split Criterions for Variable Selection Using Decision TreesSplit Criterions for Variable Selection Using Decision Trees
Split Criterions for Variable Selection Using Decision Trees
 
Thesis seminar
Thesis seminarThesis seminar
Thesis seminar
 
Subgroup identification for precision medicine. a comparative review of 13 me...
Subgroup identification for precision medicine. a comparative review of 13 me...Subgroup identification for precision medicine. a comparative review of 13 me...
Subgroup identification for precision medicine. a comparative review of 13 me...
 
Basic Concepts of Experimental Design & Standard Design ( Statistics )
Basic Concepts of Experimental Design & Standard Design ( Statistics )Basic Concepts of Experimental Design & Standard Design ( Statistics )
Basic Concepts of Experimental Design & Standard Design ( Statistics )
 
Matching Weights to Simultaneously Compare Three Treatment Groups: a Simulati...
Matching Weights to Simultaneously Compare Three Treatment Groups: a Simulati...Matching Weights to Simultaneously Compare Three Treatment Groups: a Simulati...
Matching Weights to Simultaneously Compare Three Treatment Groups: a Simulati...
 
Exact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeExact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping Ye
 

Más de sahirbhatnagar

An introduction to knitr and R Markdown
An introduction to knitr and R MarkdownAn introduction to knitr and R Markdown
An introduction to knitr and R Markdownsahirbhatnagar
 
Reproducible Research: An Introduction to knitr
Reproducible Research: An Introduction to knitrReproducible Research: An Introduction to knitr
Reproducible Research: An Introduction to knitrsahirbhatnagar
 
Analysis of DNA methylation and Gene expression to predict childhood obesity
Analysis of DNA methylation and Gene expression to predict childhood obesityAnalysis of DNA methylation and Gene expression to predict childhood obesity
Analysis of DNA methylation and Gene expression to predict childhood obesitysahirbhatnagar
 
Estimation and Accuracy after Model Selection
Estimation and Accuracy after Model SelectionEstimation and Accuracy after Model Selection
Estimation and Accuracy after Model Selectionsahirbhatnagar
 
Absolute risk estimation in a case cohort study of prostate cancer
Absolute risk estimation in a case cohort study of prostate cancerAbsolute risk estimation in a case cohort study of prostate cancer
Absolute risk estimation in a case cohort study of prostate cancersahirbhatnagar
 
Factors influencing participation in cancer screening
Factors influencing participation in cancer screeningFactors influencing participation in cancer screening
Factors influencing participation in cancer screeningsahirbhatnagar
 
Methylation and Expression data integration
Methylation and Expression data integrationMethylation and Expression data integration
Methylation and Expression data integrationsahirbhatnagar
 

Más de sahirbhatnagar (10)

An introduction to knitr and R Markdown
An introduction to knitr and R MarkdownAn introduction to knitr and R Markdown
An introduction to knitr and R Markdown
 
Atelier r-gerad
Atelier r-geradAtelier r-gerad
Atelier r-gerad
 
Reproducible Research: An Introduction to knitr
Reproducible Research: An Introduction to knitrReproducible Research: An Introduction to knitr
Reproducible Research: An Introduction to knitr
 
Analysis of DNA methylation and Gene expression to predict childhood obesity
Analysis of DNA methylation and Gene expression to predict childhood obesityAnalysis of DNA methylation and Gene expression to predict childhood obesity
Analysis of DNA methylation and Gene expression to predict childhood obesity
 
Estimation and Accuracy after Model Selection
Estimation and Accuracy after Model SelectionEstimation and Accuracy after Model Selection
Estimation and Accuracy after Model Selection
 
Absolute risk estimation in a case cohort study of prostate cancer
Absolute risk estimation in a case cohort study of prostate cancerAbsolute risk estimation in a case cohort study of prostate cancer
Absolute risk estimation in a case cohort study of prostate cancer
 
Factors influencing participation in cancer screening
Factors influencing participation in cancer screeningFactors influencing participation in cancer screening
Factors influencing participation in cancer screening
 
Introduction to LaTeX
Introduction to LaTeXIntroduction to LaTeX
Introduction to LaTeX
 
Methylation and Expression data integration
Methylation and Expression data integrationMethylation and Expression data integration
Methylation and Expression data integration
 
Reproducible Research
Reproducible ResearchReproducible Research
Reproducible Research
 

Último

Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptxkhadijarafiq2012
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 

Último (20)

Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 

High Dimensional Interactions and Environmental Effects

  • 1. Methods for High Dimensional Interactions Sahir Rai Bhatnagar, PhD Candidate – McGill Biostatistics Joint work with Yi Yang, Mathieu Blanchette and Celia Greenwood Ludmer Center – May 19, 2016
  • 4. one predictor variable at a time Predictor Variable Phenotype
  • 5. one predictor variable at a time Predictor Variable Phenotype Test 1 Test 2 Test 3 Test 4 Test 5 2
  • 6. a network based view Predictor Variable Phenotype
  • 7. a network based view Predictor Variable Phenotype
  • 8. a network based view Predictor Variable Phenotype Test 1 3
  • 9. system level changes due to environment Predictor Variable PhenotypeEnvironment A B
  • 10. system level changes due to environment Predictor Variable PhenotypeEnvironment A B Test 1 4
  • 11. Motivating Dataset: Newborn epigenetic adaptations to gesta- tional diabetes exposure (Luigi Bouchard, Sherbrooke) Environment Gestational Diabetes Large Data Child’s epigenome (p ≈ 450k) Phenotype Obesity measures 5
  • 12. Differential Correlation between environments (a) Gestational diabetes affected pregnancy (b) Controls 6
  • 13. Gene Expression: COPD patients (a) Gene Exp.: Never Smokers (b) Gene Exp.: Current Smokers (c) Correlations: Never Smokers (d) Correlations: Current Smokers 7
  • 14. Imaging Data: Topological properties and Age 8
  • 16. NIH MRI brain study Environment Age Large Data Cortical Thickness (p ≈ 80k) Phenotype Intelligence 10
  • 18. formal statement of initial problem • n: number of subjects 12
  • 19. formal statement of initial problem • n: number of subjects • p: number of predictor variables 12
  • 20. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) 12
  • 21. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype 12
  • 22. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype • En×1: environmental factor that has widespread effect on X and can modify the relation between X and Y 12
  • 23. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype • En×1: environmental factor that has widespread effect on X and can modify the relation between X and Y Objective • Which elements of X that are associated with Y , depend on E? 12
  • 25. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging
  • 26. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death)
  • 27. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death) epidemiological study
  • 28. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death) (epi)genetic/imaging associations
  • 29. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death) (epi)genetic/imaging associations (epi)genetic/imaging associations
  • 30. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death) 13
  • 31. Is this mediation analysis? 14
  • 32. Is this mediation analysis? • No 14
  • 33. Is this mediation analysis? • No • We are not making any causal claims i.e. direction of the arrows 14
  • 34. Is this mediation analysis? • No • We are not making any causal claims i.e. direction of the arrows • There are many untestable assumptions required for such analysis → not well understood for HD data 14
  • 36. analysis strategies marginal correlations (univariate p-value) multiple testing adjustment Single-Marker or Single Variable Tests
  • 37. analysis strategies marginal correlations (univariate p-value) multiple testing adjustment Single-Marker or Single Variable Tests LASSO (convex penalty with one tuning parameter) MCP, SCAD, Dantzig selector (non-convex penalty with two tuning parameters) Group level penalization (group LASSO, SCAD and MCP) Multivariate Regression Approaches Including Penalization Methods
  • 38. analysis strategies marginal correlations (univariate p-value) multiple testing adjustment Single-Marker or Single Variable Tests LASSO (convex penalty with one tuning parameter) MCP, SCAD, Dantzig selector (non-convex penalty with two tuning parameters) Group level penalization (group LASSO, SCAD and MCP) Multivariate Regression Approaches Including Penalization Methods cluster features based on euclidean distance, correlation, connectivity regression with group level summary (PCA, average) Clustering Together with Regression 15
  • 39. ECLUST - our proposed method: 3 phases Original Data
  • 40. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1
  • 41. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1
  • 42. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation
  • 43. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation n × 1 n × 1
  • 44. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation n × 1 n × 1 3) Penalized Regression Yn×1∼ + ×E 16
  • 45. the objective of statistical methods is the reduction of data. A quantity of data . . . is to be replaced by relatively few quantities which shall adequately represent . . . the relevant information contained in the original data. - Sir R. A. Fisher, 1922 16
  • 46. Underlying model Y = β0 + β1U + β2U · E + ε (1) 17
  • 47. Underlying model Y = β0 + β1U + β2U · E + ε (1) X ∼ F(α0 + α1U, ΣE ) (2) 17
  • 48. Underlying model Y = β0 + β1U + β2U · E + ε (1) X ∼ F(α0 + α1U, ΣE ) (2) • U: unobserved latent variable • X: observed data which is a function of U • ΣE : environment sensitive correlation matrix 17
  • 49. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation n × 1 n × 1 3) Penalized Regression Yn×1∼ + ×E 18
  • 50. advantages and disadvantages General Approach Advantages Disadvantages Single-Marker simple, easy to implement multiple testing burden, power, interpretability Penalization multivariate, variable selection, sparsity, efficient optimization algorithms poor sensitivity with correlated data, ignores structure in design matrix, interpretability Environment Cluster with Regression multivariate, flexible implementation, group structure, takes advantage of correlation, interpretability difficult to identify relevant clusters, clustering is unsupervised 19
  • 51. Methods to detect gene clusters Table 1: Methods to detect gene clusters General Approach Formula Correlation pearson, spearman, biweight midcorrelation Correlation Scoring |ρE=1 − ρE=0| Weighted Correlation Scoring c|ρE=1 − ρE=0| Fisher’s Z Transformation |zij0−zij1| √ 1/(n0−3)+1/(n1−3) 20
  • 52. Cluster Representation Table 2: Methods to create cluster representations General Approach Type Unsupervised average K principal components Supervised partial least squares 21
  • 54. Simulation Study 1 (a) Corr(XE=0) (b) Corr(XE=1) (c) |Corr(XE=1) − Corr(XE=0)| (d) Corr(Xall) 22
  • 55. Results: Jaccard Index and test set MSE 23
  • 57. TOM based on all subjects (a) TOM(Xall) 25
  • 58. TOM based on unexposed subjects (a) TOM(XE=0) 26
  • 59. TOM based on exposed subjects (a) TOM(XE=1) 27
  • 60. Difference of TOMs (a) |TOM(XE=1) − TOM(XE=0)| 28
  • 63. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main effects + α1E (X1E) + · · · + αpE (XpE) interactions • g(·) is a known link function • µ = E [Y |X, E, β, α] • β = (β1, β2, . . . , βp, βE ) ∈ Rp+1 • α = (α1E , . . . , αpE ) ∈ Rp 30
  • 64. Variable Selection arg min β0,β,α 1 2 Y − g(µ) 2 + λ ( β 1 + α 1) • Y − g(µ) 2 = i (yi − g(µi ))2 • β 1 = j |βj | • α 1 = j |αj | • λ ≥ 0: tuning parameter 31
  • 65. Why Strong Heredity? • Statistical Power: large main effects are more likely to lead to detectable interactions than small ones 32
  • 66. Why Strong Heredity? • Statistical Power: large main effects are more likely to lead to detectable interactions than small ones • Interpretability: Assuming a model with interaction only is generally not biologically plausible 32
  • 67. Why Strong Heredity? • Statistical Power: large main effects are more likely to lead to detectable interactions than small ones • Interpretability: Assuming a model with interaction only is generally not biologically plausible • Practical Sparsity: X1, E, X1 · E vs. X1, E, X2 · E 32
  • 68. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main effects + α1E (X1E) + · · · + αpE (XpE) interactions 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 33
  • 69. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main effects + α1E (X1E) + · · · + αpE (XpE) interactions Reparametrization1 : αjE = γjE βj βE . 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 33
  • 70. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main effects + α1E (X1E) + · · · + αpE (XpE) interactions Reparametrization1 : αjE = γjE βj βE . Strong heredity principle2 : ˆαjE = 0 ⇒ ˆβj = 0 and ˆβE = 0 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 33
  • 71. Strong Heredity Model with Penalization arg min β0,β,γ 1 2 Y − g(µ) 2 + λβ (w1β1 + · · · + wqβq + wE βE ) + λγ (w1E γ1E + · · · + wqE γqE ) wj = 1 ˆβj , wjE = ˆβj ˆβE ˆαjE 34
  • 72. Open source software • Software implementation in R: http://sahirbhatnagar.com/eclust/ • Allows user specified interaction terms • Automatically determines the optimal tuning parameters through cross validation • Can also be applied to genetic data (SNPs) 35
  • 74. The most popular way of feature screening How to fit statistical models when you have over 100,000 features? 36
  • 75. The most popular way of feature screening How to fit statistical models when you have over 100,000 features? Marginal correlations, t-tests • for each feature, calculate the correlation between X and Y 36
  • 76. The most popular way of feature screening How to fit statistical models when you have over 100,000 features? Marginal correlations, t-tests • for each feature, calculate the correlation between X and Y • keep all features with correlation greater than some threshold 36
  • 77. The most popular way of feature screening How to fit statistical models when you have over 100,000 features? Marginal correlations, t-tests • for each feature, calculate the correlation between X and Y • keep all features with correlation greater than some threshold • However this procedure assumes a linear relationship between X and Y 36
  • 78. Non-linear feature screening: Kolmogorov-Smirnov Test Mai & Zou (2012) proposed using the Kolmogorov-Smirnov (KS) test statistic ˆKj = sup x |ˆFj (x|Y = 1) − ˆFj (x|Y = 0)| (3) Figure 8: Depiction of KS statistic 37
  • 79. Non-linear Interaction Models After feature screening, we can fit non-linear relationships between X and Y Yi = β0 + f (Xij ) + f (Xij , Ei ) + εi (4) 38
  • 81. Conclusions and Contributions • Large system-wide changes are observed in many environments 39
  • 82. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data 39
  • 83. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. 39
  • 84. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations 39
  • 85. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations • Also, we develop and implement a strong heredity framework within the penalized model 39
  • 86. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations • Also, we develop and implement a strong heredity framework within the penalized model • R software: http://sahirbhatnagar.com/eclust/ 39
  • 87. Limitations • There must be a high-dimensional signature of the exposure 40
  • 88. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised 40
  • 89. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised • Two tuning parameters 40
  • 90. What type of data is required to use these methods
  • 91. ECLUST method 1. environmental exposure (currently only binary) 2. a high dimensional dataset that can be affected by the exposure 3. a single phenotype (continuous or binary) 4. Must be a high-dimensional signature of the exposure 41
  • 92. Strong Heredity and Non-linear Models 1. a single phenotype (continuous or binary) 2. environment variable (continuous or binary) 3. any number of predictor variables 42
  • 93. Check out our Lab’s Software! http://greenwoodlab.github.io/software/ 43
  • 94. acknowledgements • Dr. Celia Greenwood • Dr. Blanchette and Dr. Yang • Dr. Luigi Bouchard, Andr´e Anne Houde • Dr. Steele, Dr. Kramer, Dr. Abrahamowicz • Maxime Turgeon, Kevin McGregor, Lauren Mokry, Marie Forest, Pablo Ginestet • Greg Voisin, Vince Forgetta, Kathleen Klein • Mothers and children from the study 44