Integration of biological annotations using hierarchical modeling

Using Biological Knowledge To
Discover Higher Order Interactions
In Genetic Association Studies
Gary K. Chen
Duncan C. Thomas
Department of Preventive Medicine
USC

May 19, 2010

Outline
1. Motivation
2. The algorithm: Incorporating biological priors
into an MCMC sampler

3. Simulation 1: Performance of the method
4. Simulation 2: Detecting interactions in a known
pathway

5. Application to data from a GWAS

6. Future Extensions

Common diseases have complex etiology

GWAS have had great success in searching for
genetic variants for common diseases
Recent successes: AMD, BMI/obesity, Type 2
diabetes, breast cancer, prostate cancer
Marginal eﬀects from single SNP analyses do
not explain all heritability. Can we move
beyond the low-hanging fruit? (e.g. CNVs, rare
variants, epistatic interactions, etc.
Ideally we would ﬁt a model for all SNPs (and
interactions too)

Analyzing all SNPs simultaneously
Diﬃcult for GWAS: predictors far exceed
observations
Shrinkage methods: LASSO, ridge regression,
elastic net,...
LASSO method (Tibshirani, J Royal Stat. Soc. 96)
penalizes likelihood based on tuning parameter λ
produces sparse (interpretable) models
In GWAS settings:
Double Exp (LaPlace) prior on β(Wu and Lange,
Bioinf. 2009)
Normal Exp Gamma prior on β(Hoggart et al
PLOS Genet 2008)
Fast! Provides the maximum a posteriori (MAP)
estimates

Fully Bayesian methods for variable
selection
Bayesian model averaging assesses uncertainty
Probabilistically proposes sub-models from a
posterior distribution
Summarize statistics of parameters averaged across
all proposed models
Controls for multiple comparisons
Disadvantage: Computationally expensive
P(β) has normal distribution for conjugacy
“Spike and slab” ensures parsimony
Example: Stochastic Search Variable Selection
via Gibbs sampling (George and McCulloch
JASA 93)
βj |γj ∼ (1 − γj )N(0, τj2 ) + γj N(0, cj2 τj2 )
γ
e.g., f (γ) = Πpj j (1 − pj )(1−γj )

Searching for interactions
SSVS via Gibbs Sampling
For 1000 SNPs, length of γ:
500,500=1000 + (1000)(999)
2
Iterating through each parameter is slow
Reversible jump MCMC
In contrast to SSVS, the “model” is
M = {j : γj = 0}
Model size changes at each iteration (similar to
stepwise regression)
Informative priors
Incorporating biological information at the level of
each variable
These priors can be used towards a proposal
function in a Metropolis Hastings algorithm

Posterior density as a two-level
hierarchical model

Posterior density:
L(Y |β, X , M)P(β|π, τ, σ, M, Z , A)
First level as likelihood: a GLM at the subject
level
K
logit(P(Y = 1|β, X )) ∼ β0 + k=1 βk X
X can be G, E, GxG, GxE, etc.
Second level as prior: βk as mixed model
βk ∼ π T Zk + φk + θk

Prior mean on variable in Z

Table: The Z matrix
Intercept Conservation Missense eQTL
1 20 0 5
1 10 1 0.01
1 5 0 1
1 10 1 4.1
1 5 0 1.4

ˆ ˆ
π : regress β on Z , π ∼ N(ˆ , Σπ )
π

Variable connectivity in A matrix

Table: Example A matrix for SNP variables
Variable 1 2 3
1 0 1 0
2 1 0 1
3 0 1 0

One appraoch for populating the A matrix

Table: The Z matrix
Intercept Conservation Missense eQTL
→ 1 20 0 5
1 10 1 0.01
→ 1 5 0 1
1 10 1 4.1
1 5 0 1.4

Deﬁne entry A1,3 as corr(Z1,− ,Z3,− ),
dichotomize A

φk as mean across k’s neighbors
Table: Example A matrix for SNP variables
Variable 1 2 3
1 0 1 0
2 1 0 1
3 0 1 0

2
¯
φk ∼ N(φ−k , τ )
Pm k
ν
¯ j=1 φj Ajk
φ−k = Pm , νk neighbors of variable k
j=1 Ajk
ˆ
We set φj = βj
ˆ
Example: If β = (0.2, 0.5, 0.4), φ2 = 0.3

How the parameters ﬁt together
L(Y |β, X , M)P(β|Z , π, A, τ, σ, M)

A reversible jump MCMC algorithm

Propose a swap, addition or deletion of an
variable
Perform reversible jump Metropolis Hastings
step comparing posterior probabilities
L(Y |β ,X ,M )P(β |Z ,π,A,τ,σ,M )P(M→M )
r= L(Y |β,X ,M)P(β|Z ,π,A,τ,σ,M)P(M →M)
Accept move with probability min(1, r )

Model transition proposal density

Suppose model M has 1 newly proposed
variable:
P(M → M ) = Φ−1 (zk )
zk ∼ N(µk − µbaseline , 1)
The variable-speciﬁc tuning parameter µk
A function of the components of β’s prior
standardized by their residual variances
T ¯
µk = |π Zk +τφ−k |
2 2
σ +ν
k
Weak empirical support for priors lead to small
numerator, large denominator

Model transition proposal density

Suppose model M has 1 newly proposed
variable:
P(M → M ) = Φ−1 (zk )
zk ∼ N(µk − µbaseline , 1)
The global penalty tuning parameter
Emulate the BIC
BIC (M ) − BIC (M) = χ1 (ln(n))
−1
Probability of accepting M is Fχ (ln(n))
−1
µbaseline = Φ(Fχ (ln(n)))

Using external information to enhance
power and speciﬁcity
Disease model: 4 GxG interactions jointly
cause disease through 4 endophenotypes
Genotypes simulated for 14 independent SNPs
yik = (1 − b)N(sia ∗ sib , 1) + bU(0, 1)
b ∼ Bernoulli(p), p is proportion of noise
24 endophenotypes y used only in the prior
Disease status determined using a logistic
model
logit(Yi = 1) = β0 +β1 yi01 +β2 yi02 +β3 yi34 +β4 yi35
First 8000 persons reserved as case control
dataset, remaining 2000 for constructing priors

Constructing the Z and the A matrices

Z matrix
Measures correlation between a model variable and
each endophenotype among 2000 individuals in the
prior
Zkq = corr(gk , yq )
A matrix
Measures similarity between two variables by
comparing correlation proﬁles in Z
Ajk = corr(Zjq , Zkq )

Question 1: How do the priors aﬀect
power and speciﬁcity?
The A matrix contains information across all
24 endophenotypes
Set up 3 variants of the original Z matrix
4 causal endophenotypes only (noise parameter
p = 0)
4 intermediate endophenotypes only (noise
parameter p = 0.2)
4 weakly correlated endophenotypes only (noise
parameter p = 0.8)
Models tested:both A and Z , no A or Z , A
only, Z only (with 3 variants)

At RR=1.5, all prior models perform very well

At RR=1.4, prior models with A, Z, or both
outperform others

At RR=1.3, prior models with A, Z, or both have
> 5% power

At RR=1.2, fully informative prior still retains 80%
power

At RR=1.1, all prior models perform poorly (∼ 55%
power)

posterior estimates (shrinkage)?
Posterior estimates of β vs MLE

posterior estimates (shrinkage)?
Posterior estimates of SE of β vs MLE

Question 3: How do the priors improve
rankings?
6,441 interactions tested. 4 causal.

Question 3: How do the priors improve
rankings?
513,591 interactions tested. 4 causal.

Summary of simulation

Sensitivity analysis
All methods perform well at high RRs
Informative priors improve power at lower RRs but
not at extremely low RRs
Like LASSO, shrinkage improves interpretability
Model averaging can improve robustness of
rankings

Discovering interactions in a known
pathway: Folate

Simulated data set
14 genes, 2 environmental variables
8000 individuals in casecontrol data, remaining
2000 for constructing priors
Used a pathway simulation program to
generate steady-state concentrations
Reed et al J Nutr. 2006 Oct;136(10):2653-61
Enzyme kinetics parameters (Km , Vmax ) genotype
speciﬁc
3 mechanisms believed to be related to disease
etiology
Homocysteine concentration
Pyrimidine synthesis
Purine synthesis

Estimates of π
Construct Z and A in same manner as previous
simulation:
Z stores genotype-metabolite correlations
A stores dichotomized-correlations between rows of
Z
True log relative risk: .18 (RR=1.2)

Simulated Second-level coeﬃcients π
mechanism homocysteine pyrimidine purine
homocysteine 0.18(0.13) -0.09(0.536) 0.002(0.38)
pyrimidine -0.04(0.22) 0.22(0.066) -0.01(0.06)
purine -0.01(0.36) 0.16(0.327) 0.19(0.07)

Comparison of BMA results to stepwise
regresssion
Interaction
BF MLE p-value
FTD*MAT-II 15 0.038
FTD*MTHFR 20 0.046
MTCH*MS 534 0.006
PGT*MS 14 0.018
→ SHMT*CBS 1254 0.133
→ SHMT*Fol 2324 0.036
TS*MTHFR 227 0.022
→ TS*SHMT 1091 N/S


SHMT*CBS SHMT*Fol SHMT*TS

regresssion

Interaction Purine synthesis
BF MLE p-value
→ MTCH*MS 1130 0.008
→ MTCH*PGT 1416 0.026
→ PGT*CBS 1022 0.069
→ PGT*MS 2851 0.007
→ SHMT*Fol 1398 0.022
SHMT*MAT-II 646 0.012
TS*MTHFR 57 0.024

Purine synthesis

MTCH*MS MTCH*PGT PGT*CBS PGT*MS
SHMT*Fol

regresssion
Interaction Homocysteine
BF MLE p-value
CBS*MAT-II 77 0.045
→ CBS*Met 1072 N/S
FTD*MAT-II 38 0.045
FTD*MTHFR 213 0.015
→ MS*Met 1129 N/S
MTCH*MS 978 0.006
PGT*MS 75 0.044
TS*MTHFR 41 0.022

Homocysteine levels

CBS*Met MS*Met

Summary of folate pathway simulation

Pathway knowledge can inform model search
Simulated three plausible disease mechanisms
Eﬀect of causal metabolite on disease revealed
in corresponding element of π
Revealed plausible interactions not found
through a stepwise regression

Using gene annotations to inform a search
for interactions
Proof of concept: GWAS of breast cancer
Publicly data from NCI
(https://caintegrator.nci.nih.gov/cgems/)
1,145 cases and 1,142 controls of European
ancestry
The 22 Gene Ontology terms from Biological
Process used to deﬁne priors in A and Z
Included 6,078 SNPs, where each SNP had GO
annotation and had lowest p-value in gene

Top 10 interactions found
Interaction Non-inf prior inf prior
β(SE) BF β(SE) BF
PARK2*SORCS1 0.22(0.06) 1e 4 0.27(0.06) 5e 4
AK5*ARHGAP26 0.16(0.05) 427 0.17(0.05) 903
FGFR2*MAML2 -0.11(0.04) 1 -0.16(0.05) 686
SHC3*KIF13B N/A N/A 0.17(0.05) 621
PCLO*ME3 N/A N/A 0.18(0.05) 528
CNGA3*CNN1 -0.16(0.05) 41 -0.17(0.05) 462
FGFR2*CDT1 N/A N/A -0.16(0.05) 445
SHC3*CXCL16 N/A N/A -0.18(0.05) 403
FGFR2*ABCA1 -0.1(0.05) 158 -0.11(0.05) 268
CYP2J2*SORCS1 -0.11(0.05) 74 -0.14(0.05) 266
FGFR2*SCG5 N/A N/A 0.21(0.05) 235

Enrichment analysis
Are the top interactions (BF > 100) enriched
for certain GO terms?
Compute empiric p-value for enrichment
For each permute within bins representative of
non-independence in observed interactions
Pool bins, compute frequency of a GO term in the
pool
pvalue: Number of iterations freq exceeded obs
freq divided by 1 million
biological regulation (p=.008), growth
(p=1e −6 ), metabolic process (p=.008), and
regulation of biological process (p=.003).

Incorporate gene-expression data into
GWAS analyses
Developing priors
Should be more informative (e.g. empirical) and
granular (e.g. SNP level) than GO
Obtain genotype-expression paired data: HapMap?
Apply WGCNA to infer pathway modules
Genotype-module correlations used in Z matrix
Incorporate more advanced MCMC techniques
Evolutionary Monte Carlo
Multiply-try Metropolis
Brute-force search for MAP. Use MAP for initial
values?

Acknowledgements

James Baurley
David Conti
Angela Presson (thanks in advance!)
Funding: R01 ES016813 and R01 ES015090.

Integration of biological annotations using hierarchical modeling

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (18)

Similar a Integration of biological annotations using hierarchical modeling

Similar a Integration of biological annotations using hierarchical modeling (20)

Último

Último (20)

Integration of biological annotations using hierarchical modeling