Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Integration of biological annotations using hierarchical modeling
1. Using Biological Knowledge To
Discover Higher Order Interactions
In Genetic Association Studies
Gary K. Chen
Duncan C. Thomas
Department of Preventive Medicine
USC
May 19, 2010
2. Outline
1. Motivation
2. The algorithm: Incorporating biological priors
into an MCMC sampler
3. Simulation 1: Performance of the method
4. Simulation 2: Detecting interactions in a known
pathway
5. Application to data from a GWAS
6. Future Extensions
3. Common diseases have complex etiology
GWAS have had great success in searching for
genetic variants for common diseases
Recent successes: AMD, BMI/obesity, Type 2
diabetes, breast cancer, prostate cancer
Marginal effects from single SNP analyses do
not explain all heritability. Can we move
beyond the low-hanging fruit? (e.g. CNVs, rare
variants, epistatic interactions, etc.
Ideally we would fit a model for all SNPs (and
interactions too)
4. Analyzing all SNPs simultaneously
Difficult for GWAS: predictors far exceed
observations
Shrinkage methods: LASSO, ridge regression,
elastic net,...
LASSO method (Tibshirani, J Royal Stat. Soc. 96)
penalizes likelihood based on tuning parameter λ
produces sparse (interpretable) models
In GWAS settings:
Double Exp (LaPlace) prior on β(Wu and Lange,
Bioinf. 2009)
Normal Exp Gamma prior on β(Hoggart et al
PLOS Genet 2008)
Fast! Provides the maximum a posteriori (MAP)
estimates
5. Fully Bayesian methods for variable
selection
Bayesian model averaging assesses uncertainty
Probabilistically proposes sub-models from a
posterior distribution
Summarize statistics of parameters averaged across
all proposed models
Controls for multiple comparisons
Disadvantage: Computationally expensive
P(β) has normal distribution for conjugacy
“Spike and slab” ensures parsimony
Example: Stochastic Search Variable Selection
via Gibbs sampling (George and McCulloch
JASA 93)
βj |γj ∼ (1 − γj )N(0, τj2 ) + γj N(0, cj2 τj2 )
γ
e.g., f (γ) = Πpj j (1 − pj )(1−γj )
6. Searching for interactions
SSVS via Gibbs Sampling
For 1000 SNPs, length of γ:
500,500=1000 + (1000)(999)
2
Iterating through each parameter is slow
Reversible jump MCMC
In contrast to SSVS, the “model” is
M = {j : γj = 0}
Model size changes at each iteration (similar to
stepwise regression)
Informative priors
Incorporating biological information at the level of
each variable
These priors can be used towards a proposal
function in a Metropolis Hastings algorithm
7. Outline
1. Motivation
2. The algorithm: Incorporating biological priors
into an MCMC sampler
3. Simulation 1: Performance of the method
4. Simulation 2: Detecting interactions in a known
pathway
5. Application to data from a GWAS
6. Future Extensions
8. Posterior density as a two-level
hierarchical model
Posterior density:
L(Y |β, X , M)P(β|π, τ, σ, M, Z , A)
First level as likelihood: a GLM at the subject
level
K
logit(P(Y = 1|β, X )) ∼ β0 + k=1 βk X
X can be G, E, GxG, GxE, etc.
Second level as prior: βk as mixed model
βk ∼ π T Zk + φk + θk
9. Prior mean on variable in Z
Table: The Z matrix
Intercept Conservation Missense eQTL
1 20 0 5
1 10 1 0.01
1 5 0 1
1 10 1 4.1
1 5 0 1.4
βk ∼ π T Zk + φk + θk
ˆ ˆ
π : regress β on Z , π ∼ N(ˆ , Σπ )
π
10. Variable connectivity in A matrix
Table: Example A matrix for SNP variables
Variable 1 2 3
1 0 1 0
2 1 0 1
3 0 1 0
11. One appraoch for populating the A matrix
Table: The Z matrix
Intercept Conservation Missense eQTL
→ 1 20 0 5
1 10 1 0.01
→ 1 5 0 1
1 10 1 4.1
1 5 0 1.4
Define entry A1,3 as corr(Z1,− ,Z3,− ),
dichotomize A
12. φk as mean across k’s neighbors
Table: Example A matrix for SNP variables
Variable 1 2 3
1 0 1 0
2 1 0 1
3 0 1 0
βk ∼ π T Zk + φk + θk
2
¯
φk ∼ N(φ−k , τ )
Pm k
ν
¯ j=1 φj Ajk
φ−k = Pm , νk neighbors of variable k
j=1 Ajk
ˆ
We set φj = βj
ˆ
Example: If β = (0.2, 0.5, 0.4), φ2 = 0.3
14. A reversible jump MCMC algorithm
Propose a swap, addition or deletion of an
variable
Perform reversible jump Metropolis Hastings
step comparing posterior probabilities
L(Y |β ,X ,M )P(β |Z ,π,A,τ,σ,M )P(M→M )
r= L(Y |β,X ,M)P(β|Z ,π,A,τ,σ,M)P(M →M)
Accept move with probability min(1, r )
15. Model transition proposal density
Suppose model M has 1 newly proposed
variable:
P(M → M ) = Φ−1 (zk )
zk ∼ N(µk − µbaseline , 1)
The variable-specific tuning parameter µk
A function of the components of β’s prior
standardized by their residual variances
T ¯
µk = |π Zk +τφ−k |
2 2
σ +ν
k
Weak empirical support for priors lead to small
numerator, large denominator
16. Model transition proposal density
Suppose model M has 1 newly proposed
variable:
P(M → M ) = Φ−1 (zk )
zk ∼ N(µk − µbaseline , 1)
The global penalty tuning parameter
Emulate the BIC
BIC (M ) − BIC (M) = χ1 (ln(n))
−1
Probability of accepting M is Fχ (ln(n))
−1
µbaseline = Φ(Fχ (ln(n)))
17. Outline
1. Motivation
2. The algorithm: Incorporating biological priors
into an MCMC sampler
3. Simulation 1: Performance of the method
4. Simulation 2: Detecting interactions in a known
pathway
5. Application to data from a GWAS
6. Future Extensions
18. Using external information to enhance
power and specificity
Disease model: 4 GxG interactions jointly
cause disease through 4 endophenotypes
Genotypes simulated for 14 independent SNPs
yik = (1 − b)N(sia ∗ sib , 1) + bU(0, 1)
b ∼ Bernoulli(p), p is proportion of noise
24 endophenotypes y used only in the prior
Disease status determined using a logistic
model
logit(Yi = 1) = β0 +β1 yi01 +β2 yi02 +β3 yi34 +β4 yi35
First 8000 persons reserved as case control
dataset, remaining 2000 for constructing priors
19. Constructing the Z and the A matrices
Z matrix
Measures correlation between a model variable and
each endophenotype among 2000 individuals in the
prior
Zkq = corr(gk , yq )
A matrix
Measures similarity between two variables by
comparing correlation profiles in Z
Ajk = corr(Zjq , Zkq )
20. Question 1: How do the priors affect
power and specificity?
The A matrix contains information across all
24 endophenotypes
Set up 3 variants of the original Z matrix
4 causal endophenotypes only (noise parameter
p = 0)
4 intermediate endophenotypes only (noise
parameter p = 0.2)
4 weakly correlated endophenotypes only (noise
parameter p = 0.8)
Models tested:both A and Z , no A or Z , A
only, Z only (with 3 variants)
21. Question 1: How do the priors affect
power and specificity?
At RR=1.5, all prior models perform very well
22. Question 1: How do the priors affect
power and specificity?
At RR=1.4, prior models with A, Z, or both
outperform others
23. Question 1: How do the priors affect
power and specificity?
At RR=1.3, prior models with A, Z, or both have
> 5% power
24. Question 1: How do the priors affect
power and specificity?
At RR=1.2, fully informative prior still retains 80%
power
25. Question 1: How do the priors affect
power and specificity?
At RR=1.1, all prior models perform poorly (∼ 55%
power)
26. Question 2: How do the priors affect
posterior estimates (shrinkage)?
Posterior estimates of β vs MLE
27. Question 2: How do the priors affect
posterior estimates (shrinkage)?
Posterior estimates of SE of β vs MLE
28. Question 3: How do the priors improve
rankings?
6,441 interactions tested. 4 causal.
29. Question 3: How do the priors improve
rankings?
513,591 interactions tested. 4 causal.
30. Summary of simulation
Sensitivity analysis
All methods perform well at high RRs
Informative priors improve power at lower RRs but
not at extremely low RRs
Like LASSO, shrinkage improves interpretability
Model averaging can improve robustness of
rankings
31. Outline
1. Motivation
2. The algorithm: Incorporating biological priors
into an MCMC sampler
3. Simulation 1: Performance of the method
4. Simulation 2: Detecting interactions in a known
pathway
5. Application to data from a GWAS
6. Future Extensions
33. Simulated data set
14 genes, 2 environmental variables
8000 individuals in casecontrol data, remaining
2000 for constructing priors
Used a pathway simulation program to
generate steady-state concentrations
Reed et al J Nutr. 2006 Oct;136(10):2653-61
Enzyme kinetics parameters (Km , Vmax ) genotype
specific
3 mechanisms believed to be related to disease
etiology
Homocysteine concentration
Pyrimidine synthesis
Purine synthesis
34. Estimates of π
Construct Z and A in same manner as previous
simulation:
Z stores genotype-metabolite correlations
A stores dichotomized-correlations between rows of
Z
True log relative risk: .18 (RR=1.2)
Simulated Second-level coefficients π
mechanism homocysteine pyrimidine purine
homocysteine 0.18(0.13) -0.09(0.536) 0.002(0.38)
pyrimidine -0.04(0.22) 0.22(0.066) -0.01(0.06)
purine -0.01(0.36) 0.16(0.327) 0.19(0.07)
41. Summary of folate pathway simulation
Pathway knowledge can inform model search
Simulated three plausible disease mechanisms
Effect of causal metabolite on disease revealed
in corresponding element of π
Revealed plausible interactions not found
through a stepwise regression
42. Outline
1. Motivation
2. The algorithm: Incorporating biological priors
into an MCMC sampler
3. Simulation 1: Performance of the method
4. Simulation 2: Detecting interactions in a known
pathway
5. Application to data from a GWAS
6. Future Extensions
43. Using gene annotations to inform a search
for interactions
Proof of concept: GWAS of breast cancer
Publicly data from NCI
(https://caintegrator.nci.nih.gov/cgems/)
1,145 cases and 1,142 controls of European
ancestry
The 22 Gene Ontology terms from Biological
Process used to define priors in A and Z
Included 6,078 SNPs, where each SNP had GO
annotation and had lowest p-value in gene
45. Enrichment analysis
Are the top interactions (BF > 100) enriched
for certain GO terms?
Compute empiric p-value for enrichment
For each permute within bins representative of
non-independence in observed interactions
Pool bins, compute frequency of a GO term in the
pool
pvalue: Number of iterations freq exceeded obs
freq divided by 1 million
biological regulation (p=.008), growth
(p=1e −6 ), metabolic process (p=.008), and
regulation of biological process (p=.003).
46. Outline
1. Motivation
2. The algorithm: Incorporating biological priors
into an MCMC sampler
3. Simulation 1: Performance of the method
4. Simulation 2: Detecting interactions in a known
pathway
5. Application to data from a GWAS
6. Future Extensions
47. Incorporate gene-expression data into
GWAS analyses
Developing priors
Should be more informative (e.g. empirical) and
granular (e.g. SNP level) than GO
Obtain genotype-expression paired data: HapMap?
Apply WGCNA to infer pathway modules
Genotype-module correlations used in Z matrix
Incorporate more advanced MCMC techniques
Evolutionary Monte Carlo
Multiply-try Metropolis
Brute-force search for MAP. Use MAP for initial
values?
48. Acknowledgements
James Baurley
David Conti
Angela Presson (thanks in advance!)
Funding: R01 ES016813 and R01 ES015090.