SlideShare una empresa de Scribd logo
1 de 112
Descargar para leer sin conexión
Model Uncertainty and Uncertainty
Quantification
Merlise Clyde
Duke University
http://stat.duke.edu/∼clyde
SAMSI MUMS - August 20, 2018
Uncertainty Quantification
Wikipedia:
Uncertainty Quantification (UQ) is the science of
quantitative characterizing and reduction of uncertainties
. . .
Uncertainty Quantification
Wikipedia:
Uncertainty Quantification (UQ) is the science of
quantitative characterizing and reduction of uncertainties
. . .
parameter parameters θ in the model that are unknown
Uncertainty Quantification
Wikipedia:
Uncertainty Quantification (UQ) is the science of
quantitative characterizing and reduction of uncertainties
. . .
parameter parameters θ in the model that are unknown
inputs measurement error in model inputs x
Uncertainty Quantification
Wikipedia:
Uncertainty Quantification (UQ) is the science of
quantitative characterizing and reduction of uncertainties
. . .
parameter parameters θ in the model that are unknown
inputs measurement error in model inputs x
algorithmic induced by approximating model
Uncertainty Quantification
Wikipedia:
Uncertainty Quantification (UQ) is the science of
quantitative characterizing and reduction of uncertainties
. . .
parameter parameters θ in the model that are unknown
inputs measurement error in model inputs x
algorithmic induced by approximating model
experimental observation error in response Y at input x
Uncertainty Quantification
Wikipedia:
Uncertainty Quantification (UQ) is the science of
quantitative characterizing and reduction of uncertainties
. . .
parameter parameters θ in the model that are unknown
inputs measurement error in model inputs x
algorithmic induced by approximating model
experimental observation error in response Y at input x
model structural uncertainty about the model/data
generating process
Uncertainty Quantification
Wikipedia:
Uncertainty Quantification (UQ) is the science of
quantitative characterizing and reduction of uncertainties
. . .
parameter parameters θ in the model that are unknown
inputs measurement error in model inputs x
algorithmic induced by approximating model
experimental observation error in response Y at input x
model structural uncertainty about the model/data
generating process
predictive interpolation or extrapolation of model at new x
Uncertainty Quantification
Wikipedia:
Uncertainty Quantification (UQ) is the science of
quantitative characterizing and reduction of uncertainties
. . .
parameter parameters θ in the model that are unknown
inputs measurement error in model inputs x
algorithmic induced by approximating model
experimental observation error in response Y at input x
model structural uncertainty about the model/data
generating process
predictive interpolation or extrapolation of model at new x
Predictive uncertainty: reducible error + irreducible error
Uncertainty Quantification
Wikipedia:
Uncertainty Quantification (UQ) is the science of
quantitative characterizing and reduction of uncertainties
. . .
parameter parameters θ in the model that are unknown
inputs measurement error in model inputs x
algorithmic induced by approximating model
experimental observation error in response Y at input x
model structural uncertainty about the model/data
generating process
predictive interpolation or extrapolation of model at new x
Predictive uncertainty: reducible error + irreducible error
Rumsfeld’s “Known Unknowns” versus “Unknown Unknowns”
Model Uncertainty
Entertain a collection of models M = {Mm, m ∈ M}
Model Uncertainty
Entertain a collection of models M = {Mm, m ∈ M}
Each model corresponds to a parametric (although possibly
infinite-dimensional) distribution of the data Y:
pm(y | θm, x) = p(y | θm, Mm, x)
where θm corresponds to unknown parameters in the
distribution for Y under Mm
Model Uncertainty
Entertain a collection of models M = {Mm, m ∈ M}
Each model corresponds to a parametric (although possibly
infinite-dimensional) distribution of the data Y:
pm(y | θm, x) = p(y | θm, Mm, x)
where θm corresponds to unknown parameters in the
distribution for Y under Mm
Objective: Obtain predictive distributions or summaries at
inputs x∗
p(y∗
| y, x, x∗
)
Model Uncertainty
Entertain a collection of models M = {Mm, m ∈ M}
Each model corresponds to a parametric (although possibly
infinite-dimensional) distribution of the data Y:
pm(y | θm, x) = p(y | θm, Mm, x)
where θm corresponds to unknown parameters in the
distribution for Y under Mm
Objective: Obtain predictive distributions or summaries at
inputs x∗
p(y∗
| y, x, x∗
)
WLOG drop dependence on inputs, p(y∗ | y)
Multiple Models
Bayesian Perspectives on Model Uncertainty
M-Closed the true data generating model MT is one of
Mm ∈ M but is unknown to researchers
Bayesian Perspectives on Model Uncertainty
M-Closed the true data generating model MT is one of
Mm ∈ M but is unknown to researchers
M-Complete the true model MT exists but is not included in the
model list M. We still wish to use the models in M
because of tractability of computations or
communication of results, compared with the actual
belief model
Bayesian Perspectives on Model Uncertainty
M-Closed the true data generating model MT is one of
Mm ∈ M but is unknown to researchers
M-Complete the true model MT exists but is not included in the
model list M. We still wish to use the models in M
because of tractability of computations or
communication of results, compared with the actual
belief model
M-Open we know the true model MT is not in M, but we
cannot specify the explicit form p(y∗ | y) because it is
too difficult conceptually or computationally, we lack
time to do so, or do not have the expertise, etc.
Bernardo & Smith (1994), Clyde & Iversen (2013)
Predictive Distributions under M-Closed
A Bayesian would assign a prior probability, p(Mm),
representing their belief that each model Mm is the true
model.
Predictive Distributions under M-Closed
A Bayesian would assign a prior probability, p(Mm),
representing their belief that each model Mm is the true
model.
Distributions p(θm | Mm) characterizing a priori uncertainty
Predictive Distributions under M-Closed
A Bayesian would assign a prior probability, p(Mm),
representing their belief that each model Mm is the true
model.
Distributions p(θm | Mm) characterizing a priori uncertainty
Bayes Theorem: posterior probability of each model
p(Mm | Y)
p(Mm | Y) =
p(Y|Mm)p(Mm)
m∈M p(y | Mm)p(Mm)
, m ∈ M
where p(Y | Mm) = p(Y | θm, Mm)p(θm | Mm)dθm
Predictive Distributions under M-Closed
A Bayesian would assign a prior probability, p(Mm),
representing their belief that each model Mm is the true
model.
Distributions p(θm | Mm) characterizing a priori uncertainty
Bayes Theorem: posterior probability of each model
p(Mm | Y)
p(Mm | Y) =
p(Y|Mm)p(Mm)
m∈M p(y | Mm)p(Mm)
, m ∈ M
where p(Y | Mm) = p(Y | θm, Mm)p(θm | Mm)dθm
Predictive distribution
p(y∗
|y) =
m∈M
p(y∗
|Mm, y)p(Mm|y)
=
m∈M
p(y∗
|Mm, θm, y)p(θm|y, Mm) dθm p(Mm|y)
Estimation and Prediction
Consider the decision problem of estimation/prediction under
squared error loss
u(Y ∗
, a) = −(Y ∗
− a)2
where a is a possible action (u is utility or negative loss) and Y ∗ is
an unknown.
Estimation and Prediction
Consider the decision problem of estimation/prediction under
squared error loss
u(Y ∗
, a) = −(Y ∗
− a)2
where a is a possible action (u is utility or negative loss) and Y ∗ is
an unknown.
From a Bayesian perspective, the solution is to find the action that
maximizes expected utility given the observed data Y:
EY∗|Y[u(Y∗
, a)] = − (y∗
− a)2
p(y∗
| y)dy∗
where the expectation is taken with respect to the predictive
distribution of Y∗ given the observed data y.
Bayesian Model Averaging
Under the M-closed perspective, optimal solution for
prediction is Bayesian Model Averaging
a∗
= EY∗ [Y∗
| Y] =
m∈M
p(Mm | Y) ˆY ∗
Mm
where ˆY ∗
Mm
is the predictive mean of Y∗ under model Mm
Bayesian Model Averaging
Under the M-closed perspective, optimal solution for
prediction is Bayesian Model Averaging
a∗
= EY∗ [Y∗
| Y] =
m∈M
p(Mm | Y) ˆY ∗
Mm
where ˆY ∗
Mm
is the predictive mean of Y∗ under model Mm
Use joint posterior distribution on θ | M and M to obtain
prediction intervals
Bayesian Model Averaging
Under the M-closed perspective, optimal solution for
prediction is Bayesian Model Averaging
a∗
= EY∗ [Y∗
| Y] =
m∈M
p(Mm | Y) ˆY ∗
Mm
where ˆY ∗
Mm
is the predictive mean of Y∗ under model Mm
Use joint posterior distribution on θ | M and M to obtain
prediction intervals
Full propagation of all “known” uncertainties
Bayesian Model Averaging
Under the M-closed perspective, optimal solution for
prediction is Bayesian Model Averaging
a∗
= EY∗ [Y∗
| Y] =
m∈M
p(Mm | Y) ˆY ∗
Mm
where ˆY ∗
Mm
is the predictive mean of Y∗ under model Mm
Use joint posterior distribution on θ | M and M to obtain
prediction intervals
Full propagation of all “known” uncertainties
Extensive literature for regression and generalized linear
models [Hoeting et al 1999, Clyde & George 2004, Bayarri et
al 2012] with invariant priors/Spike & Slab + software
more complex models via RJ-MCMC, SMC, ABC
Potential Problem with BMA
Two models in M
M1 : Y = X1β1 + e
M2 : Y = X2β2 + e
Potential Problem with BMA
Two models in M
M1 : Y = X1β1 + e
M2 : Y = X2β2 + e
True Model Y = X1β1T + X2β2T + e
BMA ˆY∗
= p(M1 | Y)X1
ˆβ1 + p(M2 | Y)X2
ˆβ2
Potential Problem with BMA
Two models in M
M1 : Y = X1β1 + e
M2 : Y = X2β2 + e
True Model Y = X1β1T + X2β2T + e
BMA ˆY∗
= p(M1 | Y)X1
ˆβ1 + p(M2 | Y)X2
ˆβ2
BMA model weights converge to 1 for the model that is
“closest” to true model in Kullback-Leibler divergence
Potential Problem with BMA
Two models in M
M1 : Y = X1β1 + e
M2 : Y = X2β2 + e
True Model Y = X1β1T + X2β2T + e
BMA ˆY∗
= p(M1 | Y)X1
ˆβ1 + p(M2 | Y)X2
ˆβ2
BMA model weights converge to 1 for the model that is
“closest” to true model in Kullback-Leibler divergence
BMA only uses predictions from that model
Potential Problem with BMA
Two models in M
M1 : Y = X1β1 + e
M2 : Y = X2β2 + e
True Model Y = X1β1T + X2β2T + e
BMA ˆY∗
= p(M1 | Y)X1
ˆβ1 + p(M2 | Y)X2
ˆβ2
BMA model weights converge to 1 for the model that is
“closest” to true model in Kullback-Leibler divergence
BMA only uses predictions from that model
In the limit BMA is not consistent if MT /∈ M
Potential Problem with BMA
Two models in M
M1 : Y = X1β1 + e
M2 : Y = X2β2 + e
True Model Y = X1β1T + X2β2T + e
BMA ˆY∗
= p(M1 | Y)X1
ˆβ1 + p(M2 | Y)X2
ˆβ2
BMA model weights converge to 1 for the model that is
“closest” to true model in Kullback-Leibler divergence
BMA only uses predictions from that model
In the limit BMA is not consistent if MT /∈ M
Potential Solutions
Expand the list of models (prior specification on M)
Potential Solutions
Expand the list of models (prior specification on M)
M1 : Y = X1β1 + e
M2 : Y = X2β2 + e
M3 : Y = X1β1 + X2β2 + e
Potential Solutions
Expand the list of models (prior specification on M)
M1 : Y = X1β1 + e
M2 : Y = X2β2 + e
M3 : Y = X1β1 + X2β2 + e
Expand the list of models: Frames or
Overcomplete-Dictionaries
Potential Solutions
Expand the list of models (prior specification on M)
M1 : Y = X1β1 + e
M2 : Y = X2β2 + e
M3 : Y = X1β1 + X2β2 + e
Expand the list of models: Frames or
Overcomplete-Dictionaries
Kernel Regressions, Boosting, BART, Deep Learning
Potential Solutions
Expand the list of models (prior specification on M)
M1 : Y = X1β1 + e
M2 : Y = X2β2 + e
M3 : Y = X1β1 + X2β2 + e
Expand the list of models: Frames or
Overcomplete-Dictionaries
Kernel Regressions, Boosting, BART, Deep Learning
Flexible Non-parametric or infinity dimensional models
Potential Solutions
Expand the list of models (prior specification on M)
M1 : Y = X1β1 + e
M2 : Y = X2β2 + e
M3 : Y = X1β1 + X2β2 + e
Expand the list of models: Frames or
Overcomplete-Dictionaries
Kernel Regressions, Boosting, BART, Deep Learning
Flexible Non-parametric or infinity dimensional models
Theoretical Properties
Potential Solutions
Expand the list of models (prior specification on M)
M1 : Y = X1β1 + e
M2 : Y = X2β2 + e
M3 : Y = X1β1 + X2β2 + e
Expand the list of models: Frames or
Overcomplete-Dictionaries
Kernel Regressions, Boosting, BART, Deep Learning
Flexible Non-parametric or infinity dimensional models
Theoretical Properties
Computational Challenges
Potential Solutions
Expand the list of models (prior specification on M)
M1 : Y = X1β1 + e
M2 : Y = X2β2 + e
M3 : Y = X1β1 + X2β2 + e
Expand the list of models: Frames or
Overcomplete-Dictionaries
Kernel Regressions, Boosting, BART, Deep Learning
Flexible Non-parametric or infinity dimensional models
Theoretical Properties
Computational Challenges
Choice of Priors
Potential Solutions
Expand the list of models (prior specification on M)
M1 : Y = X1β1 + e
M2 : Y = X2β2 + e
M3 : Y = X1β1 + X2β2 + e
Expand the list of models: Frames or
Overcomplete-Dictionaries
Kernel Regressions, Boosting, BART, Deep Learning
Flexible Non-parametric or infinity dimensional models
Theoretical Properties
Computational Challenges
Choice of Priors
Other model ensembles ?
Combining Models as a Decision Problem
In M-Complete or M-Open viewpoints, if MT is not in the
list of models M then p(Mm) = 0 for Mm ∈ M.
Combining Models as a Decision Problem
In M-Complete or M-Open viewpoints, if MT is not in the
list of models M then p(Mm) = 0 for Mm ∈ M.
George Box: “All models are wrong, but some may be useful”
Combining Models as a Decision Problem
In M-Complete or M-Open viewpoints, if MT is not in the
list of models M then p(Mm) = 0 for Mm ∈ M.
George Box: “All models are wrong, but some may be useful”
a(w, Y) = wm
ˆY ∗
m
Combining Models as a Decision Problem
In M-Complete or M-Open viewpoints, if MT is not in the
list of models M then p(Mm) = 0 for Mm ∈ M.
George Box: “All models are wrong, but some may be useful”
a(w, Y) = wm
ˆY ∗
m
Treat weights {wm, m ∈ m} as part of the action space (rather
than an unknown) and maximize posterior expected utility,
EY∗|y,MT
[u(Y∗
, a(w, y))] = u(y∗
, a(w, y, ))p(y∗
| y, MT )
Combining Models as a Decision Problem
In M-Complete or M-Open viewpoints, if MT is not in the
list of models M then p(Mm) = 0 for Mm ∈ M.
George Box: “All models are wrong, but some may be useful”
a(w, Y) = wm
ˆY ∗
m
Treat weights {wm, m ∈ m} as part of the action space (rather
than an unknown) and maximize posterior expected utility,
EY∗|y,MT
[u(Y∗
, a(w, y))] = u(y∗
, a(w, y, ))p(y∗
| y, MT )
For negative squared error:
−EY∗|y,MT
Y∗
−a(w, y) 2
= − y∗
−
m
wm
ˆY∗
m
2
p(y∗
| y, MT )
Combining Models as a Decision Problem
In M-Complete or M-Open viewpoints, if MT is not in the
list of models M then p(Mm) = 0 for Mm ∈ M.
George Box: “All models are wrong, but some may be useful”
a(w, Y) = wm
ˆY ∗
m
Treat weights {wm, m ∈ m} as part of the action space (rather
than an unknown) and maximize posterior expected utility,
EY∗|y,MT
[u(Y∗
, a(w, y))] = u(y∗
, a(w, y, ))p(y∗
| y, MT )
For negative squared error:
−EY∗|y,MT
Y∗
−a(w, y) 2
= − y∗
−
m
wm
ˆY∗
m
2
p(y∗
| y, MT )
Focus on M-Open case
M-Open Predictive Distribution
No explicit form for p(y∗ | y, MT ) under MT
M-Open Predictive Distribution
No explicit form for p(y∗ | y, MT ) under MT
partition the data: Y = (Yj , Y(−j))
M-Open Predictive Distribution
No explicit form for p(y∗ | y, MT ) under MT
partition the data: Y = (Yj , Y(−j))
Yj is a proxy for Y∗
(the future observation(s))
M-Open Predictive Distribution
No explicit form for p(y∗ | y, MT ) under MT
partition the data: Y = (Yj , Y(−j))
Yj is a proxy for Y∗
(the future observation(s))
Y(−j) is a proxy for Y (the observed data)
M-Open Predictive Distribution
No explicit form for p(y∗ | y, MT ) under MT
partition the data: Y = (Yj , Y(−j))
Yj is a proxy for Y∗
(the future observation(s))
Y(−j) is a proxy for Y (the observed data)
randomly select J partitions,
u(Y∗
, a(w, y))p(y∗
| y, MT ) dy∗
≈
1
J
J
j=1
u(Yj , a(w, Y(−j)))
M-Open Predictive Distribution
No explicit form for p(y∗ | y, MT ) under MT
partition the data: Y = (Yj , Y(−j))
Yj is a proxy for Y∗
(the future observation(s))
Y(−j) is a proxy for Y (the observed data)
randomly select J partitions,
u(Y∗
, a(w, y))p(y∗
| y, MT ) dy∗
≈
1
J
J
j=1
u(Yj , a(w, Y(−j)))
Key et al. + Clyde & Iversen justification of cross-validation
to approximate expected posterior utility.
M-Open Predictive Distribution
No explicit form for p(y∗ | y, MT ) under MT
partition the data: Y = (Yj , Y(−j))
Yj is a proxy for Y∗
(the future observation(s))
Y(−j) is a proxy for Y (the observed data)
randomly select J partitions,
u(Y∗
, a(w, y))p(y∗
| y, MT ) dy∗
≈
1
J
J
j=1
u(Yj , a(w, Y(−j)))
Key et al. + Clyde & Iversen justification of cross-validation
to approximate expected posterior utility.
Guterriez-Pena & Walker approximation to a (limiting)
Dirichlet process model for estimating unknown distribution F
for MT
u(y∗
, a∗
(w, y))dFn(y∗
) →
1
n
n
i=1
u(yi , a∗
(w, Y(−i)))
Optimization Problem under Approximation
Find weights
ˆw = arg max
w
−
1
J
J
j=1
Yj −
m∈M
wm
ˆY(−j),Mm
2
Optimization Problem under Approximation
Find weights
ˆw = arg max
w
−
1
J
J
j=1
Yj −
m∈M
wm
ˆY(−j),Mm
2
Constrained Solution:
Find weights: ˆw = arg max
w
−
1
J
J
j=1
Yj −
m∈M
wm
ˆY(−j),Mm
2
subject to
i
wm = 1
wm ≥ 0 ∀m ∈ M
Optimization Problem under Approximation
Find weights
ˆw = arg max
w
−
1
J
J
j=1
Yj −
m∈M
wm
ˆY(−j),Mm
2
Constrained Solution:
Find weights: ˆw = arg max
w
−
1
J
J
j=1
Yj −
m∈M
wm
ˆY(−j),Mm
2
subject to
i
wm = 1
wm ≥ 0 ∀m ∈ M
Equivalent representation (Lagrangian):
−
1
J
J
j=1
Yj −
m∈M
wm
ˆY(−j),Mm
2
− λ0(
M
m
wm − 1) +
M
m
λmwm
Comments on Solutions
Let e = [e]ji = Yj − ˆY(−j)Mi
denote the n × M matrix of residuals
for predicting Yj under model Mi .
Comments on Solutions
Let e = [e]ji = Yj − ˆY(−j)Mi
denote the n × M matrix of residuals
for predicting Yj under model Mi .
With sum to 1 constraint alone, ˆw ∝ (eT e)−11
Comments on Solutions
Let e = [e]ji = Yj − ˆY(−j)Mi
denote the n × M matrix of residuals
for predicting Yj under model Mi .
With sum to 1 constraint alone, ˆw ∝ (eT e)−11
If residuals from models are uncorrelated, then weights are
proportional to the inverse of the cross-validation MSE for
model Mi , MSEi = j e2
ij
Comments on Solutions
Let e = [e]ji = Yj − ˆY(−j)Mi
denote the n × M matrix of residuals
for predicting Yj under model Mi .
With sum to 1 constraint alone, ˆw ∝ (eT e)−11
If residuals from models are uncorrelated, then weights are
proportional to the inverse of the cross-validation MSE for
model Mi , MSEi = j e2
ij
With highly correlated predictions/residual weights may be
negative and highly unstable
Comments on Solutions
Let e = [e]ji = Yj − ˆY(−j)Mi
denote the n × M matrix of residuals
for predicting Yj under model Mi .
With sum to 1 constraint alone, ˆw ∝ (eT e)−11
If residuals from models are uncorrelated, then weights are
proportional to the inverse of the cross-validation MSE for
model Mi , MSEi = j e2
ij
With highly correlated predictions/residual weights may be
negative and highly unstable
Non-negativity lasso-like constraint stabilizes weights, and
may drive weights to 0 for similar models
Comments on Solutions
Let e = [e]ji = Yj − ˆY(−j)Mi
denote the n × M matrix of residuals
for predicting Yj under model Mi .
With sum to 1 constraint alone, ˆw ∝ (eT e)−11
If residuals from models are uncorrelated, then weights are
proportional to the inverse of the cross-validation MSE for
model Mi , MSEi = j e2
ij
With highly correlated predictions/residual weights may be
negative and highly unstable
Non-negativity lasso-like constraint stabilizes weights, and
may drive weights to 0 for similar models
Provides a Bayesian justification for classical stacking (Wolpert
1992, Breiman 1996)
Comments on Solutions
Let e = [e]ji = Yj − ˆY(−j)Mi
denote the n × M matrix of residuals
for predicting Yj under model Mi .
With sum to 1 constraint alone, ˆw ∝ (eT e)−11
If residuals from models are uncorrelated, then weights are
proportional to the inverse of the cross-validation MSE for
model Mi , MSEi = j e2
ij
With highly correlated predictions/residual weights may be
negative and highly unstable
Non-negativity lasso-like constraint stabilizes weights, and
may drive weights to 0 for similar models
Provides a Bayesian justification for classical stacking (Wolpert
1992, Breiman 1996)
Super-Learners! h2oEnsemble
Ovarian Cancer Example
Predict short vs. long-term survival given primary tumor’s
molecular phenotype.
Ovarian Cancer Example
Predict short vs. long-term survival given primary tumor’s
molecular phenotype.
Retrospective sample of survivors of advanced stage serous
ovarian cancer
n = 30 short-term (< 3 years)
n = 24 long-term (> 7 years)
Ovarian Cancer Example
Predict short vs. long-term survival given primary tumor’s
molecular phenotype.
Retrospective sample of survivors of advanced stage serous
ovarian cancer
n = 30 short-term (< 3 years)
n = 24 long-term (> 7 years)
Eleven early stage (I/II) cases for external validation.
Ovarian Cancer Example
Predict short vs. long-term survival given primary tumor’s
molecular phenotype.
Retrospective sample of survivors of advanced stage serous
ovarian cancer
n = 30 short-term (< 3 years)
n = 24 long-term (> 7 years)
Eleven early stage (I/II) cases for external validation.
Affymetrix U133a expression microarray; 22, 283 genes.
Ovarian Cancer Example
Predict short vs. long-term survival given primary tumor’s
molecular phenotype.
Retrospective sample of survivors of advanced stage serous
ovarian cancer
n = 30 short-term (< 3 years)
n = 24 long-term (> 7 years)
Eleven early stage (I/II) cases for external validation.
Affymetrix U133a expression microarray; 22, 283 genes.
6 clinical variables
Ovarian Cancer Example
Predict short vs. long-term survival given primary tumor’s
molecular phenotype.
Retrospective sample of survivors of advanced stage serous
ovarian cancer
n = 30 short-term (< 3 years)
n = 24 long-term (> 7 years)
Eleven early stage (I/II) cases for external validation.
Affymetrix U133a expression microarray; 22, 283 genes.
6 clinical variables
Models for M-Open Model Averaging (MOMA)
Three classes of models:
Clinical Trees: (5 models) Prospective classification and
regression tree models using only clinical variables such as
age, post-treatment CA125 levels, etc.
Models for M-Open Model Averaging (MOMA)
Three classes of models:
Clinical Trees: (5 models) Prospective classification and
regression tree models using only clinical variables such as
age, post-treatment CA125 levels, etc.
Expression Trees: (4 models) Prospective Classification and
regression tree models using only expression data.
Models for M-Open Model Averaging (MOMA)
Three classes of models:
Clinical Trees: (5 models) Prospective classification and
regression tree models using only clinical variables such as
age, post-treatment CA125 levels, etc.
Expression Trees: (4 models) Prospective Classification and
regression tree models using only expression data.
Expression LDA: (4 models) Retrospective discriminant
models built using expression data given survival.
Models for M-Open Model Averaging (MOMA)
Three classes of models:
Clinical Trees: (5 models) Prospective classification and
regression tree models using only clinical variables such as
age, post-treatment CA125 levels, etc.
Expression Trees: (4 models) Prospective Classification and
regression tree models using only expression data.
Expression LDA: (4 models) Retrospective discriminant
models built using expression data given survival.
Find MOMA (M-Open Model Averaging) weights ˆw using all long
term and short term survivors
Models for M-Open Model Averaging (MOMA)
Three classes of models:
Clinical Trees: (5 models) Prospective classification and
regression tree models using only clinical variables such as
age, post-treatment CA125 levels, etc.
Expression Trees: (4 models) Prospective Classification and
regression tree models using only expression data.
Expression LDA: (4 models) Retrospective discriminant
models built using expression data given survival.
Find MOMA (M-Open Model Averaging) weights ˆw using all long
term and short term survivors
sum to 1 constraint
+ non-negativity constraint
MOMA Weights
−40 −20 0 20 40
MOMA Weights
Clinical Trees
Expression Trees
LDA (both)
Non−Neg,Sum=1
(expanded)Non−Neg,Sum=1Sum=1
0 1/4 1/2 3/4 1
Validation Experiment
5-fold cross validation; 5 splits of data into two groups:
Training Y and Validation Y∗
Validation Experiment
5-fold cross validation; 5 splits of data into two groups:
Training Y and Validation Y∗
Use training data to obtain model weights ˆw via LOO
Validation Experiment
5-fold cross validation; 5 splits of data into two groups:
Training Y and Validation Y∗
Use training data to obtain model weights ˆw via LOO
Construct M-Open Model Averaging (MOMA) estimates of
probability of long term survival ˆpj = i ˆwi
ˆY ∗
Mi
(Y) for
validation samples
Validation Experiment
5-fold cross validation; 5 splits of data into two groups:
Training Y and Validation Y∗
Use training data to obtain model weights ˆw via LOO
Construct M-Open Model Averaging (MOMA) estimates of
probability of long term survival ˆpj = i ˆwi
ˆY ∗
Mi
(Y) for
validation samples
Classify as Long Term Survivor ˆpj ≥ 1/2
Validation Experiment
5-fold cross validation; 5 splits of data into two groups:
Training Y and Validation Y∗
Use training data to obtain model weights ˆw via LOO
Construct M-Open Model Averaging (MOMA) estimates of
probability of long term survival ˆpj = i ˆwi
ˆY ∗
Mi
(Y) for
validation samples
Classify as Long Term Survivor ˆpj ≥ 1/2
Compute classification accuracy over 5 Splits
MOMA with Constraints
set1 set2 set3 set4 set5
clin1 0.00 0.07 0.00 0.00 0.00
clin2 0.00 0.00 0.00 0.00 0.00
clin3 0.00 0.00 0.00 0.00 0.00
clin4 0.00 0.11 0.00 0.00 0.00
clin5 0.30 0.17 0.07 0.41 0.00
tree1 0.00 0.00 0.00 0.00 0.77
tree2 0.00 0.00 0.00 0.00 0.21
tree3 0.23 0.44 0.21 0.00 0.01
tree4 0.00 0.00 0.00 0.00 0.01
lda100.P1 0.00 0.00 0.00 0.00 0.00
lda100.P2 0.22 0.00 0.30 0.00 0.00
lda200.P1 0.00 0.00 0.00 0.00 0.00
lda200.P2 0.26 0.21 0.41 0.58 0.00
Accuracy 0.82 0.73 0.55 0.73 0.60
Comments
can relax the sum to one constraint (overshrink)
Comments
can relax the sum to one constraint (overshrink)
justification for constraints
Comments
can relax the sum to one constraint (overshrink)
justification for constraints
other decision problems lead to other weights
Comments
can relax the sum to one constraint (overshrink)
justification for constraints
other decision problems lead to other weights
uncertainty?
More General Utilities - Yao et al (2018)
Under negative squared error loss, only have optimal point
predictions
More General Utilities - Yao et al (2018)
Under negative squared error loss, only have optimal point
predictions
Stacking probabilistic forecasts P ∈ P
a(w, y) =
m
wmp(y∗
| y, Mm)
More General Utilities - Yao et al (2018)
Under negative squared error loss, only have optimal point
predictions
Stacking probabilistic forecasts P ∈ P
a(w, y) =
m
wmp(y∗
| y, Mm)
Proper Scoring rules: S(Q, Q) ≥ S(P, Q) for P, Q ∈ P
S(P, Q) ≡ S(P, ω)dQ(ω)
More General Utilities - Yao et al (2018)
Under negative squared error loss, only have optimal point
predictions
Stacking probabilistic forecasts P ∈ P
a(w, y) =
m
wmp(y∗
| y, Mm)
Proper Scoring rules: S(Q, Q) ≥ S(P, Q) for P, Q ∈ P
S(P, Q) ≡ S(P, ω)dQ(ω)
Strictly Proper S(Q, Q) ≥ S(P, Q) with equality only when
P = Q almost surely
More General Utilities - Yao et al (2018)
Under negative squared error loss, only have optimal point
predictions
Stacking probabilistic forecasts P ∈ P
a(w, y) =
m
wmp(y∗
| y, Mm)
Proper Scoring rules: S(Q, Q) ≥ S(P, Q) for P, Q ∈ P
S(P, Q) ≡ S(P, ω)dQ(ω)
Strictly Proper S(Q, Q) ≥ S(P, Q) with equality only when
P = Q almost surely
Negative Quadratic Loss is proper, but not a strictly proper
scoring rule
More General Utilities - Yao et al (2018)
Under negative squared error loss, only have optimal point
predictions
Stacking probabilistic forecasts P ∈ P
a(w, y) =
m
wmp(y∗
| y, Mm)
Proper Scoring rules: S(Q, Q) ≥ S(P, Q) for P, Q ∈ P
S(P, Q) ≡ S(P, ω)dQ(ω)
Strictly Proper S(Q, Q) ≥ S(P, Q) with equality only when
P = Q almost surely
Negative Quadratic Loss is proper, but not a strictly proper
scoring rule
Logarithmic Score S(P, y∗) = log(p(y∗))
Probabilistic Forecasting
Ensemble BMA (Raftery + coauthors 2005 . . . ) weather
forecasting
Probabilistic Forecasting
Ensemble BMA (Raftery + coauthors 2005 . . . ) weather
forecasting
arg max
w,σ2
i
log(
M
m
wmp(y∗
i | y, Mm, σ2
)
Probabilistic Forecasting
Ensemble BMA (Raftery + coauthors 2005 . . . ) weather
forecasting
arg max
w,σ2
i
log(
M
m
wmp(y∗
i | y, Mm, σ2
)
pm Gaussian distributions centered at am + bm
ˆY∗
m
Probabilistic Forecasting
Ensemble BMA (Raftery + coauthors 2005 . . . ) weather
forecasting
arg max
w,σ2
i
log(
M
m
wmp(y∗
i | y, Mm, σ2
)
pm Gaussian distributions centered at am + bm
ˆY∗
m
allows for bias and calibration of computer model output
Probabilistic Forecasting
Ensemble BMA (Raftery + coauthors 2005 . . . ) weather
forecasting
arg max
w,σ2
i
log(
M
m
wmp(y∗
i | y, Mm, σ2
)
pm Gaussian distributions centered at am + bm
ˆY∗
m
allows for bias and calibration of computer model output
common unknown variance σ2
in each component
Probabilistic Forecasting
Ensemble BMA (Raftery + coauthors 2005 . . . ) weather
forecasting
arg max
w,σ2
i
log(
M
m
wmp(y∗
i | y, Mm, σ2
)
pm Gaussian distributions centered at am + bm
ˆY∗
m
allows for bias and calibration of computer model output
common unknown variance σ2
in each component
weights evolve with time
Probabilistic Forecasting
Ensemble BMA (Raftery + coauthors 2005 . . . ) weather
forecasting
arg max
w,σ2
i
log(
M
m
wmp(y∗
i | y, Mm, σ2
)
pm Gaussian distributions centered at am + bm
ˆY∗
m
allows for bias and calibration of computer model output
common unknown variance σ2
in each component
weights evolve with time
multivariate outcomes
West + coauthors Dynamic Linear Models (economic
forecasting) with dynamic weights
Probabilistic Forecasting
Ensemble BMA (Raftery + coauthors 2005 . . . ) weather
forecasting
arg max
w,σ2
i
log(
M
m
wmp(y∗
i | y, Mm, σ2
)
pm Gaussian distributions centered at am + bm
ˆY∗
m
allows for bias and calibration of computer model output
common unknown variance σ2
in each component
weights evolve with time
multivariate outcomes
West + coauthors Dynamic Linear Models (economic
forecasting) with dynamic weights
Gaussian Process emulators for computer models and
statistical models
Discussion
Model Averaging for Uncertainty Quantification under
different perspectives
Discussion
Model Averaging for Uncertainty Quantification under
different perspectives
Dependence on Choice of Utility Functions
Discussion
Model Averaging for Uncertainty Quantification under
different perspectives
Dependence on Choice of Utility Functions
squared error loss - point estimates
Discussion
Model Averaging for Uncertainty Quantification under
different perspectives
Dependence on Choice of Utility Functions
squared error loss - point estimates
proper scoring rules - distributions (global)
Discussion
Model Averaging for Uncertainty Quantification under
different perspectives
Dependence on Choice of Utility Functions
squared error loss - point estimates
proper scoring rules - distributions (global)
select quantiles (multiple objectives)
Discussion
Model Averaging for Uncertainty Quantification under
different perspectives
Dependence on Choice of Utility Functions
squared error loss - point estimates
proper scoring rules - distributions (global)
select quantiles (multiple objectives)
Mixture Models and Mixtures of Experts
Discussion
Model Averaging for Uncertainty Quantification under
different perspectives
Dependence on Choice of Utility Functions
squared error loss - point estimates
proper scoring rules - distributions (global)
select quantiles (multiple objectives)
Mixture Models and Mixtures of Experts
Partitions of data for approximation? LOO, k-fold, sequential
Discussion
Model Averaging for Uncertainty Quantification under
different perspectives
Dependence on Choice of Utility Functions
squared error loss - point estimates
proper scoring rules - distributions (global)
select quantiles (multiple objectives)
Mixture Models and Mixtures of Experts
Partitions of data for approximation? LOO, k-fold, sequential
Allow weights to Depend on Inputs
Discussion
Model Averaging for Uncertainty Quantification under
different perspectives
Dependence on Choice of Utility Functions
squared error loss - point estimates
proper scoring rules - distributions (global)
select quantiles (multiple objectives)
Mixture Models and Mixtures of Experts
Partitions of data for approximation? LOO, k-fold, sequential
Allow weights to Depend on Inputs
Incorporation of Model Complexity/Regularization in Utility
(sum to one?)
Discussion
Model Averaging for Uncertainty Quantification under
different perspectives
Dependence on Choice of Utility Functions
squared error loss - point estimates
proper scoring rules - distributions (global)
select quantiles (multiple objectives)
Mixture Models and Mixtures of Experts
Partitions of data for approximation? LOO, k-fold, sequential
Allow weights to Depend on Inputs
Incorporation of Model Complexity/Regularization in Utility
(sum to one?)
Optimization: Quadratic programming, EM, variational, ABC

Más contenido relacionado

La actualidad más candente

MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
The Statistical and Applied Mathematical Sciences Institute
 
Math 1300: Section 5-2 Systems of Inequalities in two variables
Math 1300: Section 5-2 Systems of Inequalities in two variablesMath 1300: Section 5-2 Systems of Inequalities in two variables
Math 1300: Section 5-2 Systems of Inequalities in two variables
Jason Aubrey
 

La actualidad más candente (18)

My data are incomplete and noisy: Information-reduction statistical methods f...
My data are incomplete and noisy: Information-reduction statistical methods f...My data are incomplete and noisy: Information-reduction statistical methods f...
My data are incomplete and noisy: Information-reduction statistical methods f...
 
Random Forest for Big Data
Random Forest for Big DataRandom Forest for Big Data
Random Forest for Big Data
 
(Approximate) Bayesian computation as a new empirical Bayes (something)?
(Approximate) Bayesian computation as a new empirical Bayes (something)?(Approximate) Bayesian computation as a new empirical Bayes (something)?
(Approximate) Bayesian computation as a new empirical Bayes (something)?
 
Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013
Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013
Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013
 
Les outils de modélisation des Big Data
Les outils de modélisation des Big DataLes outils de modélisation des Big Data
Les outils de modélisation des Big Data
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
 
Statistics symposium talk, Harvard University
Statistics symposium talk, Harvard UniversityStatistics symposium talk, Harvard University
Statistics symposium talk, Harvard University
 
Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...
 
Inference via Bayesian Synthetic Likelihoods for a Mixed-Effects SDE Model of...
Inference via Bayesian Synthetic Likelihoods for a Mixed-Effects SDE Model of...Inference via Bayesian Synthetic Likelihoods for a Mixed-Effects SDE Model of...
Inference via Bayesian Synthetic Likelihoods for a Mixed-Effects SDE Model of...
 
A short and naive introduction to epistasis in association studies
A short and naive introduction to epistasis in association studiesA short and naive introduction to epistasis in association studies
A short and naive introduction to epistasis in association studies
 
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Model
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized ModelPredicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Model
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Model
 
Estimation of Parameters and Missing Responses In Second Order Response Surfa...
Estimation of Parameters and Missing Responses In Second Order Response Surfa...Estimation of Parameters and Missing Responses In Second Order Response Surfa...
Estimation of Parameters and Missing Responses In Second Order Response Surfa...
 
Learning from (dis)similarity data
Learning from (dis)similarity dataLearning from (dis)similarity data
Learning from (dis)similarity data
 
2018 MUMS Fall Course - Issue Arising in Several Working Groups: Probabilisti...
2018 MUMS Fall Course - Issue Arising in Several Working Groups: Probabilisti...2018 MUMS Fall Course - Issue Arising in Several Working Groups: Probabilisti...
2018 MUMS Fall Course - Issue Arising in Several Working Groups: Probabilisti...
 
Testing as estimation: the demise of the Bayes factor
Testing as estimation: the demise of the Bayes factorTesting as estimation: the demise of the Bayes factor
Testing as estimation: the demise of the Bayes factor
 
interpolation of unequal intervals
interpolation of unequal intervalsinterpolation of unequal intervals
interpolation of unequal intervals
 
02.bayesian learning
02.bayesian learning02.bayesian learning
02.bayesian learning
 
Math 1300: Section 5-2 Systems of Inequalities in two variables
Math 1300: Section 5-2 Systems of Inequalities in two variablesMath 1300: Section 5-2 Systems of Inequalities in two variables
Math 1300: Section 5-2 Systems of Inequalities in two variables
 

Similar a MUMS Opening Workshop - Model Uncertainty and Uncertain Quantification - Merlise Clyde, August 20, 2018

Minghui Conference Cross-Validation Talk
Minghui Conference Cross-Validation TalkMinghui Conference Cross-Validation Talk
Minghui Conference Cross-Validation Talk
Wei Wang
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selection
chenhm
 

Similar a MUMS Opening Workshop - Model Uncertainty and Uncertain Quantification - Merlise Clyde, August 20, 2018 (20)

Minghui Conference Cross-Validation Talk
Minghui Conference Cross-Validation TalkMinghui Conference Cross-Validation Talk
Minghui Conference Cross-Validation Talk
 
Lausanne 2019 #1
Lausanne 2019 #1Lausanne 2019 #1
Lausanne 2019 #1
 
NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015
 
ABC in Varanasi
ABC in VaranasiABC in Varanasi
ABC in Varanasi
 
Approximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsApproximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forests
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selection
 
Is ABC a new empirical Bayes approach?
Is ABC a new empirical Bayes approach?Is ABC a new empirical Bayes approach?
Is ABC a new empirical Bayes approach?
 
slides of ABC talk at i-like workshop, Warwick, May 16
slides of ABC talk at i-like workshop, Warwick, May 16slides of ABC talk at i-like workshop, Warwick, May 16
slides of ABC talk at i-like workshop, Warwick, May 16
 
Bayesian computation with INLA
Bayesian computation with INLABayesian computation with INLA
Bayesian computation with INLA
 
Likelihood free computational statistics
Likelihood free computational statisticsLikelihood free computational statistics
Likelihood free computational statistics
 
Intractable likelihoods
Intractable likelihoodsIntractable likelihoods
Intractable likelihoods
 
from model uncertainty to ABC
from model uncertainty to ABCfrom model uncertainty to ABC
from model uncertainty to ABC
 
Machine learning with missing values
Machine learning with missing valuesMachine learning with missing values
Machine learning with missing values
 
ABC short course: model choice chapter
ABC short course: model choice chapterABC short course: model choice chapter
ABC short course: model choice chapter
 
Predictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataPredictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big data
 
[A]BCel : a presentation at ABC in Roma
[A]BCel : a presentation at ABC in Roma[A]BCel : a presentation at ABC in Roma
[A]BCel : a presentation at ABC in Roma
 
Boston talk
Boston talkBoston talk
Boston talk
 
Pittsburgh and Toronto "Halloween US trip" seminars
Pittsburgh and Toronto "Halloween US trip" seminarsPittsburgh and Toronto "Halloween US trip" seminars
Pittsburgh and Toronto "Halloween US trip" seminars
 
ABC and empirical likelihood
ABC and empirical likelihoodABC and empirical likelihood
ABC and empirical likelihood
 
V. pacáková, d. brebera
V. pacáková, d. breberaV. pacáková, d. brebera
V. pacáková, d. brebera
 

Más de The Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
The Statistical and Applied Mathematical Sciences Institute
 

Más de The Statistical and Applied Mathematical Sciences Institute (20)

Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
 
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
 
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
 
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
 
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
 
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
 
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
 
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
 
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
 
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
 
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
 
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
 
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
 
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
 
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
 
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
 
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
 
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 

Último

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Último (20)

Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 

MUMS Opening Workshop - Model Uncertainty and Uncertain Quantification - Merlise Clyde, August 20, 2018

  • 1. Model Uncertainty and Uncertainty Quantification Merlise Clyde Duke University http://stat.duke.edu/∼clyde SAMSI MUMS - August 20, 2018
  • 2. Uncertainty Quantification Wikipedia: Uncertainty Quantification (UQ) is the science of quantitative characterizing and reduction of uncertainties . . .
  • 3. Uncertainty Quantification Wikipedia: Uncertainty Quantification (UQ) is the science of quantitative characterizing and reduction of uncertainties . . . parameter parameters θ in the model that are unknown
  • 4. Uncertainty Quantification Wikipedia: Uncertainty Quantification (UQ) is the science of quantitative characterizing and reduction of uncertainties . . . parameter parameters θ in the model that are unknown inputs measurement error in model inputs x
  • 5. Uncertainty Quantification Wikipedia: Uncertainty Quantification (UQ) is the science of quantitative characterizing and reduction of uncertainties . . . parameter parameters θ in the model that are unknown inputs measurement error in model inputs x algorithmic induced by approximating model
  • 6. Uncertainty Quantification Wikipedia: Uncertainty Quantification (UQ) is the science of quantitative characterizing and reduction of uncertainties . . . parameter parameters θ in the model that are unknown inputs measurement error in model inputs x algorithmic induced by approximating model experimental observation error in response Y at input x
  • 7. Uncertainty Quantification Wikipedia: Uncertainty Quantification (UQ) is the science of quantitative characterizing and reduction of uncertainties . . . parameter parameters θ in the model that are unknown inputs measurement error in model inputs x algorithmic induced by approximating model experimental observation error in response Y at input x model structural uncertainty about the model/data generating process
  • 8. Uncertainty Quantification Wikipedia: Uncertainty Quantification (UQ) is the science of quantitative characterizing and reduction of uncertainties . . . parameter parameters θ in the model that are unknown inputs measurement error in model inputs x algorithmic induced by approximating model experimental observation error in response Y at input x model structural uncertainty about the model/data generating process predictive interpolation or extrapolation of model at new x
  • 9. Uncertainty Quantification Wikipedia: Uncertainty Quantification (UQ) is the science of quantitative characterizing and reduction of uncertainties . . . parameter parameters θ in the model that are unknown inputs measurement error in model inputs x algorithmic induced by approximating model experimental observation error in response Y at input x model structural uncertainty about the model/data generating process predictive interpolation or extrapolation of model at new x Predictive uncertainty: reducible error + irreducible error
  • 10. Uncertainty Quantification Wikipedia: Uncertainty Quantification (UQ) is the science of quantitative characterizing and reduction of uncertainties . . . parameter parameters θ in the model that are unknown inputs measurement error in model inputs x algorithmic induced by approximating model experimental observation error in response Y at input x model structural uncertainty about the model/data generating process predictive interpolation or extrapolation of model at new x Predictive uncertainty: reducible error + irreducible error Rumsfeld’s “Known Unknowns” versus “Unknown Unknowns”
  • 11. Model Uncertainty Entertain a collection of models M = {Mm, m ∈ M}
  • 12. Model Uncertainty Entertain a collection of models M = {Mm, m ∈ M} Each model corresponds to a parametric (although possibly infinite-dimensional) distribution of the data Y: pm(y | θm, x) = p(y | θm, Mm, x) where θm corresponds to unknown parameters in the distribution for Y under Mm
  • 13. Model Uncertainty Entertain a collection of models M = {Mm, m ∈ M} Each model corresponds to a parametric (although possibly infinite-dimensional) distribution of the data Y: pm(y | θm, x) = p(y | θm, Mm, x) where θm corresponds to unknown parameters in the distribution for Y under Mm Objective: Obtain predictive distributions or summaries at inputs x∗ p(y∗ | y, x, x∗ )
  • 14. Model Uncertainty Entertain a collection of models M = {Mm, m ∈ M} Each model corresponds to a parametric (although possibly infinite-dimensional) distribution of the data Y: pm(y | θm, x) = p(y | θm, Mm, x) where θm corresponds to unknown parameters in the distribution for Y under Mm Objective: Obtain predictive distributions or summaries at inputs x∗ p(y∗ | y, x, x∗ ) WLOG drop dependence on inputs, p(y∗ | y)
  • 16. Bayesian Perspectives on Model Uncertainty M-Closed the true data generating model MT is one of Mm ∈ M but is unknown to researchers
  • 17. Bayesian Perspectives on Model Uncertainty M-Closed the true data generating model MT is one of Mm ∈ M but is unknown to researchers M-Complete the true model MT exists but is not included in the model list M. We still wish to use the models in M because of tractability of computations or communication of results, compared with the actual belief model
  • 18. Bayesian Perspectives on Model Uncertainty M-Closed the true data generating model MT is one of Mm ∈ M but is unknown to researchers M-Complete the true model MT exists but is not included in the model list M. We still wish to use the models in M because of tractability of computations or communication of results, compared with the actual belief model M-Open we know the true model MT is not in M, but we cannot specify the explicit form p(y∗ | y) because it is too difficult conceptually or computationally, we lack time to do so, or do not have the expertise, etc. Bernardo & Smith (1994), Clyde & Iversen (2013)
  • 19. Predictive Distributions under M-Closed A Bayesian would assign a prior probability, p(Mm), representing their belief that each model Mm is the true model.
  • 20. Predictive Distributions under M-Closed A Bayesian would assign a prior probability, p(Mm), representing their belief that each model Mm is the true model. Distributions p(θm | Mm) characterizing a priori uncertainty
  • 21. Predictive Distributions under M-Closed A Bayesian would assign a prior probability, p(Mm), representing their belief that each model Mm is the true model. Distributions p(θm | Mm) characterizing a priori uncertainty Bayes Theorem: posterior probability of each model p(Mm | Y) p(Mm | Y) = p(Y|Mm)p(Mm) m∈M p(y | Mm)p(Mm) , m ∈ M where p(Y | Mm) = p(Y | θm, Mm)p(θm | Mm)dθm
  • 22. Predictive Distributions under M-Closed A Bayesian would assign a prior probability, p(Mm), representing their belief that each model Mm is the true model. Distributions p(θm | Mm) characterizing a priori uncertainty Bayes Theorem: posterior probability of each model p(Mm | Y) p(Mm | Y) = p(Y|Mm)p(Mm) m∈M p(y | Mm)p(Mm) , m ∈ M where p(Y | Mm) = p(Y | θm, Mm)p(θm | Mm)dθm Predictive distribution p(y∗ |y) = m∈M p(y∗ |Mm, y)p(Mm|y) = m∈M p(y∗ |Mm, θm, y)p(θm|y, Mm) dθm p(Mm|y)
  • 23. Estimation and Prediction Consider the decision problem of estimation/prediction under squared error loss u(Y ∗ , a) = −(Y ∗ − a)2 where a is a possible action (u is utility or negative loss) and Y ∗ is an unknown.
  • 24. Estimation and Prediction Consider the decision problem of estimation/prediction under squared error loss u(Y ∗ , a) = −(Y ∗ − a)2 where a is a possible action (u is utility or negative loss) and Y ∗ is an unknown. From a Bayesian perspective, the solution is to find the action that maximizes expected utility given the observed data Y: EY∗|Y[u(Y∗ , a)] = − (y∗ − a)2 p(y∗ | y)dy∗ where the expectation is taken with respect to the predictive distribution of Y∗ given the observed data y.
  • 25. Bayesian Model Averaging Under the M-closed perspective, optimal solution for prediction is Bayesian Model Averaging a∗ = EY∗ [Y∗ | Y] = m∈M p(Mm | Y) ˆY ∗ Mm where ˆY ∗ Mm is the predictive mean of Y∗ under model Mm
  • 26. Bayesian Model Averaging Under the M-closed perspective, optimal solution for prediction is Bayesian Model Averaging a∗ = EY∗ [Y∗ | Y] = m∈M p(Mm | Y) ˆY ∗ Mm where ˆY ∗ Mm is the predictive mean of Y∗ under model Mm Use joint posterior distribution on θ | M and M to obtain prediction intervals
  • 27. Bayesian Model Averaging Under the M-closed perspective, optimal solution for prediction is Bayesian Model Averaging a∗ = EY∗ [Y∗ | Y] = m∈M p(Mm | Y) ˆY ∗ Mm where ˆY ∗ Mm is the predictive mean of Y∗ under model Mm Use joint posterior distribution on θ | M and M to obtain prediction intervals Full propagation of all “known” uncertainties
  • 28. Bayesian Model Averaging Under the M-closed perspective, optimal solution for prediction is Bayesian Model Averaging a∗ = EY∗ [Y∗ | Y] = m∈M p(Mm | Y) ˆY ∗ Mm where ˆY ∗ Mm is the predictive mean of Y∗ under model Mm Use joint posterior distribution on θ | M and M to obtain prediction intervals Full propagation of all “known” uncertainties Extensive literature for regression and generalized linear models [Hoeting et al 1999, Clyde & George 2004, Bayarri et al 2012] with invariant priors/Spike & Slab + software more complex models via RJ-MCMC, SMC, ABC
  • 29. Potential Problem with BMA Two models in M M1 : Y = X1β1 + e M2 : Y = X2β2 + e
  • 30. Potential Problem with BMA Two models in M M1 : Y = X1β1 + e M2 : Y = X2β2 + e True Model Y = X1β1T + X2β2T + e BMA ˆY∗ = p(M1 | Y)X1 ˆβ1 + p(M2 | Y)X2 ˆβ2
  • 31. Potential Problem with BMA Two models in M M1 : Y = X1β1 + e M2 : Y = X2β2 + e True Model Y = X1β1T + X2β2T + e BMA ˆY∗ = p(M1 | Y)X1 ˆβ1 + p(M2 | Y)X2 ˆβ2 BMA model weights converge to 1 for the model that is “closest” to true model in Kullback-Leibler divergence
  • 32. Potential Problem with BMA Two models in M M1 : Y = X1β1 + e M2 : Y = X2β2 + e True Model Y = X1β1T + X2β2T + e BMA ˆY∗ = p(M1 | Y)X1 ˆβ1 + p(M2 | Y)X2 ˆβ2 BMA model weights converge to 1 for the model that is “closest” to true model in Kullback-Leibler divergence BMA only uses predictions from that model
  • 33. Potential Problem with BMA Two models in M M1 : Y = X1β1 + e M2 : Y = X2β2 + e True Model Y = X1β1T + X2β2T + e BMA ˆY∗ = p(M1 | Y)X1 ˆβ1 + p(M2 | Y)X2 ˆβ2 BMA model weights converge to 1 for the model that is “closest” to true model in Kullback-Leibler divergence BMA only uses predictions from that model In the limit BMA is not consistent if MT /∈ M
  • 34. Potential Problem with BMA Two models in M M1 : Y = X1β1 + e M2 : Y = X2β2 + e True Model Y = X1β1T + X2β2T + e BMA ˆY∗ = p(M1 | Y)X1 ˆβ1 + p(M2 | Y)X2 ˆβ2 BMA model weights converge to 1 for the model that is “closest” to true model in Kullback-Leibler divergence BMA only uses predictions from that model In the limit BMA is not consistent if MT /∈ M
  • 35. Potential Solutions Expand the list of models (prior specification on M)
  • 36. Potential Solutions Expand the list of models (prior specification on M) M1 : Y = X1β1 + e M2 : Y = X2β2 + e M3 : Y = X1β1 + X2β2 + e
  • 37. Potential Solutions Expand the list of models (prior specification on M) M1 : Y = X1β1 + e M2 : Y = X2β2 + e M3 : Y = X1β1 + X2β2 + e Expand the list of models: Frames or Overcomplete-Dictionaries
  • 38. Potential Solutions Expand the list of models (prior specification on M) M1 : Y = X1β1 + e M2 : Y = X2β2 + e M3 : Y = X1β1 + X2β2 + e Expand the list of models: Frames or Overcomplete-Dictionaries Kernel Regressions, Boosting, BART, Deep Learning
  • 39. Potential Solutions Expand the list of models (prior specification on M) M1 : Y = X1β1 + e M2 : Y = X2β2 + e M3 : Y = X1β1 + X2β2 + e Expand the list of models: Frames or Overcomplete-Dictionaries Kernel Regressions, Boosting, BART, Deep Learning Flexible Non-parametric or infinity dimensional models
  • 40. Potential Solutions Expand the list of models (prior specification on M) M1 : Y = X1β1 + e M2 : Y = X2β2 + e M3 : Y = X1β1 + X2β2 + e Expand the list of models: Frames or Overcomplete-Dictionaries Kernel Regressions, Boosting, BART, Deep Learning Flexible Non-parametric or infinity dimensional models Theoretical Properties
  • 41. Potential Solutions Expand the list of models (prior specification on M) M1 : Y = X1β1 + e M2 : Y = X2β2 + e M3 : Y = X1β1 + X2β2 + e Expand the list of models: Frames or Overcomplete-Dictionaries Kernel Regressions, Boosting, BART, Deep Learning Flexible Non-parametric or infinity dimensional models Theoretical Properties Computational Challenges
  • 42. Potential Solutions Expand the list of models (prior specification on M) M1 : Y = X1β1 + e M2 : Y = X2β2 + e M3 : Y = X1β1 + X2β2 + e Expand the list of models: Frames or Overcomplete-Dictionaries Kernel Regressions, Boosting, BART, Deep Learning Flexible Non-parametric or infinity dimensional models Theoretical Properties Computational Challenges Choice of Priors
  • 43. Potential Solutions Expand the list of models (prior specification on M) M1 : Y = X1β1 + e M2 : Y = X2β2 + e M3 : Y = X1β1 + X2β2 + e Expand the list of models: Frames or Overcomplete-Dictionaries Kernel Regressions, Boosting, BART, Deep Learning Flexible Non-parametric or infinity dimensional models Theoretical Properties Computational Challenges Choice of Priors Other model ensembles ?
  • 44. Combining Models as a Decision Problem In M-Complete or M-Open viewpoints, if MT is not in the list of models M then p(Mm) = 0 for Mm ∈ M.
  • 45. Combining Models as a Decision Problem In M-Complete or M-Open viewpoints, if MT is not in the list of models M then p(Mm) = 0 for Mm ∈ M. George Box: “All models are wrong, but some may be useful”
  • 46. Combining Models as a Decision Problem In M-Complete or M-Open viewpoints, if MT is not in the list of models M then p(Mm) = 0 for Mm ∈ M. George Box: “All models are wrong, but some may be useful” a(w, Y) = wm ˆY ∗ m
  • 47. Combining Models as a Decision Problem In M-Complete or M-Open viewpoints, if MT is not in the list of models M then p(Mm) = 0 for Mm ∈ M. George Box: “All models are wrong, but some may be useful” a(w, Y) = wm ˆY ∗ m Treat weights {wm, m ∈ m} as part of the action space (rather than an unknown) and maximize posterior expected utility, EY∗|y,MT [u(Y∗ , a(w, y))] = u(y∗ , a(w, y, ))p(y∗ | y, MT )
  • 48. Combining Models as a Decision Problem In M-Complete or M-Open viewpoints, if MT is not in the list of models M then p(Mm) = 0 for Mm ∈ M. George Box: “All models are wrong, but some may be useful” a(w, Y) = wm ˆY ∗ m Treat weights {wm, m ∈ m} as part of the action space (rather than an unknown) and maximize posterior expected utility, EY∗|y,MT [u(Y∗ , a(w, y))] = u(y∗ , a(w, y, ))p(y∗ | y, MT ) For negative squared error: −EY∗|y,MT Y∗ −a(w, y) 2 = − y∗ − m wm ˆY∗ m 2 p(y∗ | y, MT )
  • 49. Combining Models as a Decision Problem In M-Complete or M-Open viewpoints, if MT is not in the list of models M then p(Mm) = 0 for Mm ∈ M. George Box: “All models are wrong, but some may be useful” a(w, Y) = wm ˆY ∗ m Treat weights {wm, m ∈ m} as part of the action space (rather than an unknown) and maximize posterior expected utility, EY∗|y,MT [u(Y∗ , a(w, y))] = u(y∗ , a(w, y, ))p(y∗ | y, MT ) For negative squared error: −EY∗|y,MT Y∗ −a(w, y) 2 = − y∗ − m wm ˆY∗ m 2 p(y∗ | y, MT ) Focus on M-Open case
  • 50. M-Open Predictive Distribution No explicit form for p(y∗ | y, MT ) under MT
  • 51. M-Open Predictive Distribution No explicit form for p(y∗ | y, MT ) under MT partition the data: Y = (Yj , Y(−j))
  • 52. M-Open Predictive Distribution No explicit form for p(y∗ | y, MT ) under MT partition the data: Y = (Yj , Y(−j)) Yj is a proxy for Y∗ (the future observation(s))
  • 53. M-Open Predictive Distribution No explicit form for p(y∗ | y, MT ) under MT partition the data: Y = (Yj , Y(−j)) Yj is a proxy for Y∗ (the future observation(s)) Y(−j) is a proxy for Y (the observed data)
  • 54. M-Open Predictive Distribution No explicit form for p(y∗ | y, MT ) under MT partition the data: Y = (Yj , Y(−j)) Yj is a proxy for Y∗ (the future observation(s)) Y(−j) is a proxy for Y (the observed data) randomly select J partitions, u(Y∗ , a(w, y))p(y∗ | y, MT ) dy∗ ≈ 1 J J j=1 u(Yj , a(w, Y(−j)))
  • 55. M-Open Predictive Distribution No explicit form for p(y∗ | y, MT ) under MT partition the data: Y = (Yj , Y(−j)) Yj is a proxy for Y∗ (the future observation(s)) Y(−j) is a proxy for Y (the observed data) randomly select J partitions, u(Y∗ , a(w, y))p(y∗ | y, MT ) dy∗ ≈ 1 J J j=1 u(Yj , a(w, Y(−j))) Key et al. + Clyde & Iversen justification of cross-validation to approximate expected posterior utility.
  • 56. M-Open Predictive Distribution No explicit form for p(y∗ | y, MT ) under MT partition the data: Y = (Yj , Y(−j)) Yj is a proxy for Y∗ (the future observation(s)) Y(−j) is a proxy for Y (the observed data) randomly select J partitions, u(Y∗ , a(w, y))p(y∗ | y, MT ) dy∗ ≈ 1 J J j=1 u(Yj , a(w, Y(−j))) Key et al. + Clyde & Iversen justification of cross-validation to approximate expected posterior utility. Guterriez-Pena & Walker approximation to a (limiting) Dirichlet process model for estimating unknown distribution F for MT u(y∗ , a∗ (w, y))dFn(y∗ ) → 1 n n i=1 u(yi , a∗ (w, Y(−i)))
  • 57. Optimization Problem under Approximation Find weights ˆw = arg max w − 1 J J j=1 Yj − m∈M wm ˆY(−j),Mm 2
  • 58. Optimization Problem under Approximation Find weights ˆw = arg max w − 1 J J j=1 Yj − m∈M wm ˆY(−j),Mm 2 Constrained Solution: Find weights: ˆw = arg max w − 1 J J j=1 Yj − m∈M wm ˆY(−j),Mm 2 subject to i wm = 1 wm ≥ 0 ∀m ∈ M
  • 59. Optimization Problem under Approximation Find weights ˆw = arg max w − 1 J J j=1 Yj − m∈M wm ˆY(−j),Mm 2 Constrained Solution: Find weights: ˆw = arg max w − 1 J J j=1 Yj − m∈M wm ˆY(−j),Mm 2 subject to i wm = 1 wm ≥ 0 ∀m ∈ M Equivalent representation (Lagrangian): − 1 J J j=1 Yj − m∈M wm ˆY(−j),Mm 2 − λ0( M m wm − 1) + M m λmwm
  • 60. Comments on Solutions Let e = [e]ji = Yj − ˆY(−j)Mi denote the n × M matrix of residuals for predicting Yj under model Mi .
  • 61. Comments on Solutions Let e = [e]ji = Yj − ˆY(−j)Mi denote the n × M matrix of residuals for predicting Yj under model Mi . With sum to 1 constraint alone, ˆw ∝ (eT e)−11
  • 62. Comments on Solutions Let e = [e]ji = Yj − ˆY(−j)Mi denote the n × M matrix of residuals for predicting Yj under model Mi . With sum to 1 constraint alone, ˆw ∝ (eT e)−11 If residuals from models are uncorrelated, then weights are proportional to the inverse of the cross-validation MSE for model Mi , MSEi = j e2 ij
  • 63. Comments on Solutions Let e = [e]ji = Yj − ˆY(−j)Mi denote the n × M matrix of residuals for predicting Yj under model Mi . With sum to 1 constraint alone, ˆw ∝ (eT e)−11 If residuals from models are uncorrelated, then weights are proportional to the inverse of the cross-validation MSE for model Mi , MSEi = j e2 ij With highly correlated predictions/residual weights may be negative and highly unstable
  • 64. Comments on Solutions Let e = [e]ji = Yj − ˆY(−j)Mi denote the n × M matrix of residuals for predicting Yj under model Mi . With sum to 1 constraint alone, ˆw ∝ (eT e)−11 If residuals from models are uncorrelated, then weights are proportional to the inverse of the cross-validation MSE for model Mi , MSEi = j e2 ij With highly correlated predictions/residual weights may be negative and highly unstable Non-negativity lasso-like constraint stabilizes weights, and may drive weights to 0 for similar models
  • 65. Comments on Solutions Let e = [e]ji = Yj − ˆY(−j)Mi denote the n × M matrix of residuals for predicting Yj under model Mi . With sum to 1 constraint alone, ˆw ∝ (eT e)−11 If residuals from models are uncorrelated, then weights are proportional to the inverse of the cross-validation MSE for model Mi , MSEi = j e2 ij With highly correlated predictions/residual weights may be negative and highly unstable Non-negativity lasso-like constraint stabilizes weights, and may drive weights to 0 for similar models Provides a Bayesian justification for classical stacking (Wolpert 1992, Breiman 1996)
  • 66. Comments on Solutions Let e = [e]ji = Yj − ˆY(−j)Mi denote the n × M matrix of residuals for predicting Yj under model Mi . With sum to 1 constraint alone, ˆw ∝ (eT e)−11 If residuals from models are uncorrelated, then weights are proportional to the inverse of the cross-validation MSE for model Mi , MSEi = j e2 ij With highly correlated predictions/residual weights may be negative and highly unstable Non-negativity lasso-like constraint stabilizes weights, and may drive weights to 0 for similar models Provides a Bayesian justification for classical stacking (Wolpert 1992, Breiman 1996) Super-Learners! h2oEnsemble
  • 67. Ovarian Cancer Example Predict short vs. long-term survival given primary tumor’s molecular phenotype.
  • 68. Ovarian Cancer Example Predict short vs. long-term survival given primary tumor’s molecular phenotype. Retrospective sample of survivors of advanced stage serous ovarian cancer n = 30 short-term (< 3 years) n = 24 long-term (> 7 years)
  • 69. Ovarian Cancer Example Predict short vs. long-term survival given primary tumor’s molecular phenotype. Retrospective sample of survivors of advanced stage serous ovarian cancer n = 30 short-term (< 3 years) n = 24 long-term (> 7 years) Eleven early stage (I/II) cases for external validation.
  • 70. Ovarian Cancer Example Predict short vs. long-term survival given primary tumor’s molecular phenotype. Retrospective sample of survivors of advanced stage serous ovarian cancer n = 30 short-term (< 3 years) n = 24 long-term (> 7 years) Eleven early stage (I/II) cases for external validation. Affymetrix U133a expression microarray; 22, 283 genes.
  • 71. Ovarian Cancer Example Predict short vs. long-term survival given primary tumor’s molecular phenotype. Retrospective sample of survivors of advanced stage serous ovarian cancer n = 30 short-term (< 3 years) n = 24 long-term (> 7 years) Eleven early stage (I/II) cases for external validation. Affymetrix U133a expression microarray; 22, 283 genes. 6 clinical variables
  • 72. Ovarian Cancer Example Predict short vs. long-term survival given primary tumor’s molecular phenotype. Retrospective sample of survivors of advanced stage serous ovarian cancer n = 30 short-term (< 3 years) n = 24 long-term (> 7 years) Eleven early stage (I/II) cases for external validation. Affymetrix U133a expression microarray; 22, 283 genes. 6 clinical variables
  • 73. Models for M-Open Model Averaging (MOMA) Three classes of models: Clinical Trees: (5 models) Prospective classification and regression tree models using only clinical variables such as age, post-treatment CA125 levels, etc.
  • 74. Models for M-Open Model Averaging (MOMA) Three classes of models: Clinical Trees: (5 models) Prospective classification and regression tree models using only clinical variables such as age, post-treatment CA125 levels, etc. Expression Trees: (4 models) Prospective Classification and regression tree models using only expression data.
  • 75. Models for M-Open Model Averaging (MOMA) Three classes of models: Clinical Trees: (5 models) Prospective classification and regression tree models using only clinical variables such as age, post-treatment CA125 levels, etc. Expression Trees: (4 models) Prospective Classification and regression tree models using only expression data. Expression LDA: (4 models) Retrospective discriminant models built using expression data given survival.
  • 76. Models for M-Open Model Averaging (MOMA) Three classes of models: Clinical Trees: (5 models) Prospective classification and regression tree models using only clinical variables such as age, post-treatment CA125 levels, etc. Expression Trees: (4 models) Prospective Classification and regression tree models using only expression data. Expression LDA: (4 models) Retrospective discriminant models built using expression data given survival. Find MOMA (M-Open Model Averaging) weights ˆw using all long term and short term survivors
  • 77. Models for M-Open Model Averaging (MOMA) Three classes of models: Clinical Trees: (5 models) Prospective classification and regression tree models using only clinical variables such as age, post-treatment CA125 levels, etc. Expression Trees: (4 models) Prospective Classification and regression tree models using only expression data. Expression LDA: (4 models) Retrospective discriminant models built using expression data given survival. Find MOMA (M-Open Model Averaging) weights ˆw using all long term and short term survivors sum to 1 constraint + non-negativity constraint
  • 78. MOMA Weights −40 −20 0 20 40 MOMA Weights Clinical Trees Expression Trees LDA (both) Non−Neg,Sum=1 (expanded)Non−Neg,Sum=1Sum=1 0 1/4 1/2 3/4 1
  • 79. Validation Experiment 5-fold cross validation; 5 splits of data into two groups: Training Y and Validation Y∗
  • 80. Validation Experiment 5-fold cross validation; 5 splits of data into two groups: Training Y and Validation Y∗ Use training data to obtain model weights ˆw via LOO
  • 81. Validation Experiment 5-fold cross validation; 5 splits of data into two groups: Training Y and Validation Y∗ Use training data to obtain model weights ˆw via LOO Construct M-Open Model Averaging (MOMA) estimates of probability of long term survival ˆpj = i ˆwi ˆY ∗ Mi (Y) for validation samples
  • 82. Validation Experiment 5-fold cross validation; 5 splits of data into two groups: Training Y and Validation Y∗ Use training data to obtain model weights ˆw via LOO Construct M-Open Model Averaging (MOMA) estimates of probability of long term survival ˆpj = i ˆwi ˆY ∗ Mi (Y) for validation samples Classify as Long Term Survivor ˆpj ≥ 1/2
  • 83. Validation Experiment 5-fold cross validation; 5 splits of data into two groups: Training Y and Validation Y∗ Use training data to obtain model weights ˆw via LOO Construct M-Open Model Averaging (MOMA) estimates of probability of long term survival ˆpj = i ˆwi ˆY ∗ Mi (Y) for validation samples Classify as Long Term Survivor ˆpj ≥ 1/2 Compute classification accuracy over 5 Splits
  • 84. MOMA with Constraints set1 set2 set3 set4 set5 clin1 0.00 0.07 0.00 0.00 0.00 clin2 0.00 0.00 0.00 0.00 0.00 clin3 0.00 0.00 0.00 0.00 0.00 clin4 0.00 0.11 0.00 0.00 0.00 clin5 0.30 0.17 0.07 0.41 0.00 tree1 0.00 0.00 0.00 0.00 0.77 tree2 0.00 0.00 0.00 0.00 0.21 tree3 0.23 0.44 0.21 0.00 0.01 tree4 0.00 0.00 0.00 0.00 0.01 lda100.P1 0.00 0.00 0.00 0.00 0.00 lda100.P2 0.22 0.00 0.30 0.00 0.00 lda200.P1 0.00 0.00 0.00 0.00 0.00 lda200.P2 0.26 0.21 0.41 0.58 0.00 Accuracy 0.82 0.73 0.55 0.73 0.60
  • 85. Comments can relax the sum to one constraint (overshrink)
  • 86. Comments can relax the sum to one constraint (overshrink) justification for constraints
  • 87. Comments can relax the sum to one constraint (overshrink) justification for constraints other decision problems lead to other weights
  • 88. Comments can relax the sum to one constraint (overshrink) justification for constraints other decision problems lead to other weights uncertainty?
  • 89. More General Utilities - Yao et al (2018) Under negative squared error loss, only have optimal point predictions
  • 90. More General Utilities - Yao et al (2018) Under negative squared error loss, only have optimal point predictions Stacking probabilistic forecasts P ∈ P a(w, y) = m wmp(y∗ | y, Mm)
  • 91. More General Utilities - Yao et al (2018) Under negative squared error loss, only have optimal point predictions Stacking probabilistic forecasts P ∈ P a(w, y) = m wmp(y∗ | y, Mm) Proper Scoring rules: S(Q, Q) ≥ S(P, Q) for P, Q ∈ P S(P, Q) ≡ S(P, ω)dQ(ω)
  • 92. More General Utilities - Yao et al (2018) Under negative squared error loss, only have optimal point predictions Stacking probabilistic forecasts P ∈ P a(w, y) = m wmp(y∗ | y, Mm) Proper Scoring rules: S(Q, Q) ≥ S(P, Q) for P, Q ∈ P S(P, Q) ≡ S(P, ω)dQ(ω) Strictly Proper S(Q, Q) ≥ S(P, Q) with equality only when P = Q almost surely
  • 93. More General Utilities - Yao et al (2018) Under negative squared error loss, only have optimal point predictions Stacking probabilistic forecasts P ∈ P a(w, y) = m wmp(y∗ | y, Mm) Proper Scoring rules: S(Q, Q) ≥ S(P, Q) for P, Q ∈ P S(P, Q) ≡ S(P, ω)dQ(ω) Strictly Proper S(Q, Q) ≥ S(P, Q) with equality only when P = Q almost surely Negative Quadratic Loss is proper, but not a strictly proper scoring rule
  • 94. More General Utilities - Yao et al (2018) Under negative squared error loss, only have optimal point predictions Stacking probabilistic forecasts P ∈ P a(w, y) = m wmp(y∗ | y, Mm) Proper Scoring rules: S(Q, Q) ≥ S(P, Q) for P, Q ∈ P S(P, Q) ≡ S(P, ω)dQ(ω) Strictly Proper S(Q, Q) ≥ S(P, Q) with equality only when P = Q almost surely Negative Quadratic Loss is proper, but not a strictly proper scoring rule Logarithmic Score S(P, y∗) = log(p(y∗))
  • 95. Probabilistic Forecasting Ensemble BMA (Raftery + coauthors 2005 . . . ) weather forecasting
  • 96. Probabilistic Forecasting Ensemble BMA (Raftery + coauthors 2005 . . . ) weather forecasting arg max w,σ2 i log( M m wmp(y∗ i | y, Mm, σ2 )
  • 97. Probabilistic Forecasting Ensemble BMA (Raftery + coauthors 2005 . . . ) weather forecasting arg max w,σ2 i log( M m wmp(y∗ i | y, Mm, σ2 ) pm Gaussian distributions centered at am + bm ˆY∗ m
  • 98. Probabilistic Forecasting Ensemble BMA (Raftery + coauthors 2005 . . . ) weather forecasting arg max w,σ2 i log( M m wmp(y∗ i | y, Mm, σ2 ) pm Gaussian distributions centered at am + bm ˆY∗ m allows for bias and calibration of computer model output
  • 99. Probabilistic Forecasting Ensemble BMA (Raftery + coauthors 2005 . . . ) weather forecasting arg max w,σ2 i log( M m wmp(y∗ i | y, Mm, σ2 ) pm Gaussian distributions centered at am + bm ˆY∗ m allows for bias and calibration of computer model output common unknown variance σ2 in each component
  • 100. Probabilistic Forecasting Ensemble BMA (Raftery + coauthors 2005 . . . ) weather forecasting arg max w,σ2 i log( M m wmp(y∗ i | y, Mm, σ2 ) pm Gaussian distributions centered at am + bm ˆY∗ m allows for bias and calibration of computer model output common unknown variance σ2 in each component weights evolve with time
  • 101. Probabilistic Forecasting Ensemble BMA (Raftery + coauthors 2005 . . . ) weather forecasting arg max w,σ2 i log( M m wmp(y∗ i | y, Mm, σ2 ) pm Gaussian distributions centered at am + bm ˆY∗ m allows for bias and calibration of computer model output common unknown variance σ2 in each component weights evolve with time multivariate outcomes West + coauthors Dynamic Linear Models (economic forecasting) with dynamic weights
  • 102. Probabilistic Forecasting Ensemble BMA (Raftery + coauthors 2005 . . . ) weather forecasting arg max w,σ2 i log( M m wmp(y∗ i | y, Mm, σ2 ) pm Gaussian distributions centered at am + bm ˆY∗ m allows for bias and calibration of computer model output common unknown variance σ2 in each component weights evolve with time multivariate outcomes West + coauthors Dynamic Linear Models (economic forecasting) with dynamic weights Gaussian Process emulators for computer models and statistical models
  • 103. Discussion Model Averaging for Uncertainty Quantification under different perspectives
  • 104. Discussion Model Averaging for Uncertainty Quantification under different perspectives Dependence on Choice of Utility Functions
  • 105. Discussion Model Averaging for Uncertainty Quantification under different perspectives Dependence on Choice of Utility Functions squared error loss - point estimates
  • 106. Discussion Model Averaging for Uncertainty Quantification under different perspectives Dependence on Choice of Utility Functions squared error loss - point estimates proper scoring rules - distributions (global)
  • 107. Discussion Model Averaging for Uncertainty Quantification under different perspectives Dependence on Choice of Utility Functions squared error loss - point estimates proper scoring rules - distributions (global) select quantiles (multiple objectives)
  • 108. Discussion Model Averaging for Uncertainty Quantification under different perspectives Dependence on Choice of Utility Functions squared error loss - point estimates proper scoring rules - distributions (global) select quantiles (multiple objectives) Mixture Models and Mixtures of Experts
  • 109. Discussion Model Averaging for Uncertainty Quantification under different perspectives Dependence on Choice of Utility Functions squared error loss - point estimates proper scoring rules - distributions (global) select quantiles (multiple objectives) Mixture Models and Mixtures of Experts Partitions of data for approximation? LOO, k-fold, sequential
  • 110. Discussion Model Averaging for Uncertainty Quantification under different perspectives Dependence on Choice of Utility Functions squared error loss - point estimates proper scoring rules - distributions (global) select quantiles (multiple objectives) Mixture Models and Mixtures of Experts Partitions of data for approximation? LOO, k-fold, sequential Allow weights to Depend on Inputs
  • 111. Discussion Model Averaging for Uncertainty Quantification under different perspectives Dependence on Choice of Utility Functions squared error loss - point estimates proper scoring rules - distributions (global) select quantiles (multiple objectives) Mixture Models and Mixtures of Experts Partitions of data for approximation? LOO, k-fold, sequential Allow weights to Depend on Inputs Incorporation of Model Complexity/Regularization in Utility (sum to one?)
  • 112. Discussion Model Averaging for Uncertainty Quantification under different perspectives Dependence on Choice of Utility Functions squared error loss - point estimates proper scoring rules - distributions (global) select quantiles (multiple objectives) Mixture Models and Mixtures of Experts Partitions of data for approximation? LOO, k-fold, sequential Allow weights to Depend on Inputs Incorporation of Model Complexity/Regularization in Utility (sum to one?) Optimization: Quadratic programming, EM, variational, ABC