4. Two competing philosophies
[van de Meent et al., 2018] To build machines that can
reason, random variables and probabilistic calculations
are:
2
5. Two competing philosophies
[van de Meent et al., 2018] To build machines that can
reason, random variables and probabilistic calculations
are:
Probabilistic ML
An engineering requirement
[Tenenbaum et al., 2011,
Ghahramani, 2015]
2
6. Two competing philosophies
[van de Meent et al., 2018] To build machines that can
reason, random variables and probabilistic calculations
are:
Probabilistic ML
An engineering requirement
[Tenenbaum et al., 2011,
Ghahramani, 2015]
Deep Learning
Irrelevant
[LeCun et al., 2015,
Goodfellow et al., 2016]
2
7. Two competing philosophies
[van de Meent et al., 2018] To build machines that can
reason, random variables and probabilistic calculations
are:
Probabilistic ML
An engineering requirement
[Tenenbaum et al., 2011,
Ghahramani, 2015]
Deep Learning
Irrelevant
[LeCun et al., 2015,
Goodfellow et al., 2016]
2
8. Two competing philosophies
[van de Meent et al., 2018] To build machines that can
reason, random variables and probabilistic calculations
are:
Probabilistic ML
An engineering requirement
[Tenenbaum et al., 2011,
Ghahramani, 2015]
Deep Learning
Irrelevant
[LeCun et al., 2015,
Goodfellow et al., 2016]
2
10. Probabilistic programming languages (PPLs)
Just as programming beyond the simplest algo-
rithms requires tools for abstraction and com-
position, complex probabilistic modeling requires
new progress in model representation—proba-
bilistic programming languages.
[Goodman, 2013]
3
14. Abstractions over probabilistic computations
Figure 1:
[Koller and Friedman, 2009]
1 d ~ Bernoulli
2 i ~ Normal
3 g ~ Categorical(fn(d, i))
4 s ~ Normal(fn(i))
5 l ~ Bernoulli(fn(g))
5
15. Abstractions over probabilistic computations
Figure 1:
[Koller and Friedman, 2009]
1 d ~ Bernoulli
2 i ~ Normal
3 g ~ Categorical(fn(d, i))
4 s ~ Normal(fn(i))
5 l ~ Bernoulli(fn(g))
Generative model:
P(D, I, G, S, L) = P(D)P(I)P(G | D, I)P(S | I)P(G | L)
5
16. Abstractions over probabilistic computations
Figure 1:
[Koller and Friedman, 2009]
1 d ~ Bernoulli
2 i ~ Normal
3 g ~ Categorical(fn(d, i))
4 s ~ Normal(fn(i))
5 l ~ Bernoulli(fn(g))
Question : Given a student’s recommendation letter
and SAT score, what should I expect their
intelligence to be?
5
17. Abstractions over probabilistic computations
Figure 1:
[Koller and Friedman, 2009]
1 d ~ Bernoulli
2 i ~ Normal
3 g ~ Categorical(fn(d, i))
4 s ~ Normal(fn(i))
5 l ~ Bernoulli(fn(g))
Question : Given a student’s recommendation letter
and SAT score, what should I expect their
intelligence to be?
Using a PPL : infer(i, {l=Good, s=800}) 5
22. Bayesian Inference Basics
Goal: Approximate the posterior P(X | Y)
X Y
intelligence letter and grade
scene description image
simulation simulator output
program source code program return value
policy prior and world simulator rewards
cognitive decision making process observed behavior
Table 1: [van de Meent et al., 2018]
6
24. Why only approximate?
P(X | Y) =
P(Y | X)P(X)
P(Y)
=
P(Y | X)P(X)
X
P(Y | X)P(X)dX
Marginal likelihood P(Y) (i.e. partition function)
high-dimensional integral
Tractable only for small family of conjugate
prior/likelihood pairs
7
26. How to approximate?
Variational Inference
Let qφ be a tractable
parametric family (e.g.
Gaussian mean-field
qφ(X) = d
i=1 N(Xi | φ1,i, φ2,i))
Monte Carlo
8
27. How to approximate?
Variational Inference
Let qφ be a tractable
parametric family (e.g.
Gaussian mean-field
qφ(X) = d
i=1 N(Xi | φ1,i, φ2,i))
arg min
φ
KL(qφ(X) | P(X | Y))
= arg max
φ
Eqφ
log qφ(X)
log P(X | Y)
Monte Carlo
8
28. How to approximate?
Variational Inference
Let qφ be a tractable
parametric family (e.g.
Gaussian mean-field
qφ(X) = d
i=1 N(Xi | φ1,i, φ2,i))
arg min
φ
KL(qφ(X) | P(X | Y))
= arg max
φ
Eqφ
log qφ(X)
log P(X | Y)
= arg max
φ
Eqφ
log qφ(X)
log P(X, Y)
Monte Carlo
8
29. How to approximate?
Variational Inference
Let qφ be a tractable
parametric family (e.g.
Gaussian mean-field
qφ(X) = d
i=1 N(Xi | φ1,i, φ2,i))
arg min
φ
KL(qφ(X) | P(X | Y))
= arg max
φ
Eqφ
log qφ(X)
log P(X | Y)
Monte Carlo
Sample Xi
iid
∼ P(X | Y). Then
8
30. How to approximate?
Variational Inference
Let qφ be a tractable
parametric family (e.g.
Gaussian mean-field
qφ(X) = d
i=1 N(Xi | φ1,i, φ2,i))
arg min
φ
KL(qφ(X) | P(X | Y))
= arg max
φ
Eqφ
log qφ(X)
log P(X | Y)
Monte Carlo
Sample Xi
iid
∼ P(X | Y). Then
E[g(X)|Y]
= g(X) · P(X | Y)dX
≈
1
N
N
n=1
g(Xi)
8
31. How to approximate?
Variational Inference
Let qφ be a tractable
parametric family (e.g.
Gaussian mean-field
qφ(X) = d
i=1 N(Xi | φ1,i, φ2,i))
arg min
φ
KL(qφ(X) | P(X | Y))
= arg max
φ
Eqφ
log qφ(X)
log P(X | Y)
Monte Carlo
Sample Xi
iid
∼ P(X | Y). Then
E[g(X)|Y]
= g(X) · P(X | Y)dX
≈
1
N
N
n=1
g(Xi)
8
32. How to approximate?
Variational Inference
Let qφ be a tractable
parametric family (e.g.
Gaussian mean-field
qφ(X) = d
i=1 N(Xi | φ1,i, φ2,i))
arg min
φ
KL(qφ(X) | P(X | Y))
= arg max
φ
Eqφ
log qφ(X)
log P(X | Y)
Monte Carlo
Sample Xi
iid
∼ P(X | Y). Then
E[g(X)|Y]
= g(X) · P(X | Y)dX
≈
1
N
N
n=1
g(Xi)
8
40. Sequential importance sampling (SIS)
EP(X|Y)[g(X)] = EP(X|Y)
q(X)
q(X)
g(X) =
Eq
P(X,Y)
q(X)
g(X)
P(Y)
≈
1
N
N
i
P(Xi,Y)
q(Xi)
P(Y)
g(Xi) ≈
N
i
P(Xi,Y)
q(Xi)
N
i
P(Xi,Y)
q(Xi)
g(Xi)
where Xi ∼ q from proposal distribution q
10
41. Sequential importance sampling (SIS)
EP(X|Y)[g(X)] = EP(X|Y)
q(X)
q(X)
g(X) =
Eq
P(X,Y)
q(X)
g(X)
P(Y)
≈
1
N
N
i
P(Xi,Y)
q(Xi)
P(Y)
g(Xi) ≈
N
i
P(Xi,Y)
q(Xi)
N
i
P(Xi,Y)
q(Xi)
g(Xi)
where Xi ∼ q from proposal distribution q
Rate Varq
P(X,Y)
q(X)
g(X)
−1/2
[Yuan and Druzdzel, 2007]
10
42. SIS of execution traces
1. Execute the probabilistic program forwards
11
43. SIS of execution traces
1. Execute the probabilistic program forwards
2. At each latent variable Xk (i.e. sample statement),
sample q(Xk), assign value, multiply node
importance weight into trace
11
44. SIS of execution traces
1. Execute the probabilistic program forwards
2. At each latent variable Xk (i.e. sample statement),
sample q(Xk), assign value, multiply node
importance weight into trace
3. At each observed random variable (i.e. observe
statement), multiply likelihood P(Yk | X1:k, Y1:k) into
trace
11
45. SIS of execution traces
1. Execute the probabilistic program forwards
2. At each latent variable Xk (i.e. sample statement),
sample q(Xk), assign value, multiply node
importance weight into trace
3. At each observed random variable (i.e. observe
statement), multiply likelihood P(Yk | X1:k, Y1:k) into
trace
11
46. SIS of execution traces
1. Execute the probabilistic program forwards
2. At each latent variable Xk (i.e. sample statement),
sample q(Xk), assign value, multiply node
importance weight into trace
3. At each observed random variable (i.e. observe
statement), multiply likelihood P(Yk | X1:k, Y1:k) into
trace
Problem: Myopic choices from sampling q(X) early in the
trace may result in low importance weights (poor
explanation of Y) later.
11
48. Constructing a proposal distribution
How to choose q?
Likelihood-Weighting [Norvig and Intelligence, 2002]:
q(X) = P(X)
12
49. Constructing a proposal distribution
How to choose q?
Likelihood-Weighting [Norvig and Intelligence, 2002]:
q(X) = P(X)
Direct sampling: q(X) = P(X | Y), optimal
12
50. Constructing a proposal distribution
How to choose q?
Likelihood-Weighting [Norvig and Intelligence, 2002]:
q(X) = P(X)
Direct sampling: q(X) = P(X | Y), optimal
Key Idea: Account for Y when constructing q. Exploit
access to P(X, Y) to build a proposers q “close” to
P(X | Y)?
12
51. Intuition for inference compilation
−5.0 −2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0
x
−4
−2
0
2
4
y
Generative model p(x,y)
−5.0 −2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0
x
0.00
0.02
0.04
0.06
0.08
0.10
PDF
Posterior density given y=0.25
p(x|y = 0.25)
x ∼ p(x | y = 0.25)
q(x; φ(y = 0.25), K = 1)
q(x; φ(y = 0.25), K = 2)
13
52. Trace-based inference compilation (IC)
• Construct DNN with parameters φ mapping
observations Y (amortized inference,
[Goodman, 2013]) and execution prefix to proposal
distribution qφ(· | Y)
• Train qφ against forward samples from the
probabilistic program p(x, y) (inference compilation)
Figure 3: [Le et al., 2017]
14
56. Imperative vs Declarative PPLs
Declarative: Graph-based, samples instantiated graphical
models (i.e. worlds) (BUGS, BLOG, Stan, beanmachine)
Figure 6: [Blei et al., 2003]
Key Idea: Markov blanket available in declarative PPL
17
57. MCMC sampling of graphical models
Metropolis-within-Gibbs / Lightweight MH
[Wingate et al., 2011]:
• Initialize minimal self-supporting world consistent
with Y
18
58. MCMC sampling of graphical models
Metropolis-within-Gibbs / Lightweight MH
[Wingate et al., 2011]:
• Initialize minimal self-supporting world consistent
with Y
• Repeat:
• Pick single random unobserved node Xi
18
59. MCMC sampling of graphical models
Metropolis-within-Gibbs / Lightweight MH
[Wingate et al., 2011]:
• Initialize minimal self-supporting world consistent
with Y
• Repeat:
• Pick single random unobserved node Xi
• Sample proposal q(Xi) to propose new value
18
60. MCMC sampling of graphical models
Metropolis-within-Gibbs / Lightweight MH
[Wingate et al., 2011]:
• Initialize minimal self-supporting world consistent
with Y
• Repeat:
• Pick single random unobserved node Xi
• Sample proposal q(Xi) to propose new value
• Accept with probability α and revert otherwise
18
61. MCMC sampling of graphical models
Metropolis-within-Gibbs / Lightweight MH
[Wingate et al., 2011]:
• Initialize minimal self-supporting world consistent
with Y
• Repeat:
• Pick single random unobserved node Xi
• Sample proposal q(Xi) to propose new value
• Accept with probability α and revert otherwise
18
62. MCMC sampling of graphical models
Metropolis-within-Gibbs / Lightweight MH
[Wingate et al., 2011]:
• Initialize minimal self-supporting world consistent
with Y
• Repeat:
• Pick single random unobserved node Xi
• Sample proposal q(Xi) to propose new value
• Accept with probability α and revert otherwise
Theorem ([Hastings, 1970])
With appropriately chosen α, the above algorithm yields
a Markov Chain with the posterior as the invariant
distribution.
18
65. MH proposal distributions
Different q(·) =⇒ different MCMC algorithms
• Random walk MH q(·) isotropic Gaussian
• Newtonian Monte Carlo q(·) Gaussian with empirical
Fisher information precision
19
66. MH proposal distributions
Different q(·) =⇒ different MCMC algorithms
• Random walk MH q(·) isotropic Gaussian
• Newtonian Monte Carlo q(·) Gaussian with empirical
Fisher information precision
• Hamiltonian Monte Carlo q(·) integrates
iso-Hamiltonian system
19
67. MH proposal distributions
Different q(·) =⇒ different MCMC algorithms
• Random walk MH q(·) isotropic Gaussian
• Newtonian Monte Carlo q(·) Gaussian with empirical
Fisher information precision
• Hamiltonian Monte Carlo q(·) integrates
iso-Hamiltonian system
• Lightweight Inference Compilation q(· | MB(Xi)) a
neural network function of Markov Blanket
19
68. MH proposal distributions
Different q(·) =⇒ different MCMC algorithms
• Random walk MH q(·) isotropic Gaussian
• Newtonian Monte Carlo q(·) Gaussian with empirical
Fisher information precision
• Hamiltonian Monte Carlo q(·) integrates
iso-Hamiltonian system
• Lightweight Inference Compilation q(· | MB(Xi)) a
neural network function of Markov Blanket
Theorem ([Pearl, 1987])
Gibbs distributions P(Xi | Xc
i ) = P(Xi | MB(Xi)) have
acceptance probability 1
19
69. MH proposal distributions
Different q(·) =⇒ different MCMC algorithms
• Random walk MH q(·) isotropic Gaussian
• Newtonian Monte Carlo q(·) Gaussian with empirical
Fisher information precision
• Hamiltonian Monte Carlo q(·) integrates
iso-Hamiltonian system
• Lightweight Inference Compilation q(· | MB(Xi)) a
neural network function of Markov Blanket
Theorem ([Pearl, 1987])
Gibbs distributions P(Xi | Xc
i ) = P(Xi | MB(Xi)) have
acceptance probability 1
∴ MB(Xi) minimal sufficient inputs for constructing
proposal distribution 19
74. Recovering conjugate expressions in normal-normal
x ∼ N(0, 2), y | x ∼ N(x, 0.1)
Know: y | x ∼ N(0.999x, 0.0001)
10 5 0 5 10
y (Observed value)
10
5
0
5
10
value
Learning a conjugate model's posterior
variable
Closed-form mean
Mean of LIC proposer
23
75. GMM Mode Escape
−5.0 −2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0
x
−4
−2
0
2
4
y
Generative model p(x,y)
−5.0 −2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0
x
0.00
0.02
0.04
0.06
0.08
0.10
PDF
Posterior density given y=0.25
p(x|y = 0.25)
x ∼ p(x | y = 0.25)
q(x; φ(y = 0.25), K = 1)
q(x; φ(y = 0.25), K = 2)
0.000
0.025
0.050
0.075
Density
method = Adaptive HMC (Hoffman 2014) method = Adaptive RWMH (Garthwaite 2016) method = Ground Truth
−5 0 5 10 15
x
0.000
0.025
0.050
0.075
Density
method = Inference Compilation (this paper)
−5 0 5 10 15
x
method = NMC (Arora 2020)
−5 0 5 10 15
x
method = NUTS (Stan defaults)
24
76. Robustness to nuisance random variables
1 def magnitude(obs):
2 x = sample(Normal(0, 10))
3 for _ in range(100):
4 nuisance = sample(Normal(0, 10))
5 y = sample(Normal(0, 10))
6 observe(
7 obs**2,
8 likelihood=Normal(x**2 + y**2, 0.1))
9 return x
1 class NuisanceModel:
2 @random_variable
3 def x(self):
4 return dist.Normal(0, 10)
5 @random_variable
6 def nuisance(self, i):
7 return dist.Normal(0, 10)
8 @random_variable
9 def y(self):
10 return dist.Normal(0, 10)
11 @random_variable
12 def noisy_sq_length(self):
13 return dist.Normal(
14 self.x()**2 + self.y()**2,
15 0.1)
# params compile time ESS
LIC (this paper) 3,358 44 sec. 49.75
[Le et al., 2017] 21,952 472 sec. 10.99
25
81. Adaptive LIC
Problem: forward samples from P(X, Y) may not
represent Y at inference
RWMH [Garthwaite et al., 2016] and HMC
[Hoffman and Gelman, 2014] all have adaptive variants.
Idea: Perform MH with LIC to draw posterior samples
(x(m)
, y(m)
= obs) ∼ P(x | y = obs), hill-climb LIC artifacts
on inclusive KL between conditional (rather than joint)
posterior
arg min
φ
DKL(p(x|y = obs)||q(x|y = obs; φ))
≈ arg min
φ
N
m=1
log Q(x(m)
| y = obs, φ)
28
82. IAF density estimators
Problem: GMM in LIC may provide poor approximations
Idea: Parameterize IAFs [Kingma et al., 2016] with LIC
outputs
Figure 7: Neal’s funnel (left) and a 7-component isotropic GMM
(middle) and 7-layer IAF (right) density approximation
29
83. Heavy-tailed density estimators
Problem: GMMs and standard IAFs (Lipschitz functions of
Gaussians) remain sub-Gaussian, n-schools is heavy
tailed
Idea: IAFs with heavy-tailed base distribution
Figure 8: IAF density estimation of a Cauchy(−2, 1) (left) and
their K-S statistics when using a Normal (right top) and
StudentT (right bottom) base distribution
30
85. References i
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003).
Latent dirichlet allocation.
Journal of machine Learning research, 3(Jan):993–1022.
Garthwaite, P. H., Fan, Y., and Sisson, S. A. (2016).
Adaptive optimal scaling of metropolis–hastings
algorithms using the robbins–monro process.
Communications in Statistics-Theory and Methods,
45(17):5098–5111.
32
86. References ii
Ghahramani, Z. (2015).
Probabilistic machine learning and artificial
intelligence.
Nature, 521(7553):452–459.
Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y.
(2016).
Deep learning.
MIT press Cambridge.
33
87. References iii
Goodman, N. D. (2013).
The principles and practice of probabilistic
programming.
ACM SIGPLAN Notices, 48(1):399–402.
Harvey, W., Munk, A., Baydin, A. G., Bergholm, A., and
Wood, F. (2019).
Attention for inference compilation.
arXiv preprint arXiv:1910.11961.
Hastings, W. K. (1970).
Monte carlo sampling methods using markov chains
and their applications.
34
88. References iv
Hoffman, M. D. and Gelman, A. (2014).
The no-u-turn sampler: adaptively setting path
lengths in hamiltonian monte carlo.
J. Mach. Learn. Res., 15(1):1593–1623.
Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X.,
Sutskever, I., and Welling, M. (2016).
Improved variational inference with inverse
autoregressive flow.
In Advances in neural information processing
systems, pages 4743–4751.
35
89. References v
Koller, D. and Friedman, N. (2009).
Probabilistic graphical models: principles and
techniques.
MIT press.
Le, T. A., Baydin, A. G., and Wood, F. (2017).
Inference compilation and universal probabilistic
programming.
In Artificial Intelligence and Statistics, pages
1338–1348.
36
90. References vi
LeCun, Y., Bengio, Y., and Hinton, G. (2015).
Deep learning.
nature, 521(7553):436–444.
Norvig, P. R. and Intelligence, S. A. (2002).
A modern approach.
Prentice Hall.
Pearl, J. (1987).
Evidential reasoning using stochastic simulation of
causal models.
Artificial Intelligence, 32(2):245–257.
37
91. References vii
Tenenbaum, J. B., Kemp, C., Griffiths, T. L., and
Goodman, N. D. (2011).
How to grow a mind: Statistics, structure, and
abstraction.
science, 331(6022):1279–1285.
van de Meent, J.-W., Paige, B., Yang, H., and Wood, F.
(2018).
An introduction to probabilistic programming.
arXiv preprint arXiv:1809.10756.
38
92. References viii
Wingate, D., Stuhlmüller, A., and Goodman, N. (2011).
Lightweight implementations of probabilistic
programming languages via transformational
compilation.
In Proceedings of the Fourteenth International
Conference on Artificial Intelligence and Statistics,
pages 770–778.
39
93. References ix
Yuan, C. and Druzdzel, M. J. (2007).
Theoretical analysis and practical insights on
importance sampling in bayesian networks.
International Journal of Approximate Reasoning,
46(2):320–333.
40