Accelerating Metropolis Hastings with Lightweight Inference Compilation

Accelerating Metropolis-Hastings with
Lightweight Inference Compilation
Feynman Liang, Nim Arora, Nazanin Tehrani, Yucen Li, Michael
Tingley, Erik Meijer
November 11, 2020
Facebook Probability, Facebook AI Infrastructure, UC Berkeley Mahoney Lab

Table of contents
Background
Probabilistic Programming
Bayesian Inference
Inference compilation
SIS in imperative PPLs
Lightweight Inference Compilation for MCMC
Results
Future directions
1

Two competing philosophies
[van de Meent et al., 2018] To build machines that can
reason, random variables and probabilistic calculations
are:
2

are:
Probabilistic ML
An engineering requirement
[Tenenbaum et al., 2011,
Ghahramani, 2015]
2

are:
Probabilistic ML
An engineering requirement
[Tenenbaum et al., 2011,
Ghahramani, 2015]
Deep Learning
Irrelevant
[LeCun et al., 2015,
Goodfellow et al., 2016]
2

Background
Probabilistic Programming

Probabilistic programming languages (PPLs)
Just as programming beyond the simplest algo-
rithms requires tools for abstraction and com-
position, complex probabilistic modeling requires
new progress in model representation—proba-
bilistic programming languages.
[Goodman, 2013]
3

Abstractions over deterministic computations
Low Level Assembly
1 mov dx, msg
2 ; ah=9 - "print string" sub-function
3 mov ah, 9
4 int 0x21
5
6 "exit" sub-function
7 mov ah, 0x4c
8 int 0x21
9
10 msg db 'Hello!', 0x0d, 0x0a, '$'
4

Abstractions over deterministic computations
Low Level Assembly
1 mov dx, msg
2 ; ah=9 - "print string" sub-function
3 mov ah, 9
4 int 0x21
5
6 "exit" sub-function
7 mov ah, 0x4c
8 int 0x21
9
10 msg db 'Hello!', 0x0d, 0x0a, '$'
High Level Python
1 print("Hello!")
4

Abstractions over probabilistic computations
Figure 1:
[Koller and Friedman, 2009]
5

Figure 1:
1 d ~ Bernoulli
2 i ~ Normal
3 g ~ Categorical(fn(d, i))
4 s ~ Normal(fn(i))
5 l ~ Bernoulli(fn(g))
5

Figure 1:
1 d ~ Bernoulli
2 i ~ Normal
4 s ~ Normal(fn(i))
Generative model:
P(D, I, G, S, L) = P(D)P(I)P(G | D, I)P(S | I)P(G | L)
5

Figure 1:
1 d ~ Bernoulli
2 i ~ Normal
4 s ~ Normal(fn(i))
Question : Given a student’s recommendation letter
and SAT score, what should I expect their
intelligence to be?
5

Figure 1:
1 d ~ Bernoulli
2 i ~ Normal
4 s ~ Normal(fn(i))
Question : Given a student’s recommendation letter
and SAT score, what should I expect their
intelligence to be?
Using a PPL : infer(i, {l=Good, s=800}) 5

Bayesian Inference Basics
Latent Variables X
Observed Variables Y
Prior P(X)
Likelihood P(Y | X)
6

Latent Variables X
Observed Variables Y
Prior P(X)
Likelihood P(Y | X)
Goal: Approximate the posterior P(X | Y)
6

Figure 2: [van de Meent et al., 2018]
6

X Y
intelligence letter and grade
scene description image
simulation simulator output
program source code program return value
policy prior and world simulator rewards
cognitive decision making process observed behavior
Table 1: [van de Meent et al., 2018]
6

Why only approximate?
P(X | Y) =
P(Y | X)P(X)
P(Y)
=
P(Y | X)P(X)
X
P(Y | X)P(X)dX
7

Why only approximate?
P(X | Y) =
P(Y | X)P(X)
P(Y)
=
P(Y | X)P(X)
X
P(Y | X)P(X)dX
Marginal likelihood P(Y) (i.e. partition function)
high-dimensional integral
Tractable only for small family of conjugate
prior/likelihood pairs
7

How to approximate?
Variational Inference Monte Carlo
8

How to approximate?
Variational Inference
Let qφ be a tractable
parametric family (e.g.
Gaussian mean-field
qφ(X) = d
i=1 N(Xi | φ1,i, φ2,i))
Monte Carlo
8

How to approximate?
Gaussian mean-field
qφ(X) = d
i=1 N(Xi | φ1,i, φ2,i))
arg min
φ
KL(qφ(X) | P(X | Y))
= arg max
φ
Eqφ
log qφ(X)
log P(X | Y)
Monte Carlo
8

How to approximate?
Gaussian mean-field
qφ(X) = d
i=1 N(Xi | φ1,i, φ2,i))
arg min
φ
= arg max
φ
Eqφ
log qφ(X)
log P(X | Y)
= arg max
φ
Eqφ
log qφ(X)
log P(X, Y)
Monte Carlo
8

How to approximate?
Gaussian mean-field
qφ(X) = d
i=1 N(Xi | φ1,i, φ2,i))
arg min
φ
= arg max
φ
Eqφ
log qφ(X)
log P(X | Y)
Monte Carlo
Sample Xi
iid
∼ P(X | Y). Then
8

Imperative vs Declarative PPLs
Imperative: Evaluation-based, samples (linear) execution
traces (Pyro, Church, WebPPL)
9

1 (begin
2 (define geometric
3 (lambda (p)
4 (if (flip p)
5 1
6 (+ 1 (geometric p)))))
7 (geometric .3))
9

1 (begin
2 (define geometric
3 (lambda (p)
4 (if (flip p)
5 1
6 (+ 1 (geometric p)))))
7 (geometric .3))
Figure 2: [Wingate et al., 2011] 9

SIS in imperative PPLs

Sequential importance sampling (SIS)
EP(X|Y)[g(X)]
10

EP(X|Y)[g(X)] = EP(X|Y)
q(X)
q(X)
g(X) =
Eq
P(X,Y)
q(X)
g(X)
P(Y)
10

q(X)
q(X)
g(X) =
Eq
P(X,Y)
q(X)
g(X)
P(Y)
≈
1
N
N
i
P(Xi,Y)
q(Xi)
P(Y)
g(Xi) ≈
N
i
P(Xi,Y)
q(Xi)
N
i
P(Xi,Y)
q(Xi)
g(Xi)
where Xi ∼ q from proposal distribution q
10

q(X)
q(X)
g(X) =
Eq
P(X,Y)
q(X)
g(X)
P(Y)
≈
1
N
N
i
P(Xi,Y)
q(Xi)
P(Y)
g(Xi) ≈
N
i
P(Xi,Y)
q(Xi)
N
i
P(Xi,Y)
q(Xi)
g(Xi)
where Xi ∼ q from proposal distribution q
Rate Varq
P(X,Y)
q(X)
g(X)
−1/2
[Yuan and Druzdzel, 2007]
10

SIS of execution traces
1. Execute the probabilistic program forwards
11

2. At each latent variable Xk (i.e. sample statement),
sample q(Xk), assign value, multiply node
importance weight into trace
11

3. At each observed random variable (i.e. observe
statement), multiply likelihood P(Yk | X1:k, Y1:k) into
trace
11

3. At each observed random variable (i.e. observe
statement), multiply likelihood P(Yk | X1:k, Y1:k) into
trace
Problem: Myopic choices from sampling q(X) early in the
trace may result in low importance weights (poor
explanation of Y) later.
11

Constructing a proposal distribution
How to choose q?
12

How to choose q?
Likelihood-Weighting [Norvig and Intelligence, 2002]:
q(X) = P(X)
12

How to choose q?
q(X) = P(X)
Direct sampling: q(X) = P(X | Y), optimal
12

How to choose q?
q(X) = P(X)
Direct sampling: q(X) = P(X | Y), optimal
Key Idea: Account for Y when constructing q. Exploit
access to P(X, Y) to build a proposers q “close” to
P(X | Y)?
12

Intuition for inference compilation
−5.0 −2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0
x
−4
−2
0
2
4
y
Generative model p(x,y)
−5.0 −2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0
x
0.00
0.02
0.04
0.06
0.08
0.10
PDF
Posterior density given y=0.25
p(x|y = 0.25)
x ∼ p(x | y = 0.25)
q(x; φ(y = 0.25), K = 1)
q(x; φ(y = 0.25), K = 2)
13

Trace-based inference compilation (IC)
• Construct DNN with parameters φ mapping
observations Y (amortized inference,
[Goodman, 2013]) and execution prefix to proposal
distribution qφ(· | Y)
• Train qφ against forward samples from the
probabilistic program p(x, y) (inference compilation)
Figure 3: [Le et al., 2017]
14

Trace-based inference compilation (IC)
Figure 4: [Le et al., 2017]
15

Sensitivity to nuisance random variables
1 def magnitude(obs, M):
2 x = sample(Normal (0, 10))
3 [sample(Normal (0 ,10)) for _ in range(M)] # extend trace with nuisance
4 y = sample(Normal (0 ,10))
5 observe(obs**2, Likelihood=Normal(x**2 + y**2, 0.1)0
6 return x, y
7
Figure 5: [Harvey et al., 2019] 16

Lightweight Inference Compilation for
MCMC

Declarative: Graph-based, samples instantiated graphical
models (i.e. worlds) (BUGS, BLOG, Stan, beanmachine)
Figure 6: [Blei et al., 2003]
Key Idea: Markov blanket available in declarative PPL
17

MCMC sampling of graphical models
Metropolis-within-Gibbs / Lightweight MH
[Wingate et al., 2011]:
• Initialize minimal self-supporting world consistent
with Y
18

with Y
• Repeat:
• Pick single random unobserved node Xi
18

with Y
• Repeat:
• Sample proposal q(Xi) to propose new value
18

with Y
• Repeat:
• Accept with probability α and revert otherwise
18

with Y
• Repeat:
• Accept with probability α and revert otherwise
Theorem ([Hastings, 1970])
With appropriately chosen α, the above algorithm yields
a Markov Chain with the posterior as the invariant
distribution.
18

MH proposal distributions
Different q(·) =⇒ different MCMC algorithms
19

• Random walk MH q(·) isotropic Gaussian
19

• Newtonian Monte Carlo q(·) Gaussian with empirical
Fisher information precision
19

• Hamiltonian Monte Carlo q(·) integrates
iso-Hamiltonian system
19

• Lightweight Inference Compilation q(· | MB(Xi)) a
neural network function of Markov Blanket
19

Theorem ([Pearl, 1987])
Gibbs distributions P(Xi | Xc
i ) = P(Xi | MB(Xi)) have
acceptance probability 1
19

Theorem ([Pearl, 1987])
Gibbs distributions P(Xi | Xc
i ) = P(Xi | MB(Xi)) have
acceptance probability 1
∴ MB(Xi) minimal sufficient inputs for constructing
proposal distribution 19

Recovering conjugate expressions in normal-normal
x ∼ N(0, 2), y | x ∼ N(x, 0.1)
Know: y | x ∼ N(0.999x, 0.0001)
10 5 0 5 10
y (Observed value)
10
5
0
5
10
value
Learning a conjugate model's posterior
variable
Closed-form mean
Mean of LIC proposer
23

GMM Mode Escape
−5.0 −2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0
x
−4
−2
0
2
4
y
Generative model p(x,y)
−5.0 −2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0
x
0.00
0.02
0.04
0.06
0.08
0.10
PDF
Posterior density given y=0.25
p(x|y = 0.25)
x ∼ p(x | y = 0.25)
q(x; φ(y = 0.25), K = 1)
q(x; φ(y = 0.25), K = 2)
0.000
0.025
0.050
0.075
Density
method = Adaptive HMC (Hoﬀman 2014) method = Adaptive RWMH (Garthwaite 2016) method = Ground Truth
−5 0 5 10 15
x
0.000
0.025
0.050
0.075
Density
method = Inference Compilation (this paper)
−5 0 5 10 15
x
method = NMC (Arora 2020)
−5 0 5 10 15
x
method = NUTS (Stan defaults)
24

Robustness to nuisance random variables
1 def magnitude(obs):
2 x = sample(Normal(0, 10))
3 for _ in range(100):
4 nuisance = sample(Normal(0, 10))
5 y = sample(Normal(0, 10))
6 observe(
7 obs**2,
8 likelihood=Normal(x**2 + y**2, 0.1))
9 return x
1 class NuisanceModel:
2 @random_variable
3 def x(self):
4 return dist.Normal(0, 10)
5 @random_variable
6 def nuisance(self, i):
8 @random_variable
9 def y(self):
11 @random_variable
12 def noisy_sq_length(self):
13 return dist.Normal(
14 self.x()**2 + self.y()**2,
15 0.1)
# params compile time ESS
LIC (this paper) 3,358 44 sec. 49.75
[Le et al., 2017] 21,952 472 sec. 10.99
25

Bayesian Logistic Regression
β ∼ Nd+1(0d+1, diag(10, 2.51d))
yi | xi
iid
∼ Bernoulli(σ(β xi)) where σ(t) = (1 + e−t
)−1
LIC NMC RWMH NUTS
ppl
0
10
20
30
40
value
variable = Compile time
LIC NMC RWMH NUTS
ppl
0
10
20
variable = Inference time
LIC NMC RWMH NUTS
ppl
4.3
4.4
4.5
4.6
variable = PLL
LIC NMC RWMH NUTS
ppl
100
200
value
variable = ESS
LIC NMC RWMH NUTS
ppl
1.2
1.4
1.6
variable = Rhat
26

n-Schools
β0 ∼ StudentT(3, 0, 10)
τi ∼ HalfCauchy(σi) for i ∈ [district, state, type]
βi,j ∼ N(0, τi) for i ∈ [district, state, type], j ∈ [ni]
yk ∼ N(β0 +
i
βi,jk
, σk)
27

n-Schools
LIC NMC RWMH NUTS
ppl
0
25
50
75
100
value
variable = Compile time (sec)
LIC NMC RWMH NUTS
ppl
0
10
20
variable = Inference time (sec)
LIC NMC RWMH NUTS
ppl
7
8
9
10
variable = PLL
LIC NMC RWMH NUTS
ppl
10
15
20
25
value
variable = ESS
LIC NMC RWMH NUTS
ppl
2
4
6
8
variable = Rhat
27

Adaptive LIC
Problem: forward samples from P(X, Y) may not
represent Y at inference
RWMH [Garthwaite et al., 2016] and HMC
[Hoffman and Gelman, 2014] all have adaptive variants.
Idea: Perform MH with LIC to draw posterior samples
(x(m)
, y(m)
= obs) ∼ P(x | y = obs), hill-climb LIC artifacts
on inclusive KL between conditional (rather than joint)
posterior
arg min
φ
DKL(p(x|y = obs)||q(x|y = obs; φ))
≈ arg min
φ
N
m=1
log Q(x(m)
| y = obs, φ)
28

IAF density estimators
Problem: GMM in LIC may provide poor approximations
Idea: Parameterize IAFs [Kingma et al., 2016] with LIC
outputs
Figure 7: Neal’s funnel (left) and a 7-component isotropic GMM
(middle) and 7-layer IAF (right) density approximation
29

Heavy-tailed density estimators
Problem: GMMs and standard IAFs (Lipschitz functions of
Gaussians) remain sub-Gaussian, n-schools is heavy
tailed
Idea: IAFs with heavy-tailed base distribution
Figure 8: IAF density estimation of a Cauchy(−2, 1) (left) and
their K-S statistics when using a Normal (right top) and
StudentT (right bottom) base distribution
30

References i
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003).
Latent dirichlet allocation.
Journal of machine Learning research, 3(Jan):993–1022.
Garthwaite, P. H., Fan, Y., and Sisson, S. A. (2016).
Adaptive optimal scaling of metropolis–hastings
algorithms using the robbins–monro process.
Communications in Statistics-Theory and Methods,
45(17):5098–5111.
32

References ii
Ghahramani, Z. (2015).
Probabilistic machine learning and artificial
intelligence.
Nature, 521(7553):452–459.
Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y.
(2016).
Deep learning.
MIT press Cambridge.
33

References iii
Goodman, N. D. (2013).
The principles and practice of probabilistic
programming.
ACM SIGPLAN Notices, 48(1):399–402.
Harvey, W., Munk, A., Baydin, A. G., Bergholm, A., and
Wood, F. (2019).
Attention for inference compilation.
arXiv preprint arXiv:1910.11961.
Hastings, W. K. (1970).
Monte carlo sampling methods using markov chains
and their applications.
34

References iv
Hoffman, M. D. and Gelman, A. (2014).
The no-u-turn sampler: adaptively setting path
lengths in hamiltonian monte carlo.
J. Mach. Learn. Res., 15(1):1593–1623.
Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X.,
Sutskever, I., and Welling, M. (2016).
Improved variational inference with inverse
autoregressive flow.
In Advances in neural information processing
systems, pages 4743–4751.
35

References v
Koller, D. and Friedman, N. (2009).
Probabilistic graphical models: principles and
techniques.
MIT press.
Le, T. A., Baydin, A. G., and Wood, F. (2017).
Inference compilation and universal probabilistic
programming.
In Artificial Intelligence and Statistics, pages
1338–1348.
36

References vi
LeCun, Y., Bengio, Y., and Hinton, G. (2015).
Deep learning.
nature, 521(7553):436–444.
Norvig, P. R. and Intelligence, S. A. (2002).
A modern approach.
Prentice Hall.
Pearl, J. (1987).
Evidential reasoning using stochastic simulation of
causal models.
Artificial Intelligence, 32(2):245–257.
37

References vii
Tenenbaum, J. B., Kemp, C., Griffiths, T. L., and
Goodman, N. D. (2011).
How to grow a mind: Statistics, structure, and
abstraction.
science, 331(6022):1279–1285.
van de Meent, J.-W., Paige, B., Yang, H., and Wood, F.
(2018).
An introduction to probabilistic programming.
arXiv preprint arXiv:1809.10756.
38

References viii
Wingate, D., Stuhlmüller, A., and Goodman, N. (2011).
Lightweight implementations of probabilistic
programming languages via transformational
compilation.
In Proceedings of the Fourteenth International
Conference on Artificial Intelligence and Statistics,
pages 770–778.
39

References ix
Yuan, C. and Druzdzel, M. J. (2007).
Theoretical analysis and practical insights on
importance sampling in bayesian networks.
International Journal of Approximate Reasoning,
46(2):320–333.
40

Accelerating Metropolis Hastings with Lightweight Inference Compilation

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Accelerating Metropolis Hastings with Lightweight Inference Compilation

Similar a Accelerating Metropolis Hastings with Lightweight Inference Compilation (20)

Más de Feynman Liang

Más de Feynman Liang (7)

Último

Último (20)

Accelerating Metropolis Hastings with Lightweight Inference Compilation