24. Automating Expectations
Monte Carlo sampling
θ
f(θ)
a a + 1
θ
f(θ)
a a + 1
f(θ(s)
)
a+1
a
f(θ)dθ ≈
1
S
S
s=1
f(θ(s)
)
where θ(s)
∼ Uniform(a,a + 1)
25. Automating Expectations
Monte Carlo sampling
q(θ;φ) logp(X,θ) = logp(X,θ) q(θ;φ)dθ
≈
1
S
S
s=1
logp(X,θ(s)
)
where θ(s)
∼ q(θ;φ)
Monte Carlo Statistical Methods, Robert and Casella, 1999
Monte Carlo and Quasi-Monte Carlo Sampling, Lemieux, 2009
27. Automating Gradients
Symbolic or Automatic Differentiation
Let f(x1,x2) = logx1 +x1x2 −sinx2. Compute ∂ f(2,5)/∂ x1.
Automatic di↵erentiation in machine learning: a survey 9
Table 2 Forward mode AD example, with y = f(x1, x2) = ln(x1) + x1x2 sin(x2) at
(x1, x2) = (2, 5) and setting ˙x1 = 1 to compute @y
@x1
. The original forward run on the left
is augmented by the forward AD operations on the right, where each line supplements the
original on its left.
Forward Evaluation Trace
v 1 = x1 = 2
v0 = x2 = 5
v1 = ln v 1 = ln 2
v2 = v 1 ⇥v0 = 2 ⇥ 5
v3 = sin v0 = sin 5
v4 = v1 + v2 = 0.693 + 10
v5 = v4 v3 = 10.693 + 0.959
y = v5 = 11.652
Forward Derivative Trace
˙v 1 = ˙x1 = 1
˙v0 = ˙x2 = 0
˙v1 = ˙v 1/v 1 = 1/2
˙v2 = ˙v 1⇥v0+ ˙v0⇥v 1 = 1⇥5+0⇥2
˙v3 = ˙v0 ⇥ cos v0 = 0 ⇥ cos 5
˙v4 = ˙v1 + ˙v2 = 0.5 + 5
˙v5 = ˙v4 ˙v3 = 5.5 0
˙y = ˙v5 = 5.5
each intermediate variable vi a derivative
˙vi =
@vi
@x1
.
Applying the chain rule to each elementary operation in the forward evalu-
ation trace, we generate the corresponding derivative trace, given on the right
hand side of Table 2. Evaluating variables vi one by one together with their
corresponding ˙vi values gives us the required derivative in the final variable
@y
Automatic differentiation in machine learning: a survey, Baydin
et al., 2015
28. #include < stan /math . hpp>
i n t main () {
using namespace std ;
stan : : math : : var x1 = 2 , x2 = 5;
stan : : math : : var f ;
f = log ( x1 ) + x1*x2 - sin ( x2 ) ;
cout << " f ( x1 , x2 ) = " << f . val () << endl ;
f . grad () ;
cout << " df / dx1 = " << x1 . adj () << endl
<< " df / dx2 = " << x2 . adj () << endl ;
return 0;
}
The Stan math library, Carpenter et al., 2015
30. Stochastic Optimization
Follow noisy unbiased gradients.
8.5. Online learning and stochastic optimization
black line = LMS trajectory towards LS soln (red cross)
w0
w1
−1 0 1 2 3
−1
−0.5
0
0.5
1
1.5
2
2.5
3
(a)
0 5 10 15
3
4
5
6
7
8
9
10
RSS vs iteration
(b)
Figure 8.8 Illustration of the LMS algorithm. Left: we start from θ = (−0.5,
to the least squares solution of ˆθ = (1.45, 0.92) (red cross). Right: plot of obje
Note that it does not decrease monotonically. Figure generated by LMSdemo.
where i = i(k) is the training example to use at iteration k. If the data s
i(k) = k; we shall assume this from now on, for notational simplicity.
Figure 8.8a.
Scale up by subsampling the data at each step.
Machine Learning: a Probabilistic Perspective, Murphy, 2012
32. ADVI (Automatic Differentiation Variational Inference)
An easy-to use, scalable, flexible algorithm
smc‐ tan.org
Stan is a probabilistic programming system.
1. Write the model in a simple language.
2. Provide data.
3. Run.
RStan, PyStan, Stan.jl, ...
34. Exploring Taxi Rides
Data: 1.7 million taxi rides
Write down a pPCA model. (∼minutes)
Use ADVI to infer subspace. (∼hours)
Project data into pPCA subspace. (∼minutes)
Write down a mixture model. (∼minutes)
Use ADVI to find patterns. (∼minutes)
Write down a supervised pPCA model. (∼minutes)
Repeat. (∼hours)
What would have taken us weeks → a single day.
35. statistical model
data
automatic
tool
hidden
patternsinstant
revise
Monte Carlo Statistical Methods, Robert and Casella, 1999
Monte Carlo and Quasi-Monte Carlo Sampling, Lemieux, 2009
Automatic differentiation in machine learning: a survey, Baydin et al., 2015
The Stan math library, Carpenter et al., 2015
Machine Learning: a Probabilistic Perspective, Murphy, 2012
Automatic differentiation variational inference, K et al., 2016
proditus.com mc-stan.org Thank you!