6. ¤ ', # 2
¤ -
¤ !.
Processes (Tran
e & Mohamed,
rs (Burda et al.,
nd warm-up al-
suggesting that
o show that the
good or better
model for fur-
ned latent rep-
oposed here are
ations utilizing
tive assessment
s that the multi-
in the datasets
ised learning.
inspired by the
better than the
abilistic ladder network allows direct integration (+ in figure, see
Eq. (21) ) of bottom-up and top-down information in the infer-
ence model. In the VAE the top-down information is incorporated
indirectly through the conditional priors in the generative model.
The generative model p✓ is specified as follows:
p✓(x|z1) = N x|µ✓(z1), 2
✓(z1) or (1)
P✓(x|z1) = B (x|µ✓(z1)) (2)
for continuous-valued (Gaussian N) or binary-valued
(Bernoulli B) data, respectively. The latent variables z are
split into L layers zi, i = 1 . . . L:
p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2
✓,i(zi+1) (3)
p✓(zL) = N (zL|0, I) . (4)
The hierarchical specification allows the lower layers of the
latent variables to be highly correlated but still maintain the
computational efficiency of fully factorized models.
Each layer in the inference model q (z|x) is specified using
a fully factorized Gaussian distribution:
pared to equally or more
ng flexible variational dis-
Gaussian Processes (Tran
s (Rezende & Mohamed,
Autoencoders (Burda et al.,
alization and warm-up al-
rformance, suggesting that
ul. We also show that the
erforms as good or better
interesting model for fur-
dy the learned latent rep-
methods proposed here are
nt representations utilizing
. A qualitative assessment
her indicates that the multi-
el structure in the datasets
emi-supervised learning.
e:
the VAE inspired by the
g as well or better than the
n training increasing both
e across several different
of active stochastic latent
Figure 2. Flow of information in the inference and generative
models of a) probabilistic ladder network and b) VAE. The prob-
abilistic ladder network allows direct integration (+ in figure, see
Eq. (21) ) of bottom-up and top-down information in the infer-
ence model. In the VAE the top-down information is incorporated
indirectly through the conditional priors in the generative model.
The generative model p✓ is specified as follows:
p✓(x|z1) = N x|µ✓(z1), 2
✓(z1) or (1)
P✓(x|z1) = B (x|µ✓(z1)) (2)
for continuous-valued (Gaussian N) or binary-valued
(Bernoulli B) data, respectively. The latent variables z are
split into L layers zi, i = 1 . . . L:
p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2
✓,i(zi+1) (3)
p✓(zL) = N (zL|0, I) . (4)
The hierarchical specification allows the lower layers of the
latent variables to be highly correlated but still maintain the
computational efficiency of fully factorized models.
Each layer in the inference model q (z|x) is specified using
a fully factorized Gaussian distribution:
q (z1|x) = N z1|µ ,1(x), 2
,1(x) (5)
q (zi|zi 1) = N zi|µ ,i(zi 1), 2
,i(zi 1) (6)
ve generative performance, measured in terms of
likelihood, when compared to equally or more
ted methods for creating flexible variational dis-
s such as the Variational Gaussian Processes (Tran
15) Normalizing Flows (Rezende & Mohamed,
Importance Weighted Autoencoders (Burda et al.,
We find that batch normalization and warm-up al-
rease the generative performance, suggesting that
thods are broadly useful. We also show that the
stic ladder network performs as good or better
ng VAEs making it an interesting model for fur-
ies. Secondly, we study the learned latent rep-
ons. We find that the methods proposed here are
y for learning rich latent representations utilizing
ayers of latent variables. A qualitative assessment
ent representations further indicates that the multi-
DGMs capture high level structure in the datasets
likely to be useful for semi-supervised learning.
ary our contributions are:
ew parametrization of the VAE inspired by the
der network performing as well or better than the
ent best models.
ovel warm-up period in training increasing both
Figure 2. Flow of information in the inference and ge
models of a) probabilistic ladder network and b) VAE. Th
abilistic ladder network allows direct integration (+ in fig
Eq. (21) ) of bottom-up and top-down information in th
ence model. In the VAE the top-down information is incor
indirectly through the conditional priors in the generative
The generative model p✓ is specified as follows:
p✓(x|z1) = N x|µ✓(z1), 2
✓ (z1) or
P✓(x|z1) = B (x|µ✓(z1))
for continuous-valued (Gaussian N ) or binary-
(Bernoulli B) data, respectively. The latent variable
split into L layers zi, i = 1 . . . L:
p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2
✓,i(zi+1)
p✓(zL) = N (zL|0, I) .
The hierarchical specification allows the lower layer
latent variables to be highly correlated but still main
computational efficiency of fully factorized models.
Each layer in the inference model q (z|x) is specifie
a fully factorized Gaussian distribution:
q (z1|x) = N z1|µ ,1(x), 2
,1(x)
Autoencoders (Burda et al.,
alization and warm-up al-
formance, suggesting that
ul. We also show that the
rforms as good or better
interesting model for fur-
dy the learned latent rep-
methods proposed here are
t representations utilizing
A qualitative assessment
er indicates that the multi-
el structure in the datasets
emi-supervised learning.
e:
the VAE inspired by the
as well or better than the
n training increasing both
e across several different
of active stochastic latent
ence model. In the VAE the top-down information is incorporated
indirectly through the conditional priors in the generative model.
The generative model p✓ is specified as follows:
p✓(x|z1) = N x|µ✓(z1), 2
✓(z1) or (1)
P✓(x|z1) = B (x|µ✓(z1)) (2)
for continuous-valued (Gaussian N) or binary-valued
(Bernoulli B) data, respectively. The latent variables z are
split into L layers zi, i = 1 . . . L:
p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2
✓,i(zi+1) (3)
p✓(zL) = N (zL|0, I) . (4)
The hierarchical specification allows the lower layers of the
latent variables to be highly correlated but still maintain the
computational efficiency of fully factorized models.
Each layer in the inference model q (z|x) is specified using
a fully factorized Gaussian distribution:
q (z1|x) = N z1|µ ,1(x), 2
,1(x) (5)
q (zi|zi 1) = N zi|µ ,i(zi 1), 2
,i(zi 1) (6)
for i = 2 . . . L.
Functions µ(·) and 2
(·) in both the generative and the in-
ference models are implemented as:
7. ¤ /(0|1, 34
)
¤ ℬ(0|1)
increasing both
several different
stochastic latent
is essential for
stochastic latent
rain a generative
a x using auxil-
model q (z|x)1
to the likelihood
recognition model
decoder.
q (z1|x) = N z1|µ ,1(x), 2
,1(x) (5)
q (zi|zi 1) = N zi|µ ,i(zi 1), 2
,i(zi 1) (6)
for i = 2 . . . L.
Functions µ(·) and 2
(·) in both the generative and the in-
ference models are implemented as:
d(y) =MLP(y) (7)
µ(y) =Linear(d(y)) (8)
2
(y) =Softplus(Linear(d(y))) , (9)
where MLP is a two layered multilayer perceptron network,
Linear is a single linear layer, and Softplus applies
log(1 + exp(·)) non linearity to each component of its ar-
gument vector. In our notation, each MLP(·) or Linear(·)
gives a new mapping with its own parameters, so the de-
terministic variable d is used to mark that the MLP-part is
shared between µ and 2
whereas the last Linear layer is
not shared.
nspired by the
better than the
ncreasing both
veral different
ochastic latent
s essential for
tochastic latent
in a generative
x using auxil-
p✓(zL) = N (zL|0, I) . (4)
The hierarchical specification allows the lower layers of the
latent variables to be highly correlated but still maintain the
computational efficiency of fully factorized models.
Each layer in the inference model q (z|x) is specified using
a fully factorized Gaussian distribution:
q (z1|x) = N z1|µ ,1(x), 2
,1(x) (5)
q (zi|zi 1) = N zi|µ ,i(zi 1), 2
,i(zi 1) (6)
for i = 2 . . . L.
Functions µ(·) and 2
(·) in both the generative and the in-
ference models are implemented as:
d(y) =MLP(y) (7)
µ(y) =Linear(d(y)) (8)
2
(y) =Softplus(Linear(d(y))) , (9)
where MLP is a two layered multilayer perceptron network,
Linear is a single linear layer, and Softplus applies
Sigmoid
Abstract
Variational autoencoders are a powerful frame-
work for unsupervised learning. However, pre-
vious work has been restricted to shallow mod-
els with one or two layers of fully factorized
stochastic latent variables, limiting the flexibil-
ity of the latent representation. We propose three
advances in training algorithms of variational au-
toencoders, for the first time allowing to train
deep models of up to five stochastic layers, (1)
using a structure similar to the Ladder network
as the inference model, (2) warm-up period to
support stochastic units staying active in early
training, and (3) use of batch normalization. Us-
ing these improvements we show state-of-the-art
log-likelihood results for generative modeling on
several benchmark datasets.
1. Introduction
The recently introduced variational autoencoder (VAE)
(Kingma & Welling, 2013; Rezende et al., 2014) provides
a framework for deep generative models (DGM). DGMs
have later been shown to be a powerful framework for
semi-supervised learning (Kingma et al., 2014; Maaloee
X
!"
z
z
X
!"
d
"
d
"
a) b)
1
2
1
2
2 2
2
1 1
1
Figure 1. Inference (or encoder/rec
decoder) models. a) VAE inference
der inference model and c) generati
variables sampled from the approxi
with mean and variances parameteri
ables, each conditioned on the l
highly flexible latent distribution
model parameterizations: the fir
the VAE to multiple layers of la
ond is parameterized in such a w
as a probabilistic variational vari
arXiv:1602.02282v1[stat.ML]6
12. VAE
¤ VAE
¤
¤ CVAE [Kingma+ 2014]
¤ Variational Fair Auto Encoder [Louizos+ 15]
¤ X sensitive '(%|X) MMD
The inference network q (z|x) (3) is used during training of the model using both the labelled and
unlabelled data sets. This approximate posterior is then used as a feature extractor for the labelled
data set, and the features used for training the classifier.
3.1.2 Generative Semi-supervised Model Objective
For this model, we have two cases to consider. In the first case, the label corresponding to a data
point is observed and the variational bound is a simple extension of equation (5):
log p✓(x, y) Eq (z|x,y) [log p✓(x|y, z) + log p✓(y) + log p(z) log q (z|x, y)]= L(x, y), (6)
For the case where the label is missing, it is treated as a latent variable over which we perform
posterior inference and the resulting bound for handling data points with an unobserved label y is:
log p✓(x) Eq (y,z|x) [log p✓(x|y, z) + log p✓(y) + log p(z) log q (y, z|x)]
=
X
y
q (y|x)( L(x, y)) + H(q (y|x)) = U(x). (7)
The bound on the marginal likelihood for the entire dataset is now:
J =
X
(x,y)⇠epl
L(x, y) +
X
x⇠epu
U(x) (8)
The distribution q (y|x) (4) for the missing labels has the form a discriminative classifier, and
we can use this knowledge to construct the best classifier possible as our inference model. This
distribution is also used at test time for predictions of any unseen data.
In the objective function (8), the label predictive distribution q (y|x) contributes only to the second
term relating to the unlabelled data, which is an undesirable property if we wish to use this distribu-
tion as a classifier. Ideally, all model and variational parameters should learn in all cases. To remedy
%
#
X
13. ¤ Auxiliary Deep Generative Models [Maaløe+ 16]
¤ Auxiliary variables [Agakov+
2004 ]
¤
¤ state-of-the-art
Auxiliary Deep Generative Models
ry deep generative models
ngma (2013); Rezende et al. (2014) have cou-
oach of variational inference with deep learn-
e to powerful probabilistic models constructed
nce neural network q(z|x) and a generative
rk p(x|z). This approach can be perceived as
equivalent to the deep auto-encoder, in which
as the encoder and p(x|z) the decoder. How-
ference is that these models ensures efficient
er various continuous distributions in the la-
and complex input datasets x, where the pos-
ution p(x|z) is intractable. Furthermore, the
the variational upper bound are easily defined
agation through the network(s). To keep the
al requirements low the variational distribution
ally chosen to be a diagonal Gaussian, limiting
e power of the inference model.
er we propose a variational auxiliary vari-
ch (Agakov and Barber, 2004) to improve
al distribution: The generative model is ex-
yz
a
x
(a) Generative model P.
yz
a
x
(b) Inference model Q.
Figure 1. Probabilistic graphical model of the ADGM for semi-
supervised learning. The incoming joint connections to each vari-
able are deep neural networks with parameters ✓ and .
2.2. Auxiliary variables
We propose to extend the variational distribution with aux-
iliary variables a: q(a, z|x) = q(z|a, x)q(a|x) such that
the marginal distribution q(z|x) can fit more complicated
posteriors p(z|x). In order to have an unchanged gen-
(x|z). This approach can be perceived as
valent to the deep auto-encoder, in which
e encoder and p(x|z) the decoder. How-
ce is that these models ensures efficient
arious continuous distributions in the la-
complex input datasets x, where the pos-
n p(x|z) is intractable. Furthermore, the
variational upper bound are easily defined
on through the network(s). To keep the
quirements low the variational distribution
chosen to be a diagonal Gaussian, limiting
wer of the inference model.
e propose a variational auxiliary vari-
Agakov and Barber, 2004) to improve
istribution: The generative model is ex-
bles a to p(x, z, a) such that the original
t to marginalization over a: p(x, z, a) =
In the variational distribution, on the
s used such that marginal q(z|x) =
)da is a general non-Gaussian distribution.
specification allows the latent variables to
ough a, while maintaining the computa-
(a) Generative model P. (b) Inference model
Figure 1. Probabilistic graphical model of the ADGM
supervised learning. The incoming joint connections to e
able are deep neural networks with parameters ✓ and .
2.2. Auxiliary variables
We propose to extend the variational distribution w
iliary variables a: q(a, z|x) = q(z|a, x)q(a|x) s
the marginal distribution q(z|x) can fit more com
posteriors p(z|x). In order to have an unchang
erative model, p(x|z), it is required that the joi
p(x, z, a) gives back the original p(x, z) under m
ization over a, thus p(x, z, a) = p(a|x, z)p(x, z
iliary variables are used in the EM algorithm an
sampling and has previously been considered fo
tional learning by Agakov and Barber (2004). R
Ranganath et al. (2015) has proposed to make the
q(z|x) acts as the encoder and p(x|z) the decoder
ever, the difference is that these models ensures e
inference over various continuous distributions in
tent space z and complex input datasets x, where t
terior distribution p(x|z) is intractable. Furtherm
gradients of the variational upper bound are easily
by backpropagation through the network(s). To k
computational requirements low the variational dist
q(z|x) is usually chosen to be a diagonal Gaussian,
the expressive power of the inference model.
In this paper we propose a variational auxiliar
able approach (Agakov and Barber, 2004) to i
the variational distribution: The generative mode
tended with variables a to p(x, z, a) such that the
model is invariant to marginalization over a: p(x,
p(a|x, z)p(x, z). In the variational distribution,
other hand, a is used such that marginal q(zR
q(z|a, x)p(a|x)da is a general non-Gaussian distr
This hierarchical specification allows the latent vari
be correlated through a, while maintaining the co
tional efficiency of fully factorized models (cf. Fig
15. Importance Weighted AE
¤ Importance Weighted Autoencoder [Burda+ 15]
¤
¤ k
¤ Rényi [Li+ 16]
¤ Y
log' 0 ≥ :
;< %(C)
# ;< %(Z)
#
log F
', 0, =([)
!. =([)
0
Z
IC
≥ ℒ(), (; 0)
ecall from Section 2.1 that the family of R´enyi divergences includes the KL divergence. Perhaps c
ariational free-energy approaches be generalised to the R´enyi case? Consider approximating the tr
osterior p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0):
q(✓) = arg min
q2Q
D↵[q(✓)||p(✓|D)]. (1
ow we verify the alternative optimization problem
q(✓) = arg max
q2Q
log p(D) D↵[q(✓)||p(✓|D)]. (1
When ↵ 6= 1, the objective can be rewritten as
log p(D)
1
↵ 1
log
Z
q(✓)↵
p(✓|D)1 ↵
d✓
= log p(D)
1
↵ 1
log Eq
"✓
p(✓, D)
q(✓)p(D)
◆1 ↵
#
=
1
1 ↵
log Eq
"✓
p(✓, D)
q(✓)
◆1 ↵
#
:= L↵(q; D).
(1
We name this new objective the variational R´enyi bound (VR). Importantly the following theorem i
rect result of Proposition 1.
heorem 1. The objective L↵(q; D) is continuous and non-increasing on ↵ 2 [0, 1] [ {|L↵| < +1
specially for all 0 < ↵ < 1,
16. ¤ Normalizing flows [Rezende+ 15]
¤
¤ Variational Gaussian Process [Tran+ 15]
¤
Variational Inference with Normalizing Flows
and involve matrix inverses that can be numerically unsta-
ble. We therefore require normalizing flows that allow for
low-cost computation of the determinant, or where the Ja-
cobian is not needed at all.
4.1. Invertible Linear-time Transformations
We consider a family of transformations of the form:
f(z) = z + uh(w>
z + b), (10)
where = {w 2 IRD
, u 2 IRD
, b 2 IR} are free pa-
rameters and h(·) is a smooth element-wise non-linearity,
with derivative h0
(·). For this mapping we can compute
the logdet-Jacobian term in O(D) time (using the matrix
determinant lemma):
(z) = h0
(w>
z + b)w (11)
det @f
@z = | det(I + u (z)>
)| = |1 + u>
(z)|. (12)
From (7) we conclude that the density qK(z) obtained by
transforming an arbitrary initial density q0(z) through the
sequence of maps fk of the form (10) is implicitly given
by:
zK = fK fK 1 . . . f1(z)
ln qK(zK) = ln q0(z)
KX
k=1
ln |1 + u>
k k(zk)|. (13)
The flow defined by the transformation (13) modifies the
initial density q0 by applying a series of contractions and
expansions in the direction perpendicular to the hyperplane
w>
z+b = 0, hence we refer to these maps as planar flows.
As an alternative, we can consider a family of transforma-
tions that modify an initial density q0 around a reference
point z0. The transformation family is:
f(z) = z + h(↵, r)(z z0), (14)
@f d 1 0
K=1 K=2
Planar Radial
q0 K=1 K=2K=10 K=10
UnitGaussianUniform
Figure 1. Effect of normalizing flow on two distributions.
Inference network Generative model
Figure 2. Inference and generative models. Left: Inference net-
work maps the observations to the parameters of the flow; Right:
generative model which receives the posterior samples from the
inference network during training time. Round containers repre-
sent layers of stochastic variables whereas square containers rep-
resent deterministic layers.
4.2. Flow-Based Free Energy Bound
If we parameterize the approximate posterior distribution
with a flow of length K, q (z|x) := qK(zK), the free en-
ergy (3) can be written as an expectation over the initial
distribution q0(z):
F(x) = Eq (z|x)[log q (z|x) log p(x, z)]
= Eq0(z0) [ln qK(zK) log p(x, zK)]
= Eq0(z0) [ln q0(z0)] Eq0(z0) [log p(x, zK)]
" K
#
Variational Inference with Normalizi
distribution :
q(z0
) = q(z) det
@f 1
@z0
= q(z) det
@f
@z
1
, (5)
where the last equality can be seen by applying the chain
rule (inverse function theorem) and is a property of Jaco-
bians of invertible functions. We can construct arbitrarily
complex densities by composing several simple maps and
successively applying (5). The density qK(z) obtained by
successively transforming a random variable z0 with distri-
bution q0 through a chain of K transformations fk is:
zK = fK . . . f2 f1(z0) (6)
ln qK(zK) = ln q0(z0)
KX
k=1
ln det
@fk
@zk
, (7)
where equation (6) will be used throughout the paper as a
shorthand for the composition fK(fK 1(. . . f1(x))). The
path traversed by the random variables zk = fk(zk 1) with
initial distribution q0(z0) is called the flow and the path
formed by the successive distributions qk is a normalizing
flow. A property of such transformations, often referred
to as the law of the unconscious statistician (LOTUS), is
that expectations w.r.t. the transformed density qK can be
computed without explicitly knowing qK. Any expectation
EqK
[h(z)] can be written as an expectation under q0 as:
EqK
[h(z)] = Eq0
[h(fK fK 1 . . . f1(z0))], (8)
which does not require computation of the the logdet-
Jacobian terms when h(z) does not depend on qK.
We can understand the effect of invertible flows as a se-
quence of expansions or contractions on the initial density.
For an expansion, the map z0
= f(z) pulls the points z
away from a region in IRd
, reducing the density in that re-
gion while increasing the density outside the region. Con-
partial diffe
sity q0(z) e
T describes
Langevin F
the Langev
dz
where d⇠(t
E[⇠i(t)⇠j(t
D = GG
random var
Langevin fl
of densities
Kolmogoro
qt(z) of the
@
@t
qt(z)=
In machine
with F(z, t
L(z) is an u
Importantly
is given by
That is, if
evolve its s
sulting poin
e L(z)
, i.e.
plored for s
Teh (2011);
Hamiltonia
described in
space ˜z = (
tonian H(z
used in mac
Under review as a conference paper at ICLR 2016
zifi⇠
✓ D = {(s, t)}
d
(a) VARIATIONAL MODEL
zi x
d
(b) GENERATIVE MODEL
Figure 1: (a) Graphical model of the variational Gaussian process. The VGP generates samples of
latent variables z by evaluating random non-linear mappings of latent inputs ⇠, and then drawing
mean-field samples parameterized by the mapping. These latent variables aim to follow the posterior
distribution for a generative model (b), conditioned on data x.
20. ¤ Ladder network [Valpola, 14][Rasmus+ 15]
probabilistic ladder network
¤ ladder network
¤
ariational Autoencoders and Probabilistic Ladder Networks
s we propose (1)
rm-up period to
rly training, and
zing DGMs and
up to five layers
ls, consisting of
h highly expres-
fficiency of fully
ese models have
ured in terms of
qually or more
variational dis-
Processes (Tran
e & Mohamed,
zd
z
z
a) b)
n
n
nn
"Likelihood"
Deterministic
bottom up
pathway
Stochastic
top down
pathway
"Posterior"
"Prior"
+
"Copy"
Top down pathway
through KL-divergences
in generative model
Bottom up pathway
in inference model
Indirect top
down
information
through prior
Direct flow of
information
zn
Generative
model
"Copy"
Figure 2. Flow of information in the inference and generative
models of a) probabilistic ladder network and b) VAE. The prob-
abilistic ladder network allows direct integration (+ in figure, see
Eq. (21) ) of bottom-up and top-down information in the infer-
Probabilistic ladder network VAE
bottom-up top-down
top-down
21. Top-down
¤ bottom-up
¤ top-down
:=~;< = 0 log
R =T R =TUV|=T R x|=V
; =V|0 ; =W|=V ;(=T|=TUV)
=C~! =C|0
=4~! =4|=C
:=~;< = 0 log
R =T R =TUV|=T R x|=V
; =V|0 ; =W|=V ;(=T|=TUV)
' =C
|=4
' =4
' 0|=C=C
~! =C
|0
=4
~! =4
|=C
22. ¤
¤
¤
th different number of latent lay-
Warm-up WU
vides a tractable lower bound
an be used as a training crite-
✓(x, z)
(z|x)
= L(✓, ; x) (10)
|p✓(z)) + Eq (z|x) [p✓(x|z)] ,
(11)
ibler divergence.
e likelihood may be obtained
crease of samples by using the
Burda et al., 2015):
(z(K)|x)
"
log
KX
k=1
p✓(x, z(k)
)
q (z(k)|x)
#
(12)
ve parameters, ✓ and , are
g Eq. (11) using stochastic
se the reparametrization trick
n through the Gaussian latent
, 2013; Rezende et al., 2014).
mance during training. The test set performance was estimated
using 5000 importance weighted samples providing a tighter
bound than the training bound explaining the better performance
here.
2015) as the inference model of a VAE, as shown in Figure
1. The generative model is the same as before.
The inference is constructed to first make a deterministic
upward pass:
d1 =MLP(x) (13)
µd,i =Linear(di), i = 1 . . . L (14)
2
d,i =Softplus(Linear(di)), i = 1 . . . L (15)
di =MLP(µd,i 1), i = 2 . . . L (16)
followed by a stochastic downward pass:
q (zL|x) =N µd,L, 2
d,L (17)
ti =MLP(zi+1), i = 1 . . . L 1 (18)
µt,i =Linear(ti) (19)
2
t,i =Softplus(Linear(ti)) (20)
q✓(zi|zi+1, x) =N
µt,i
2
t,i + µd,i
2
d,i
2
t,i + 2
d,i
,
1
2
t,i + 2
d,i
!
.
(21)
kelihood values for VAEs and the
ith different number of latent lay-
d Warm-up WU
ovides a tractable lower bound
can be used as a training crite-
p✓(x, z)
q (z|x)
= L(✓, ; x) (10)
||p✓(z)) + Eq (z|x) [p✓(x|z)] ,
(11)
eibler divergence.
he likelihood may be obtained
crease of samples by using the
(Burda et al., 2015):
q (z(K)|x)
"
log
KX
k=1
p✓(x, z(k)
)
q (z(k)|x)
#
(12)
ve parameters, ✓ and , are
ng Eq. (11) using stochastic
use the reparametrization trick
on through the Gaussian latent
g, 2013; Rezende et al., 2014).
mated using Monte Carlo sam-
orresponding q distribution.
mance during training. The test set performance was estimated
using 5000 importance weighted samples providing a tighter
bound than the training bound explaining the better performance
here.
2015) as the inference model of a VAE, as shown in Figure
1. The generative model is the same as before.
The inference is constructed to first make a deterministic
upward pass:
d1 =MLP(x) (13)
µd,i =Linear(di), i = 1 . . . L (14)
2
d,i =Softplus(Linear(di)), i = 1 . . . L (15)
di =MLP(µd,i 1), i = 2 . . . L (16)
followed by a stochastic downward pass:
q (zL|x) =N µd,L, 2
d,L (17)
ti =MLP(zi+1), i = 1 . . . L 1 (18)
µt,i =Linear(ti) (19)
2
t,i =Softplus(Linear(ti)) (20)
q✓(zi|zi+1, x) =N
µt,i
2
t,i + µd,i
2
d,i
2
t,i + 2
d,i
,
1
2
t,i + 2
d,i
!
.
(21)
2 2
a tractable lower bound
used as a training crite-
)
)
= L(✓, ; x) (10)
) + Eq (z|x) [p✓(x|z)] ,
(11)
ivergence.
lihood may be obtained
of samples by using the
a et al., 2015):
|x)
"
log
KX
k=1
p✓(x, z(k)
)
q (z(k)|x)
#
(12)
ameters, ✓ and , are
(11) using stochastic
reparametrization trick
ugh the Gaussian latent
3; Rezende et al., 2014).
using Monte Carlo sam-
here.
2015) as the inference model of a VAE, as shown in Figure
1. The generative model is the same as before.
The inference is constructed to first make a deterministic
upward pass:
d1 =MLP(x) (13)
µd,i =Linear(di), i = 1 . . . L (14)
2
d,i =Softplus(Linear(di)), i = 1 . . . L (15)
di =MLP(µd,i 1), i = 2 . . . L (16)
followed by a stochastic downward pass:
q (zL|x) =N µd,L, 2
d,L (17)
ti =MLP(zi+1), i = 1 . . . L 1 (18)
µt,i =Linear(ti) (19)
2
t,i =Softplus(Linear(ti)) (20)
q✓(zi|zi+1, x) =N
µt,i
2
t,i + µd,i
2
d,i
2
t,i + 2
d,i
,
1
2
t,i + 2
d,i
!
.
(21)
PRML
24. ¤ B
¤
[MacKey 01]
¤ ^
¤
¤
¤
ℒ ), (; 0 = −L-[!. = 0 ∥ ', = ] + :;< = 0 log', 0|=
phenomenon. The probabilistic ladder network provides
a framework with the wanted interaction, while keeping
complications manageable. A further extension could be
to make the inference in k steps over an iterative inference
procedure (Raiko et al., 2014).
2.2. Warm-up from deterministic to variational
autoencoder
The variational training criterion in Eq. (11) contains the
reconstruction term p✓(x|z) and the variational regular-
ization term. The variational regularization term causes
some of the latent units to become inactive during train-
ing (MacKay, 2001) because the approximate posterior for
unit k, q(zi,k| . . . ) is regularized towards its own prior
p(zi,k| . . . ), a phenomenon also recognized in the VAE set-
ting (Burda et al., 2015). This can be seen as a virtue
of automatic relevance determination, but also as a prob-
lem when many units are pruned away early in training
before they learned a useful representation. We observed
that such units remain inactive for the rest of the training,
presumably trapped in a local minima or saddle point at
KL(qi,k|pi,k) ⇡ 0, with the optimization algorithm unable
to re-activate them.
We propose to alleviate the problem by initializing train-
tr
fr
v
th
o
3
T
M
C
ch
3
al
M
q
6
b
re
phenomenon. The
a framework with
complications man
to make the inferen
procedure (Raiko et
2.2. Warm-up from
autoencoder
The variational trai
reconstruction term
ization term. The
some of the latent
ing (MacKay, 2001
unit k, q(zi,k| . . . )
p(zi,k| . . . ), a pheno
ting (Burda et al.,
of automatic releva
lem when many un
before they learned
that such units rem
presumably trapped
KL(qi,k|pi,k) ⇡ 0,
to re-activate them.
We propose to alle
ing using the recon
training a standard
25. ¤
¤
¤
→Warm-up
¤ Batch Normalization[Ioffe+ 15]
¤
¤ 2
before they learned a useful representation. We observed
that such units remain inactive for the rest of the training,
presumably trapped in a local minima or saddle point at
KL(qi,k|pi,k) ⇡ 0, with the optimization algorithm unable
to re-activate them.
We propose to alleviate the problem by initializing train-
ing using the reconstruction error only (corresponding to
training a standard deterministic auto-encoder), and then
gradually introducing the variational regularization term:
L(✓, ; x)T = (22)
KL(q (z|x)||p✓(z)) + Eq (z|x) [p✓(x|z)] ,
where is increased linearly from 0 to 1 during the first Nt
epochs of training. We denote this scheme warm-up (ab-
breviated WU in tables and graphs) because the objective
goes from having a delta-function solution (correspond-
ing to zero temperature) and then move towards the fully
stochastic variational objective. A similar idea has previ-
ously been considered in Raiko et al. (2007, Section 6.2),
however here used for Bayesian models trained with a co-
ordinate descent algorithm.
32, 16, 8 and 4, goin
all mappings using
MLP’s between x an
quent layers were c
64 and 32 for all co
bilistic ladder netwo
removing latent vari
sometimes refer to t
the four layer mode
models were trained
& Ba, 2014) optimiz
reported test log-like
(12) with 5000 impo
et al. (2015). The mo
(Bastien et al., 2012
Parmesan3
framewo
For MNIST we used
mean of a Bernoulli
(max(x, 0.1x)) as n
were trained for 200
on the complete trai
Nt = 200. Simila
ple the binarized tra
ages using a Bernou
27. MNIST
¤ VAE 2
¤ Batch normalization & warm-up
¤ probabilistic ladder network
How to Train Deep Variational Autoencoders and Probabilistic Ladder Netwo
Figure 3. MNIST test-set log-likelihood values for VAEs and the
probabilistic ladder networks with different number of latent lay-
ers, Batch normalizationBN and Warm-up WU
The variational principle provides a tractable lower bound
on the log likelihood which can be used as a training crite-
rion L.
log p(x) E
log
p✓(x, z)
= L(✓, ; x) (10)
200 400 600 800 1000 1
Epoc
-90
-88
-86
-84
-82
L(x)
VAE
VAE+BN
Figure 4. MNIST train (full lines) an
mance during training. The test set
using 5000 importance weighted s
bound than the training bound expla
here.
2015) as the inference model of a
1. The generative model is the sa
The inference is constructed to
28. ¤ MC importance weighted
IW
¤ MC IW
¤ permutation invariant MNIST
¤ -82.90 [Burda+ 15]
¤ -81.90 [Tran+ 15]
¤
tional Autoencoders and Probabilistic Ladder Networks
, where iter-
own signals
e 2. Notably
ce networks
see van den
sion on this
ork provides
hile keeping
on could be
ve inference
nal
contains the
nal regular-
term causes
during train-
Table 1. Fine-tuned test log-likelihood values for 5 layered VAE
and probabilistic ladder networks trained on MNIST. ANN. LR:
Annealed Learning rate, MC: Monte Carlo samples to approxi-
mate Eq(·)[·], IW: Importance weighted samples
FINETUNING NONE
ANN. LR.
MC=1
IW=1
ANN. LR.
MC=10
IW=1
ANN. LR.
MC=1
IW=10
ANN. LR.
MC=10
IW=10
VAE 82.14 81.97 81.84 81.41 81.30
PROB. LADDER 81.87 81.54 81.46 81.35 -81.20
training in deep neural networks by normalizing the outputs
from each layer. We show that batch normalization (abbre-
viated BN in tables and graphs), applied to all layers except
the output layers, is essential for learning deep hierarchies
of latent variables for L > 2.
29. ¤ KL 0.01
¤ VAE
¤ Batch normalization (BN)
¤ Warm-up (WU)
¤ Probabilistic ladder network
How to Train Deep Variational Autoencoders and Probabilistic Ladde
Table 2. Number of active latent units in five layer VAE and prob-
abilistic ladder networks trained on MNIST. A unit was defined
as active if KL(qi,k||pi,k) > 0.01
VAE VAE
+BN
VAE
+BN
+WU
PROB. LADDER
+BN
+WU
LAYER 1 20 20 34 46
LAYER 2 1 9 18 22
LAYER 3 0 3 6 8
LAYER 4 0 3 2 3
LAYER 5 0 2 1 2
TOTAL 21 37 61 81
importance weighted samples to 10 to reduce the variance
in the approximation of the expectations in Eq. (10) and
improve the inference model, respectively.
Models trained on the OMNIGLOT dataset4
, consisting of
28x28 binary images images were trained similar to above
except that the number of training epochs was 1500.
Models trained on the NORB dataset5
, consisting of 32x32
grays-scale images with color-coding rescaled to [0, 1],
Table 3. Test set Log-likelih
OMNIGLOT and NORB da
dataset and the number of la
VAE
OMNIGLOT
64 114.45
64-32 112.60
64-32-16 112.13
64-32-16-8 112.49
64-32-16-8-4 112.10
NORB
64 2630.8
64-32 2830.8
64-32-16 2757.5
64-32-16-8 2832.0
64-32-16-8-4 3064.1
tic layers. The performan
not improve with more th
variables. Contrary to thi
30. OMNIGLOT NORB
¤ BN WU
¤ NORB ladder
¤
¤ tanh
w to Train Deep Variational Autoencoders and Probabilistic Ladder Networks
ent units in five layer VAE and prob-
ined on MNIST. A unit was defined
> 0.01
AE
BN
VAE
+BN
+WU
PROB. LADDER
+BN
+WU
0 34 46
9 18 22
3 6 8
3 2 3
2 1 2
7 61 81
ples to 10 to reduce the variance
he expectations in Eq. (10) and
del, respectively.
MNIGLOT dataset4
, consisting of
ges were trained similar to above
training epochs was 1500.
RB dataset5
, consisting of 32x32
color-coding rescaled to [0, 1],
tion model with mean and vari-
near and a softplus output layer
were similar to the models above
ngent was used as nonlinearities
rate was 0.002, Nt = 1000 and
ochs were 4000.
Table 3. Test set Log-likelihood values for models trained on the
OMNIGLOT and NORB datasets. The left most column show
dataset and the number of latent variables i each model.
VAE VAE
+BN
VAE
+BN
+WU
PROB. LADDER
+BN
+WU
OMNIGLOT
64 114.45 108.79 104.63
64-32 112.60 106.86 102.03 102.12
64-32-16 112.13 107.09 101.60 -101.26
64-32-16-8 112.49 107.66 101.68 101.27
64-32-16-8-4 112.10 107.94 101.86 101.59
NORB
64 2630.8 3263.7 3481.5
64-32 2830.8 3140.1 3532.9 3522.7
64-32-16 2757.5 3247.3 3346.7 3458.7
64-32-16-8 2832.0 3302.3 3393.6 3499.4
64-32-16-8-4 3064.1 3258.7 3393.6 3430.3
tic layers. The performance of the vanilla VAE model did
not improve with more than two layers of stochastic latent
variables. Contrary to this, models trained with batch nor-
malization and warm-up consistently increase the model
performance for additional layers of stochastic latent vari-
ables. As expected the improvement in performance is de-
creasing for each additional layer, but we emphasize that
the improvements are consistent even for the addition of
31. ¤ VAE
¤
¤ KL
¤ KL 0
¤ BN WU Ladder
To study this effect we calculated the KL-divergence be-
tween q(zi,k|zi 1,k)
tent variable k during training as seen in Figure
To study this effect we calculated the KL-divergence be-
and p(zi|zi+1) for each stochastic la-
during training as seen in Figure
term is zero if the inference model is independent of the
data, i.e. q(zi,k|zi 1,k) = q(zi,k), and hence collapsed
33. ¤
¤ KL
¤ VAE 2
¤ Ladder BN WU
¤
structured high level latent representations that are likely
useful for semi-supervised learning.
The hierarchical latent variable models used here allows
highly flexible distributions of the lower layers conditioned
on the layers above. We measure the divergence between
these conditional distributions and the restrictive mean field
approximation by calculating the KL-divergence between
q(zi|zi 1) and a standard normal distribution for several
models trained on MNIST, see Figure 6 a). As expected
the lower layers have highly non (standard) Gaussian dis-
tributions when conditioned on the layers above. Interest-
ingly the probabilistic ladder network seems to have more
active intermediate layers than t,he VAE with batch nor-
malization and warm-up. Again this might be explained
by the deterministic upward pass easing flow of informa-
tion to the intermediate and upper layers. We further note
that the KL-divergence is approximately zero in the vanilla
VAE model above the second layer confirming the inactiv-
ity of these layers. Figure 6 b) shows generative samples
from the probabilistic ladder network created by injecting
⇤
Bouchard,
Yoshua. Th
arXiv prepr
Burda, Yuri,
lan. Impor
arXiv:1509
Dayan, Peter,
Zemel, Ric
putation, 7
Dieleman, Sa
ren Kaae S
Aaron, and
lease., Aug
10.5281/
Ioffe, Sergey
Acceleratin
covariate sh
Kingma, Di