4. probabilis?c
modeling
LATENT DIRICHLET ALLOCATION
• ほ 䝕䞊䝍䛾⫼ᚋ䛻䛒䜛₯ᅾⓗᵓ㐀䜢☜⋡ⓗ䛻⾲⌧
BLEI, NG, AND JORDAN
• 䠖䚷Latent
Dirichlet
Alloca?on[D.
Blei
et
al.,
JMLR,2003]
β_k
α θ z w
In the LDA setting, we obtain the extended graphical model shown in Figure 7. We treat β as
⇥ࠥDirchlet(η)
θ_mࠥDirichlet(α)
z_nࠥMul?nomial(θ_m)
w_nࠥMul?nomial(β_k)
|
z_n=k
– ᩥ᭩(bag-‐of-‐words)䛾⫼ᚋ䛻䛒䜛ᵓ㐀䜢ከ㡯ศᕸ䜔䝕䜱䝸䜽䝺ศᕸ䛷⾲⌧
– ₯ᅾኚᩘz䛿༢ㄒw䛜䛹䛾䝖䝢䝑䜽䛻ᒓ䛩䜛䛛䜢⾲⌧
• ₯ᅾኚᩘ䜔䝟䝷䝯䞊䝍䛾Ꮫ⩦䜢䛩䜛䛸⯆῝䛔ྍど䛜䛷䛝䛯䜚䛩䜛
– p(θ|X)䌱p(X|θ)p(θ)
M
N
k
η β
Figure 7: Graphical model representation of the smoothed LDA model.
These two steps are repeated until the lower bound on the log likelihood converges.
In Appendix A.4, we show that the M-step update for the conditional multinomial parameter β
can be written out analytically:
βi j ∝
MΣ
d=1
NdΣ
n=1
φ⇤d
niwj
dn. (9)
We further show that the M-step update for Dirichlet parameter α can be implemented using an
efficient Newton-Raphson method in which the Hessian is inverted in linear time.
TheWilliam Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropoli-tan
Opera Co., New York Philharmonic and Juilliard School. “Our board felt that we had a
real opportunity to make a mark on the future of the performing arts with these grants an act
every bit as important as our traditional areas of support in health, medical research, education
and the social services,” Hearst Foundation President Randolph A. Hearst said Monday in
announcing the grants. Lincoln Center’s share will be $200,000 for its new building, which
will house young artists and provide new public facilities. The Metropolitan Opera Co. and
New York Philharmonic will receive $400,000 each. The Juilliard School, where music and
the performing arts are taught, will get $250,000. The Hearst Foundation, a leading supporter
of the Lincoln Center Consolidated Corporate Fund, will make its usual annual $100,000
donation, too.
5.4 Smoothing
The large vocabulary size that is characteristic of many document corpora creates serious problems
of sparsity. A new document is very likely to contain words that did not appear in any of the
documents in a training corpus. Maximum likelihood estimates of the multinomial parameters
assign zero probability to such words, and thus zero probability to new documents. The standard
approach to coping with this problem is to “smooth” the multinomial parameters, assigning positive
probability to all vocabulary items whether or not they are observed in the training set (Jelinek,
1997). Laplace smoothing is commonly used; this essentially yields the mean of the posterior
distribution under a uniform Dirichlet prior on the multinomial parameters.
Unfortunately, in the mixture model setting, simple Laplace smoothing is no longer justified as a
Figure 8: An example article from the AP corpus. Each color codes a different factor from which
maximum a posteriori method (although it is often implemented in practice; cf. Nigam et al., 1999).
In the fact, word by placing is putatively a Dirichlet generated.
prior on the multinomial parameter we obtain an intractable posterior
in the mixture model setting, for much the same reason that one obtains an intractable posterior in
the basic LDA model. Our proposed solution to this problem is to simply apply variational inference
methods to the extended model that includes Dirichlet smoothing on the multinomial parameter.
LATENT DIRICHLET ALLOCATION
㻠
TheWilliam Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropoli-tan
Opera Co., New York Philharmonic and Juilliard School. “Our board felt that we had a
real opportunity to make a mark on the future of the performing arts with these grants an act
every bit as important as our traditional areas of support in health, medical research, education
5. • ◊✲䛷䛿
probabilis?c
modeling
– ほ 䝕䞊䝍䜢䛔䛛䛻⾲⌧䞉ண 䛷䛝䜛䛛
• ࿘㎶ᑬᗘ䜔perplexity䛻䜘䛳䛶䝰䝕䝹⮬య䜢ホ౯
• ᛂ⏝䛷䛿
– ₯ᅾኚᩘ䜢䛳䛶᭷ຠ䛺▱ぢ䜢ᚓ䜛
• 䝖䝢䝑䜽䛻䜘䜛ྍど䞉䝕䞊䝍䝬䜲䝙䞁䜾
– 」㞧䛺ၥ㢟䜢ゎ䛟
• 䠖䜽䝷䝇䝍䝸䞁䜾䛧䛺䛜䜙」ᩘ䛾㌶㊧䛾㠀⥺ᙧᅇᖐ,䚷䛥䜙䛻ᡭື䛷䜽䝷
䝇䝍䛻ไ⣙䜢ධ䜜䛯䛔
– J.
Nonparametric Mixture of Gaussian Processes with Constraints
Ross
et
al,
Nonparametric
Mixture
of
Gaussian
Processes
with
Constraints,
JMLR
2013.
– ḟඖᅽ⦰䞉≉ᚩ㔞ᢳฟ
– Fisher
z1 z2 z3
kernel䛾䜘䛖䛻䜹䞊䝛䝹䛸䛧䛶䛖
• ᇶᮏⓗ䛻ᩍᖌ䛺䛧Ꮫ⩦
z4 z5 z6
– ᩍᖌ䝕䞊䝍స䜙䛺䛟䛶䛔䛔
– 㧗㏿䛻ゎ䛡䜜䜀つᶍ䝕䞊䝍䛻ྥ䛔䛶䜛䠛
z7 z8 z9
Figure 2. Example MRF illustrating disconnected sub-graphs.
Each graph edge represents either a must-link or
cannot-link constraint.
㻡
Figure 4. Illustration of an algorithm output for the un-constrained
case. While the curves provide a reasonable
explanation of the data, it may not be the solution of in-terest.
6. semantic mixture components from the co-existing image content
and text descriptions. In the data mining and information retrieval
community, there has been a long time focus on using
probabilistic topic models to study the correlation between image
and text descriptions. Specifically, the Correspondence LDA
(CorrLDA) model [1], which imposes correspondence between
text word and other semantic entities, provides a natural way to
learn latent semantic components (topics) from image features and
associate them with text descriptions. Many recent studies,
including sophisticated topic models that associate image features
with multiple types of semantic entities (such as protein entities
[8], ontology-based biomedical concepts [9]), still follow a similar
generative process to the prototype CorrLDA model. In CorrLDA
model, each image document has different distribution over
semantic mixture components; this feature provides the model a
flexibility of adapting to different image contents. However, the
CorrLDA model requires specifying the exact number of mixture
components, which is fixed for each image document and remains
unchanged during the model estimation. In practice, in order to
get an optimal number, the researchers have to try out different
mixture components numbers and make a choice by comparing
the log-likelihood, perplexity and other criteria that indicate how
good the model fits the data. The Hieratical Dirichlet Process
(HDP) model [5], is a nonparametric extension of the Latent
Dirichlet Allocation (LDA)-based topic models, it enables
modeling documents with countable infinite mixture components,
thus provides the flexibility of modeling images whose actual
semantic component numbers are unknown.
2.2 Modeling User’s Perspective
Study of social tagging in web-based applications has gained
increased popularity in the data mining community. Specifically,
several probabilistic generative models have been proposed to
study users’ tagging patterns [10, 11]. In [11], a topic-perspective
(TP) model is proposed to infer how both users’ perspective and
the resource content relate to the generation of social annotations.
It improves the generative process of social annotations by
complement part of the holistic GIST features [3]. Our motivation
comes from the fact that the mechanism of human visual
perception allows for very rapid holistic image analysis to provide
a coarse context of image scene (special layout model), yet it also
gives rise to a small set of candidate salient locations in a scene
(saliency model) that needs to be intensively studied [2]. In Fig. 1,
Nj
probabilis?c
modeling
• Web䝃䞊䝡䝇䛜⏕䜏ฟ䛩Mul?
Adribute䛺䝕䞊䝍
– Amazon
review
data:䚷ホ౯್,
䝔䜻䝇䝖,
t is the number of tags in document j, while Nj
〇ရ䝆䝱䞁䝹,䚷ⴭ⪅ሗ䜢ྠ䛻ほ
• 」㞧䛺䝰䝕䝹䛜ḟ䚻Ⓩሙ
– 1䝃䞊䝡䝇䛻1䝰䝕䝹䛒䛳䛶䜒䛔䛔䛸ᛮ䛖
– 㔞䛛䛴ከᵝ䛺₯ᅾኚᩘ
• ₯ᅾኚᩘ䞉䝟䝷䝯䞊䝍䛾᥎ᐃ䛜㛫㛗䛟䛺䜛
• つᶍ䝕䞊䝍䚸」㞧䝰䝕䝹
X.
Chen
et
al.,
Perspec?ve
hierarchical
dirichlet
process
for
user-‐tagged
image
modeling,
CIKM2011.
• ఏ⤫ⓗ䞉ỗ⏝ⓗ䛺Inferenceᡭἲ
– Markov
Chain
Monte
Carlo䠖䚷ᚋศᕸ䛛䜙䝃䞁䝥䝸䞁䜾
– Varia?onal
Bayes:
ᚋศᕸ䜢ኚศᚋศᕸ䛸䛧䛶㏆ఝ
– 䛭䛾䜎䜎䛷䛿ᤍ䛝䛝䜜䛺䛔
– ᅇ䛿ୖ2䛴䜢୰ᚰ䛻䛧䛯ᑐฎ᪉ἲ䛾◊✲⤂
• Spectral
v and Nj
ʌj
ȕ
N v j
Į'0 ȕ' Ȗ
learning,
splash
belief
propaga?on,
sequen?al
monte
carlo䛺䛹䛿ᢅ䜟䛺䛔
r represent
the total number of extracted visual code-words and MSER
regions in document j, respectively. In the model, the holistic
representation of an image is replicated 10 times to enable the
posterior sampling, so Nj
h denoted the hth replication of the
holistic image representation in document j. In both models, we
assume fixed value for Dirichlet process concentration parameters
Į0 and Ȗ. We also assume symmetric priors Įu, ȟv, ȟt, ߟ and ȗ for
Dirichlet distributions in the models. Detailed explanations of
notations in following discussions are summarized in Table 1.
J
hj rjl vji
Nj
r
μkr ıkr ijkv
U
L
șu
ȥ
t
Nj
t
Ȝj
xjt p/z
ij'kt
Į0
sj zjl zji
μkh ıkh
ȟ't ȟ v
Ȗ
U
Kĺ’
ʌ'j
Nj
h
K'ĺ’
ijkt
ȟt
K
ȗ
Įu
Ș
Fig. 1 Graphical representation of the perspective HDP
(pHDP) model for user-tagged images
㻢
7. University of Oxford, Oxford OX1 3TG, UK
• B.
REMI.BARDENET@GMAIL.COM
DOUCET@STATS.OX.AC.UK
CHOLMES@STATS.OX.AC.UK
䝃䝤䝃䞁䝥䝸䞁䜾
Rémi,
A.
Doucet,
and
C.
Holmes,
“Towards
scaling
up
Markov
chain
Monte
Carlo:
an
adap?ve
subsampling
approach”,ICML2014.
[䝃䝤䝃䞁䝥䝸䞁䜾]
– MCMC䛻䛚䛔䛶᪂䛯䛺䝟䝷䝯䞊䝍䜢᥇ᢥ䛩䜛䛛ྰ䛛䛾᥇ᢥ⋡ィ⟬䛷
ᑬᗘ䛾ィ⟬䛜㔜䛯䛔䠄䝕䞊䝍ᩘn䠅
(✓, ✓0) − (u, ✓, ✓0)| ct, we want to increase t un-til
the condition|⇤⇤t(✓, ✓0) − (u, ✓, ✓0)| ct is satisfied.
Let # 2 (0, 1) be a user-specified parameter. We provide a
construction which ensures that at the first random time T
such that |⇤⇤T (✓, ✓0) − (u, ✓, ✓0)| cT , the correct deci-sion
– 䝕䞊䝍䜢䝃䝤䝃䞁䝥䝸䞁䜾䛧䛶ᩘ䜢ῶ䜙䛩
– 䛹䛾䛟䜙䛔䝃䞁䝥䝸䞁䜾䛧䛯䜙䜘䛔䠛
䛹䛾䛟䜙䛔㏆ఝ䛷䛝䜛䠛
– 䛒䜛☜⋡ⓗ䛺bound䜢‶䛯䛩䜎䛷䝃䞁䝥䝸䞁䜾
– exact䛺್䛸䝃䝤䝃䞁䝥䝸䞁䜾䛧䛯್䛾ㄗᕪ䜢
䝁䞁䝖䝻䞊䝹䛷䛝䜛䠄☜⋡ⓗ䛻䠅
Abstract
Carlo (MCMC) methods
too computationally inten-sive
practical use for large datasets.
a methodology that aims
Metropolis-Hastings (MH) algo-rithm
We propose an approxi-mate
of the accept/reject step of
evaluating the likelihood
the data, yet is guaranteed
accept/reject step based on
probability superior to a
tolerance level. This adaptive sub-sampling
an alternative to the re-cent
developed in (Korattikara et al.,
to establish rigorously that
approximate MH algorithm samples
version of the target distribu-tion
total variation distance to
controlled explicitly. We ex-plore
limitations of this scheme
(2004, Chapter 7.3)). MH consists in building an ergodic
Markov chain of invariant distribution ⇡(✓). Given a pro-posal
q(✓0|✓), the MH algorithm starts its chain at a user-defined
✓0, then at iteration k + 1 it proposes a candidate
state ✓0 ⇠ q(·|✓k) and sets ✓k+1 to ✓0 with probability
↵(✓k, ✓0) = 1^
⇡(✓0)
⇡(✓k)
q(✓k|✓0)
q(✓0|✓k)
= 1^
p (✓0)
p (✓k)
q(✓k|✓0)
q(✓0|✓k)
Yn
i=1
p(xi|✓0)
p(xi|✓k)
, (1)
while ✓k+1 is otherwise set to ✓k. When the dataset is large
(n % 1), evaluating the likelihood ratio appearing in the
MH acceptance ratio (1) is too costly an operation and rules
out the applicability of such a method.
The aim of this paper is to propose an approximate imple-mentation
of this “ideal”MHsampler, the maximal approx-imation
error being pre-specified by the user. To achieve
this, we first present the “ideal” MH sampler in a slightly
non-standard way.
In practice, the accept/reject step of the MH step is imple-mented
by sampling a uniform random variable u ⇠ U(0,1)
and accepting the candidate if and only if
⇡(✓0)
q(✓k|✓0)
ct = ˆ!t
t
+
t
, (9)
applies. While the bound of Audibert et al. (2009) origi-nally
covers the case where the x⇤i
are drawn with replace-ment,
it was early remarked (Hoeffding, 1963) that Cher-noff
bounds, such as the empirical Bernstein bound, still
hold when considering sampling without replacement. Fi-nally,
we will also consider the recent Bernstein bound of
Bardenet Maillard (2013, Theorem 3), designed specifi-cally
for the case of sampling without replacement.
2.2. Stopping rule construction
The concentration bounds given above are helpful as they
can allow us to decide whether (3) holds or not. Indeed,
on the event {|⇤⇤t
(✓, ✓0) − ⇤n(✓, ✓0)| ct}, we can de-cide
whether or not ⇤n(✓, ✓0) (u, ✓, ✓0) if |⇤⇤t
t
(✓⇤, ✓0) − (u, ✓, ✓0)| ct additionally holds. This is illustrated
in Figure 2. Combined to the concentration inequality
(6), we thus take the correct decision with probability at
least 1 − #t if |⇤(✓, ✓0) − (u, ✓, ✓0)| ct. In case
|⇤⇤t
is taken with probability at least 1 − #. This adaptive
stopping rule adapted from (Mnih et al., 2008) is inspired
by bandit algorithms, Hoeffding races (Maron Moore,
1993) and procedures developed to scale up boosting al-gorithms
to large datasets (Domingo Watanabe, 2000).
Formally, we set the stopping time
T = n ^ inf{t % 1 : |⇤⇤t
(✓, ✓0)− (u, ✓, ✓0)| ct}, (10)
where a ^ b denotes the minimum of a and b. In other
words, if the infimum in (10) is larger than n, then we
stop as our sampling without replacement procedure en-sures
⇤⇤n(✓, ✓0) = ⇤n(✓, ✓0). Letting p 1 and selecting
#t = p−1
ptp #, we have
P
t$1 #t #. Setting (ct)t$1 such
that (6) holds, the event
E =
{|⇤⇤t
(✓, ✓0) − ⇤n(✓, ✓0)| ct} (11)
each new x⇤i
has been drawn, but rather draw several new
subsamples x⇤i
between each check of Step 19. This is why
we introduce the variable tlook is Steps 6, 16, and 17 of Fig-ure
3. This variable simply counts the number of times the
check in Step 19 was performed. Finally, as recommended
in a related setting in (Mnih et al., 2008; Mnih, 2008), we
augment the size of the subsample geometrically by a user-input
factor % 1 in Step 18. Obviously this modification
does not impact the fact that the correct decision is taken
with probability at least 1 − #.
MHSUBLHD
%
p(x|✓), p(✓), q(✓0|✓), ✓0,Niter,X, (#t), C✓,✓0
1n
1 for k 1 to Niter
2 ✓ ✓k−1
3 ✓0 ⇠ q(.|✓), u ⇠ U(0,1),
4 (u, ✓, ✓0) log
h
u p(✓)q(✓0|✓)
p(✓0)q(✓|✓0)
i
5 t 0
6 tlook 0
7 ⇤⇤ 0
8 X⇤ ; . Keeping track of points already used
9 b 1 . Initialize batchsize to 1
10 DONE FALSE
11 while DONE == FALSE do
12 x⇤t
+1, . . . ,x⇤b
⇠w/o repl. X X⇤ . Sample new
batch without replacement
13 X⇤ X⇤ [ {x⇤t
+1, . . . ,x⇤b
}
14 ⇤⇤ 1b
⇣
t⇤⇤ +
Pb
i=t+1 log
h
p(x⇤i
|✓0)
p(x⇤i
|✓)
i⌘
15 t b
16 c 2C✓,✓0
q
(1−f⇤t ) log(2/tlook )
2t
17 tlook tlook + 1
18 b n^d%te . Increase batchsize geometrically
19 if |⇤⇤ − (u, ✓, ✓0)| % c or b n
20 DONE TRUE
21 if ⇤⇤ (u, ✓, ✓0)
22 ✓k ✓0 . Accept
23 else ✓k ✓ .Reject
24 return (✓k)㻣
8. • T.
䝰䝕䝹䛾ኚᙧ
Allocation of Gaussian Process Experts
Fast Allocation of Gaussian Process
Experts
Nguyen
and
E.
Bonilla,
Fast
Alloca?on
of
Gaussian
Experts”,
ICML
2014.
0.08
[ィ⟬㔞๐ῶ]
0.06
– Gaussian
SMSE
0.08
Process䛻䜘䜛㠀⥺ᙧᅇᖐ䠄0.06
0.04
O(N^3)䠅
– 䝕䞊䝍䜢䜽䝷䝇䝍䝸䞁䜾䛧䛺䛜䜙ᅇᖐ
0.02
4 3
NLPD
2
1
0
our method
FITC
kmeans
random
0.04
4
3
2
• 䝕䞊䝍Ⅼ䛻₯ᅾኚᩘ䚸ྠ䛨₯ᅾኚᩘ䜢ᣢ䛴䝕䞊䝍䛰䛡䛷0
−1
1
kin40k pumadyn pole
kin40k pumadyn pole
GP
5
4
3
2
1
– 1䜽䝷䝇䝍䛻1GP䛜ᑐᛂ䚸䜽䝷䝇䝍ෆ䛾䝕䞊䝍䛰䛡䛷GPᅇᖐ䛩䜛䛾䛷ィ⟬㔞䛜ῶ䜛
– 䜽䝷䝇䝍䝸䞁䜾䝁䝇䝖䜢ୗ䛢䜛䛯䜑䛻㏆ఝධ䜜䛯䜚
• S.
0.02
0
5
4
3
2
1
Williamson,
A.
Dubey
and
E.
Xing,
Parallel
Markov
Chain
Monte
Carlo
for
Nonparametric
Mixture
Models”,
ICML
2013.
[୪ิ]
– Dirichlet
process
mixture:
Markov Chain Monte Carlo for Nonparametric Mixture Models
RPC 0.832 ± 0.027 7.11 ± 䝜䞁䝟䝷䝯䝖䝸䝑䜽䝧䜲䝈ΰྜ䝰䝕䝹
䠄䜽䝷䝇䝍ᩘ䛾⮬ືỴᐃ䠅
1200
– DP䛾DP
8Processor
4Processor
2Processor
Gibbs(1Processor)
mixture䛿DP
mixture䛻䛺䜛䛣䛸䜢ド᫂
– ྛmixture䛿⊂❧䛻800
MCMC䛷䛝䜛
– ヲ⣽㔮䜚ྜ䛔᮲௳䜢ᔂ䛥䛪䛻MCMC䜢୪ิྍ⬟
400
• ᮲௳䜢↓ど䛧䛶ᙉᘬ䛻୪ิ䛩䜛᪉ἲ䜒ฟ䛶䛔䜛
kin40k pumadyn pole
0
Training time (hours)
(a) smse (b) nlpd (c) training time
Figure 3. Predictive performance and training time of our method compared to FITC and local FITC with kmeans and random clustering.
The standardized mean square error (SMSE) and negative log predictive density (NLPD) averaged across all test points are reported;
smaller is better.
Chalupka et al. (2012) that is more efficient than k-means
and tends to give more balanced cluster sizes. We denote
our model with the random and RPC initialization as FGP-RANDOM
and FGP-RPC method respectively.
We evaluate our model against six other competitive base-lines.
The first baseline is the local FITC model described
in the previous section, with random or RPC assignments
of the data points to clusters. Analogous to our model, we
refer to them as FITC-RANDOM and FITC-RPC. The second
baseline (GPSVI2000) is FITC with stochastic variational
inference (Hensman et al., 2013) training using B = 2000
inducing points. Note that GPSVI has quadratic storage
complexity O(B2) which limits the total number of induc-ing
points that can be used. Unlike our model and local
FITC, the inducing locations cannot be learned and must
be selected on some ad hoc basis. In addition to random
selection, we also clustered the dataset into partitions using
RPC and k-means and used the centroids as the inducing
inputs. We obtained essentially identical results with k-means
selection so its results are reported here. The third
baseline (SOD2000) is the standard GP regression model
where a subset of 2000 data points is randomly sampled for
training and the rest is discarded. For all of these GP-based
methods, we repeat the experiments 5 times with different
Table 1. Test performance of the models on the Million Song
Dataset. MAE is the mean absolute error and SMSE and NLPD
are as defined previously. All GP-based methods are reported with
standard deviation over 5 runs. Our method (FGP-RANDOM and
FGP-RPC) significantly outperforms all other baselines.
METHOD SMSE MAE NLPD
FGP-RANDOM 0.715 ± 0.003 AVparallel
6.47 ± 0.02 3.59 ± 0.01
FGP-RPC 0.723 ± 0.003 Synch
6.48 ± 0.02 3.58 ± 0.01
FITC-RANDOM 0.761 ± 0.009 6.74 ± 0.07 3.63 ± 0.03
FITC-RPC 0.832 ± 0.027 VB
7.11 ± 0.23 3.73 0.07
GPSVI2000 0.724 ± 0.005 Gibbs(6.53 ± 1Processor)
± 0.04 3.64 ± 0.01
SOD2000 0.794 ± 0.011 6.94 ± 0.08 3.69 ± 0.01
LR 0.770 6.846 NA
4000
3000
2000
CONSTANT 1.000 8.195 NA
NN1 1.683 9.900 NA
NN50 1.332 8.208 NA
1000
even worse than prediction using the constant mean. Lin-ear
regression does only slightly better than two 㻤
of the GP-based
methods namely FITC-RPC and SOD2000. Overall
our model is significantly better than all of the competing
methods. In particular, it is more accurate (for e.g. in terms
of MAE) than all but GPSVI2000 by at least 0.27 year per
kin40k pumadyn pole
0
SMSE
kin40k pumadyn pole
−1
NLPD
our method
FITC
kmeans
random
kin40k pumadyn 0
Training time (hours)
(a) smse (b) nlpd (c) training time
Figure 3. Predictive performance and training time of our method compared to FITC and local FITC with kmeans and The standardized mean square error (SMSE) and negative log predictive density (NLPD) averaged across all test smaller is better.
Chalupka et al. (2012) that is more efficient than k-means
and tends to give more balanced cluster sizes. We denote
our model with the random and RPC initialization as FGP-RANDOM
and FGP-RPC method respectively.
We evaluate our model against six other competitive base-lines.
The first baseline is the local FITC model described
in the previous section, with random or RPC assignments
of the data points to clusters. Analogous to our model, we
refer to them as FITC-RANDOM and FITC-RPC. The second
baseline (GPSVI2000) is FITC with stochastic variational
inference (Hensman et al., 2013) training using B = 2000
Table 1. Test performance of the models Dataset. MAE is the mean absolute error are as defined previously. All GP-based methods standard deviation over 5 runs. Our method FGP-RPC) significantly outperforms all other METHOD SMSE MAE FGP-RANDOM 0.715 ± 0.003 6.47 ± FGP-RPC 0.723 ± 0.003 6.48 ± FITC-RANDOM 0.761 ± 0.009 6.74 ± FITC-0
0 500 1000 1500 2000 2500
Perplexity*
Time*(minutes)*
(a) Test set perplexity against run time for AVparallel.
0
0.25 1 4 16 64 256 1024 4096
Perplexity*
Time*(minutes)*
(b) Test set perplexity against run time for various al-gorithms.
9. • T.
୪ิ䞉䝇䝖䝸䞊䝮
−7.3
log predictive probability
D = 361155800
D = 36115580
D = 3611558
D = 361155
D = 36115
−7.4
−7.6
Broderick,
N.
−7.35
Boyd,
A.
Wibisono,
A.
Wilson
and
M.
Jordan,
Streaming
Varia?onal
Bayes”,
NIPS
2013.
−7.8
−7.4
– ୪ิ䞉䝇䝖䝸䞊䝮䞉㠀ྠᮇ䛻Ꮫ⩦䛩䜛SDA-‐Bayes−8
䜢ᥦ
−7.45
– ኚศᚋศᕸ䛾䝟䝷䝯䞊䝍䜢䝧䜲䝈๎䛷᭦᪂
−8.2
– ẚ㍑ᑐ㇟䛾−7.5
Stochas0 1e6 ?c
Varia2e6 ?onal
−8.4
3e6
Inference(SVI,
Varia?onal
Bayes
number of examples seen
0 2e6 4e6
+stochas?c
gradient
descent)䛿䝭䝙䝞䝑䝏䝃䜲䝈䛾㑅䜃᪉䛜ㄢ㢟
– SDA-‐Bayes䛿䝻䝞䝇䝖
(a) SVI sensitivity to D on
Wikipedia
Wikipedia
−7
32-SDA 1-SDA SVI SSU
−7.2
Log pred prob −7.30 −7.38 −7.39 −7.94
Time (hours) 2.07 25.37 6.56 10.11
Nature
−7.4
−7.6
−7.8
32-SDA 1-SDA SVI SSU
−8
Log pred prob −7.07 −7.13 −7.14 −7.89
Time (hours) 0.31 7.00 1.73 1.99
−7
−7.5
−8
−8.5
Table 1: A comparison of (1) log predictive probability of held-out data and (2)
running time of four algorithms: SDA-Bayes with 32 threads, SDA-Bayes with 1
thread, SVI, and SSU.
number of examples seen
log predictive probability
SDA (256)
SDA (16)
SDA (64)
SDA (4096)
SDA (1024)
SVI (4096)
SVI (1024)
SVI (256)
SVI (64)
SVI (16)
(b) Sensitivity to minibatch size
on Wikipedia
−7.3
−7.35
−7.4
−7.45
log −7.5
0 number predictive probability
(c) SVI sensitivity parameters 0 1e5 2e5 3e5
number of examples seen
log predictive probability
D = 3515250
D = 35152500
D = 351525
D = 35152
D = 3515
(d) SVI sensitivity to D on Na-ture
0 2e5 4e5
number of examples seen
log predictive probability
SDA (256)
SDA (16)
SDA (4096)
SDA (64)
SVI (4096)
SDA (1024)
SVI (1024)
SVI (256)
SVI (64)
SVI (16)
(e) Sensitivity to minibatch size
on Nature
−7
−7.2
−7.4
−7.6
log −7.8
0 number predictive probability
(f) SVI sensitivity parameters 㻥