SlideShare una empresa de Scribd logo
1 de 10
Descargar para leer sin conexión
኱つᶍ䝕䞊䝍䛻ᑐ䛩䜛䝧䜲䝈Ꮫ⩦ 
➨6ᅇDSIRNLP 
@I_eric_Y 
㻝
⮬ᕫ⤂௓ 
• ほ 䝕䞊䝍䛾⫼ᚋ䛻䛒䜛ᵓ㐀䠄䝰䝕䝸䞁䜾䠅䛻⯆࿡ 
– Dynamic 
Bayesian 
Net, 
Gaussian 
Process, 
Latent 
Dirichlet 
Alloca?on, 
(Hierarchical) 
Dirichlet 
Process, 
Indian 
Buffet 
Process, 
Infinite 
Rela?onal 
Model… 
• Ꮫ⏕᫬௦䛿JX㏻ಙ♫䛷䝞䜲䝖 
– グ஦䛾⮬ື཰㞟䜰䝥䝸Vingow䛷グ஦䛾⮬ືせ⣙ᶵ⬟㛤Ⓨ䛻ᚑ஦ 
㻞
outline 
• probabilis?c 
modeling 
• ኱つᶍ䝕䞊䝍䜈䛾ᑐฎ 
– 䝃䝤䝃䞁䝥䝸䞁䜾 
– 䝰䝕䝹䛾ኚᙧ 
– ୪ิ໬䞉䝇䝖䝸䞊䝮 
• 䜎䛸䜑 
䈜䝇䝷䜲䝗୰䛾ᅗ䜔⾲䛿ㄽᩥ䛛䜙ᘬ⏝䛧䛶䛚䜚䜎䛩. 
㻟
probabilis?c 
modeling 
LATENT DIRICHLET ALLOCATION 
• ほ 䝕䞊䝍䛾⫼ᚋ䛻䛒䜛₯ᅾⓗᵓ㐀䜢☜⋡ⓗ䛻⾲⌧ 
BLEI, NG, AND JORDAN 
• ౛䠖䚷Latent 
Dirichlet 
Alloca?on[D. 
Blei 
et 
al., 
JMLR,2003] 
β_k 
α θ z w 
In the LDA setting, we obtain the extended graphical model shown in Figure 7. We treat β as 
⇥ࠥDirchlet(η) 
θ_mࠥDirichlet(α) 
z_nࠥMul?nomial(θ_m) 
w_nࠥMul?nomial(β_k) 
| 
z_n=k 
– ᩥ᭩(bag-­‐of-­‐words)䛾⫼ᚋ䛻䛒䜛ᵓ㐀䜢ከ㡯ศᕸ䜔䝕䜱䝸䜽䝺ศᕸ䛷⾲⌧ 
– ₯ᅾኚᩘz䛿༢ㄒw䛜䛹䛾䝖䝢䝑䜽䛻ᒓ䛩䜛䛛䜢⾲⌧ 
• ₯ᅾኚᩘ䜔䝟䝷䝯䞊䝍䛾Ꮫ⩦䜢䛩䜛䛸⯆࿡῝䛔ྍど໬䛜䛷䛝䛯䜚䛩䜛 
– p(θ|X)䌱p(X|θ)p(θ) 
M 
N 
k 
η β 
Figure 7: Graphical model representation of the smoothed LDA model. 
These two steps are repeated until the lower bound on the log likelihood converges. 
In Appendix A.4, we show that the M-step update for the conditional multinomial parameter β 
can be written out analytically: 
βi j ∝ 
MΣ 
d=1 
NdΣ 
n=1 
φ⇤d 
niwj 
dn. (9) 
We further show that the M-step update for Dirichlet parameter α can be implemented using an 
efficient Newton-Raphson method in which the Hessian is inverted in linear time. 
TheWilliam Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropoli-tan 
Opera Co., New York Philharmonic and Juilliard School. “Our board felt that we had a 
real opportunity to make a mark on the future of the performing arts with these grants an act 
every bit as important as our traditional areas of support in health, medical research, education 
and the social services,” Hearst Foundation President Randolph A. Hearst said Monday in 
announcing the grants. Lincoln Center’s share will be $200,000 for its new building, which 
will house young artists and provide new public facilities. The Metropolitan Opera Co. and 
New York Philharmonic will receive $400,000 each. The Juilliard School, where music and 
the performing arts are taught, will get $250,000. The Hearst Foundation, a leading supporter 
of the Lincoln Center Consolidated Corporate Fund, will make its usual annual $100,000 
donation, too. 
5.4 Smoothing 
The large vocabulary size that is characteristic of many document corpora creates serious problems 
of sparsity. A new document is very likely to contain words that did not appear in any of the 
documents in a training corpus. Maximum likelihood estimates of the multinomial parameters 
assign zero probability to such words, and thus zero probability to new documents. The standard 
approach to coping with this problem is to “smooth” the multinomial parameters, assigning positive 
probability to all vocabulary items whether or not they are observed in the training set (Jelinek, 
1997). Laplace smoothing is commonly used; this essentially yields the mean of the posterior 
distribution under a uniform Dirichlet prior on the multinomial parameters. 
Unfortunately, in the mixture model setting, simple Laplace smoothing is no longer justified as a 
Figure 8: An example article from the AP corpus. Each color codes a different factor from which 
maximum a posteriori method (although it is often implemented in practice; cf. Nigam et al., 1999). 
In the fact, word by placing is putatively a Dirichlet generated. 
prior on the multinomial parameter we obtain an intractable posterior 
in the mixture model setting, for much the same reason that one obtains an intractable posterior in 
the basic LDA model. Our proposed solution to this problem is to simply apply variational inference 
methods to the extended model that includes Dirichlet smoothing on the multinomial parameter. 
LATENT DIRICHLET ALLOCATION 
㻠 
TheWilliam Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropoli-tan 
Opera Co., New York Philharmonic and Juilliard School. “Our board felt that we had a 
real opportunity to make a mark on the future of the performing arts with these grants an act 
every bit as important as our traditional areas of support in health, medical research, education
• ◊✲䛷䛿 
probabilis?c 
modeling 
– ほ 䝕䞊䝍䜢䛔䛛䛻⾲⌧䞉ண 䛷䛝䜛䛛 
• ࿘㎶ᑬᗘ䜔perplexity䛻䜘䛳䛶䝰䝕䝹⮬య䜢ホ౯ 
• ᛂ⏝䛷䛿 
– ₯ᅾኚᩘ䜢౑䛳䛶᭷ຠ䛺▱ぢ䜢ᚓ䜛 
• 䝖䝢䝑䜽䛻䜘䜛ྍど໬䞉䝕䞊䝍䝬䜲䝙䞁䜾 
– 」㞧䛺ၥ㢟䜢ゎ䛟 
• ౛䠖䜽䝷䝇䝍䝸䞁䜾䛧䛺䛜䜙」ᩘ䛾㌶㊧䛾㠀⥺ᙧᅇᖐ,䚷䛥䜙䛻ᡭື䛷䜽䝷 
䝇䝍䛻ไ⣙䜢ධ䜜䛯䛔 
– J. 
Nonparametric Mixture of Gaussian Processes with Constraints 
Ross 
et 
al, 
Nonparametric 
Mixture 
of 
Gaussian 
Processes 
with 
Constraints, 
JMLR 
2013. 
– ḟඖᅽ⦰䞉≉ᚩ㔞ᢳฟ 
– Fisher 
z1 z2 z3 
kernel䛾䜘䛖䛻䜹䞊䝛䝹䛸䛧䛶౑䛖 
• ᇶᮏⓗ䛻ᩍᖌ䛺䛧Ꮫ⩦ 
z4 z5 z6 
– ᩍᖌ䝕䞊䝍స䜙䛺䛟䛶䛔䛔 
– 㧗㏿䛻ゎ䛡䜜䜀኱つᶍ䝕䞊䝍䛻ྥ䛔䛶䜛䠛 
z7 z8 z9 
Figure 2. Example MRF illustrating disconnected sub-graphs. 
Each graph edge represents either a must-link or 
cannot-link constraint. 
㻡 
Figure 4. Illustration of an algorithm output for the un-constrained 
case. While the curves provide a reasonable 
explanation of the data, it may not be the solution of in-terest.
semantic mixture components from the co-existing image content 
and text descriptions. In the data mining and information retrieval 
community, there has been a long time focus on using 
probabilistic topic models to study the correlation between image 
and text descriptions. Specifically, the Correspondence LDA 
(CorrLDA) model [1], which imposes correspondence between 
text word and other semantic entities, provides a natural way to 
learn latent semantic components (topics) from image features and 
associate them with text descriptions. Many recent studies, 
including sophisticated topic models that associate image features 
with multiple types of semantic entities (such as protein entities 
[8], ontology-based biomedical concepts [9]), still follow a similar 
generative process to the prototype CorrLDA model. In CorrLDA 
model, each image document has different distribution over 
semantic mixture components; this feature provides the model a 
flexibility of adapting to different image contents. However, the 
CorrLDA model requires specifying the exact number of mixture 
components, which is fixed for each image document and remains 
unchanged during the model estimation. In practice, in order to 
get an optimal number, the researchers have to try out different 
mixture components numbers and make a choice by comparing 
the log-likelihood, perplexity and other criteria that indicate how 
good the model fits the data. The Hieratical Dirichlet Process 
(HDP) model [5], is a nonparametric extension of the Latent 
Dirichlet Allocation (LDA)-based topic models, it enables 
modeling documents with countable infinite mixture components, 
thus provides the flexibility of modeling images whose actual 
semantic component numbers are unknown. 
2.2 Modeling User’s Perspective 
Study of social tagging in web-based applications has gained 
increased popularity in the data mining community. Specifically, 
several probabilistic generative models have been proposed to 
study users’ tagging patterns [10, 11]. In [11], a topic-perspective 
(TP) model is proposed to infer how both users’ perspective and 
the resource content relate to the generation of social annotations. 
It improves the generative process of social annotations by 
complement part of the holistic GIST features [3]. Our motivation 
comes from the fact that the mechanism of human visual 
perception allows for very rapid holistic image analysis to provide 
a coarse context of image scene (special layout model), yet it also 
gives rise to a small set of candidate salient locations in a scene 
(saliency model) that needs to be intensively studied [2]. In Fig. 1, 
Nj 
probabilis?c 
modeling 
• Web䝃䞊䝡䝇䛜⏕䜏ฟ䛩Mul? 
Adribute䛺䝕䞊䝍 
– Amazon 
review 
data:䚷ホ౯್, 
䝔䜻䝇䝖, 
t is the number of tags in document j, while Nj 
〇ရ䝆䝱䞁䝹,䚷ⴭ⪅᝟ሗ䜢ྠ᫬䛻ほ  
• 」㞧䛺䝰䝕䝹䛜ḟ䚻Ⓩሙ 
– 1䝃䞊䝡䝇䛻1䝰䝕䝹䛒䛳䛶䜒䛔䛔䛸ᛮ䛖 
– ኱㔞䛛䛴ከᵝ䛺₯ᅾኚᩘ 
• ₯ᅾኚᩘ䞉䝟䝷䝯䞊䝍䛾᥎ᐃ䛜᫬㛫㛗䛟䛺䜛 
• ኱つᶍ䝕䞊䝍䚸」㞧䝰䝕䝹 
X. 
Chen 
et 
al., 
Perspec?ve 
hierarchical 
dirichlet 
process 
for 
user-­‐tagged 
image 
modeling, 
CIKM2011. 
• ఏ⤫ⓗ䞉ỗ⏝ⓗ䛺Inferenceᡭἲ 
– Markov 
Chain 
Monte 
Carlo䠖䚷஦ᚋศᕸ䛛䜙䝃䞁䝥䝸䞁䜾 
– Varia?onal 
Bayes: 
஦ᚋศᕸ䜢ኚศ஦ᚋศᕸ䛸䛧䛶㏆ఝ 
– 䛭䛾䜎䜎䛷䛿ᤍ䛝䛝䜜䛺䛔 
– ௒ᅇ䛿ୖ2䛴䜢୰ᚰ䛻䛧䛯ᑐฎ᪉ἲ䛾◊✲⤂௓ 
• Spectral 
v and Nj 
ʌj 
ȕ 
N v j 
Į'0 ȕ' Ȗ 
learning, 
splash 
belief 
propaga?on, 
sequen?al 
monte 
carlo䛺䛹䛿ᢅ䜟䛺䛔 
r represent 
the total number of extracted visual code-words and MSER 
regions in document j, respectively. In the model, the holistic 
representation of an image is replicated 10 times to enable the 
posterior sampling, so Nj 
h denoted the hth replication of the 
holistic image representation in document j. In both models, we 
assume fixed value for Dirichlet process concentration parameters 
Į0 and Ȗ. We also assume symmetric priors Įu, ȟv, ȟt, ߟ and ȗ for 
Dirichlet distributions in the models. Detailed explanations of 
notations in following discussions are summarized in Table 1. 
J 
hj rjl vji 
Nj 
r 
μkr ıkr ijkv 
U 
L 
șu 
ȥ 
t 
Nj 
t 
Ȝj 
xjt p/z 
ij'kt 
Į0 
sj zjl zji 
μkh ıkh 
ȟ't ȟ v 
Ȗ 
U 
Kĺ’ 
ʌ'j 
Nj 
h 
K'ĺ’ 
ijkt 
ȟt 
K 
ȗ 
Įu 
Ș 
Fig. 1 Graphical representation of the perspective HDP 
(pHDP) model for user-tagged images 
㻢
University of Oxford, Oxford OX1 3TG, UK 
• B. 
REMI.BARDENET@GMAIL.COM 
DOUCET@STATS.OX.AC.UK 
CHOLMES@STATS.OX.AC.UK 
䝃䝤䝃䞁䝥䝸䞁䜾 
Rémi, 
A. 
Doucet, 
and 
C. 
Holmes, 
“Towards 
scaling 
up 
Markov 
chain 
Monte 
Carlo: 
an 
adap?ve 
subsampling 
approach”,ICML2014. 
[䝃䝤䝃䞁䝥䝸䞁䜾] 
– MCMC䛻䛚䛔䛶᪂䛯䛺䝟䝷䝯䞊䝍䜢᥇ᢥ䛩䜛䛛ྰ䛛䛾᥇ᢥ⋡ィ⟬䛷 
ᑬᗘ䛾ィ⟬䛜㔜䛯䛔䠄䝕䞊䝍ᩘn䠅 
(✓, ✓0) − (u, ✓, ✓0)|  ct, we want to increase t un-til 
the condition|⇤⇤t(✓, ✓0) − (u, ✓, ✓0)|  ct is satisfied. 
Let # 2 (0, 1) be a user-specified parameter. We provide a 
construction which ensures that at the first random time T 
such that |⇤⇤T (✓, ✓0) − (u, ✓, ✓0)|  cT , the correct deci-sion 
– 䝕䞊䝍䜢䝃䝤䝃䞁䝥䝸䞁䜾䛧䛶ᩘ䜢ῶ䜙䛩 
– 䛹䛾䛟䜙䛔䝃䞁䝥䝸䞁䜾䛧䛯䜙䜘䛔䠛 
䛹䛾䛟䜙䛔㏆ఝ䛷䛝䜛䠛 
– 䛒䜛☜⋡ⓗ䛺bound䜢‶䛯䛩䜎䛷䝃䞁䝥䝸䞁䜾 
– exact䛺್䛸䝃䝤䝃䞁䝥䝸䞁䜾䛧䛯್䛾ㄗᕪ䜢 
䝁䞁䝖䝻䞊䝹䛷䛝䜛䠄☜⋡ⓗ䛻䠅 
Abstract 
Carlo (MCMC) methods 
too computationally inten-sive 
practical use for large datasets. 
a methodology that aims 
Metropolis-Hastings (MH) algo-rithm 
We propose an approxi-mate 
of the accept/reject step of 
evaluating the likelihood 
the data, yet is guaranteed 
accept/reject step based on 
probability superior to a 
tolerance level. This adaptive sub-sampling 
an alternative to the re-cent 
developed in (Korattikara et al., 
to establish rigorously that 
approximate MH algorithm samples 
version of the target distribu-tion 
total variation distance to 
controlled explicitly. We ex-plore 
limitations of this scheme 
(2004, Chapter 7.3)). MH consists in building an ergodic 
Markov chain of invariant distribution ⇡(✓). Given a pro-posal 
q(✓0|✓), the MH algorithm starts its chain at a user-defined 
✓0, then at iteration k + 1 it proposes a candidate 
state ✓0 ⇠ q(·|✓k) and sets ✓k+1 to ✓0 with probability 
↵(✓k, ✓0) = 1^ 
⇡(✓0) 
⇡(✓k) 
q(✓k|✓0) 
q(✓0|✓k) 
= 1^ 
p (✓0) 
p (✓k) 
q(✓k|✓0) 
q(✓0|✓k) 
Yn 
i=1 
p(xi|✓0) 
p(xi|✓k) 
, (1) 
while ✓k+1 is otherwise set to ✓k. When the dataset is large 
(n % 1), evaluating the likelihood ratio appearing in the 
MH acceptance ratio (1) is too costly an operation and rules 
out the applicability of such a method. 
The aim of this paper is to propose an approximate imple-mentation 
of this “ideal”MHsampler, the maximal approx-imation 
error being pre-specified by the user. To achieve 
this, we first present the “ideal” MH sampler in a slightly 
non-standard way. 
In practice, the accept/reject step of the MH step is imple-mented 
by sampling a uniform random variable u ⇠ U(0,1) 
and accepting the candidate if and only if 
⇡(✓0) 
q(✓k|✓0) 
ct = ˆ!t 
t 
+ 
t 
, (9) 
applies. While the bound of Audibert et al. (2009) origi-nally 
covers the case where the x⇤i 
are drawn with replace-ment, 
it was early remarked (Hoeffding, 1963) that Cher-noff 
bounds, such as the empirical Bernstein bound, still 
hold when considering sampling without replacement. Fi-nally, 
we will also consider the recent Bernstein bound of 
Bardenet  Maillard (2013, Theorem 3), designed specifi-cally 
for the case of sampling without replacement. 
2.2. Stopping rule construction 
The concentration bounds given above are helpful as they 
can allow us to decide whether (3) holds or not. Indeed, 
on the event {|⇤⇤t 
(✓, ✓0) − ⇤n(✓, ✓0)|  ct}, we can de-cide 
whether or not ⇤n(✓, ✓0)  (u, ✓, ✓0) if |⇤⇤t 
t 
(✓⇤, ✓0) − (u, ✓, ✓0)|  ct additionally holds. This is illustrated 
in Figure 2. Combined to the concentration inequality 
(6), we thus take the correct decision with probability at 
least 1 − #t if |⇤(✓, ✓0) − (u, ✓, ✓0)|  ct. In case 
|⇤⇤t 
is taken with probability at least 1 − #. This adaptive 
stopping rule adapted from (Mnih et al., 2008) is inspired 
by bandit algorithms, Hoeffding races (Maron  Moore, 
1993) and procedures developed to scale up boosting al-gorithms 
to large datasets (Domingo  Watanabe, 2000). 
Formally, we set the stopping time 
T = n ^ inf{t % 1 : |⇤⇤t 
(✓, ✓0)− (u, ✓, ✓0)|  ct}, (10) 
where a ^ b denotes the minimum of a and b. In other 
words, if the infimum in (10) is larger than n, then we 
stop as our sampling without replacement procedure en-sures 
⇤⇤n(✓, ✓0) = ⇤n(✓, ✓0). Letting p  1 and selecting 
#t = p−1 
ptp #, we have 
P 
t$1 #t  #. Setting (ct)t$1 such 
that (6) holds, the event 
E = 
 
{|⇤⇤t 
(✓, ✓0) − ⇤n(✓, ✓0)|  ct} (11) 
each new x⇤i 
has been drawn, but rather draw several new 
subsamples x⇤i 
between each check of Step 19. This is why 
we introduce the variable tlook is Steps 6, 16, and 17 of Fig-ure 
3. This variable simply counts the number of times the 
check in Step 19 was performed. Finally, as recommended 
in a related setting in (Mnih et al., 2008; Mnih, 2008), we 
augment the size of the subsample geometrically by a user-input 
factor %  1 in Step 18. Obviously this modification 
does not impact the fact that the correct decision is taken 
with probability at least 1 − #. 
MHSUBLHD 
% 
p(x|✓), p(✓), q(✓0|✓), ✓0,Niter,X, (#t), C✓,✓0 
 
1n 
1 for k 1 to Niter 
2 ✓ ✓k−1 
3 ✓0 ⇠ q(.|✓), u ⇠ U(0,1), 
4 (u, ✓, ✓0) log 
h 
u p(✓)q(✓0|✓) 
p(✓0)q(✓|✓0) 
i 
5 t 0 
6 tlook 0 
7 ⇤⇤ 0 
8 X⇤ ; . Keeping track of points already used 
9 b 1 . Initialize batchsize to 1 
10 DONE FALSE 
11 while DONE == FALSE do 
12 x⇤t 
+1, . . . ,x⇤b 
⇠w/o repl. X  X⇤ . Sample new 
batch without replacement 
13 X⇤ X⇤ [ {x⇤t 
+1, . . . ,x⇤b 
} 
14 ⇤⇤ 1b 
⇣ 
t⇤⇤ + 
Pb 
i=t+1 log 
h 
p(x⇤i 
|✓0) 
p(x⇤i 
|✓) 
i⌘ 
15 t b 
16 c 2C✓,✓0 
q 
(1−f⇤t ) log(2/tlook ) 
2t 
17 tlook tlook + 1 
18 b n^d%te . Increase batchsize geometrically 
19 if |⇤⇤ − (u, ✓, ✓0)| % c or b  n 
20 DONE TRUE 
21 if ⇤⇤  (u, ✓, ✓0) 
22 ✓k ✓0 . Accept 
23 else ✓k ✓ .Reject 
24 return (✓k)㻣
• T. 
䝰䝕䝹䛾ኚᙧ 
Allocation of Gaussian Process Experts 
Fast Allocation of Gaussian Process 
Experts 
Nguyen 
and 
E. 
Bonilla, 
Fast 
Alloca?on 
of 
Gaussian 
Experts”, 
ICML 
2014. 
0.08 
[ィ⟬㔞๐ῶ] 
0.06 
– Gaussian 
SMSE 
0.08 
Process䛻䜘䜛㠀⥺ᙧᅇᖐ䠄0.06 
0.04 
O(N^3)䠅 
– 䝕䞊䝍䜢䜽䝷䝇䝍䝸䞁䜾䛧䛺䛜䜙ᅇᖐ 
0.02 
4 3 
NLPD 
2 
1 
0 
our method 
FITC 
kmeans 
random 
0.04 
4 
3 
2 
• 䝕䞊䝍Ⅼ䛻₯ᅾኚᩘ䚸ྠ䛨₯ᅾኚᩘ䜢ᣢ䛴䝕䞊䝍䛰䛡䛷0 
−1 
1 
kin40k pumadyn pole 
kin40k pumadyn pole 
GP 
5 
4 
3 
2 
1 
– 1䜽䝷䝇䝍䛻1GP䛜ᑐᛂ䚸䜽䝷䝇䝍ෆ䛾䝕䞊䝍䛰䛡䛷GPᅇᖐ䛩䜛䛾䛷ィ⟬㔞䛜ῶ䜛 
– 䜽䝷䝇䝍䝸䞁䜾䝁䝇䝖䜢ୗ䛢䜛䛯䜑䛻㏆ఝධ䜜䛯䜚 
• S. 
0.02 
0 
5 
4 
3 
2 
1 
Williamson, 
A. 
Dubey 
and 
E. 
Xing, 
Parallel 
Markov 
Chain 
Monte 
Carlo 
for 
Nonparametric 
Mixture 
Models”, 
ICML 
2013. 
[୪ิ໬] 
– Dirichlet 
process 
mixture: 
Markov Chain Monte Carlo for Nonparametric Mixture Models 
RPC 0.832 ± 0.027 7.11 ± 䝜䞁䝟䝷䝯䝖䝸䝑䜽䝧䜲䝈ΰྜ䝰䝕䝹 
䠄䜽䝷䝇䝍ᩘ䛾⮬ືỴᐃ䠅 
1200 
– DP䛾DP 
8Processor 
4Processor 
2Processor 
Gibbs(1Processor) 
mixture䛿DP 
mixture䛻䛺䜛䛣䛸䜢ド᫂ 
– ྛmixture䛿⊂❧䛻800 
MCMC䛷䛝䜛 
– ヲ⣽㔮䜚ྜ䛔᮲௳䜢ᔂ䛥䛪䛻MCMC䜢୪ิ໬ྍ⬟ 
400 
• ᮲௳䜢↓ど䛧䛶ᙉᘬ䛻୪ิ໬䛩䜛᪉ἲ䜒ฟ䛶䛔䜛 
kin40k pumadyn pole 
0 
Training time (hours) 
(a) smse (b) nlpd (c) training time 
Figure 3. Predictive performance and training time of our method compared to FITC and local FITC with kmeans and random clustering. 
The standardized mean square error (SMSE) and negative log predictive density (NLPD) averaged across all test points are reported; 
smaller is better. 
Chalupka et al. (2012) that is more efficient than k-means 
and tends to give more balanced cluster sizes. We denote 
our model with the random and RPC initialization as FGP-RANDOM 
and FGP-RPC method respectively. 
We evaluate our model against six other competitive base-lines. 
The first baseline is the local FITC model described 
in the previous section, with random or RPC assignments 
of the data points to clusters. Analogous to our model, we 
refer to them as FITC-RANDOM and FITC-RPC. The second 
baseline (GPSVI2000) is FITC with stochastic variational 
inference (Hensman et al., 2013) training using B = 2000 
inducing points. Note that GPSVI has quadratic storage 
complexity O(B2) which limits the total number of induc-ing 
points that can be used. Unlike our model and local 
FITC, the inducing locations cannot be learned and must 
be selected on some ad hoc basis. In addition to random 
selection, we also clustered the dataset into partitions using 
RPC and k-means and used the centroids as the inducing 
inputs. We obtained essentially identical results with k-means 
selection so its results are reported here. The third 
baseline (SOD2000) is the standard GP regression model 
where a subset of 2000 data points is randomly sampled for 
training and the rest is discarded. For all of these GP-based 
methods, we repeat the experiments 5 times with different 
Table 1. Test performance of the models on the Million Song 
Dataset. MAE is the mean absolute error and SMSE and NLPD 
are as defined previously. All GP-based methods are reported with 
standard deviation over 5 runs. Our method (FGP-RANDOM and 
FGP-RPC) significantly outperforms all other baselines. 
METHOD SMSE MAE NLPD 
FGP-RANDOM 0.715 ± 0.003 AVparallel 
6.47 ± 0.02 3.59 ± 0.01 
FGP-RPC 0.723 ± 0.003 Synch 
6.48 ± 0.02 3.58 ± 0.01 
FITC-RANDOM 0.761 ± 0.009 6.74 ± 0.07 3.63 ± 0.03 
FITC-RPC 0.832 ± 0.027 VB 
7.11 ± 0.23 3.73 0.07 
GPSVI2000 0.724 ± 0.005 Gibbs(6.53 ± 1Processor) 
± 0.04 3.64 ± 0.01 
SOD2000 0.794 ± 0.011 6.94 ± 0.08 3.69 ± 0.01 
LR 0.770 6.846 NA 
4000 
3000 
2000 
CONSTANT 1.000 8.195 NA 
NN1 1.683 9.900 NA 
NN50 1.332 8.208 NA 
1000 
even worse than prediction using the constant mean. Lin-ear 
regression does only slightly better than two 㻤 
of the GP-based 
methods namely FITC-RPC and SOD2000. Overall 
our model is significantly better than all of the competing 
methods. In particular, it is more accurate (for e.g. in terms 
of MAE) than all but GPSVI2000 by at least 0.27 year per 
kin40k pumadyn pole 
0 
SMSE 
kin40k pumadyn pole 
−1 
NLPD 
our method 
FITC 
kmeans 
random 
kin40k pumadyn 0 
Training time (hours) 
(a) smse (b) nlpd (c) training time 
Figure 3. Predictive performance and training time of our method compared to FITC and local FITC with kmeans and The standardized mean square error (SMSE) and negative log predictive density (NLPD) averaged across all test smaller is better. 
Chalupka et al. (2012) that is more efficient than k-means 
and tends to give more balanced cluster sizes. We denote 
our model with the random and RPC initialization as FGP-RANDOM 
and FGP-RPC method respectively. 
We evaluate our model against six other competitive base-lines. 
The first baseline is the local FITC model described 
in the previous section, with random or RPC assignments 
of the data points to clusters. Analogous to our model, we 
refer to them as FITC-RANDOM and FITC-RPC. The second 
baseline (GPSVI2000) is FITC with stochastic variational 
inference (Hensman et al., 2013) training using B = 2000 
Table 1. Test performance of the models Dataset. MAE is the mean absolute error are as defined previously. All GP-based methods standard deviation over 5 runs. Our method FGP-RPC) significantly outperforms all other METHOD SMSE MAE FGP-RANDOM 0.715 ± 0.003 6.47 ± FGP-RPC 0.723 ± 0.003 6.48 ± FITC-RANDOM 0.761 ± 0.009 6.74 ± FITC-0 
0 500 1000 1500 2000 2500 
Perplexity* 
Time*(minutes)* 
(a) Test set perplexity against run time for AVparallel. 
0 
0.25 1 4 16 64 256 1024 4096 
Perplexity* 
Time*(minutes)* 
(b) Test set perplexity against run time for various al-gorithms.
• T. 
୪ิ໬䞉䝇䝖䝸䞊䝮 
−7.3 
log predictive probability 
D = 361155800 
D = 36115580 
D = 3611558 
D = 361155 
D = 36115 
−7.4 
−7.6 
Broderick, 
N. 
−7.35 
Boyd, 
A. 
Wibisono, 
A. 
Wilson 
and 
M. 
Jordan, 
Streaming 
Varia?onal 
Bayes”, 
NIPS 
2013. 
−7.8 
−7.4 
– ୪ิ䞉䝇䝖䝸䞊䝮䞉㠀ྠᮇ䛻Ꮫ⩦䛩䜛SDA-­‐Bayes−8 
䜢ᥦ᱌ 
−7.45 
– ኚศ஦ᚋศᕸ䛾䝟䝷䝯䞊䝍䜢䝧䜲䝈๎䛷᭦᪂ 
−8.2 
– ẚ㍑ᑐ㇟䛾−7.5 
Stochas0 1e6 ?c 
Varia2e6 ?onal 
−8.4 
3e6 
Inference(SVI, 
Varia?onal 
Bayes 
number of examples seen 
0 2e6 4e6 
+stochas?c 
gradient 
descent)䛿䝭䝙䝞䝑䝏䝃䜲䝈䛾㑅䜃᪉䛜ㄢ㢟 
– SDA-­‐Bayes䛿䝻䝞䝇䝖 
(a) SVI sensitivity to D on 
Wikipedia 
Wikipedia 
−7 
32-SDA 1-SDA SVI SSU 
−7.2 
Log pred prob −7.30 −7.38 −7.39 −7.94 
Time (hours) 2.07 25.37 6.56 10.11 
Nature 
−7.4 
−7.6 
−7.8 
32-SDA 1-SDA SVI SSU 
−8 
Log pred prob −7.07 −7.13 −7.14 −7.89 
Time (hours) 0.31 7.00 1.73 1.99 
−7 
−7.5 
−8 
−8.5 
Table 1: A comparison of (1) log predictive probability of held-out data and (2) 
running time of four algorithms: SDA-Bayes with 32 threads, SDA-Bayes with 1 
thread, SVI, and SSU. 
number of examples seen 
log predictive probability 
SDA (256) 
SDA (16) 
SDA (64) 
SDA (4096) 
SDA (1024) 
SVI (4096) 
SVI (1024) 
SVI (256) 
SVI (64) 
SVI (16) 
(b) Sensitivity to minibatch size 
on Wikipedia 
−7.3 
−7.35 
−7.4 
−7.45 
log −7.5 
0 number predictive probability 
(c) SVI sensitivity parameters 0 1e5 2e5 3e5 
number of examples seen 
log predictive probability 
D = 3515250 
D = 35152500 
D = 351525 
D = 35152 
D = 3515 
(d) SVI sensitivity to D on Na-ture 
0 2e5 4e5 
number of examples seen 
log predictive probability 
SDA (256) 
SDA (16) 
SDA (4096) 
SDA (64) 
SVI (4096) 
SDA (1024) 
SVI (1024) 
SVI (256) 
SVI (64) 
SVI (16) 
(e) Sensitivity to minibatch size 
on Nature 
−7 
−7.2 
−7.4 
−7.6 
log −7.8 
0 number predictive probability 
(f) SVI sensitivity parameters 㻥
䜎䛸䜑 
• ☜⋡ⓗ䝰䝕䝸䞁䜾䛿኱つᶍ䝕䞊䝍䛻ྥ䛔䛶䛔䜛䛛䜒䛧䜜䛺䛔 
– ᩍᖌ䝕䞊䝍䛜䛔䜙䛺䛔 
– ᰂ㌾䛻䝰䝕䝸䞁䜾䛩䜛䛣䛸䛷ከᙬ䛺┠ᶆ䜢㐩ᡂ 
• 䛯䛰䛧inference䛜㞴䛧䛔 
– 㔜䛔 
– ⌮ㄽୖ୪ิ໬䛷䛝䛺䛔䜒䛾䛜䛒䛳䛯䜚 
– ศᩓฎ⌮䝣䝺䞊䝮䝽䞊䜽䛷䛒䜎䜚ᬑཬ䛧䛺䛔⌮⏤䠛 
• ௒ᅇ䛿௨ୗ3䛴䛾䛟䛟䜚䛷⤂௓ 
– 䝕䞊䝍䛾䝃䞁䝥䝸䞁䜾 
– 䝰䝕䝹䛾ኚᙧ 
– ୪ิ໬䞉䝇䝖䝸䞊䝮 
• ௒ᅇᢅ䛘䛺䛛䛳䛯䜒䛾 
– 䝰䝕䝹䛻≉໬䛧䛯inference, 
㧗㏿ΰྜ: 
type-­‐based 
MCMC… 
– బ⸨୍ㄔ䛥䜣䛾㈨ᩱ 
Big 
Data᫬௦䛾኱つᶍ䝧䜲䝈Ꮫ⩦-­‐Stochas?c 
Gradient 
Langevin 
Dynamics䜢୰ᚰ䛸䛧䛶 
hdp://www.slideshare.net/issei_sato/big-­‐datastochas?c-­‐gradient-­‐langevin-­‐dynamics 
㻝㻜

Más contenido relacionado

Similar a 第6回DSIRNLP 大規模データに対するベイズ学習@I_eric_Y

Bean arsel
Bean arselBean arsel
Bean arsel
Assignment Help
 
Chao Wrote Some trends that influence human resource are, Leade.docx
Chao Wrote Some trends that influence human resource are, Leade.docxChao Wrote Some trends that influence human resource are, Leade.docx
Chao Wrote Some trends that influence human resource are, Leade.docx
sleeperharwell
 
Chao Wrote Some trends that influence human resource are, Leade.docx
Chao Wrote Some trends that influence human resource are, Leade.docxChao Wrote Some trends that influence human resource are, Leade.docx
Chao Wrote Some trends that influence human resource are, Leade.docx
keturahhazelhurst
 
Towards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in LinguisticsTowards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in Linguistics
Richard Littauer
 
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
Sri Ambati
 
Artificial Intelligence - A modern approach 3ed
Artificial Intelligence - A modern approach 3edArtificial Intelligence - A modern approach 3ed
Artificial Intelligence - A modern approach 3ed
RohanMistry15
 
Using user personalized ontological profile to infer semantic knowledge for p...
Using user personalized ontological profile to infer semantic knowledge for p...Using user personalized ontological profile to infer semantic knowledge for p...
Using user personalized ontological profile to infer semantic knowledge for p...
Joao Luis Tavares
 
DeepLabCut AI Residency
DeepLabCut AI ResidencyDeepLabCut AI Residency
DeepLabCut AI Residency
Vic Shao-Chih Chiang
 

Similar a 第6回DSIRNLP 大規模データに対するベイズ学習@I_eric_Y (20)

Probabilistic Topic Models
Probabilistic Topic ModelsProbabilistic Topic Models
Probabilistic Topic Models
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
ICCSS2015 talk: Null model for meme popularity
ICCSS2015 talk: Null model for meme popularityICCSS2015 talk: Null model for meme popularity
ICCSS2015 talk: Null model for meme popularity
 
Bean arsel
Bean arselBean arsel
Bean arsel
 
OpenmHealth Overview
OpenmHealth OverviewOpenmHealth Overview
OpenmHealth Overview
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
Poster
PosterPoster
Poster
 
Show me the data! Actionable insight from open courses
Show me the data! Actionable insight from open coursesShow me the data! Actionable insight from open courses
Show me the data! Actionable insight from open courses
 
Chao Wrote Some trends that influence human resource are, Leade.docx
Chao Wrote Some trends that influence human resource are, Leade.docxChao Wrote Some trends that influence human resource are, Leade.docx
Chao Wrote Some trends that influence human resource are, Leade.docx
 
Chao Wrote Some trends that influence human resource are, Leade.docx
Chao Wrote Some trends that influence human resource are, Leade.docxChao Wrote Some trends that influence human resource are, Leade.docx
Chao Wrote Some trends that influence human resource are, Leade.docx
 
Tragedy of the Data Commons (ODSC-East, 2021)
Tragedy of the Data Commons (ODSC-East, 2021)Tragedy of the Data Commons (ODSC-East, 2021)
Tragedy of the Data Commons (ODSC-East, 2021)
 
Towards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in LinguisticsTowards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in Linguistics
 
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
 
Artificial Intelligence - A modern approach 3ed
Artificial Intelligence - A modern approach 3edArtificial Intelligence - A modern approach 3ed
Artificial Intelligence - A modern approach 3ed
 
Essay Editing Software
Essay Editing SoftwareEssay Editing Software
Essay Editing Software
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
 
Using user personalized ontological profile to infer semantic knowledge for p...
Using user personalized ontological profile to infer semantic knowledge for p...Using user personalized ontological profile to infer semantic knowledge for p...
Using user personalized ontological profile to infer semantic knowledge for p...
 
Striving to Demystify Bayesian Computational Modelling
Striving to Demystify Bayesian Computational ModellingStriving to Demystify Bayesian Computational Modelling
Striving to Demystify Bayesian Computational Modelling
 
DeepLabCut AI Residency
DeepLabCut AI ResidencyDeepLabCut AI Residency
DeepLabCut AI Residency
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

第6回DSIRNLP 大規模データに対するベイズ学習@I_eric_Y

  • 2. ⮬ᕫ⤂௓ • ほ 䝕䞊䝍䛾⫼ᚋ䛻䛒䜛ᵓ㐀䠄䝰䝕䝸䞁䜾䠅䛻⯆࿡ – Dynamic Bayesian Net, Gaussian Process, Latent Dirichlet Alloca?on, (Hierarchical) Dirichlet Process, Indian Buffet Process, Infinite Rela?onal Model… • Ꮫ⏕᫬௦䛿JX㏻ಙ♫䛷䝞䜲䝖 – グ஦䛾⮬ື཰㞟䜰䝥䝸Vingow䛷グ஦䛾⮬ືせ⣙ᶵ⬟㛤Ⓨ䛻ᚑ஦ 㻞
  • 3. outline • probabilis?c modeling • ኱つᶍ䝕䞊䝍䜈䛾ᑐฎ – 䝃䝤䝃䞁䝥䝸䞁䜾 – 䝰䝕䝹䛾ኚᙧ – ୪ิ໬䞉䝇䝖䝸䞊䝮 • 䜎䛸䜑 䈜䝇䝷䜲䝗୰䛾ᅗ䜔⾲䛿ㄽᩥ䛛䜙ᘬ⏝䛧䛶䛚䜚䜎䛩. 㻟
  • 4. probabilis?c modeling LATENT DIRICHLET ALLOCATION • ほ 䝕䞊䝍䛾⫼ᚋ䛻䛒䜛₯ᅾⓗᵓ㐀䜢☜⋡ⓗ䛻⾲⌧ BLEI, NG, AND JORDAN • ౛䠖䚷Latent Dirichlet Alloca?on[D. Blei et al., JMLR,2003] β_k α θ z w In the LDA setting, we obtain the extended graphical model shown in Figure 7. We treat β as ⇥ࠥDirchlet(η) θ_mࠥDirichlet(α) z_nࠥMul?nomial(θ_m) w_nࠥMul?nomial(β_k) | z_n=k – ᩥ᭩(bag-­‐of-­‐words)䛾⫼ᚋ䛻䛒䜛ᵓ㐀䜢ከ㡯ศᕸ䜔䝕䜱䝸䜽䝺ศᕸ䛷⾲⌧ – ₯ᅾኚᩘz䛿༢ㄒw䛜䛹䛾䝖䝢䝑䜽䛻ᒓ䛩䜛䛛䜢⾲⌧ • ₯ᅾኚᩘ䜔䝟䝷䝯䞊䝍䛾Ꮫ⩦䜢䛩䜛䛸⯆࿡῝䛔ྍど໬䛜䛷䛝䛯䜚䛩䜛 – p(θ|X)䌱p(X|θ)p(θ) M N k η β Figure 7: Graphical model representation of the smoothed LDA model. These two steps are repeated until the lower bound on the log likelihood converges. In Appendix A.4, we show that the M-step update for the conditional multinomial parameter β can be written out analytically: βi j ∝ MΣ d=1 NdΣ n=1 φ⇤d niwj dn. (9) We further show that the M-step update for Dirichlet parameter α can be implemented using an efficient Newton-Raphson method in which the Hessian is inverted in linear time. TheWilliam Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropoli-tan Opera Co., New York Philharmonic and Juilliard School. “Our board felt that we had a real opportunity to make a mark on the future of the performing arts with these grants an act every bit as important as our traditional areas of support in health, medical research, education and the social services,” Hearst Foundation President Randolph A. Hearst said Monday in announcing the grants. Lincoln Center’s share will be $200,000 for its new building, which will house young artists and provide new public facilities. The Metropolitan Opera Co. and New York Philharmonic will receive $400,000 each. The Juilliard School, where music and the performing arts are taught, will get $250,000. The Hearst Foundation, a leading supporter of the Lincoln Center Consolidated Corporate Fund, will make its usual annual $100,000 donation, too. 5.4 Smoothing The large vocabulary size that is characteristic of many document corpora creates serious problems of sparsity. A new document is very likely to contain words that did not appear in any of the documents in a training corpus. Maximum likelihood estimates of the multinomial parameters assign zero probability to such words, and thus zero probability to new documents. The standard approach to coping with this problem is to “smooth” the multinomial parameters, assigning positive probability to all vocabulary items whether or not they are observed in the training set (Jelinek, 1997). Laplace smoothing is commonly used; this essentially yields the mean of the posterior distribution under a uniform Dirichlet prior on the multinomial parameters. Unfortunately, in the mixture model setting, simple Laplace smoothing is no longer justified as a Figure 8: An example article from the AP corpus. Each color codes a different factor from which maximum a posteriori method (although it is often implemented in practice; cf. Nigam et al., 1999). In the fact, word by placing is putatively a Dirichlet generated. prior on the multinomial parameter we obtain an intractable posterior in the mixture model setting, for much the same reason that one obtains an intractable posterior in the basic LDA model. Our proposed solution to this problem is to simply apply variational inference methods to the extended model that includes Dirichlet smoothing on the multinomial parameter. LATENT DIRICHLET ALLOCATION 㻠 TheWilliam Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropoli-tan Opera Co., New York Philharmonic and Juilliard School. “Our board felt that we had a real opportunity to make a mark on the future of the performing arts with these grants an act every bit as important as our traditional areas of support in health, medical research, education
  • 5. • ◊✲䛷䛿 probabilis?c modeling – ほ 䝕䞊䝍䜢䛔䛛䛻⾲⌧䞉ண 䛷䛝䜛䛛 • ࿘㎶ᑬᗘ䜔perplexity䛻䜘䛳䛶䝰䝕䝹⮬య䜢ホ౯ • ᛂ⏝䛷䛿 – ₯ᅾኚᩘ䜢౑䛳䛶᭷ຠ䛺▱ぢ䜢ᚓ䜛 • 䝖䝢䝑䜽䛻䜘䜛ྍど໬䞉䝕䞊䝍䝬䜲䝙䞁䜾 – 」㞧䛺ၥ㢟䜢ゎ䛟 • ౛䠖䜽䝷䝇䝍䝸䞁䜾䛧䛺䛜䜙」ᩘ䛾㌶㊧䛾㠀⥺ᙧᅇᖐ,䚷䛥䜙䛻ᡭື䛷䜽䝷 䝇䝍䛻ไ⣙䜢ධ䜜䛯䛔 – J. Nonparametric Mixture of Gaussian Processes with Constraints Ross et al, Nonparametric Mixture of Gaussian Processes with Constraints, JMLR 2013. – ḟඖᅽ⦰䞉≉ᚩ㔞ᢳฟ – Fisher z1 z2 z3 kernel䛾䜘䛖䛻䜹䞊䝛䝹䛸䛧䛶౑䛖 • ᇶᮏⓗ䛻ᩍᖌ䛺䛧Ꮫ⩦ z4 z5 z6 – ᩍᖌ䝕䞊䝍స䜙䛺䛟䛶䛔䛔 – 㧗㏿䛻ゎ䛡䜜䜀኱つᶍ䝕䞊䝍䛻ྥ䛔䛶䜛䠛 z7 z8 z9 Figure 2. Example MRF illustrating disconnected sub-graphs. Each graph edge represents either a must-link or cannot-link constraint. 㻡 Figure 4. Illustration of an algorithm output for the un-constrained case. While the curves provide a reasonable explanation of the data, it may not be the solution of in-terest.
  • 6. semantic mixture components from the co-existing image content and text descriptions. In the data mining and information retrieval community, there has been a long time focus on using probabilistic topic models to study the correlation between image and text descriptions. Specifically, the Correspondence LDA (CorrLDA) model [1], which imposes correspondence between text word and other semantic entities, provides a natural way to learn latent semantic components (topics) from image features and associate them with text descriptions. Many recent studies, including sophisticated topic models that associate image features with multiple types of semantic entities (such as protein entities [8], ontology-based biomedical concepts [9]), still follow a similar generative process to the prototype CorrLDA model. In CorrLDA model, each image document has different distribution over semantic mixture components; this feature provides the model a flexibility of adapting to different image contents. However, the CorrLDA model requires specifying the exact number of mixture components, which is fixed for each image document and remains unchanged during the model estimation. In practice, in order to get an optimal number, the researchers have to try out different mixture components numbers and make a choice by comparing the log-likelihood, perplexity and other criteria that indicate how good the model fits the data. The Hieratical Dirichlet Process (HDP) model [5], is a nonparametric extension of the Latent Dirichlet Allocation (LDA)-based topic models, it enables modeling documents with countable infinite mixture components, thus provides the flexibility of modeling images whose actual semantic component numbers are unknown. 2.2 Modeling User’s Perspective Study of social tagging in web-based applications has gained increased popularity in the data mining community. Specifically, several probabilistic generative models have been proposed to study users’ tagging patterns [10, 11]. In [11], a topic-perspective (TP) model is proposed to infer how both users’ perspective and the resource content relate to the generation of social annotations. It improves the generative process of social annotations by complement part of the holistic GIST features [3]. Our motivation comes from the fact that the mechanism of human visual perception allows for very rapid holistic image analysis to provide a coarse context of image scene (special layout model), yet it also gives rise to a small set of candidate salient locations in a scene (saliency model) that needs to be intensively studied [2]. In Fig. 1, Nj probabilis?c modeling • Web䝃䞊䝡䝇䛜⏕䜏ฟ䛩Mul? Adribute䛺䝕䞊䝍 – Amazon review data:䚷ホ౯್, 䝔䜻䝇䝖, t is the number of tags in document j, while Nj 〇ရ䝆䝱䞁䝹,䚷ⴭ⪅᝟ሗ䜢ྠ᫬䛻ほ  • 」㞧䛺䝰䝕䝹䛜ḟ䚻Ⓩሙ – 1䝃䞊䝡䝇䛻1䝰䝕䝹䛒䛳䛶䜒䛔䛔䛸ᛮ䛖 – ኱㔞䛛䛴ከᵝ䛺₯ᅾኚᩘ • ₯ᅾኚᩘ䞉䝟䝷䝯䞊䝍䛾᥎ᐃ䛜᫬㛫㛗䛟䛺䜛 • ኱つᶍ䝕䞊䝍䚸」㞧䝰䝕䝹 X. Chen et al., Perspec?ve hierarchical dirichlet process for user-­‐tagged image modeling, CIKM2011. • ఏ⤫ⓗ䞉ỗ⏝ⓗ䛺Inferenceᡭἲ – Markov Chain Monte Carlo䠖䚷஦ᚋศᕸ䛛䜙䝃䞁䝥䝸䞁䜾 – Varia?onal Bayes: ஦ᚋศᕸ䜢ኚศ஦ᚋศᕸ䛸䛧䛶㏆ఝ – 䛭䛾䜎䜎䛷䛿ᤍ䛝䛝䜜䛺䛔 – ௒ᅇ䛿ୖ2䛴䜢୰ᚰ䛻䛧䛯ᑐฎ᪉ἲ䛾◊✲⤂௓ • Spectral v and Nj ʌj ȕ N v j Į'0 ȕ' Ȗ learning, splash belief propaga?on, sequen?al monte carlo䛺䛹䛿ᢅ䜟䛺䛔 r represent the total number of extracted visual code-words and MSER regions in document j, respectively. In the model, the holistic representation of an image is replicated 10 times to enable the posterior sampling, so Nj h denoted the hth replication of the holistic image representation in document j. In both models, we assume fixed value for Dirichlet process concentration parameters Į0 and Ȗ. We also assume symmetric priors Įu, ȟv, ȟt, ߟ and ȗ for Dirichlet distributions in the models. Detailed explanations of notations in following discussions are summarized in Table 1. J hj rjl vji Nj r μkr ıkr ijkv U L șu ȥ t Nj t Ȝj xjt p/z ij'kt Į0 sj zjl zji μkh ıkh ȟ't ȟ v Ȗ U Kĺ’ ʌ'j Nj h K'ĺ’ ijkt ȟt K ȗ Įu Ș Fig. 1 Graphical representation of the perspective HDP (pHDP) model for user-tagged images 㻢
  • 7. University of Oxford, Oxford OX1 3TG, UK • B. REMI.BARDENET@GMAIL.COM DOUCET@STATS.OX.AC.UK CHOLMES@STATS.OX.AC.UK 䝃䝤䝃䞁䝥䝸䞁䜾 Rémi, A. Doucet, and C. Holmes, “Towards scaling up Markov chain Monte Carlo: an adap?ve subsampling approach”,ICML2014. [䝃䝤䝃䞁䝥䝸䞁䜾] – MCMC䛻䛚䛔䛶᪂䛯䛺䝟䝷䝯䞊䝍䜢᥇ᢥ䛩䜛䛛ྰ䛛䛾᥇ᢥ⋡ィ⟬䛷 ᑬᗘ䛾ィ⟬䛜㔜䛯䛔䠄䝕䞊䝍ᩘn䠅 (✓, ✓0) − (u, ✓, ✓0)|  ct, we want to increase t un-til the condition|⇤⇤t(✓, ✓0) − (u, ✓, ✓0)| ct is satisfied. Let # 2 (0, 1) be a user-specified parameter. We provide a construction which ensures that at the first random time T such that |⇤⇤T (✓, ✓0) − (u, ✓, ✓0)| cT , the correct deci-sion – 䝕䞊䝍䜢䝃䝤䝃䞁䝥䝸䞁䜾䛧䛶ᩘ䜢ῶ䜙䛩 – 䛹䛾䛟䜙䛔䝃䞁䝥䝸䞁䜾䛧䛯䜙䜘䛔䠛 䛹䛾䛟䜙䛔㏆ఝ䛷䛝䜛䠛 – 䛒䜛☜⋡ⓗ䛺bound䜢‶䛯䛩䜎䛷䝃䞁䝥䝸䞁䜾 – exact䛺್䛸䝃䝤䝃䞁䝥䝸䞁䜾䛧䛯್䛾ㄗᕪ䜢 䝁䞁䝖䝻䞊䝹䛷䛝䜛䠄☜⋡ⓗ䛻䠅 Abstract Carlo (MCMC) methods too computationally inten-sive practical use for large datasets. a methodology that aims Metropolis-Hastings (MH) algo-rithm We propose an approxi-mate of the accept/reject step of evaluating the likelihood the data, yet is guaranteed accept/reject step based on probability superior to a tolerance level. This adaptive sub-sampling an alternative to the re-cent developed in (Korattikara et al., to establish rigorously that approximate MH algorithm samples version of the target distribu-tion total variation distance to controlled explicitly. We ex-plore limitations of this scheme (2004, Chapter 7.3)). MH consists in building an ergodic Markov chain of invariant distribution ⇡(✓). Given a pro-posal q(✓0|✓), the MH algorithm starts its chain at a user-defined ✓0, then at iteration k + 1 it proposes a candidate state ✓0 ⇠ q(·|✓k) and sets ✓k+1 to ✓0 with probability ↵(✓k, ✓0) = 1^ ⇡(✓0) ⇡(✓k) q(✓k|✓0) q(✓0|✓k) = 1^ p (✓0) p (✓k) q(✓k|✓0) q(✓0|✓k) Yn i=1 p(xi|✓0) p(xi|✓k) , (1) while ✓k+1 is otherwise set to ✓k. When the dataset is large (n % 1), evaluating the likelihood ratio appearing in the MH acceptance ratio (1) is too costly an operation and rules out the applicability of such a method. The aim of this paper is to propose an approximate imple-mentation of this “ideal”MHsampler, the maximal approx-imation error being pre-specified by the user. To achieve this, we first present the “ideal” MH sampler in a slightly non-standard way. In practice, the accept/reject step of the MH step is imple-mented by sampling a uniform random variable u ⇠ U(0,1) and accepting the candidate if and only if ⇡(✓0) q(✓k|✓0) ct = ˆ!t t + t , (9) applies. While the bound of Audibert et al. (2009) origi-nally covers the case where the x⇤i are drawn with replace-ment, it was early remarked (Hoeffding, 1963) that Cher-noff bounds, such as the empirical Bernstein bound, still hold when considering sampling without replacement. Fi-nally, we will also consider the recent Bernstein bound of Bardenet Maillard (2013, Theorem 3), designed specifi-cally for the case of sampling without replacement. 2.2. Stopping rule construction The concentration bounds given above are helpful as they can allow us to decide whether (3) holds or not. Indeed, on the event {|⇤⇤t (✓, ✓0) − ⇤n(✓, ✓0)|  ct}, we can de-cide whether or not ⇤n(✓, ✓0) (u, ✓, ✓0) if |⇤⇤t t (✓⇤, ✓0) − (u, ✓, ✓0)| ct additionally holds. This is illustrated in Figure 2. Combined to the concentration inequality (6), we thus take the correct decision with probability at least 1 − #t if |⇤(✓, ✓0) − (u, ✓, ✓0)| ct. In case |⇤⇤t is taken with probability at least 1 − #. This adaptive stopping rule adapted from (Mnih et al., 2008) is inspired by bandit algorithms, Hoeffding races (Maron Moore, 1993) and procedures developed to scale up boosting al-gorithms to large datasets (Domingo Watanabe, 2000). Formally, we set the stopping time T = n ^ inf{t % 1 : |⇤⇤t (✓, ✓0)− (u, ✓, ✓0)| ct}, (10) where a ^ b denotes the minimum of a and b. In other words, if the infimum in (10) is larger than n, then we stop as our sampling without replacement procedure en-sures ⇤⇤n(✓, ✓0) = ⇤n(✓, ✓0). Letting p 1 and selecting #t = p−1 ptp #, we have P t$1 #t  #. Setting (ct)t$1 such that (6) holds, the event E = {|⇤⇤t (✓, ✓0) − ⇤n(✓, ✓0)|  ct} (11) each new x⇤i has been drawn, but rather draw several new subsamples x⇤i between each check of Step 19. This is why we introduce the variable tlook is Steps 6, 16, and 17 of Fig-ure 3. This variable simply counts the number of times the check in Step 19 was performed. Finally, as recommended in a related setting in (Mnih et al., 2008; Mnih, 2008), we augment the size of the subsample geometrically by a user-input factor % 1 in Step 18. Obviously this modification does not impact the fact that the correct decision is taken with probability at least 1 − #. MHSUBLHD % p(x|✓), p(✓), q(✓0|✓), ✓0,Niter,X, (#t), C✓,✓0 1n 1 for k 1 to Niter 2 ✓ ✓k−1 3 ✓0 ⇠ q(.|✓), u ⇠ U(0,1), 4 (u, ✓, ✓0) log h u p(✓)q(✓0|✓) p(✓0)q(✓|✓0) i 5 t 0 6 tlook 0 7 ⇤⇤ 0 8 X⇤ ; . Keeping track of points already used 9 b 1 . Initialize batchsize to 1 10 DONE FALSE 11 while DONE == FALSE do 12 x⇤t +1, . . . ,x⇤b ⇠w/o repl. X X⇤ . Sample new batch without replacement 13 X⇤ X⇤ [ {x⇤t +1, . . . ,x⇤b } 14 ⇤⇤ 1b ⇣ t⇤⇤ + Pb i=t+1 log h p(x⇤i |✓0) p(x⇤i |✓) i⌘ 15 t b 16 c 2C✓,✓0 q (1−f⇤t ) log(2/tlook ) 2t 17 tlook tlook + 1 18 b n^d%te . Increase batchsize geometrically 19 if |⇤⇤ − (u, ✓, ✓0)| % c or b n 20 DONE TRUE 21 if ⇤⇤ (u, ✓, ✓0) 22 ✓k ✓0 . Accept 23 else ✓k ✓ .Reject 24 return (✓k)㻣
  • 8. • T. 䝰䝕䝹䛾ኚᙧ Allocation of Gaussian Process Experts Fast Allocation of Gaussian Process Experts Nguyen and E. Bonilla, Fast Alloca?on of Gaussian Experts”, ICML 2014. 0.08 [ィ⟬㔞๐ῶ] 0.06 – Gaussian SMSE 0.08 Process䛻䜘䜛㠀⥺ᙧᅇᖐ䠄0.06 0.04 O(N^3)䠅 – 䝕䞊䝍䜢䜽䝷䝇䝍䝸䞁䜾䛧䛺䛜䜙ᅇᖐ 0.02 4 3 NLPD 2 1 0 our method FITC kmeans random 0.04 4 3 2 • 䝕䞊䝍Ⅼ䛻₯ᅾኚᩘ䚸ྠ䛨₯ᅾኚᩘ䜢ᣢ䛴䝕䞊䝍䛰䛡䛷0 −1 1 kin40k pumadyn pole kin40k pumadyn pole GP 5 4 3 2 1 – 1䜽䝷䝇䝍䛻1GP䛜ᑐᛂ䚸䜽䝷䝇䝍ෆ䛾䝕䞊䝍䛰䛡䛷GPᅇᖐ䛩䜛䛾䛷ィ⟬㔞䛜ῶ䜛 – 䜽䝷䝇䝍䝸䞁䜾䝁䝇䝖䜢ୗ䛢䜛䛯䜑䛻㏆ఝධ䜜䛯䜚 • S. 0.02 0 5 4 3 2 1 Williamson, A. Dubey and E. Xing, Parallel Markov Chain Monte Carlo for Nonparametric Mixture Models”, ICML 2013. [୪ิ໬] – Dirichlet process mixture: Markov Chain Monte Carlo for Nonparametric Mixture Models RPC 0.832 ± 0.027 7.11 ± 䝜䞁䝟䝷䝯䝖䝸䝑䜽䝧䜲䝈ΰྜ䝰䝕䝹 䠄䜽䝷䝇䝍ᩘ䛾⮬ືỴᐃ䠅 1200 – DP䛾DP 8Processor 4Processor 2Processor Gibbs(1Processor) mixture䛿DP mixture䛻䛺䜛䛣䛸䜢ド᫂ – ྛmixture䛿⊂❧䛻800 MCMC䛷䛝䜛 – ヲ⣽㔮䜚ྜ䛔᮲௳䜢ᔂ䛥䛪䛻MCMC䜢୪ิ໬ྍ⬟ 400 • ᮲௳䜢↓ど䛧䛶ᙉᘬ䛻୪ิ໬䛩䜛᪉ἲ䜒ฟ䛶䛔䜛 kin40k pumadyn pole 0 Training time (hours) (a) smse (b) nlpd (c) training time Figure 3. Predictive performance and training time of our method compared to FITC and local FITC with kmeans and random clustering. The standardized mean square error (SMSE) and negative log predictive density (NLPD) averaged across all test points are reported; smaller is better. Chalupka et al. (2012) that is more efficient than k-means and tends to give more balanced cluster sizes. We denote our model with the random and RPC initialization as FGP-RANDOM and FGP-RPC method respectively. We evaluate our model against six other competitive base-lines. The first baseline is the local FITC model described in the previous section, with random or RPC assignments of the data points to clusters. Analogous to our model, we refer to them as FITC-RANDOM and FITC-RPC. The second baseline (GPSVI2000) is FITC with stochastic variational inference (Hensman et al., 2013) training using B = 2000 inducing points. Note that GPSVI has quadratic storage complexity O(B2) which limits the total number of induc-ing points that can be used. Unlike our model and local FITC, the inducing locations cannot be learned and must be selected on some ad hoc basis. In addition to random selection, we also clustered the dataset into partitions using RPC and k-means and used the centroids as the inducing inputs. We obtained essentially identical results with k-means selection so its results are reported here. The third baseline (SOD2000) is the standard GP regression model where a subset of 2000 data points is randomly sampled for training and the rest is discarded. For all of these GP-based methods, we repeat the experiments 5 times with different Table 1. Test performance of the models on the Million Song Dataset. MAE is the mean absolute error and SMSE and NLPD are as defined previously. All GP-based methods are reported with standard deviation over 5 runs. Our method (FGP-RANDOM and FGP-RPC) significantly outperforms all other baselines. METHOD SMSE MAE NLPD FGP-RANDOM 0.715 ± 0.003 AVparallel 6.47 ± 0.02 3.59 ± 0.01 FGP-RPC 0.723 ± 0.003 Synch 6.48 ± 0.02 3.58 ± 0.01 FITC-RANDOM 0.761 ± 0.009 6.74 ± 0.07 3.63 ± 0.03 FITC-RPC 0.832 ± 0.027 VB 7.11 ± 0.23 3.73 0.07 GPSVI2000 0.724 ± 0.005 Gibbs(6.53 ± 1Processor) ± 0.04 3.64 ± 0.01 SOD2000 0.794 ± 0.011 6.94 ± 0.08 3.69 ± 0.01 LR 0.770 6.846 NA 4000 3000 2000 CONSTANT 1.000 8.195 NA NN1 1.683 9.900 NA NN50 1.332 8.208 NA 1000 even worse than prediction using the constant mean. Lin-ear regression does only slightly better than two 㻤 of the GP-based methods namely FITC-RPC and SOD2000. Overall our model is significantly better than all of the competing methods. In particular, it is more accurate (for e.g. in terms of MAE) than all but GPSVI2000 by at least 0.27 year per kin40k pumadyn pole 0 SMSE kin40k pumadyn pole −1 NLPD our method FITC kmeans random kin40k pumadyn 0 Training time (hours) (a) smse (b) nlpd (c) training time Figure 3. Predictive performance and training time of our method compared to FITC and local FITC with kmeans and The standardized mean square error (SMSE) and negative log predictive density (NLPD) averaged across all test smaller is better. Chalupka et al. (2012) that is more efficient than k-means and tends to give more balanced cluster sizes. We denote our model with the random and RPC initialization as FGP-RANDOM and FGP-RPC method respectively. We evaluate our model against six other competitive base-lines. The first baseline is the local FITC model described in the previous section, with random or RPC assignments of the data points to clusters. Analogous to our model, we refer to them as FITC-RANDOM and FITC-RPC. The second baseline (GPSVI2000) is FITC with stochastic variational inference (Hensman et al., 2013) training using B = 2000 Table 1. Test performance of the models Dataset. MAE is the mean absolute error are as defined previously. All GP-based methods standard deviation over 5 runs. Our method FGP-RPC) significantly outperforms all other METHOD SMSE MAE FGP-RANDOM 0.715 ± 0.003 6.47 ± FGP-RPC 0.723 ± 0.003 6.48 ± FITC-RANDOM 0.761 ± 0.009 6.74 ± FITC-0 0 500 1000 1500 2000 2500 Perplexity* Time*(minutes)* (a) Test set perplexity against run time for AVparallel. 0 0.25 1 4 16 64 256 1024 4096 Perplexity* Time*(minutes)* (b) Test set perplexity against run time for various al-gorithms.
  • 9. • T. ୪ิ໬䞉䝇䝖䝸䞊䝮 −7.3 log predictive probability D = 361155800 D = 36115580 D = 3611558 D = 361155 D = 36115 −7.4 −7.6 Broderick, N. −7.35 Boyd, A. Wibisono, A. Wilson and M. Jordan, Streaming Varia?onal Bayes”, NIPS 2013. −7.8 −7.4 – ୪ิ䞉䝇䝖䝸䞊䝮䞉㠀ྠᮇ䛻Ꮫ⩦䛩䜛SDA-­‐Bayes−8 䜢ᥦ᱌ −7.45 – ኚศ஦ᚋศᕸ䛾䝟䝷䝯䞊䝍䜢䝧䜲䝈๎䛷᭦᪂ −8.2 – ẚ㍑ᑐ㇟䛾−7.5 Stochas0 1e6 ?c Varia2e6 ?onal −8.4 3e6 Inference(SVI, Varia?onal Bayes number of examples seen 0 2e6 4e6 +stochas?c gradient descent)䛿䝭䝙䝞䝑䝏䝃䜲䝈䛾㑅䜃᪉䛜ㄢ㢟 – SDA-­‐Bayes䛿䝻䝞䝇䝖 (a) SVI sensitivity to D on Wikipedia Wikipedia −7 32-SDA 1-SDA SVI SSU −7.2 Log pred prob −7.30 −7.38 −7.39 −7.94 Time (hours) 2.07 25.37 6.56 10.11 Nature −7.4 −7.6 −7.8 32-SDA 1-SDA SVI SSU −8 Log pred prob −7.07 −7.13 −7.14 −7.89 Time (hours) 0.31 7.00 1.73 1.99 −7 −7.5 −8 −8.5 Table 1: A comparison of (1) log predictive probability of held-out data and (2) running time of four algorithms: SDA-Bayes with 32 threads, SDA-Bayes with 1 thread, SVI, and SSU. number of examples seen log predictive probability SDA (256) SDA (16) SDA (64) SDA (4096) SDA (1024) SVI (4096) SVI (1024) SVI (256) SVI (64) SVI (16) (b) Sensitivity to minibatch size on Wikipedia −7.3 −7.35 −7.4 −7.45 log −7.5 0 number predictive probability (c) SVI sensitivity parameters 0 1e5 2e5 3e5 number of examples seen log predictive probability D = 3515250 D = 35152500 D = 351525 D = 35152 D = 3515 (d) SVI sensitivity to D on Na-ture 0 2e5 4e5 number of examples seen log predictive probability SDA (256) SDA (16) SDA (4096) SDA (64) SVI (4096) SDA (1024) SVI (1024) SVI (256) SVI (64) SVI (16) (e) Sensitivity to minibatch size on Nature −7 −7.2 −7.4 −7.6 log −7.8 0 number predictive probability (f) SVI sensitivity parameters 㻥
  • 10. 䜎䛸䜑 • ☜⋡ⓗ䝰䝕䝸䞁䜾䛿኱つᶍ䝕䞊䝍䛻ྥ䛔䛶䛔䜛䛛䜒䛧䜜䛺䛔 – ᩍᖌ䝕䞊䝍䛜䛔䜙䛺䛔 – ᰂ㌾䛻䝰䝕䝸䞁䜾䛩䜛䛣䛸䛷ከᙬ䛺┠ᶆ䜢㐩ᡂ • 䛯䛰䛧inference䛜㞴䛧䛔 – 㔜䛔 – ⌮ㄽୖ୪ิ໬䛷䛝䛺䛔䜒䛾䛜䛒䛳䛯䜚 – ศᩓฎ⌮䝣䝺䞊䝮䝽䞊䜽䛷䛒䜎䜚ᬑཬ䛧䛺䛔⌮⏤䠛 • ௒ᅇ䛿௨ୗ3䛴䛾䛟䛟䜚䛷⤂௓ – 䝕䞊䝍䛾䝃䞁䝥䝸䞁䜾 – 䝰䝕䝹䛾ኚᙧ – ୪ิ໬䞉䝇䝖䝸䞊䝮 • ௒ᅇᢅ䛘䛺䛛䛳䛯䜒䛾 – 䝰䝕䝹䛻≉໬䛧䛯inference, 㧗㏿ΰྜ: type-­‐based MCMC… – బ⸨୍ㄔ䛥䜣䛾㈨ᩱ Big Data᫬௦䛾኱つᶍ䝧䜲䝈Ꮫ⩦-­‐Stochas?c Gradient Langevin Dynamics䜢୰ᚰ䛸䛧䛶 hdp://www.slideshare.net/issei_sato/big-­‐datastochas?c-­‐gradient-­‐langevin-­‐dynamics 㻝㻜