SlideShare una empresa de Scribd logo
1 de 12
Descargar para leer sin conexión
Mini-batch Variational Inference for
Time-Aware Topic Modeling
First Author1[0000−1111−2222−3333]
and Second Author2[1111−2222−3333−4444]
1
2
Abstract. This paper proposes a time-aware topic model and its mini-
batch variational inference for exploring chronological trends in docu-
ment contents. Our contribution is twofold. First, to extract topics in a
time-aware manner, our method uses two vector embeddings: the embed-
ding of latent topics and that of document timestamps. By combining
these two embeddings and applying the softmax function, we have as
many word probability distributions as document timestamps for each
topic. This modeling enables us to extract remarkable topical trends.
Second, to achieve memory efficiency, the variational inference is imple-
mented as a mini-batch maximization of the evidence lower bound. Our
inference requires no knowledge of the total number of training docu-
ments. This enables us to perform the optimization in a similar manner
to neural networks. Actually, our implementation used a deep learning
framework. The evaluation results show that we could improve test per-
plexity by using document timestamps. The results also show that our
test perplexity was comparable with that of collapsed Gibbs sampling,
which is less efficient in memory usage than our method. This makes our
proposal promising for time-aware topic extraction from large data sets.
Keywords: topic models · variational Bayes · time-aware analysis.
1 Introduction
This paper proposes a topic model using document timestamps, which is an ex-
tension of latent Dirichlet allocation (LDA) [3], and also provides an inference
for the proposed model. Our contribution is twofold. First, we use two vector
embeddings in our model: the vector embedding of latent topics and that of doc-
ument timestamps. Vector embedding enables us to easily incorporate covariates
like timestamps in topic modeling. Second, we provide a mini-batch based vari-
ational inference for the proposed model. An important merit of mini-batch
inference is memory efficiency. These two contributions, explained in the follow-
ing subsections, provide a viewpoint from which we can regard our time-aware
topic modeling as sharing common features with neural networks. We actually
implemented the proposed inference by using PyTorch.3
3
https://github.com/pytorch
2 F. Author et al.
1.1 Vector Embeddings of Topics and Timestamps
The first contribution concerns the modeling. Topic models are a probabilistic
model of documents. The following two types of categorical distributions are
used in topic modeling: per-document categorical distributions over topics and
per-topic categorical distributions over words.
First, each document is modeled with a categorical distribution defined over
latent topics. We denote the number of topics by K. Let θd,k refer to the prob-
ability of the topic k in the document d. Intuitively, θd,k for each k = 1, . . . , K
tells the importance of each topic in the document d. Our method introduces
no modification with respect to this type of categorical distributions. Therefore,
the corresponding posterior parameters are estimated as described in [3].
Second, each latent topic is in turn represented as a categorical distribution
defined over vocabulary words. We denote the number of different words, i.e.,
the vocabulary size, by V and refer to the probability of the word v in the topic
k by φk,v. Intuitively, φk,v is the relevancy of the word v to the topic k. In
the proposed model, the φk,v’s for each topic k are obtained by applying the
softmax function to the V -dimensional embedding wk of the topic k plus bias,
i.e., φk,v = exp(wk,v +bv)/ v exp(wk,v +bv ), where bv is the bias for the word
v and is shared by all topics. Let W be the V × K matrix whose k-th column is
wk. Then φk = (φk,1, . . . , φk,V ) can be written as
φk = Softmax(W ek + b) (1)
where ek is the one-hot vector whose k-th element is 1 and all other elements
are 0. The formula in Eq. (1) is similar to that presented in [7]. However, we
here provide a viewpoint from which these per-topic word probabilities φk for
k = 1, . . . , K can be regarded as softmax outputs of a single-layer feedfoward
network whose input is always one-hot. We adopt the same approach also for
document timestamps. Assume that there are T different timestamps and that
ut is the vector embedding of the timestamp t. We then obtain time-aware per-
topic word probabilities as
φk,t = Softmax(W ek + Uot + b) (2)
where U is the V ×T matrix whose t-th column is ut, and ot is the one hot vector
whose t-th element is 1 and all other elements are 0. φk,t = (φk,t,1, . . . , φk,t,V )
in Eq. (2) represents the word probabilities in the topic k at the time point t (cf.
Fig. 1). In the evaluation experiments, we compared the word probability mod-
eling based on Eq. (2) to that based on Eq. (1) in terms of test perplexity. The
experiments are expected to show whether document timestamps can improve
the quality of extracted topics in terms of perplexity.
1.2 Mini-Batch Variational Bayes
Our second contribution concerns the inference for the proposed model. Mini-
batch based variational Bayesian inferences have already been proposed for topic
Mini-batch Variational Inference for Time-Aware Topic Modeling 3
Fig. 1. Word cloud presentation of the same topic at three different time points. This
topic, seemingly related to HCI, extracted from 827,141 paper titles in DBLP data set
(cf. Section 3). K was set to 50. The words having larger probabilities are shown in
larger fonts, because the font size is determined by the value of φk,t,v. The left, center,
and right panels correspond to the time points 1998, 2005, and 2012, respectively.
models [9][4]. However, these inferences require the knowledge of the total num-
ber of training documents. Our second contribution is to provide a mini-batch
learning for the proposed model, where no knowledge of data size is required.
We perform the parameter estimation for our model in a similar manner to
that for neural networks. We iterate over mini-batches, i.e., small subsets of the
corpus, and conduct gradient descent where the loss function is the negative of
the evidence lower bound (ELBO). Of course, we need to choose an appropri-
ate optimization algorithm like Adam [10] and to adjust its learning rate based
on validation set evaluation. However, this is what we usually do when training
neural networks. The experimental results show that the test perplexity achieved
by the proposed inference was comparable to that achieved by collapsed Gibbs
sampling (CGS) for the vanilla LDA [8]. While CGS can give a good test perplex-
ity in many situations [1], it is far less efficient in memory use than mini-batch
learning. Since our inference is a mini-batch learning, it can be adopted even
when the number of training documents is prohibitively large for CGS.
The rest of the paper is organized as follows. Section 2 describes our proposal
in detail. Section 3 provides the settings and results of evaluation experiments.
Section 4 discusses some of the existing approaches related to ours. Section 5
concludes the paper by giving a future research direction.
2 Method
2.1 Variational Lower Bound
While there are many ways to perform an inference for topic models, we adopt
variational Bayesian inference [3][9][4], because it allows us to perform inference
as an optimization. Our topic model is a modification of the vanilla LDA [3].
In our model, a symmetric Dirichlet prior Dirichlet(α) is assigned to the per-
document categorical distributions over topics as in the vanilla LDA. In con-
trast, no prior is assigned to the per-topic categorical distributions over words
4 F. Author et al.
in our model, because the word probabilities are directly estimated in a time-
aware manner. The evidence lower bound (ELBO) to be maximized in variational
Bayesian inference can thus be written as follows [3]:
d
log p(xd; α, Φ) ≥
d,v,k
nd,vγd,v,k log φk,v
+
d,k
α +
v
nd,vγd,v,k − λd,k Ψ(λd,k) − Ψ
k
λd,k
−
d,v,k
nd,vγd,v,k log γd,v,k + log Γ(Kα) − K log Γ(α) (3)
where xd is the bag-of-words representation of the document d. Φ = {φ1, . . . , φK}
are the parameter vectors of the per-topic categorical distributions over words,
i.e., per-topic word probabilities. nd,v is the frequency of the word v in d. γd,v,k
is the responsibility of the word v with respect to the topic k in d. λd,k is the
posterior parameter for the topic k in d. Ψ is the digamma function. We regard
the parameter α of the symmetric Dirichlet prior Dirichlet(α) as free parame-
ter. The document-specific parameters λd,k and γd,v,k are estimated in the same
manner as in [3][9][4]. However, we estimate Φ differently, because the per-topic
word probabilities are modeled differently. The details are given below.
2.2 Modeling of Time-Awareness
The vanilla LDA represents the per-topic word distributions simply as the cat-
egorical distributions or as those to which a Dirichlet prior is assigned. In our
model, the parameter vector φk of the word categorical distribution for each
topic k is given by
φk,v =
exp(wk,v + bv)
v exp(wk,v + bv )
(4)
where the softmax function is used. wk can be viewed as a vector embedding
of the topic k as discussed in Section 1.1. While we use no prior, the bias term
bv in Eq. (4) plays a similar role to that of Dirichlet prior in smoothing [5][1],
because bv is shared by all topics.
Our proposal also considers the usage of covariates like timestamps, authors,
venues, etc. In this paper, we explore chronological trends in document con-
tents and thus use document timestamps in order to make the per-topic word
probabilities time-aware, though other covariates may also be used in a similar
manner. By using timestamps, the per-topic word probabilities are given by
φk,t,v =
exp(wk,v + ut,v + bv)
v exp(wk,v + ut,v + bv )
(5)
where t represents document timestamps. U = (ut,v) is a weight matrix mod-
eling the dependency of word probabilities on timestamps. Each column ut of U
Mini-batch Variational Inference for Time-Aware Topic Modeling 5
can be viewed as a vector embedding of the timestamp t. As already discussed in
Section 1.1, the φk,t,v’s can also be written as softmax outputs of a single-layer
network whose input is always one-hot:
φk,t = Softmax(W ek + Uot + b) (6)
where ek and ot are one-hot input vectors corresponding to the latent topic k
and the timestamp t, respectively.
2.3 Implementation
In the manner described above, the parameter vector φk and its time-aware
version φk,t are regarded as an output of the softmax function. The weights and
biases, i.e., W , U, and b, are estimated by maximizing the ELBO in Eq. (3).
We implemented the maximization as a mini-batch learning by using PyTorch.
Thanks to the broadcasting mechanism in PyTorch, we can write Eq. (6) as
phi = torch.nn.functional.softmax(
W.unsqueeze(2)
+ U.unsqueeze(1)
+ b.unsqueeze(1).unsqueeze(2),
dim=0)
by collecting all φk,t,v’s in one place. In the evaluation experiments, we found
that when it was difficult to obtain a good validation perplexity, DropConnect
[15] worked very well. This can be implemented by modifying the above code
snippet as follows:
dropout = torch.nn.Dropout(p=dropout_prob)
phi = torch.nn.functional.softmax(
dropout(W).unsqueeze(2)
+ dropout(U).unsqueeze(1)
+ b.unsqueeze(1).unsqueeze(2),
dim=0)
The dropout probability dropout prob was chosen from {0.2, 0.5} in our ex-
periment. No further fine tuning of probability was required. However, it was
mandatory to check if DropConnect could improve test perplexity, because the
extent of improvement sometimes reached a significant level.
2.4 Parameterization Efficiency
Here we give an additional discussion on modeling. As Eq. (5) shows, our method
combines the parameters indexed by the tuple (k, v) with those indexed by (t, v)
to make our topic extraction dependent on document timestamps. Therefore, the
memory space for these parameters is O(KV + TV ). However, it is also possible
6 F. Author et al.
to introduce the parameters indexed explicitly by the 3-tuple (k, t, v) as follows
in the same manner as in [7] and [13]:
φk,t,v =
exp(wk,v + ut,v + rk,t,v + bv)
v exp(wk,v + ut,v + rk,t,v + bv )
. (7)
A merit of this approach is that we can separate out the words not depend-
ing on topics but exclusively depending on timestamps (e.g. ‘2001’, ‘December’,
etc) from the words depending both on topics and on timestamps equally (e.g.
‘ATM’, ‘WDM’, etc4
). The words of the former type may appear in documents
related to a wide range of topics but only appear at some limited set of time
points. Therefore, such words may have probabilities mainly determined by the
parameters indexed by (t, v), and the contribution of the parameters indexed by
(k, t, v) may be small. The words of the latter type may appear in documents
related to a narrow range of topics and at the same time only appear at some
limited set of time points. Such words may have probabilities determined both
by the parameters indexed by (t, v) and by those indexed by (k, t, v).
Our model has no parameters indexed by (k, t, v). Therefore, it may be diffi-
cult to make a clear distinction between the above two types of words. However,
an obvious drawback of the above approach is the memory space requirement of
O(KTV ). When we use GPU, it can be prohibitive to allocate a memory space
for K × T × V parameters each requiring its gradient. Even without using this
many parameters, we could obtain interesting time-dependencies as presented in
Fig. 1 and in Fig. 3. Therefore, we did not introduce the parameters explicitly
indexed by (k, t, v).
2.5 Details of Inference Algorithm
Algorithm 1 is the inference for the proposed model. The inference related to
the document-specific parameters are performed in the same manner as [3][9][4].
Only the corpus-wide parameters, i.e., W , U, and b, are updated by backpropa-
gation. The inputs for the algorithm are: (1) the frequency nd,v of the word v in
the document d; (2) the parameter α of the symmetric Dirichlet prior distribu-
tion; and (3) the initial learning rate η of optimization algorithm. As Algorithm 1
shows, we don’t need to know the total number of training documents.
The elements of W and U were initialized by samples from the Gaussian
distribution with zero-mean and a variance of 0.01. The bias terms b were ini-
tialized to zero. For updating these parameters, we used Adam optimizer [10].
Let the initial learning rate of Adam be denoted by η. In our implementation,
the learning rate for the m-th mini-batch was set to (1+m/500)−0.7
×η. Further,
η was halved every 500 mini-batches until 2,000 mini-batches were seen.
4
These words are observed often in the titles of computer science papers related to
networking and communication but only appear across a limited range of years.
Mini-batch Variational Inference for Time-Aware Topic Modeling 7
Algorithm 1 Mini-batch variational Bayes
1: Input: nd,v for each mini-batch, α, and η
2: Initialize W , U, and b
3: while True do
4: φk,t ← Softmax(W ek + Uot + b)
5: Get next mini-batch
6: for each document d in mini-batch do
7: t ← timestamp of document d
8: Initialize γd,v,k randomly
9: repeat
10: λd,k ← α + v nd,vγd,v,k
11: γd,v,k ←∝ exp Ψ(λd,k) × φk,t,v
12: until change in λd,k is negligible
13: end for
14: Make computational graph of the negative of ELBO
15: Backpropagate
16: Update W , U, and b by using η
17: Update η if necessary
18: end while
3 Experiments
3.1 Data Sets
The specifications of the four document sets used in the evaluation experiments
are given in Table 1, where the total number of documents, the vocabulary size,
and the number of different document timestamps are denoted by D, V , and T,
respectively. A detailed explanation is given below.
The first data set ‘MAI’ is a set of newswire articles from the Mainichi, a
Japanese newspaper. A morphological analysis was performed by using MeCab.5
The dates of the articles range from November 1, 2007 to May 15, 2008. We split
the date range into 13 slices of nearly equal size and used the slices as document
timestamps. The second data set ‘TDT4’ is a set of English documents from
TDT4 Multilingual Text and Annotations.6
The dates of the documents range
from December 1, 2000 to January 31, 2001. We split the date range into 15 slices
of nearly equal size and used them as document timestamps. The third data set
‘DBLP’ is a set of paper titles obtained from DBLP web site.7
We regarded each
title as a single document. Therefore, this is a set of documents of quite short
length. The publication years ranging from 1998 to 2012 were used as document
timestamps. The last data set ‘STOV’ is a subset of the questions in the Stack-
Overflow data set available at Kaggle.8
We used as document timestamps the
months on which the questions were published, ranging from January 2014 to
5
http://taku910.github.io/mecab/
6
https://catalog.ldc.upenn.edu/LDC2005T16
7
http://dblp.uni-trier.de/xml/
8
https://www.kaggle.com/stackoverflow/rquestions
8 F. Author et al.
Table 1. Specifications of data sets
D V T D V T
MAI 32,775 15,161 13 DBLP 1,034,067 18,940 15
TDT4 96,246 15,153 15 STOV 25,608 13,184 34
December 2016. For all data sets, English words were lower-cased, and highly
frequent words and infrequent words were removed. The training:validation:test
ratio was 8:1:1 for all data sets. We used one random split as in the usual exper-
iments for neural networks. The mini-batch size was 200 for MAI, TDT4, and
STOV data sets and was 10,000 for DBLP data set, because DBLP is a set of
short documents.
3.2 Evaluation Method
We compared the following three approaches for topic extraction: the mini-batch
inference for the topic model with document timestamps (cf. Eq. (5)), the mini-
batch inference for the topic model without document timestamps (cf. Eq. (4)),
and collapsed Gibbs sampling (CGS) [8] for the vanilla LDA. These compared
methods are referred to by ‘VB w/ t’, ‘VB’, and ‘CGS’, respectively in Table 2,
which summarizes the evaluation results. With respect to K, i.e., the number of
latent topics, we tested the following two settings: K = 50 and K = 100.
The comparison is performed in terms of test perplexity [3]. However, each
approach has its own free parameters to be tuned. Therefore, we tuned the free
parameters based on the perplexity computed over validation set. For ‘VB w/ t’
and ‘VB’, the parameter α of the symmetric Dirichlet prior distribution and the
initial learning rate η of Adam were tuned based on validation perplexity. The
tuned values of η and α are given in Table 2. With respect to ‘CGS’, not only α
but also the parameter β of the symmetric Dirichlet prior distribution assigned
to the per-topic word categorical distributions were tuned based on validation
perplexity. The tuned values of α and β are given in Table 2.
Both the validation and test perplexities were computed by using fold-in
procedure [1], where randomly selected two-thirds of word tokens were used
for obtaining topic posterior probabilities in each document. There are myriads
of ways to select two-thirds of word tokens. Therefore, when computing test
perplexity, i.e., when performing the final evaluation, and not when computing
validation perplexity, we repeated the computation of perplexity ten times, each
over a different set of randomly chosen two-thirds of word tokens. Also for CGS,
we performed an evaluation in the same manner. We then calculated the mean
and standard deviation of ten resulting test perplexities as given in Table 2.
3.3 Comparison Results
Table 2 provides the mean and standard deviation of ten test perplexities ob-
tained in a manner described above for each data set. It can be observed that
the mini-batch variational Bayes for the topic model using document timestamps
Mini-batch Variational Inference for Time-Aware Topic Modeling 9
Table 2. Evaluation results in terms of test perplexity
K method perplexity
free parameters
(η: learning rate)
(mean±stdev) (α, β: Dirichlet)
MAI
100
VB w/ t 1115.64±3.94 η = .07, α = .03
VB 1138.99±3.11 η = .07, α = .03
CGS 1130.93±3.04 α = .02, β = .05
50
VB w/ t 1283.05±5.74 η = .08, α = .02
VB 1315.10±5.94 η = .08, α = .02
CGS 1336.75±5.56 α = .04, β = .15
TDT4
100
VB w/ t 1480.82±3.85 η = .06, α = .02
VB 1481.13±3.79 η = .06, α = .04
CGS 1425.64±3.08 α = .02, β = .10
50
VB w/ t 1587.45±3.25 η = .05, α = .01
VB 1589.99±3.10 η = .10, α = .02
CGS 1586.48±3.05 α = .05, β = .05
DBLP
100
VB w/ t 1274.45±3.88 η =6e-4, α = .09†
VB 1325.30±5.25 η =8e-4, α = .08†
CGS 1341.07±4.84 α = .03, β = .04
50
VB w/ t 1300.09±5.69 η =5e-4, α = .12†
VB 1335.71±5.72 η =6e-4, α = .10†
CGS 1384.36±6.21 α = .04, β = .05
STOV
100
VB w/ t 1083.10±4.82 η = .015, α = .20†
VB 1125.78±5.62 η = .010, α = .30†
CGS 1042.18±4.19 α = .05, β = .01
50
VB w/ t 1255.72±3.59 η = .018, α = .30
VB 1266.16±3.43 η = .020, α = .35
CGS 1178.20±6.16 α = .07, β = .04
(i.e., the model based on Eq. (5)) gave the best test perplexity among the three
compared methods when K = 50 or 100 for MAI and when K = 50 or 100 for
DBLP. Further, it can be also observed that even when we adopt the topic model
using no timestamps (i.e., the model based on Eq. (4)), the test perplexity is
better than that of CGS in three cases, i.e., when K = 50 for MAI and when
K = 50 or 100 for DBLP. When we consider the fact that CGS often gives a
good test perplexity [1], it can be said that even when we use no timestamps, the
proposed mini-batch variational Bayes is a promising alternative to CGS when
the size of training data set is prohibitively large for CGS.
Another important observation is that DropConnect [15] led to a remarkable
improvement when K = 50 or 100 for DBLP and when K = 100 for STOV.
These cases are marked by dagger ‘†’ in Table 2. DropConnect worked both for
the modeling with timestamps and for that without timestamps. Fig. 2 shows
to what extent DropConnect could improve the quality of extracted topics in
terms of validation perplexity for DBLP data set when K = 50. The horizontal
axis gives the number of seen mini-batches. The vertical axis gives the validation
perplexity computed during the course of inference. As Fig. 2 shows, DropCon-
10 F. Author et al.
Fig. 2. Comparing the inference with DropConnect (solid line) to the inference without
it (dashed line) in terms of validation perplexity for DBLP data set when K = 50.
The horizontal axis gives the number of seen mini-batches. The vertical axis gives the
perplexity computed over the validation set. The learning rate η and the Dirichlet
hyperparameter α were tuned separately for each inference. The validation perplexity
given by the inference without DropConnect (dashed line) reached a stable value, which
was around 1,870, after 2,000 mini-batches were seen. The validation perplexity given
by the inference with DropConnect reached around 1,320 as presented in the above
chart. That is, the perplexity was improved by around 500.
nect improved the validation perplexity by around 500. Based on this kind of
result, we adopted DropConnect also for the other two cases.
3.4 Examples of Extracted Topical Trends
Fig. 1 and Fig. 3 present the latent topics extracted by the proposed model as
word clouds.9
These two topics were extracted from 827,141 paper titles con-
tained in the training set of DBLP data set. In the three panels of each figure,
we provide top 50 words according to their probabilities at three different time
points, 1998, 2005, and 2012, respectively. The words having larger probabilities
are depicted in larger fonts. For the topic given in Fig. 1, seemingly related to
HCI, the word ‘object’ is more relevant to this topic in 1998 than in 2005 and
2012. In contrast, the word ‘application’ is more relevant in 2005 than in 1998
and 2012, and the words ‘interaction’ and ‘augmented’ are more relevant in 2012
than in 1998 and 2005. For the topic given in Fig. 3, seemingly related to wireless
network, the word ‘cellular’ is more relevant to this topic in 1998 and 2005 than
in 2012, the word ‘wireless’ is more relevant in 2005 and 2012 than in 1998, and
the word ‘power’ is more relevant in 2012 than in 1998 and 2005. In this manner,
we can observe remarkable trends by comparing word probability distributions
coming from different time points. Our approach enables us to conduct this type
of time-aware topic analysis.
9
https://github.com/amueller/word_cloud
Mini-batch Variational Inference for Time-Aware Topic Modeling 11
Fig. 3. Word cloud presentation of the same topic at three different time points. This
topic, seemingly related to wireless network, extracted from 827,141 paper titles in
DBLP data set. K was set to 50. The left, center, and right panels correspond to the
time points 1998, 2005, and 2012, respectively.
4 Related work
While NVDM [11] is the first proposal of document modeling using neural net-
works, several other proposals combine topic modeling with neural networks in
a more interesting and interpretable manner. Mikolov and Zweig [12] devised an
RNN-based language model using LDA. The output of the inference for LDA is
used as an additional context information for language modeling. However, this
is neither a new topic model nor a new inference for topic models. Srivastava
and Sutton [14] proposed a topic model, called ProdLDA, in order to apply au-
toencoding variational Bayes, which was implemented by using neural networks.
Dieng et al. [6] proposed a topic model, called TopicRNN, where RNN is used to
obtain per-document word probabilities. Both of these models use the softmax
function to carry out the word-level mixture over per-topic word probabilities
after marginalizing each zd,i, a hidden variable representing the topic assign-
ment of the i-th word token in the document d. However, since both models
marginalize the zd,i’s out, they widely deviate from the vanilla LDA [3]. Our
method introduces only a limited complication to the vanilla LDA and keeps
the coordinate ascent related to the document-specific parameters untouched to
make the inference simple and efficient.
5 Conclusion
This paper proposed a topic model using document timestamps as covariate and
provided a mini-batch inference, whose implementation was realized by using a
deep learning framework. While learning for topic models is performed in an un-
supervised manner, we can regard it as an optimization by adopting variational
inference approach. Our mini-batch variational inference updated per-topic word
probabilities in a similar manner to neural network parameters and required
no knowledge about the training data size. We also showed that DropConnect
worked in our inference. It is an important future research direction to consider
covariates other than timestamps and to introduce more flexibility in modeling
by referring to the recent work discussed in Section 4.
12 F. Author et al.
References
1. Asuncion, A., Welling, M., Smyth, P., and Teh, Y. W.: On smoothing and inference
for topic models. In Proc. the Twenty-Fifth Conference on Uncertainty in Artificial
Intelligence (UAI ’09), 27–34 (2009)
2. Blei, D. M., Carin, L., and Dunson, D.: Probabilistic Topic Models. IEEE Signal
Process. Mag. 27, no. 6, 55–65 (2010)
3. Blei, D. M., Ng, A. Y., and Jordan, M. I.: Latent dirichlet allocation. J. Mach.
Learn. Res. 3, 993–1022 (2003)
4. Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., and Jordan, M. I.: Streaming
Variational Bayes. In Proc. Advances in Neural Information Processing Systems
(NIPS), 1727–1735 (2013)
5. Chen, S. F. and Goodman, J.: An Empirical Study of Smoothing Techniques for
Language Modeling. In Proc. Annual Meeting on Association for Computational
Linguistics (ACL), 310–318 (1996)
6. Dieng, A. B., Wang, C., Gao, J., and Paisley, J.: TopicRNN: A recurrent neural
network with long-range semantic dependency. arXiv preprint, arXiv:1611.01702
(2016)
7. Eisenstein, J., Ahmed, A., and Xing, E. P.: Sparse Additive Generative Models of
Text. In Proc. International Conference on Machine Learning (ICML), 1041–1048
(2011)
8. Griffiths T. L., and Steyvers, M.: Finding Scientific Topics. Proc. Natl. Acad. Sci.
U.S.A 101, Suppl. 1, 5288–5235 (2004)
9. Hoffman, M. D., Blei, D. M., and Bach, F.: Online learning for latent Dirichlet
allocation. In Proc. Advances in Neural Information Processing Systems (NIPS),
856–864 (2010)
10. Kingma, D. P. and Ba, J. L.: Adam: a Method for Stochastic Optimization. In
Proc. International Conference on Learning Representations (ICLR) (2015)
11. Miao, Y., Yu, L., and Blunsom, P.: Neural Variational Inference for Text Process-
ing. In Proc. International Conference on Machine Learning (ICML), 1727–1736
(2016)
12. Mikolov, T. and Zweig, G.: Context Dependent Recurrent Neural Network Lan-
guage Model. In Proc. IEEE Spoken Language Technology Workshop, 234–239
(2012)
13. Roberts, M. E., Stewart, B. M., Tingley, D., and Airoldi, E. M.: The Structural
Topic Model and Applied Social Science. In Proc. Advances in Neural Information
Processing Systems Workshop on Topic Models: Computation, Application, and
Evaluation (2013)
14. Srivastava, A. and Sutton, C.: Autoencoding Variational Inference For Topic Mod-
els. In Proc. International Conference on Learning Representations (ICLR) (2017)
15. Wan, L., Zeiler, M., Zhang, S., Cun, Y. L., and Fergus, R.: Regularization of
Neural Networks using DropConnect. In Proc. International Conference on Machine
Learning (ICML), 1058–1066 (2013)

Más contenido relacionado

La actualidad más candente

Exploiting tls to disrupt privacy of web application's traffic
Exploiting tls to disrupt privacy of web application's trafficExploiting tls to disrupt privacy of web application's traffic
Exploiting tls to disrupt privacy of web application's trafficSandipan Biswas
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine LearningInstance Based Learning in Machine Learning
Instance Based Learning in Machine LearningPavithra Thippanaik
 
A New Key Agreement Protocol Using BDP and CSP in Non Commutative Groups
A New Key Agreement Protocol Using BDP and CSP in Non Commutative GroupsA New Key Agreement Protocol Using BDP and CSP in Non Commutative Groups
A New Key Agreement Protocol Using BDP and CSP in Non Commutative GroupsEswar Publications
 
Full Communication in a Wireless Sensor Network by Merging Blocks of a Key Pr...
Full Communication in a Wireless Sensor Network by Merging Blocks of a Key Pr...Full Communication in a Wireless Sensor Network by Merging Blocks of a Key Pr...
Full Communication in a Wireless Sensor Network by Merging Blocks of a Key Pr...cscpconf
 
Finding bursty topics from microblogs
Finding bursty topics from microblogsFinding bursty topics from microblogs
Finding bursty topics from microblogsmoresmile
 
SECURITY ENHANCED KEY PREDISTRIBUTION SCHEME USING TRANSVERSAL DESIGNS AND RE...
SECURITY ENHANCED KEY PREDISTRIBUTION SCHEME USING TRANSVERSAL DESIGNS AND RE...SECURITY ENHANCED KEY PREDISTRIBUTION SCHEME USING TRANSVERSAL DESIGNS AND RE...
SECURITY ENHANCED KEY PREDISTRIBUTION SCHEME USING TRANSVERSAL DESIGNS AND RE...IJNSA Journal
 
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTINGFAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTINGIJNSA Journal
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkmVahid Mirjalili
 
Brief Introduction to Error Correction Coding
Brief Introduction to Error Correction CodingBrief Introduction to Error Correction Coding
Brief Introduction to Error Correction CodingBen Miller
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all thatZhibo Xiao
 
[ ] uottawa_copeck.doc
[ ] uottawa_copeck.doc[ ] uottawa_copeck.doc
[ ] uottawa_copeck.docbutest
 
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
 
RFNM-Aranda-Final.PDF
RFNM-Aranda-Final.PDFRFNM-Aranda-Final.PDF
RFNM-Aranda-Final.PDFThomas Aranda
 
ANN Based POS Tagging For Nepali Text
ANN Based POS Tagging For Nepali Text ANN Based POS Tagging For Nepali Text
ANN Based POS Tagging For Nepali Text ijnlc
 

La actualidad más candente (18)

Exploiting tls to disrupt privacy of web application's traffic
Exploiting tls to disrupt privacy of web application's trafficExploiting tls to disrupt privacy of web application's traffic
Exploiting tls to disrupt privacy of web application's traffic
 
CNN for modeling sentence
CNN for modeling sentenceCNN for modeling sentence
CNN for modeling sentence
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine LearningInstance Based Learning in Machine Learning
Instance Based Learning in Machine Learning
 
Is3314841490
Is3314841490Is3314841490
Is3314841490
 
A New Key Agreement Protocol Using BDP and CSP in Non Commutative Groups
A New Key Agreement Protocol Using BDP and CSP in Non Commutative GroupsA New Key Agreement Protocol Using BDP and CSP in Non Commutative Groups
A New Key Agreement Protocol Using BDP and CSP in Non Commutative Groups
 
Full Communication in a Wireless Sensor Network by Merging Blocks of a Key Pr...
Full Communication in a Wireless Sensor Network by Merging Blocks of a Key Pr...Full Communication in a Wireless Sensor Network by Merging Blocks of a Key Pr...
Full Communication in a Wireless Sensor Network by Merging Blocks of a Key Pr...
 
Finding bursty topics from microblogs
Finding bursty topics from microblogsFinding bursty topics from microblogs
Finding bursty topics from microblogs
 
SECURITY ENHANCED KEY PREDISTRIBUTION SCHEME USING TRANSVERSAL DESIGNS AND RE...
SECURITY ENHANCED KEY PREDISTRIBUTION SCHEME USING TRANSVERSAL DESIGNS AND RE...SECURITY ENHANCED KEY PREDISTRIBUTION SCHEME USING TRANSVERSAL DESIGNS AND RE...
SECURITY ENHANCED KEY PREDISTRIBUTION SCHEME USING TRANSVERSAL DESIGNS AND RE...
 
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTINGFAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
 
I6 mala3 sowmya
I6 mala3 sowmyaI6 mala3 sowmya
I6 mala3 sowmya
 
Brief Introduction to Error Correction Coding
Brief Introduction to Error Correction CodingBrief Introduction to Error Correction Coding
Brief Introduction to Error Correction Coding
 
Quantum Deep Learning
Quantum Deep LearningQuantum Deep Learning
Quantum Deep Learning
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all that
 
[ ] uottawa_copeck.doc
[ ] uottawa_copeck.doc[ ] uottawa_copeck.doc
[ ] uottawa_copeck.doc
 
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
 
RFNM-Aranda-Final.PDF
RFNM-Aranda-Final.PDFRFNM-Aranda-Final.PDF
RFNM-Aranda-Final.PDF
 
ANN Based POS Tagging For Nepali Text
ANN Based POS Tagging For Nepali Text ANN Based POS Tagging For Nepali Text
ANN Based POS Tagging For Nepali Text
 

Similar a Mini-batch Variational Inference for Time-Aware Topic Modeling

Context-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingContext-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingTomonari Masada
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationEugene Nho
 
graduate_thesis (1)
graduate_thesis (1)graduate_thesis (1)
graduate_thesis (1)Sihan Chen
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
 
A scalable gibbs sampler for probabilistic entity linking
A scalable gibbs sampler for probabilistic entity linkingA scalable gibbs sampler for probabilistic entity linking
A scalable gibbs sampler for probabilistic entity linkingSunny Kr
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain OntologyKeerti Bhogaraju
 
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...ijsc
 
Texts Classification with the usage of Neural Network based on the Word2vec’s...
Texts Classification with the usage of Neural Network based on the Word2vec’s...Texts Classification with the usage of Neural Network based on the Word2vec’s...
Texts Classification with the usage of Neural Network based on the Word2vec’s...ijsc
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Concurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsConcurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsParang Saraf
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 

Similar a Mini-batch Variational Inference for Time-Aware Topic Modeling (20)

Context-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingContext-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 
graduate_thesis (1)
graduate_thesis (1)graduate_thesis (1)
graduate_thesis (1)
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
A scalable gibbs sampler for probabilistic entity linking
A scalable gibbs sampler for probabilistic entity linkingA scalable gibbs sampler for probabilistic entity linking
A scalable gibbs sampler for probabilistic entity linking
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
 
Texts Classification with the usage of Neural Network based on the Word2vec’s...
Texts Classification with the usage of Neural Network based on the Word2vec’s...Texts Classification with the usage of Neural Network based on the Word2vec’s...
Texts Classification with the usage of Neural Network based on the Word2vec’s...
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Lec1
Lec1Lec1
Lec1
 
Topic modelling
Topic modellingTopic modelling
Topic modelling
 
Topic Models
Topic ModelsTopic Models
Topic Models
 
poster
posterposter
poster
 
Concurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsConcurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector Representations
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
TEXT CLUSTERING.doc
TEXT CLUSTERING.docTEXT CLUSTERING.doc
TEXT CLUSTERING.doc
 
Ir 08
Ir   08Ir   08
Ir 08
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 

Más de Tomonari Masada

Learning Latent Space Energy Based Prior Modelの解説
Learning Latent Space Energy Based Prior Modelの解説Learning Latent Space Energy Based Prior Modelの解説
Learning Latent Space Energy Based Prior Modelの解説Tomonari Masada
 
Denoising Diffusion Probabilistic Modelsの重要な式の解説
Denoising Diffusion Probabilistic Modelsの重要な式の解説Denoising Diffusion Probabilistic Modelsの重要な式の解説
Denoising Diffusion Probabilistic Modelsの重要な式の解説Tomonari Masada
 
A note on the density of Gumbel-softmax
A note on the density of Gumbel-softmaxA note on the density of Gumbel-softmax
A note on the density of Gumbel-softmaxTomonari Masada
 
トピックモデルの基礎と応用
トピックモデルの基礎と応用トピックモデルの基礎と応用
トピックモデルの基礎と応用Tomonari Masada
 
Expectation propagation for latent Dirichlet allocation
Expectation propagation for latent Dirichlet allocationExpectation propagation for latent Dirichlet allocation
Expectation propagation for latent Dirichlet allocationTomonari Masada
 
A note on variational inference for the univariate Gaussian
A note on variational inference for the univariate GaussianA note on variational inference for the univariate Gaussian
A note on variational inference for the univariate GaussianTomonari Masada
 
Document Modeling with Implicit Approximate Posterior Distributions
Document Modeling with Implicit Approximate Posterior DistributionsDocument Modeling with Implicit Approximate Posterior Distributions
Document Modeling with Implicit Approximate Posterior DistributionsTomonari Masada
 
LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka Composition
LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka CompositionLDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka Composition
LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka CompositionTomonari Masada
 
A Note on Latent LSTM Allocation
A Note on Latent LSTM AllocationA Note on Latent LSTM Allocation
A Note on Latent LSTM AllocationTomonari Masada
 
Topic modeling with Poisson factorization (2)
Topic modeling with Poisson factorization (2)Topic modeling with Poisson factorization (2)
Topic modeling with Poisson factorization (2)Tomonari Masada
 
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic ModelA Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic ModelTomonari Masada
 
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationA Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationTomonari Masada
 
Word count in Husserliana Volumes 1 to 28
Word count in Husserliana Volumes 1 to 28Word count in Husserliana Volumes 1 to 28
Word count in Husserliana Volumes 1 to 28Tomonari Masada
 
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationA Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationTomonari Masada
 
A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...
A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...
A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...Tomonari Masada
 
A Note on BPTT for LSTM LM
A Note on BPTT for LSTM LMA Note on BPTT for LSTM LM
A Note on BPTT for LSTM LMTomonari Masada
 

Más de Tomonari Masada (20)

Learning Latent Space Energy Based Prior Modelの解説
Learning Latent Space Energy Based Prior Modelの解説Learning Latent Space Energy Based Prior Modelの解説
Learning Latent Space Energy Based Prior Modelの解説
 
Denoising Diffusion Probabilistic Modelsの重要な式の解説
Denoising Diffusion Probabilistic Modelsの重要な式の解説Denoising Diffusion Probabilistic Modelsの重要な式の解説
Denoising Diffusion Probabilistic Modelsの重要な式の解説
 
A note on the density of Gumbel-softmax
A note on the density of Gumbel-softmaxA note on the density of Gumbel-softmax
A note on the density of Gumbel-softmax
 
トピックモデルの基礎と応用
トピックモデルの基礎と応用トピックモデルの基礎と応用
トピックモデルの基礎と応用
 
Expectation propagation for latent Dirichlet allocation
Expectation propagation for latent Dirichlet allocationExpectation propagation for latent Dirichlet allocation
Expectation propagation for latent Dirichlet allocation
 
A note on variational inference for the univariate Gaussian
A note on variational inference for the univariate GaussianA note on variational inference for the univariate Gaussian
A note on variational inference for the univariate Gaussian
 
Document Modeling with Implicit Approximate Posterior Distributions
Document Modeling with Implicit Approximate Posterior DistributionsDocument Modeling with Implicit Approximate Posterior Distributions
Document Modeling with Implicit Approximate Posterior Distributions
 
LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka Composition
LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka CompositionLDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka Composition
LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka Composition
 
A Note on ZINB-VAE
A Note on ZINB-VAEA Note on ZINB-VAE
A Note on ZINB-VAE
 
A Note on Latent LSTM Allocation
A Note on Latent LSTM AllocationA Note on Latent LSTM Allocation
A Note on Latent LSTM Allocation
 
A Note on TopicRNN
A Note on TopicRNNA Note on TopicRNN
A Note on TopicRNN
 
Topic modeling with Poisson factorization (2)
Topic modeling with Poisson factorization (2)Topic modeling with Poisson factorization (2)
Topic modeling with Poisson factorization (2)
 
Poisson factorization
Poisson factorizationPoisson factorization
Poisson factorization
 
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic ModelA Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
 
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationA Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
 
Word count in Husserliana Volumes 1 to 28
Word count in Husserliana Volumes 1 to 28Word count in Husserliana Volumes 1 to 28
Word count in Husserliana Volumes 1 to 28
 
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationA Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
 
FDSE2015
FDSE2015FDSE2015
FDSE2015
 
A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...
A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...
A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...
 
A Note on BPTT for LSTM LM
A Note on BPTT for LSTM LMA Note on BPTT for LSTM LM
A Note on BPTT for LSTM LM
 

Último

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 

Último (20)

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 

Mini-batch Variational Inference for Time-Aware Topic Modeling

  • 1. Mini-batch Variational Inference for Time-Aware Topic Modeling First Author1[0000−1111−2222−3333] and Second Author2[1111−2222−3333−4444] 1 2 Abstract. This paper proposes a time-aware topic model and its mini- batch variational inference for exploring chronological trends in docu- ment contents. Our contribution is twofold. First, to extract topics in a time-aware manner, our method uses two vector embeddings: the embed- ding of latent topics and that of document timestamps. By combining these two embeddings and applying the softmax function, we have as many word probability distributions as document timestamps for each topic. This modeling enables us to extract remarkable topical trends. Second, to achieve memory efficiency, the variational inference is imple- mented as a mini-batch maximization of the evidence lower bound. Our inference requires no knowledge of the total number of training docu- ments. This enables us to perform the optimization in a similar manner to neural networks. Actually, our implementation used a deep learning framework. The evaluation results show that we could improve test per- plexity by using document timestamps. The results also show that our test perplexity was comparable with that of collapsed Gibbs sampling, which is less efficient in memory usage than our method. This makes our proposal promising for time-aware topic extraction from large data sets. Keywords: topic models · variational Bayes · time-aware analysis. 1 Introduction This paper proposes a topic model using document timestamps, which is an ex- tension of latent Dirichlet allocation (LDA) [3], and also provides an inference for the proposed model. Our contribution is twofold. First, we use two vector embeddings in our model: the vector embedding of latent topics and that of doc- ument timestamps. Vector embedding enables us to easily incorporate covariates like timestamps in topic modeling. Second, we provide a mini-batch based vari- ational inference for the proposed model. An important merit of mini-batch inference is memory efficiency. These two contributions, explained in the follow- ing subsections, provide a viewpoint from which we can regard our time-aware topic modeling as sharing common features with neural networks. We actually implemented the proposed inference by using PyTorch.3 3 https://github.com/pytorch
  • 2. 2 F. Author et al. 1.1 Vector Embeddings of Topics and Timestamps The first contribution concerns the modeling. Topic models are a probabilistic model of documents. The following two types of categorical distributions are used in topic modeling: per-document categorical distributions over topics and per-topic categorical distributions over words. First, each document is modeled with a categorical distribution defined over latent topics. We denote the number of topics by K. Let θd,k refer to the prob- ability of the topic k in the document d. Intuitively, θd,k for each k = 1, . . . , K tells the importance of each topic in the document d. Our method introduces no modification with respect to this type of categorical distributions. Therefore, the corresponding posterior parameters are estimated as described in [3]. Second, each latent topic is in turn represented as a categorical distribution defined over vocabulary words. We denote the number of different words, i.e., the vocabulary size, by V and refer to the probability of the word v in the topic k by φk,v. Intuitively, φk,v is the relevancy of the word v to the topic k. In the proposed model, the φk,v’s for each topic k are obtained by applying the softmax function to the V -dimensional embedding wk of the topic k plus bias, i.e., φk,v = exp(wk,v +bv)/ v exp(wk,v +bv ), where bv is the bias for the word v and is shared by all topics. Let W be the V × K matrix whose k-th column is wk. Then φk = (φk,1, . . . , φk,V ) can be written as φk = Softmax(W ek + b) (1) where ek is the one-hot vector whose k-th element is 1 and all other elements are 0. The formula in Eq. (1) is similar to that presented in [7]. However, we here provide a viewpoint from which these per-topic word probabilities φk for k = 1, . . . , K can be regarded as softmax outputs of a single-layer feedfoward network whose input is always one-hot. We adopt the same approach also for document timestamps. Assume that there are T different timestamps and that ut is the vector embedding of the timestamp t. We then obtain time-aware per- topic word probabilities as φk,t = Softmax(W ek + Uot + b) (2) where U is the V ×T matrix whose t-th column is ut, and ot is the one hot vector whose t-th element is 1 and all other elements are 0. φk,t = (φk,t,1, . . . , φk,t,V ) in Eq. (2) represents the word probabilities in the topic k at the time point t (cf. Fig. 1). In the evaluation experiments, we compared the word probability mod- eling based on Eq. (2) to that based on Eq. (1) in terms of test perplexity. The experiments are expected to show whether document timestamps can improve the quality of extracted topics in terms of perplexity. 1.2 Mini-Batch Variational Bayes Our second contribution concerns the inference for the proposed model. Mini- batch based variational Bayesian inferences have already been proposed for topic
  • 3. Mini-batch Variational Inference for Time-Aware Topic Modeling 3 Fig. 1. Word cloud presentation of the same topic at three different time points. This topic, seemingly related to HCI, extracted from 827,141 paper titles in DBLP data set (cf. Section 3). K was set to 50. The words having larger probabilities are shown in larger fonts, because the font size is determined by the value of φk,t,v. The left, center, and right panels correspond to the time points 1998, 2005, and 2012, respectively. models [9][4]. However, these inferences require the knowledge of the total num- ber of training documents. Our second contribution is to provide a mini-batch learning for the proposed model, where no knowledge of data size is required. We perform the parameter estimation for our model in a similar manner to that for neural networks. We iterate over mini-batches, i.e., small subsets of the corpus, and conduct gradient descent where the loss function is the negative of the evidence lower bound (ELBO). Of course, we need to choose an appropri- ate optimization algorithm like Adam [10] and to adjust its learning rate based on validation set evaluation. However, this is what we usually do when training neural networks. The experimental results show that the test perplexity achieved by the proposed inference was comparable to that achieved by collapsed Gibbs sampling (CGS) for the vanilla LDA [8]. While CGS can give a good test perplex- ity in many situations [1], it is far less efficient in memory use than mini-batch learning. Since our inference is a mini-batch learning, it can be adopted even when the number of training documents is prohibitively large for CGS. The rest of the paper is organized as follows. Section 2 describes our proposal in detail. Section 3 provides the settings and results of evaluation experiments. Section 4 discusses some of the existing approaches related to ours. Section 5 concludes the paper by giving a future research direction. 2 Method 2.1 Variational Lower Bound While there are many ways to perform an inference for topic models, we adopt variational Bayesian inference [3][9][4], because it allows us to perform inference as an optimization. Our topic model is a modification of the vanilla LDA [3]. In our model, a symmetric Dirichlet prior Dirichlet(α) is assigned to the per- document categorical distributions over topics as in the vanilla LDA. In con- trast, no prior is assigned to the per-topic categorical distributions over words
  • 4. 4 F. Author et al. in our model, because the word probabilities are directly estimated in a time- aware manner. The evidence lower bound (ELBO) to be maximized in variational Bayesian inference can thus be written as follows [3]: d log p(xd; α, Φ) ≥ d,v,k nd,vγd,v,k log φk,v + d,k α + v nd,vγd,v,k − λd,k Ψ(λd,k) − Ψ k λd,k − d,v,k nd,vγd,v,k log γd,v,k + log Γ(Kα) − K log Γ(α) (3) where xd is the bag-of-words representation of the document d. Φ = {φ1, . . . , φK} are the parameter vectors of the per-topic categorical distributions over words, i.e., per-topic word probabilities. nd,v is the frequency of the word v in d. γd,v,k is the responsibility of the word v with respect to the topic k in d. λd,k is the posterior parameter for the topic k in d. Ψ is the digamma function. We regard the parameter α of the symmetric Dirichlet prior Dirichlet(α) as free parame- ter. The document-specific parameters λd,k and γd,v,k are estimated in the same manner as in [3][9][4]. However, we estimate Φ differently, because the per-topic word probabilities are modeled differently. The details are given below. 2.2 Modeling of Time-Awareness The vanilla LDA represents the per-topic word distributions simply as the cat- egorical distributions or as those to which a Dirichlet prior is assigned. In our model, the parameter vector φk of the word categorical distribution for each topic k is given by φk,v = exp(wk,v + bv) v exp(wk,v + bv ) (4) where the softmax function is used. wk can be viewed as a vector embedding of the topic k as discussed in Section 1.1. While we use no prior, the bias term bv in Eq. (4) plays a similar role to that of Dirichlet prior in smoothing [5][1], because bv is shared by all topics. Our proposal also considers the usage of covariates like timestamps, authors, venues, etc. In this paper, we explore chronological trends in document con- tents and thus use document timestamps in order to make the per-topic word probabilities time-aware, though other covariates may also be used in a similar manner. By using timestamps, the per-topic word probabilities are given by φk,t,v = exp(wk,v + ut,v + bv) v exp(wk,v + ut,v + bv ) (5) where t represents document timestamps. U = (ut,v) is a weight matrix mod- eling the dependency of word probabilities on timestamps. Each column ut of U
  • 5. Mini-batch Variational Inference for Time-Aware Topic Modeling 5 can be viewed as a vector embedding of the timestamp t. As already discussed in Section 1.1, the φk,t,v’s can also be written as softmax outputs of a single-layer network whose input is always one-hot: φk,t = Softmax(W ek + Uot + b) (6) where ek and ot are one-hot input vectors corresponding to the latent topic k and the timestamp t, respectively. 2.3 Implementation In the manner described above, the parameter vector φk and its time-aware version φk,t are regarded as an output of the softmax function. The weights and biases, i.e., W , U, and b, are estimated by maximizing the ELBO in Eq. (3). We implemented the maximization as a mini-batch learning by using PyTorch. Thanks to the broadcasting mechanism in PyTorch, we can write Eq. (6) as phi = torch.nn.functional.softmax( W.unsqueeze(2) + U.unsqueeze(1) + b.unsqueeze(1).unsqueeze(2), dim=0) by collecting all φk,t,v’s in one place. In the evaluation experiments, we found that when it was difficult to obtain a good validation perplexity, DropConnect [15] worked very well. This can be implemented by modifying the above code snippet as follows: dropout = torch.nn.Dropout(p=dropout_prob) phi = torch.nn.functional.softmax( dropout(W).unsqueeze(2) + dropout(U).unsqueeze(1) + b.unsqueeze(1).unsqueeze(2), dim=0) The dropout probability dropout prob was chosen from {0.2, 0.5} in our ex- periment. No further fine tuning of probability was required. However, it was mandatory to check if DropConnect could improve test perplexity, because the extent of improvement sometimes reached a significant level. 2.4 Parameterization Efficiency Here we give an additional discussion on modeling. As Eq. (5) shows, our method combines the parameters indexed by the tuple (k, v) with those indexed by (t, v) to make our topic extraction dependent on document timestamps. Therefore, the memory space for these parameters is O(KV + TV ). However, it is also possible
  • 6. 6 F. Author et al. to introduce the parameters indexed explicitly by the 3-tuple (k, t, v) as follows in the same manner as in [7] and [13]: φk,t,v = exp(wk,v + ut,v + rk,t,v + bv) v exp(wk,v + ut,v + rk,t,v + bv ) . (7) A merit of this approach is that we can separate out the words not depend- ing on topics but exclusively depending on timestamps (e.g. ‘2001’, ‘December’, etc) from the words depending both on topics and on timestamps equally (e.g. ‘ATM’, ‘WDM’, etc4 ). The words of the former type may appear in documents related to a wide range of topics but only appear at some limited set of time points. Therefore, such words may have probabilities mainly determined by the parameters indexed by (t, v), and the contribution of the parameters indexed by (k, t, v) may be small. The words of the latter type may appear in documents related to a narrow range of topics and at the same time only appear at some limited set of time points. Such words may have probabilities determined both by the parameters indexed by (t, v) and by those indexed by (k, t, v). Our model has no parameters indexed by (k, t, v). Therefore, it may be diffi- cult to make a clear distinction between the above two types of words. However, an obvious drawback of the above approach is the memory space requirement of O(KTV ). When we use GPU, it can be prohibitive to allocate a memory space for K × T × V parameters each requiring its gradient. Even without using this many parameters, we could obtain interesting time-dependencies as presented in Fig. 1 and in Fig. 3. Therefore, we did not introduce the parameters explicitly indexed by (k, t, v). 2.5 Details of Inference Algorithm Algorithm 1 is the inference for the proposed model. The inference related to the document-specific parameters are performed in the same manner as [3][9][4]. Only the corpus-wide parameters, i.e., W , U, and b, are updated by backpropa- gation. The inputs for the algorithm are: (1) the frequency nd,v of the word v in the document d; (2) the parameter α of the symmetric Dirichlet prior distribu- tion; and (3) the initial learning rate η of optimization algorithm. As Algorithm 1 shows, we don’t need to know the total number of training documents. The elements of W and U were initialized by samples from the Gaussian distribution with zero-mean and a variance of 0.01. The bias terms b were ini- tialized to zero. For updating these parameters, we used Adam optimizer [10]. Let the initial learning rate of Adam be denoted by η. In our implementation, the learning rate for the m-th mini-batch was set to (1+m/500)−0.7 ×η. Further, η was halved every 500 mini-batches until 2,000 mini-batches were seen. 4 These words are observed often in the titles of computer science papers related to networking and communication but only appear across a limited range of years.
  • 7. Mini-batch Variational Inference for Time-Aware Topic Modeling 7 Algorithm 1 Mini-batch variational Bayes 1: Input: nd,v for each mini-batch, α, and η 2: Initialize W , U, and b 3: while True do 4: φk,t ← Softmax(W ek + Uot + b) 5: Get next mini-batch 6: for each document d in mini-batch do 7: t ← timestamp of document d 8: Initialize γd,v,k randomly 9: repeat 10: λd,k ← α + v nd,vγd,v,k 11: γd,v,k ←∝ exp Ψ(λd,k) × φk,t,v 12: until change in λd,k is negligible 13: end for 14: Make computational graph of the negative of ELBO 15: Backpropagate 16: Update W , U, and b by using η 17: Update η if necessary 18: end while 3 Experiments 3.1 Data Sets The specifications of the four document sets used in the evaluation experiments are given in Table 1, where the total number of documents, the vocabulary size, and the number of different document timestamps are denoted by D, V , and T, respectively. A detailed explanation is given below. The first data set ‘MAI’ is a set of newswire articles from the Mainichi, a Japanese newspaper. A morphological analysis was performed by using MeCab.5 The dates of the articles range from November 1, 2007 to May 15, 2008. We split the date range into 13 slices of nearly equal size and used the slices as document timestamps. The second data set ‘TDT4’ is a set of English documents from TDT4 Multilingual Text and Annotations.6 The dates of the documents range from December 1, 2000 to January 31, 2001. We split the date range into 15 slices of nearly equal size and used them as document timestamps. The third data set ‘DBLP’ is a set of paper titles obtained from DBLP web site.7 We regarded each title as a single document. Therefore, this is a set of documents of quite short length. The publication years ranging from 1998 to 2012 were used as document timestamps. The last data set ‘STOV’ is a subset of the questions in the Stack- Overflow data set available at Kaggle.8 We used as document timestamps the months on which the questions were published, ranging from January 2014 to 5 http://taku910.github.io/mecab/ 6 https://catalog.ldc.upenn.edu/LDC2005T16 7 http://dblp.uni-trier.de/xml/ 8 https://www.kaggle.com/stackoverflow/rquestions
  • 8. 8 F. Author et al. Table 1. Specifications of data sets D V T D V T MAI 32,775 15,161 13 DBLP 1,034,067 18,940 15 TDT4 96,246 15,153 15 STOV 25,608 13,184 34 December 2016. For all data sets, English words were lower-cased, and highly frequent words and infrequent words were removed. The training:validation:test ratio was 8:1:1 for all data sets. We used one random split as in the usual exper- iments for neural networks. The mini-batch size was 200 for MAI, TDT4, and STOV data sets and was 10,000 for DBLP data set, because DBLP is a set of short documents. 3.2 Evaluation Method We compared the following three approaches for topic extraction: the mini-batch inference for the topic model with document timestamps (cf. Eq. (5)), the mini- batch inference for the topic model without document timestamps (cf. Eq. (4)), and collapsed Gibbs sampling (CGS) [8] for the vanilla LDA. These compared methods are referred to by ‘VB w/ t’, ‘VB’, and ‘CGS’, respectively in Table 2, which summarizes the evaluation results. With respect to K, i.e., the number of latent topics, we tested the following two settings: K = 50 and K = 100. The comparison is performed in terms of test perplexity [3]. However, each approach has its own free parameters to be tuned. Therefore, we tuned the free parameters based on the perplexity computed over validation set. For ‘VB w/ t’ and ‘VB’, the parameter α of the symmetric Dirichlet prior distribution and the initial learning rate η of Adam were tuned based on validation perplexity. The tuned values of η and α are given in Table 2. With respect to ‘CGS’, not only α but also the parameter β of the symmetric Dirichlet prior distribution assigned to the per-topic word categorical distributions were tuned based on validation perplexity. The tuned values of α and β are given in Table 2. Both the validation and test perplexities were computed by using fold-in procedure [1], where randomly selected two-thirds of word tokens were used for obtaining topic posterior probabilities in each document. There are myriads of ways to select two-thirds of word tokens. Therefore, when computing test perplexity, i.e., when performing the final evaluation, and not when computing validation perplexity, we repeated the computation of perplexity ten times, each over a different set of randomly chosen two-thirds of word tokens. Also for CGS, we performed an evaluation in the same manner. We then calculated the mean and standard deviation of ten resulting test perplexities as given in Table 2. 3.3 Comparison Results Table 2 provides the mean and standard deviation of ten test perplexities ob- tained in a manner described above for each data set. It can be observed that the mini-batch variational Bayes for the topic model using document timestamps
  • 9. Mini-batch Variational Inference for Time-Aware Topic Modeling 9 Table 2. Evaluation results in terms of test perplexity K method perplexity free parameters (η: learning rate) (mean±stdev) (α, β: Dirichlet) MAI 100 VB w/ t 1115.64±3.94 η = .07, α = .03 VB 1138.99±3.11 η = .07, α = .03 CGS 1130.93±3.04 α = .02, β = .05 50 VB w/ t 1283.05±5.74 η = .08, α = .02 VB 1315.10±5.94 η = .08, α = .02 CGS 1336.75±5.56 α = .04, β = .15 TDT4 100 VB w/ t 1480.82±3.85 η = .06, α = .02 VB 1481.13±3.79 η = .06, α = .04 CGS 1425.64±3.08 α = .02, β = .10 50 VB w/ t 1587.45±3.25 η = .05, α = .01 VB 1589.99±3.10 η = .10, α = .02 CGS 1586.48±3.05 α = .05, β = .05 DBLP 100 VB w/ t 1274.45±3.88 η =6e-4, α = .09† VB 1325.30±5.25 η =8e-4, α = .08† CGS 1341.07±4.84 α = .03, β = .04 50 VB w/ t 1300.09±5.69 η =5e-4, α = .12† VB 1335.71±5.72 η =6e-4, α = .10† CGS 1384.36±6.21 α = .04, β = .05 STOV 100 VB w/ t 1083.10±4.82 η = .015, α = .20† VB 1125.78±5.62 η = .010, α = .30† CGS 1042.18±4.19 α = .05, β = .01 50 VB w/ t 1255.72±3.59 η = .018, α = .30 VB 1266.16±3.43 η = .020, α = .35 CGS 1178.20±6.16 α = .07, β = .04 (i.e., the model based on Eq. (5)) gave the best test perplexity among the three compared methods when K = 50 or 100 for MAI and when K = 50 or 100 for DBLP. Further, it can be also observed that even when we adopt the topic model using no timestamps (i.e., the model based on Eq. (4)), the test perplexity is better than that of CGS in three cases, i.e., when K = 50 for MAI and when K = 50 or 100 for DBLP. When we consider the fact that CGS often gives a good test perplexity [1], it can be said that even when we use no timestamps, the proposed mini-batch variational Bayes is a promising alternative to CGS when the size of training data set is prohibitively large for CGS. Another important observation is that DropConnect [15] led to a remarkable improvement when K = 50 or 100 for DBLP and when K = 100 for STOV. These cases are marked by dagger ‘†’ in Table 2. DropConnect worked both for the modeling with timestamps and for that without timestamps. Fig. 2 shows to what extent DropConnect could improve the quality of extracted topics in terms of validation perplexity for DBLP data set when K = 50. The horizontal axis gives the number of seen mini-batches. The vertical axis gives the validation perplexity computed during the course of inference. As Fig. 2 shows, DropCon-
  • 10. 10 F. Author et al. Fig. 2. Comparing the inference with DropConnect (solid line) to the inference without it (dashed line) in terms of validation perplexity for DBLP data set when K = 50. The horizontal axis gives the number of seen mini-batches. The vertical axis gives the perplexity computed over the validation set. The learning rate η and the Dirichlet hyperparameter α were tuned separately for each inference. The validation perplexity given by the inference without DropConnect (dashed line) reached a stable value, which was around 1,870, after 2,000 mini-batches were seen. The validation perplexity given by the inference with DropConnect reached around 1,320 as presented in the above chart. That is, the perplexity was improved by around 500. nect improved the validation perplexity by around 500. Based on this kind of result, we adopted DropConnect also for the other two cases. 3.4 Examples of Extracted Topical Trends Fig. 1 and Fig. 3 present the latent topics extracted by the proposed model as word clouds.9 These two topics were extracted from 827,141 paper titles con- tained in the training set of DBLP data set. In the three panels of each figure, we provide top 50 words according to their probabilities at three different time points, 1998, 2005, and 2012, respectively. The words having larger probabilities are depicted in larger fonts. For the topic given in Fig. 1, seemingly related to HCI, the word ‘object’ is more relevant to this topic in 1998 than in 2005 and 2012. In contrast, the word ‘application’ is more relevant in 2005 than in 1998 and 2012, and the words ‘interaction’ and ‘augmented’ are more relevant in 2012 than in 1998 and 2005. For the topic given in Fig. 3, seemingly related to wireless network, the word ‘cellular’ is more relevant to this topic in 1998 and 2005 than in 2012, the word ‘wireless’ is more relevant in 2005 and 2012 than in 1998, and the word ‘power’ is more relevant in 2012 than in 1998 and 2005. In this manner, we can observe remarkable trends by comparing word probability distributions coming from different time points. Our approach enables us to conduct this type of time-aware topic analysis. 9 https://github.com/amueller/word_cloud
  • 11. Mini-batch Variational Inference for Time-Aware Topic Modeling 11 Fig. 3. Word cloud presentation of the same topic at three different time points. This topic, seemingly related to wireless network, extracted from 827,141 paper titles in DBLP data set. K was set to 50. The left, center, and right panels correspond to the time points 1998, 2005, and 2012, respectively. 4 Related work While NVDM [11] is the first proposal of document modeling using neural net- works, several other proposals combine topic modeling with neural networks in a more interesting and interpretable manner. Mikolov and Zweig [12] devised an RNN-based language model using LDA. The output of the inference for LDA is used as an additional context information for language modeling. However, this is neither a new topic model nor a new inference for topic models. Srivastava and Sutton [14] proposed a topic model, called ProdLDA, in order to apply au- toencoding variational Bayes, which was implemented by using neural networks. Dieng et al. [6] proposed a topic model, called TopicRNN, where RNN is used to obtain per-document word probabilities. Both of these models use the softmax function to carry out the word-level mixture over per-topic word probabilities after marginalizing each zd,i, a hidden variable representing the topic assign- ment of the i-th word token in the document d. However, since both models marginalize the zd,i’s out, they widely deviate from the vanilla LDA [3]. Our method introduces only a limited complication to the vanilla LDA and keeps the coordinate ascent related to the document-specific parameters untouched to make the inference simple and efficient. 5 Conclusion This paper proposed a topic model using document timestamps as covariate and provided a mini-batch inference, whose implementation was realized by using a deep learning framework. While learning for topic models is performed in an un- supervised manner, we can regard it as an optimization by adopting variational inference approach. Our mini-batch variational inference updated per-topic word probabilities in a similar manner to neural network parameters and required no knowledge about the training data size. We also showed that DropConnect worked in our inference. It is an important future research direction to consider covariates other than timestamps and to introduce more flexibility in modeling by referring to the recent work discussed in Section 4.
  • 12. 12 F. Author et al. References 1. Asuncion, A., Welling, M., Smyth, P., and Teh, Y. W.: On smoothing and inference for topic models. In Proc. the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI ’09), 27–34 (2009) 2. Blei, D. M., Carin, L., and Dunson, D.: Probabilistic Topic Models. IEEE Signal Process. Mag. 27, no. 6, 55–65 (2010) 3. Blei, D. M., Ng, A. Y., and Jordan, M. I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003) 4. Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., and Jordan, M. I.: Streaming Variational Bayes. In Proc. Advances in Neural Information Processing Systems (NIPS), 1727–1735 (2013) 5. Chen, S. F. and Goodman, J.: An Empirical Study of Smoothing Techniques for Language Modeling. In Proc. Annual Meeting on Association for Computational Linguistics (ACL), 310–318 (1996) 6. Dieng, A. B., Wang, C., Gao, J., and Paisley, J.: TopicRNN: A recurrent neural network with long-range semantic dependency. arXiv preprint, arXiv:1611.01702 (2016) 7. Eisenstein, J., Ahmed, A., and Xing, E. P.: Sparse Additive Generative Models of Text. In Proc. International Conference on Machine Learning (ICML), 1041–1048 (2011) 8. Griffiths T. L., and Steyvers, M.: Finding Scientific Topics. Proc. Natl. Acad. Sci. U.S.A 101, Suppl. 1, 5288–5235 (2004) 9. Hoffman, M. D., Blei, D. M., and Bach, F.: Online learning for latent Dirichlet allocation. In Proc. Advances in Neural Information Processing Systems (NIPS), 856–864 (2010) 10. Kingma, D. P. and Ba, J. L.: Adam: a Method for Stochastic Optimization. In Proc. International Conference on Learning Representations (ICLR) (2015) 11. Miao, Y., Yu, L., and Blunsom, P.: Neural Variational Inference for Text Process- ing. In Proc. International Conference on Machine Learning (ICML), 1727–1736 (2016) 12. Mikolov, T. and Zweig, G.: Context Dependent Recurrent Neural Network Lan- guage Model. In Proc. IEEE Spoken Language Technology Workshop, 234–239 (2012) 13. Roberts, M. E., Stewart, B. M., Tingley, D., and Airoldi, E. M.: The Structural Topic Model and Applied Social Science. In Proc. Advances in Neural Information Processing Systems Workshop on Topic Models: Computation, Application, and Evaluation (2013) 14. Srivastava, A. and Sutton, C.: Autoencoding Variational Inference For Topic Mod- els. In Proc. International Conference on Learning Representations (ICLR) (2017) 15. Wan, L., Zeiler, M., Zhang, S., Cun, Y. L., and Fergus, R.: Regularization of Neural Networks using DropConnect. In Proc. International Conference on Machine Learning (ICML), 1058–1066 (2013)