This document introduces supervised topic models, which are extensions of latent Dirichlet allocation (LDA) that allow topic models to be fit explicitly for prediction tasks. Supervised LDA models documents and their associated response variables (like ratings or categories) jointly, with the goal of discovering topics predictive of the responses. The model assumes the response depends on the topic proportions of the document, allowing it to blend generative and discriminative modeling by conditioning the response on the words through the topic assignments.
1. Supervised and Relational Topic Models
David M. Blei
Department of Computer Science
Princeton University
October 5, 2009
Joint work with Jonathan Chang and Jon McAuliffe
2. Topic modeling
• Large electronic archives of document collections require new
statistical tools for analyzing text.
• Topic models have emerged as a powerful technique for
unsupervised analysis of large document collections.
• Topic models posit latent topics in text using hidden random
variables, and uncover that structure with posterior inference.
• Useful for tasks like browsing, search, information retrieval, etc.
3. Examples of topic modeling
contractual employment female markets criminal
expectation industrial men earnings discretion
gain local women investors justice
promises jobs see sec civil
expectations employees sexual research process
breach relations note structure federal
enforcing unfair employer managers see
supra agreement discrimination firm officer
note economic harassment risk parole
perform case gender large inmates
4. Examples of topic modeling
online
scheduling
quantum task
Quantum lower bounds by polynomials competitive
On the power of bounded concurrency I: finite automata automata approximation
nc tasks
Dense quantum coding and quantum finite automata s
Classical physics and the Church--Turing Thesis automaton points
languages distance
convex
n routing Nearly optimal algorithms and bounds for multilayer channel routing
machine functions adaptive How bad is selfish routing?
domain polynomial networks network Authoritative sources in a hyperlinked environment
networks Balanced sequences and optimal routing
degree log protocol protocols
degrees
polynomials algorithm network
packets
link
learning
learnable
statistical constraint
examples dependencies Module algebra
classes local On XML integrity constraints in the presence of DTDs
An optimal algorithm for intersecting line segments in the plane graph Closure properties of constraints
Recontamination does not help to search a graph graphs consistency Dynamic functional dependencies and database aging
A new approach to the maximum-flow problem edge tractable
The time complexity of maximum matching by simulated annealing minimum the,of database
vertices constraints
a, is algebra
and boolean logic
m relational logics
merging n query
networks algorithm theories
sorting languages
multiplication time
log
bound
system
learning
consensus
systems knowledge
objects
logic performance reasoning
messages
protocol programs analysis verification
circuit
asynchronous
systems distributed
language
trees
regular sets networks
Single-class bounds of multi-class queuing networks
tree queuing
The maximum concurrent flow problem
search asymptotic Contention in shared memory algorithms
compression database productform Linear probing with a nonuniform address distribution
transactions server
retrieval
concurrency
Magic Functions: In Memoriam: Bernard M. Dwork 1923--1998 proof restrictions
property formulas
A mechanical proof of the Church-Rosser theorem
program firstorder
Timed regular expressions
On the power and limitations of strictness analysis resolution decision
abstract temporal
queries
5. Examples of topic modeling
1880 1890 1900 1910 1920 1930 1940
electric electric apparatus air apparatus tube air
machine power steam water tube apparatus tube
power company power engineering air glass apparatus
engine steam engine apparatus pressure air glass
steam electrical engineering room water mercury laboratory
two machine water laboratory glass laboratory rubber
machines two construction engineer gas pressure pressure
iron system engineer made made made small
battery motor room gas laboratory gas mercury
wire engine feet tube mercury small gas
1950 1960 1970 1980 1990 2000
tube tube air high materials devices
apparatus system heat power high device
glass temperature power design power materials
air air system heat current current
chamber heat temperature system applications gate
instrument chamber chamber systems technology high
small power high devices devices light
laboratory high flow instruments design silicon
pressure instrument tube control device material
rubber control design large heat technology
6. Examples of topic modeling
neurons
brain stimulus
motor
memory
visual
activated subjects synapses
tyrosine phosphorylation cortical
left ltp
activation
phosphorylation p53 task surface glutamate
kinase cell cycle proteins tip synaptic
activity protein
cyclin binding rna image neurons
regulation domain dna sample materials
computer
domains rna polymerase organic
problem device
receptor cleavage
information
polymer
science amino acids
research
scientists
receptors cdna
site
computers
polymers
funding molecules physicists
support says ligand sequence problems
laser particles
nih research ligands isolated optical physics
program people protein sequence light
apoptosis sequences surface
particle
electrons experiment
genome liquid quantum
wild type dna surfaces stars
mutant enzyme sequencing fluid
mutations enzymes model reaction astronomers
united states mutants
iron
active site
reactions universe
women cells
mutation reduction molecule galaxies
universities
cell molecules
expression magnetic
galaxy
cell lines plants
magnetic field transition state
students bone marrow plant
spin
superconductivity
gene
education genes
superconducting
pressure mantle
arabidopsis
bacteria high pressure crust sun
bacterial pressures upper mantle solar wind
host fossil record core meteorites earth
resistance development birds inner core ratios planets
mice parasite embryos fossils planet
gene dinosaurs
antigen virus drosophila species
disease fossil
t cells hiv genes forest
mutations
antigens aids expression forests
families earthquake co2
immune response infection populations
mutation earthquakes carbon
viruses ecosystems
fault carbon dioxide
ancient images methane
patients genetic found
disease cells population impact
data water
ozone
treatment proteins populations million years ago volcanic atmospheric
drugs differences africa
clinical researchers deposits climate
measurements
variation stratosphere
protein magma ocean
eruption ice concentrations
found volcanism changes
climate change
7. Supervised topic models
• These applications of topic modeling work in the same way.
• Fit a model using a likelihood criterion. Then, hope that the
resulting model is useful for the task at hand.
• Supervised topic models and relational topic models fit
topics explicitly to perform prediction.
• Useful for building topic models that can
• Predict the rating of a review
• Predict the category of an image
• Predict the links emitted from a document
9. Probabilistic modeling
1 Treat data as observations that arise from a generative
probabilistic process that includes hidden variables
• For documents, the hidden variables reflect the thematic
structure of the collection.
2 Infer the hidden structure using posterior inference
• What are the topics that describe this collection?
3 Situate new data into the estimated model.
• How does this query or new document fit into the estimated
topic structure?
11. Generative model
Topic proportions and
Topics Documents
assignments
gene 0.04
dna 0.02
genetic 0.01
.,,
life 0.02
evolve 0.01
organism 0.01
.,,
brain 0.04
neuron 0.02
nerve 0.01
...
data 0.02
number 0.02
computer 0.01
.,,
• Each document is a random mixture of corpus-wide topics
• Each word is drawn from one of those topics
12. The posterior distribution
Topic proportions and
Topics Documents
assignments
• In reality, we only observe the documents
• Our goal is to infer the underlying topic structure
13. Latent Dirichlet allocation
Per-word
Dirichlet
topic assignment
parameter
Per-document Observed Topic
topic proportions word Topics hyperparameter
α θd Zd,n Wd,n βk η
N
D K
Each piece of the structure is a random variable.
14. Latent Dirichlet allocation
α θd Zd,n Wd,n βk η
N
D K
βk ∼ Dir(η) k = 1 . . . K
θd ∼ Dir(α) d = 1 . . . D
Zd,n | θd ∼ Mult(1, θd ) d = 1 . . . D, n = 1...N
Wd,n | θd , zd,n , β1:K ∼ Mult(1, βzd,n ) d = 1 . . . D, n = 1...N
15. Latent Dirichlet allocation
α θd Zd,n Wd,n βk η
N
D K
1 Draw each topic βk ∼ Dir(η), for k ∈ {1, . . . , K }.
2 For each document:
1 Draw topic proportions θd ∼ Dir(α).
2 For each word:
1 Draw Zd,n ∼ Mult(θd ).
2 Draw Wd,n ∼ Mult(βzd,n ).
16. Latent Dirichlet allocation
α θd Zd,n Wd,n βk η
N
D K
• From a collection of documents, infer
• Per-word topic assignment zd,n
• Per-document topic proportions θd
• Per-corpus topic distributions βk
• Use posterior expectations to perform the task at hand, e.g.,
information retrieval, document similarity, etc.
17. Latent Dirichlet allocation
α θd Zd,n Wd,n βk η
N
D K
• Computing the posterior is intractable:
N
p(θ | α) n=1 p(zn | θ)p(wn | zn , β1:K )
N K
θ p(θ | α) n=1 z=1 p(zn | θ)p(wn | zn , β1:K )
• Several approximation techniques have been developed.
18. Latent Dirichlet allocation
α θd Zd,n Wd,n βk η
N
D K
• Mean field variational methods (Blei et al., 2001, 2003)
• Expectation propagation (Minka and Lafferty, 2002)
• Collapsed Gibbs sampling (Griffiths and Steyvers, 2002)
• Collapsed variational inference (Teh et al., 2006)
20. Example topics
“Genetics” “Evolution” “Disease” “Computers”
human evolution disease computer
genome evolutionary host models
dna species bacteria information
genetic organisms diseases data
genes life resistance computers
sequence origin bacterial system
gene biology new network
molecular groups strains systems
sequencing phylogenetic control model
map living infectious parallel
information diversity malaria methods
genetics group parasite networks
mapping new parasites software
project two united new
sequences common tuberculosis simulations
22. LDA summary
• LDA is a powerful model for
• Visualizing the hidden thematic structure in large corpora
• Generalizing new data to fit into that structure
• LDA is a mixed membership model (Erosheva, 2004) that builds
on the work of Deerwester et al. (1990) and Hofmann (1999).
• For document collections and other grouped data, this might be
more appropriate than a simple finite mixture.
• The same model was independently invented for population
genetics analysis (Pritchard et al., 2000).
23. LDA summary
• Modular : It can be embedded in more complicated models.
• General: The data generating distribution can be changed.
• Variational inference is fast; lets us to analyze large data sets.
• See Blei et al., 2003 for details and a quantitative comparison.
See my web-site for code and other papers.
• Jonathan Chang’s excellent R package “lda” contains Gibbs
sampling code for this model and many others.
24. Supervised topic models
• But LDA is an unsupervised model. How can we build a topic
model that is good at the task we care about?
• Many data are paired with response variables.
• User reviews paired with a number of stars
• Web pages paired with a number of “diggs”
• Documents paired with links to other documents
• Images paired with a category
• Supervised topic models are topic models of documents and
responses, fit to find topics predictive of the response.
25. Supervised LDA
α θd Zd,n Wd,n βk K
N
Yd D η, σ 2
1 Draw topic proportions θ | α ∼ Dir(α).
2 For each word
• Draw topic assignment zn | θ ∼ Mult(θ).
• Draw word wn | zn , β1:K ∼ Mult(βzn ).
3 Draw response variable y | z1:N , η, σ 2 ∼ N η z , σ 2 , where
¯
N
¯
z = (1/N) n=1 zn .
26. Supervised LDA
α θd Zd,n Wd,n βk K
N
Yd D η, σ 2
• The response variable y is drawn after the document because it
depends on z1:N , an assumption of partial exchangeability.
• Consequently, y is necessarily conditioned on the words.
• In a sense, this blends generative and discriminative modeling.
27. Supervised LDA
α θd Zd,n Wd,n βk K
N
Yd D η, σ 2
• Given a set of document-response pairs, fit the model
parameters by maximum likelihood.
• Given a new document, compute a prediction of its response.
• Both of these activities hinge on variational inference.
28. Variational inference (in general)
• Variational methods are a deterministic alternative to MCMC.
• Let x1:N be observations and z1:M be latent variables
• Our goal is to compute the posterior distribution
p(z1:M , x1:N )
p(z1:M | x1:N ) =
p(z1:M , x1:N )dz1:M
• For many interesting distributions, the marginal likelihood of the
observations is difficult to efficiently compute
29. Variational inference
• Use Jensen’s inequality to bound the log prob of the
observations:
log p(x1:N ) = log p(z1:M , x1:N )dz1:M
qν (z1:M )
= log p(z1:M , x1:N ) dz1:M
qν (z1:M )
≥ Eqν [log p(z1:M , x1:N )] − Eqν [log qν (z1:M )]
• We have introduced a distribution of the latent variables with free
variational parameters ν.
• We optimize those parameters to tighten this bound.
• This is the same as finding the member of the family qν that is
closest in KL divergence to p(z1:M | x1:N ).
30. Mean-field variational inference
• Factorization of qν determines complexity of optimization
• In mean field variational inference qν is fully factored
M
qν (z1:M ) = qνm (zm ).
m=1
• The latent variables are independent.
• Each is governed by its own variational parameter νm .
• In the true posterior they can exhibit dependence
(often, this is what makes exact inference difficult).
31. MFVI and conditional exponential families
• Suppose the distribution of each latent variable conditional on all
other variables is in the exponential family:
p(zm | z−m , x) = hm (zm ) exp{gm (z−m , x)T zm − am (gi (z−m , x))}
• Assume qν is fully factorized, and each factor is in the same
exponential family as the corresponding conditional:
qνm (zm ) = hm (zm ) exp{νm zm − am (νm )}
T
32. MFVI and conditional exponential families
• Variational inference is the following coordinate ascent algorithm
νm = Eqν [gm (Z−m , x)]
• Notice the relationship to Gibbs sampling.
33. Variational inference
• Alternative to MCMC; replace sampling with optimization.
• Deterministic approximation to posterior distribution.
• Uses established optimization methods
(block coordinate ascent; Newton-Raphson; interior-point).
• Faster, more scalable than MCMC for large problems.
• Biased, whereas MCMC is not.
• Emerging as a useful framework for fully Bayesian and empirical
Bayesian inference problems. Many open issues!
• Good papers: Beal’s Ph.D. thesis, Wainwright and Jordan (2009)
34. Variational inference in sLDA
α θd Zd,n Wd,n βk K
N
Yd D η, σ 2
• In sLDA the variational bound is
N
E[log p(θ | α)] + n=1 E[log p(Zn | θ)]
+ N
n=1 E[log p(wn | Zn , β1:K )] + E[log p(y | Z1:N , η, σ 2 )] + H(q)
• As in Blei, Ng, and Jordan (2003), we use the fully-factorized
variational distribution
N
q(θ, z1:N | γ, φ1:N ) = q(θ | γ) n=1 q(zn | φn ),
35. Variational inference in sLDA
• The distinguishing term is
E[log p(y | Z1:N , η, σ 2 )]
1 y 2 − 2yη E Z + η E Z Z
¯ ¯¯ η
= − log 2πσ 2 −
2 2σ 2
• The first expectation is
¯ ¯ 1 N
E Z = φ := N n=1 φn .
• The second expectation is
¯¯ 1 N N
E ZZ = N2 n=1 m=n φn φm + n=1 diag{φn } .
• Linear in φn , which leads to an easy coordinate ascent algorithm.
36. Maximum likelihood estimation
• The M-step is an MLE under expected sufficient statistics.
• Define
• y = y1:D is the response vector
¯
• A is the D × K matrix whose rows are Zd .
• MLE of the coefficients solve the expected normal equations
−1
E A A η = E[A] y ⇒ ηnew ← E A A
ˆ E[A] y
• The MLE of the variance is
−1
ˆ2
σnew ← (1/D){y y − y E[A] E A A E[A] y }
37. Prediction
• We have fit SLDA parameters to a corpus, using variational EM.
• We have a new document w1:N with unknown response value.
• First, run variational inference in the unsupervised LDA model, to
obtain γ and φ1:N for the new document.
(LDA ⇔ integrating unobserved Y out of SLDA.)
• Predict y using SLDA expected value:
E Y | w1:N , α, β1:K , η, σ 2 ≈ η Eq Z = η φ.
¯ ¯
38. Example: Movie reviews
least bad more awful his both
problem guys has featuring their motion
unfortunately watchable than routine character simple
supposed its films dry many perfect
worse not director offered while fascinating
flat one will charlie performance power
dull movie characters paris between complex
● ● ● ●● ● ● ● ●
−30 −20 −10 have not 0 one however 10 20
like about from cinematography
you movie there screenplay
was all which performances
just would who pictures
some they much effective
out its what picture
• 10-topic sLDA model on movie reviews (Pang and Lee, 2005).
• Response: Number of stars associated with each review
• Each component of coefficient vector η is associated with a topic.
40. Held out likelihood
(SLDA is red.)
−6.37
q
q q
q
Per−word held out log likelihood
−6.38 q q q
q
q
q
q
−6.39
q
q
q
q
q
−6.40
q
q
−6.41
q
−6.42
q
5 10 15 20 25 30 35 40 45 50
Number of topics
41. Diverse response types with GLMs
• Want to work with response variables that don’t live in the reals.
• binary / multiclass classification
• count data
• waiting time
• Model the response response with a generalized linear model
ζy − A(ζ)
p(y | ζ, δ) = h(y , δ) exp ,
δ
where ζ = η z .
¯
• Complicates inference, but allows for flexible modeling.
42. CVPR 2009 Submission #318. CONFIDE
759 highway coast (highw
Example: Multi-class classification
760
CVPR
761
car, sign, road
756
757
756
Correct classification
with predicted annotations
Correct classification
car, sand bea
#318
762 758
757 with predicted annotations CVPR
763 CVPR 2009 Submission #318. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBU #318
759
758 highway
CVPR 2009 Submission #318. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
764 760
759
inside city highway street (insid
765 761 car, sign, road
756 Correct classification760 Incorrect classification (correct
766 762
757 buildings, car, sidewalk 761
with predicted annotations with predicted annotations tre
car, sign, road 594 window,
767 763 occluded
758
image classification on the LabelMe dataset 762
image classification on the UIUC!Sport dataset 595
0.78 768 764
759 763 596
769 0.66 highway inside city coast (highw
0.76 765
764
760 tall building 597 inside city (t
770 766 inside city
0.74 761 765
0.64 car, sign, road buildings, car, sidewalk 598 sand bea
car,
771 767
762 trees, buildings 766 599 tree, car, sid
772 buildings, car, sidewalk
average accuracy
average accuracy
0.72
occluded, window 768
767
763 0.62 600
773 769
0.7
764 768 601
774 inside city
770 tall building street (inside
0.68 765 0.6 769 602
775 street
771 tall building highway (str
766 770 603
0.66 776 buildings, car, sidewalk 772 trees, buildings window, tree
767
0.58
771 604
777 tree, car, sidewalk occluded, window
trees, buildings occluded
car, window
773
772
0.64
768 605
778 0.56
774
occluded, window
773
60 769 80 606
20 40
779 100 120 topics 20 40
775
60 80 100 120 topics
770 Fei!Fei and 774
tall building2005 street inside city (t
multi!class sLDA with annotations
780
# of components
multi!class sLDA Perona, Bosch et al., 2006
607
776
775
forest mountain (fo
771 street 608
781
mparisons of average accuracy over all classes based on 5 random train/test subsets. multi-class sLDA with annotations and
trees, buildings 777 tree, car, sidewalk tree, car, side
772 776 609
782
LDA (red curves in color) are both our models. left. Accuracy as a function of the number of topics on the LabelMe dataset.
occluded, window 778
773 tree trunk,777 trees, tree, car, sidewalk 610 snowy moun
783
acy as a function of the number of topics on the UIUC-Sport dataset. ground778 grass
779 trunk
(SLDA for image classification, with Chong Wang)
774
784
lass sLDA: This is the multi-class sLDA model,
775
780
779
purely generative approach. On one hand, a large number
611
612
785 street forest highway (str
ed in this paper. 781
780
of topics increases the possibility of overfitting; on the other
776 coast 613 open countr
786 782 forest
777 hand, car, sidewalk 781
tree, it provides more latent features for building the clas- 614 window,
lass sLDA with annotations: This is multi-class
787
tree trunk, trees, car,
sifier. beach, 782 783 ground trees,
grass
with annotations, described788 778 this paper.
in sand cloud tree trunk, 615 water, bu
sea
784
783
779 ground grass 616
789 785
ing is performed on unlabeled and unannotated
780 784 617
790 Image Annotation. In the case of multi-class sLDA with
forest
786
785 coast mountain (fo
781 618
791 annotations, we can use the same trained model for coast
mountain
787
786
image highway (mo
43. Supervised topic models
• SLDA enables model-based regression where the predictor
“variable” is a text document.
• It can easily be used wherever LDA is used in an unsupervised
fashion (e.g., images, genes, music).
• SLDA is a supervised dimension-reduction technique, whereas
LDA performs unsupervised dimension reduction.
• LDA + regression compared to sLDA is like principal components
regression compared to partial least squares.
• Paper: Blei and McAuliffe, NIPS 2007.
44. Relational topic models
966
902 1673
1253 1140
1432
1590 1481
964
981 ...
120 1060 ...
831
2259 837
474 ...
436
264 722
1743 965 442 ...
375
660
1335 640
Utilizing prior concepts for
109
1959
learning
254
885 The inductive learning problem
2272
1489
635 Irrelevant features and the consists of learning a concept
801 2192
172
256 subset selection problem given examples and
381
1285
547
651 89
683
We address the problem of nonexamples of the concept. To
2033
534
177
632 finding a subset of features that perform this learning task,
1270
1592 524
634
686 allows a supervised induction inductive learning algorithms bias
1020
208
119 algorithm to induce small high- their learning method...
1642 1176
1317 1698
accuracy concepts...
539 1568 430 236
994 ...
1284 2593 223
313 1426
1304 1165
992 1792 2557
541
1188 2343
1377 2487 2197
1001 2137 1637
1617
Learning with many irrelevant An evolutionary approach to ...
1674 911
1483
1123 52
1569 features learning in robots
1695 1354
1039 In many domains, an appropriate Evaluation and selection of Evolutionary learning methods
603
inductive bias is the MIN- biases in machine learning have been found to be useful in
1680 1207 FEATURES bias, which prefers In this introduction, we define the several areas in the development
288
1355 1047
1465
1040
136
consistent hypotheses definable term bias as it is used in machine of intelligent robots. In the
75 1089 478 1010 over as few features as learning systems. We motivate approach described here,
1348
1420 possible... the importance of automated evolutionary...
479 585 methods for evaluating...
806
2122 227 1651
1345 692
92
396 218
1061
178
Using a genetic algorithm to ...
2299 960
1854
378 1578 learn strategies for collision
2291 ...
1344
418 1539
286 1963 avoidance and local
649
1855
1138 449 303
335
navigation ...
2042
2290
1290 1678 Improving tactical plans with Navigation through obstacles
2300 147
1627 1275
2195 ... genetic algorithms such as mine fields is an
1121
2636 2091 1027 1238 The problem of learning decision important capability for
2447
1644 rules for sequential tasks is autonomous underwater vehicles.
344 2583 2012 addressed, focusing on the One way to produce robust
426
2438
problem of learning tactical plans behavior...
from a simple flight simulator
1244
where a plane must avoid a
2617
missile...
2213
1234
1944
• Many data sets contain connected observations.
• For example:
• Citation networks of documents
• Hyperlinked networks of web-pages.
• Friend-connected social network profiles