Natural Language
Summarization of Text and
Videos using Topic Models
Pradipto Das
PhD Dissertation Defense
CSE Department, SUNY at Buffalo
Rohini K. Srihari Sargur N. Srihari Aidong Zhang
Professor and Committee Chair Distinguished Professor Professor and Chair
CSE Dept., SUNY Buffalo CSE Dept., SUNY Buffalo CSE Dept., SUNY Buffalo
Download this presentation from http://bit.ly/pdasthesispptx or http://bit.ly/pdasthesispptxpdf
Primary committee members

Using Tag-Topic Models and
Rhetorical Structure Trees
to Generate Bulleted List
Summaries[journal
submission]
The Road Ahead (modulo presenter)
Discovering Voter Preferences using Mixtures
of Topic Models [AND Wkshp 2009]
Simultaneous Joint and
Conditional Modeling of
documents Tagged from Two
Perspectives [CIKM 2011]
A Thousand Frames in just a Few
Words: Lingual Descriptions of Videos
through Latent Topic Models and
Sparse Object Stitching [CVPR 2013]
Translating Related Words
to Videos and Back through
Latent Topics [WSDM 2013]
Introduction
to LDA
Learning to
Summarize using
Coherence [NIPS
Wkshp 2009]

• Stay hungry
• Stay foolish
The answers are coming within the
next 60-75 minutes.. so..
Steve Jobs: Stanford Commencement
Speech, 2005
there is great food,
green tea and coffee
at the back!
But if you stay hungry I will happily
grab the leftovers!

Contributions of this thesis
We can explore our data, extrapolate from our data and
use context to guide decisions about new information
Can we find topics from a corpus without human
intervention? Can we use these topics to annotate
documents and use annotations to organize, summarize
and search text? Well, yes, LDA does that for us! That is so 2003!
 Well, can LDA model documents tagged from at least
two different viewpoints or perspectives? No!
 Can we do that after reading this thesis? Yes we can!
 Can we generate bulleted lists from multiple
documents after reading this thesis? Yes we can!
 Can we go further and translate videos into text and
vice versa after reading this thesis? Yes we can!
Bottomline:

http://www.cs.princeton.edu/~blei/kdd-tutorial.pdf
DavidBlei’stalkatKDD2012
DavidBlei’stalkatICML2012

• Unsupervised topic exploration using LDA
– Full text of first 50 patents from uspto.gov using search
keywords of “rocket” & full text of 50 scientific papers from
American journal of Aerospace Engineering
– Vocabulary size: 10102 words; Total word count: 219568
Theme 1 Theme 2 Theme 3 Theme 4 Theme 5
insulation fuel launch rocket system
composition matter mission assembly fuel
fiber A-B space nozzle engine
system engineer system surface combustion
sensor tower vehicle portion propulsion
fire magnetic earth ring pump
water electron orbit motor oxidizeTopic from
patent
documents
Topic from
journal
papers
Topic from
patent
documents
Topic from
journal
papers
Topic from
journal
papers
Explore and extrapolate from context

Power of LDA: Language independence
Topic Translation Topic Translation Topic Translation
,
, ,
,
,
,
,
Tsunami,
earthquake,
Chile,
Pichilemu,
gone,
warning ,
news, city
,
, ,
,
,
, ,
,
,
flight,
Air, France,
Brazil,
A, 447,
disappear,
ocean
France
,
,
,
, ,
,
,
China,
Olympic,
Beijing,
Gore,
function,
stadium,
games
Topic Translation Topic Translation Topic Translation
,
,
xx->xx,
,
,
,
,
:xx->xx
Tsunami,
earthquake,
earthquake:x
x->xx, city,
local, UTC,
Mayor,
Tsunami:xx-
>xx
xx-
>xx,
xx->xx,
xx->xx,
Brazil, A,
disappeared,
search, flight,
aircraft:xx-
>xx, ocean,
ship:xx->xx,
air:xx->xx,
air, space
xx->xx,
xx-
>xx, xx-
>xx,
xx-
>xx,
China,
Olympic,
China:xx->xx,
Olympic:xx-
>xx, Gore:xx-
>xx, Gore,
gold,
Beijing:xx-
>xx, National
TopicsoverwordsTopicsovercontrolledvocabulary

How does LDA look at documents?
A boring view
of Wikipedia

What about other perspectives?
Words
forming
other Wiki
articles
Article
specific
content
words
Words forming
section titles
An exciting
view of
Wikipedia

Insulation,
composition, fiber
system, sensor,
fire, water
Fuel, matter, A-B
Engineer, tower
magnetic, electron
Rocket, assembly,
Nozzle, surface,
Portion, ring,
motor
Launch, mission,
Space, system,
Vehicle, earth
orbit
We are identifying the
landscape from within the
landscape – similar to
finding the map of a maze
from within the maze!
Fuel, matter, A-B
Engineer, tower
magnetic, electron
Explore and extrapolate from context

Mostly from
premier topic
model research
groups
Year I
joined
UB
Today!
Success of LDA: a Generative Model
August

Success of LDA
• Fitting themes to an UNSEEN patent document on insulating a
rocket motor using basalt fibers, nanoclay compositions etc.
Theme 1 Theme 2 Theme 3 Theme 4 Theme 5
insulation fuel launch rocket system
composition matter mission assembly fuel
fiber A-B space nozzle engine
system engineer system surface combustion
sensor tower vehicle portion propulsion
fire magnetic earth ring pump
water electron orbit motor oxidize
“What is claimed is:
1. An insulation composition comprising: a polymer comprising at least one
of a nitrile butadiene rubber and polybenzimidazole fibers; basalt fibers
having a diameter that is at least 5 .mu.m
2. (lots more) …”
Topic from
patent
documents
Topic from
journal
papers
Topic from
patent
documents
Topic from
journal
papers
Topic from
journal
papers

K-Means
Hierarchical
Clustering
LDA: VB
LDA: Gibbs
Dynamic
LDA
MMLDA
Corr-LDA
Hierarchi
cal LDA
Markov
LDA
Syntactic
LDA
Suffix
Tree LDA
TagLDA
Corr-
METag2LDA
Corr-
MMG
LDA
Model Complexities (modulo presenter)
GMM

Model Complexities (modulo presenter)
K-Means
GMM
Hierarchical
Clustering
LDA: VB
Dynamic
LDA
MMLDA
Corr-LDA
Hierarchi
cal LDA
Markov
LDA
Syntactic
LDA
Suffix
Tree LDA
TagLDA
Corr-
METag2LDA
Corr-
MMG
LDA
Hair Loss
LDA: Gibbs

Why do we want to explore?
Master Yoda, how do I find wisdom
from so many things happening
around us?
Go to the center of the data and
find your wisdom you will

parkour perform traceur area flip footage jump park
urban run outdoor outdoors kid group pedestrian
playground
lobster burger dress celery Christmas wrap roll mix
tarragon steam season scratch stick live water lemon
garlic
floor parkour wall jump handrail locker contestant
school run interview block slide indoor perform build
tab duck
make dog sandwich man outdoors guy bench black
sit park white disgustingly toe cough feed rub
contest parody
Can you find your wisdom?
Corr-
MMGLDA

Corr-
MMGLDA
parkour perform traceur area flip footage jump park
urban run outdoor outdoors kid group pedestrian
lobster burger dress celery Christmas wrap roll mix
tarragon steam season scratch stick live water lemon
floor parkour wall jump handrail locker contestant
school run interview block slide indoor perform build
tab duck
make dog sandwich man outdoors guy bench black
sit park white disgustingly toe cough feed rub
contest parody
tutorial: man explains how to make lobster rolls from scratch
One guy is making sandwich outdoors
montage of guys free running up
a tree and through the woods
interview with parkour contestants
Kid does parkour around the park
Footage of group of performing parkour outdoors
A family holds a strange burger assembly
and wrapping contest at Christmas
Actualground-truthsynopsesoverlaid
Man performs parkour in various locations
Are these what you were thinking?

1 10 11 12 13 142 3 4 5 6 7 8 9
• No ground truth label assignments are known
The Classical Partitioning Problem

1 10 11 12 13 142 3 4 5 6 7 8 9
• Then, select the one with the lowest loss; for example the one
shown – blue = +1, red = -1
• But we don’t really have a good way to measure loss here!
Distance from or closeness
to a central point
The Classical Partitioning Problem

1 10 11 12 13 142 3 4 5 6 7 8 9
• Then, select the one with the lowest loss; for example the one
shown – blue = +1, red = -1
• But we don’t really have a good way to measure loss here!
Distance from or closeness
to a central point
Lets sample one more point

The Ground Truth – Two “Topics”
The seven
virtues
The seven
vices
Assume, now, that we have some vocabulary V of English words
X is a set of positions and each element of X is labeled with an
element from V

If X is a multi-set of words (set of positions), then it has an inherent
structure in it: for e.g.
• We no longer see:
• We are used to: and #pow is in #doing
Additional Partitioning: Documents
The seven
virtues
The seven
vices

Success behind LDA
 Allocate as few topics to a document
 Allocate as few words to each topic
I am Nik
WalLenDABalancing Act
This checker board
pattern has a
significance – in general
NP-Hard to figure out
the correct pattern from
limited samples even for
2 topics
 The topic ALLOCATION is controlled by the parameter of a DIRICHLET distribution
governing a LATENT proportion of topics over each document

Current Timeline Consequent Timeline
Event Categories: Accidents/Natural Disasters; Attacks (Criminal/Terrorist); Health &
Safety; Endangered Resources; Investigations (Criminal/Legal/Other)
Previously, long long time ago

Centers of an utterance – Entities serving to link that
utterance to other utterances in the current discourse
segment
Sparse Coherence Flows
[BarbaraJ.Grosz,ScottWeinstein,andArvindK.Joshi.Centering:Aframeworkfor
modelingthelocalcoherenceofdiscourse.InComputationalLinguistics,volume21,
pages203–225,1995]
a. Bob opened a new dealership last week. [Cf=Bob,
dealership; Cp=Bob; Cb=undefined]
b. John took a look at the Fords in his lot. [Cf=John, Fords;
Cp=John; Cb=Bob] {Retain}
c. He ended up buying one.
i. [Cf=John; Cp=John; Cb=John] {Smooth-Shift} OR
ii. [Cf=Bob; Cp=Bob; Cb=Bob] {Continue}
Centerapproximation=the(word,[Grammatical/
Semantic]role)pair(GSR)e.g.(Bob,Subject),(John,
Subject),(dealership,Noun)
Algorithmically
By inspection
For n+1 = 3 and case ii

Global (document/section level) focus
Problems with Centering Theory
a. The house appeared to have been burgled. [Cf=house ]
b. The door was ajar. [ Cb=house; Cf=door, house; Cp=door]
c. The furniture was in disarray. [ Cb=house; Cf=furniture,
house; Cp=furniture] {?}
For n+1 = 3
 Utterances like these are the majority in most free text
documents [redundancy reduction]
 In general, co-reference resolution is very HARD

An example summary sentence from folder D0906B-A of TAC2009 A timeline:
• “A fourth day of thrashing thunderstorms began to take a heavier toll on southern
California on Sunday with at least three deaths blamed on the rain, as flooding and
mudslides forced road closures and emergency crews carried out harrowing rescue
operations.”
The next two contextual sentences in the document of the previous sentence are:
• “In Elysian Park, just north of downtown, a 42-year-old homeless man was killed
and another injured when a mudslide swept away their makeshift encampment.”
• “Another man was killed on Pacific Coast Highway in Malibu when his sport utility
vehicle skidded into a mud patch and plunged into the Pacific Ocean.”
If the query is, “Describe the effects and responses to the heavy rainfall and mudslides
in Southern California,” observe the focus of attention on mudslides as subject in
the first two sentences in the table below:
Sentence-GSR grid for a sample summary document slice
Summarization using Coherence
 Incorporating coherence this way does not necessarily
lead to the final summary being coherent
 Coherence is best obtained in a post processing step
using the Traveling Salesman Problem

measure project lady
tape indoor sew
marker pleat
highwaist zigzag
scissor card mark
teach cut fold stitch
pin woman skirt
machine fabric inside
scissors make leather
kilt man beltloop
sew woman fabric
make machine show
baby traditional loom
blouse outdoors
blanket quick
rectangle hood knit
indoor stitch scissors
pin cut iron studio
montage measure kid
penguin dad stuff
thread
One lady is doing sewing project indoors
Woman demonstrating different stitches using a
serger/sewing machine
dad sewing up stuffed penguin for kids Woman makes a bordered hem skirt
A pair of hands do a sewing project using a sewing machine
ground-truthsynopsesoverlaid
But what we really want is this

ground-truthsynopsesoverlaid
clock mechanism
repair computer tube
wash machine lapse
click desk mouse time
front wd40 pliers
reattach knob make
level video water
control person clip
part wire inside
indoor whirlpool man
gear machine guy
repair sew fan test
make replace grease
vintage motor box
indoor man tutorial
fuse bypass brush
wrench repairman
lubricate workshop
bottom remove screw
unscrew screwdriver
video wire
How to repair the water level control mechanism on a
Whirlpool washing machine
a man is repairing a whirlpool washer
how to remove blockage from
a washing machine pump
Woman demonstrates replacing a door hinge
on a dishwasher
A guy shows how to make
repairs on a microwave
How to fix a broken agitator on a Whirlpool
washing machine
A guy working on a vintage box
fan
And this

Roadmap
Introduction
to LDA
Discovering Voter Preferences Using
Mixtures of Topic Models [AND’09 Oral]
Learning to Summarize
Using Coherence [NIPS
09 Poster]
Core NLP
including summarization,
information extraction,
unsupervised grammar
induction, dependency parsing,
rhetorical parsing, sentiment
and polarity analysis…
Non-parametric Bayes
Applied StatisticsExit 2
Exit 1
Uncharted territory –
proceed at your own risk

Why
When
Who
Where
TagLDA: More Observed Constraints
Domain knowledge
Topic
distribution
over words
Annotation/
Tag
distribution
over words
Is there a model which
can take additional clues
and attempt to correct
the misclassifications?

Why
When
Who
Where
Domain knowledge
Incorporating Prior Knowledge
Topic
distribution
over words
but
conditioned
over tags
Number of
parameters
= (K+T)V
TagLDA
switches to
this view for
partial
normalization
of some
weights
- x5 and x10 are annotated with the
orange label and x5 co-occurs with x9
both in documents d1 and d2
- It is thus likely that x5, x9 and x10
belong to the same class since both d1
and d2 should contain as few topics

Why
When
Who
Where
Domain knowledge
LDA
TagLDA

With Additional Perspectives
Why
When
Who
Where
Domain knowledge
LDA
TagLDA
LDA

Words
indicative of
important
Wiki concepts
Actual human
generated
Wiki category
tags – words
that
summarize/
categorize the
document
Wikipedia
Ubiquitous Bi-Perspective Document Structure

Words
indicative
of
questions
Actual tags
for the
forum post
– even
frequencies
are
available!
Words
indicative
of answers
StackOverflow

Words
indicative
of
document
title
Actual
tags given
by users
Words
indicative
of image
description
Yahoo! Flickr

News Article
What if the documents
are plain text files?
Understanding the Two Perspectives

It is believed US investigators have asked
for, but have been so far refused access to,
evidence accumulated by German
prosecutors probing allegations that former
GM director, Mr. Lopez, stole industrial
secrets from the US group and took them
with him when he joined VW last year.
This investigation was launched by US
President Bill Clinton and is in principle a far
more simple or at least more single-minded
pursuit than that of Ms. Holland.
Dorothea Holland, until four months ago
was the only prosecuting lawyer on the
German case.
News Article
Imagine browsing over many reports on an event

It is believed USinvestigators have asked for,
but have been so far refused access to, evidence
accumulated by German prosecutors
probing allegations that former GM director, Mr.
Lopez, stole industrial secrets from the USgroup
and took them with him when he joined VW last year.
President Bill Clinton and is in principle a far more simple
or at least more single-minded pursuit than that of Ms.
Holland.
was the only prosecutinglawyer on the
German case.News Article
The “document level”
perspective
What words can we remember after a first browse?
German, US,
investigations,
GM, Dorothea
Holland, Lopez,
prosecute

Important Verbs
and Dependents
Named Entities
What helped us remember?
ORGANIZATION
pursuit than that of Ms. Holland.
German case.
News Article
LOCATION
MISC
PERSON
WHAT
HAPPENED?
The “word level”
perspective
The “document level”
perspective
German, US,
investigations,
GM, Dorothea
Holland, Lopez,
prosecute

Summarization power of the perspectives
pursuit than that of Ms. Holland
German case.
German, US,
investigations,
GM, Dorothea
Holland, Lopez,
prosecute
Sentence Boundaries
What if we turn the document off?BeginMiddleEnd

A young man climbs an artificial rock wall indoors
Adjective modifier
(What kind of wall?)
Direct Object
Direct
Subject
Adverb modifier
(climbing where?)
Major Topic: Rock climbing
Sub-topics: artificial rock wall, indoor rock climbing gym
And as if that wasn’t enough!

Categories: Weather hazards to aircraft | Accidents involving fog | Snow or ice weather
phenomena | Fog | Psychrometrics Labeled by human editors
BeginningMiddleEnd
A Wikipedia Article on “fog”

 Take the first category label – “weather hazards to aircraft”
 “aircraft” doesn’t occur in the document body!
 “hazard” only appears in a section title read as “Visibility
hazards”
 “Weather” appears only 6 out of 15 times in the main body
 However, the images suggest that fog is related to concepts like
fog over the Golden Gate bridge, fog in streets, poor visibility
and quality of air
Wiki categories: Abstract or specific?
Labeled by a Tag2LDA model from title and image captions
Categories: Weather hazards to aircraft | Accidents
involving fog | Snow or ice weather phenomena | Fog |
Psychrometrics Labeled by human editors
Categories: fog, San Francisco, visible, high,
temperature, streets, Bay, lake, California, bridge, air

• How do we model such a document
collection?

METag2LDA Corr-METag2LDAMMLDA CorrMMLDATagLDA
Combines
TagLDA
and
MMLDA
Combines
TagLDA and
Corr-
MMLDA
MM = Multinomial + Multinomial; ME = Multinomial + Exponential
Made Possible with Tag2LDA Models
E-
Harmony!

Topic ALLOCATION is controlled by the parameter of a
DIRICHLET distribution governing a LATENT proportion of
topics over each document
I am Nik
WalLenDA
Bi-Perspective Topic Model – METag2LDA
And this
balancing act
got a whole
lot tougher

Exponential State Space
Bayes
Ball

Mean Field Distributions
Hmmm… a
smudge…
wipe.. wipe..
wipe..
2 plates, 2
arrows, 4
circles… no
smudges…
even and
nice!

Mixture Model: Real valued data

y
x
Mixture Model: Real valued data

Mean Field Optimization
Empirical mean
p belongs to
exponential
family by MaxEnt

Forward Mapping
Backward
Mapping
Mean Field Optimization Sufficient
statistics

Very similar to finding the basic feasible solution
(BFS) in linear programming
• Start with pivot at the origin (only slack variables
as solution)
• Cycle the pivot through the extreme points i.e.
replace slacks in BFS until solution is found

However, mean field optimization space is
inherently non-convex over the set of tractable
distributions due to the delta functions which match
the extreme points of the convex hull of sufficient
statistics of the original discrete distributions

Topics conditioned on different section identifiers
(WL tag categories)
Topic Marginals
Topics
over
image
captions
Correspondence
of DL tag words
with content
words
Topic Labeling
Faceted Bi-Perspective Document Organization
All of the inference machinery *is needed*
to generate exploratory outputs like this!

• METag2LDA: A topic generating all DL tags in a document
does not necessarily mean that the same topic generates
all words in the document
• Corr-METag2LDA: A topic generating *all* DL tags in a
document does mean that the same topic generates all
words in the document - a considerable strongpoint
Topic concentration parameter
Document specific topic proportions
Document content words
Document Level (DL) tags
Word Level (WL) tags
Indicator variables
Topic Parameters
Tag Parameters
CorrME-
Tag2LDA
METag2LDA
The Family of Tag2LDA Models

Experiments
 Wikipedia articles with images and captions manually
collected along {food, animal, countries, sport, war,
transportation, nature, weapon, universe and ethnic
groups} concepts
 Annotations/tags used:
 DL Tags – image caption words and the article titles
 WL Annotations – Positions of sections binned into 5
bins
 Objective: to generate category labels for test documents
 Evaluation
– ELBO: to see performance among various TagLDA models
– WordNet based similarity evaluation between actual category
labels and proxies for them from caption words

Held-out ELBO
Selected Wikipedia Articles
 WL annotations – Section positions in the document
 DL tags – image caption words and article titles
 TagLDA perplexity is comparable to MM(METag2)LDA
 The (image caption words + article titles) and the content words
are independently discriminative enough
 Corr-MM(METag2)LDA performs best since almost all image caption
words and the article title for a Wikipedia document are about a
specific topic
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
K=20 K=50 K=100 K=200Millions
MMLDA TagLDA corrLDA METag2LDA corrMETag2LDA

0
0.5
1
1.5
2
40 60 80 100
Millions
MMLDA METag2LDA corrLDA
corrMETag2LDA TagLDA
Held-out ELBO
DUC05 Newswire Dataset (Recent Experiments with TagLDA Included)
 WL annotations – Named Entities
 DL tags – abstract coherence tuples like (subject, object) e.g. “Mary(Subject) taught the
class. Everybody liked Mary(Object).” *Ignoring coref resolution]
 Abstract markers like (“subj” “obj”) acting as DL perspective are not document
discriminative or even topical markers
 Rather they indicate a semantic perspective of coherence which is intricately linked
to words
 By ignoring the DL perspective completely leads to better fit by TagLDA due to variations
in word distributions only
1.35
1.4
1.45
1.5
1.55
1.6
1.65
40 60 80 100
Millions
MMLDA METag2LDA corrLDA corrMETag2LDA

Are Categories more abstract or specific?
Inverse Hop distance in WordNet ontology
 Top 5 words from the caption vocabulary are chosen
 Max Weighted Average = 5, Max Best = 1
 METag2LDA almost always wins by narrow margins
 METag2LDA reweights the vocabulary of caption words and article titles that are about a
topic and hence may miss specializations relevant to document within the top (5) ones
 In WordNet ontology, specializations lead to more hop distance
 Ontology based scoring helps explain connections to caption words to ground truths e.g.
Skateboard skate glide snowboard
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
K=20 K=50 K=100 K=200
METag2LDA-
AverageDistance
corrMETag2LDA-
AverageDistance
METag2LDA-
BestDistance
corrMETag2LDA-
BestDistance

• Applications
– Document classification using reduced dimensions
– Find faceted topics automatically through word level tags
– Learn correspondences between perspectives
– Label topics through document level multimedia
– Create recommendations based on perspectives
– Video analysis: word prediction given video features
– Tying “multilingual comparable corpora” through topics
– Multi-document summarization using coherence
– E-Textbook aided discussion forum mining:
• Explore topics through the lens of students and teachers
• Label topics from posts through concepts in the e-textbook
Model Usefulness and Applications

Roadmap
Introduction
to LDA
09 Poster]
Core NLP including
summarization, information
extraction, unsupervised
grammar
induction, dependency
parsing, rhetorical
parsing, sentiment and
polarity analysis…
Computer Vision and Applications
– Core Technologies
Applied Statistics
Supervised
Learning, Structured
Prediction
Documents Tagged from Two
Perspectives [CIKM 2011 Oral]

Mostly from
premier topic
model research
groups
Year I
joined
UB
Today!
Success of LDA: Image Annotation
August

Previously
Words
forming
other Wiki
articles
Article specific content words
Caption corresponding to the
embedded multimedia
[P. Das, R. K. Srihari and Y. Fu. “Simultaneous Joint and Conditional Modeling of
Documents Tagged from Two Perspectives,” CIKM, Glasgow Scotland, 2011+

Afterwards
Words
forming
other Wiki
articles
Article specific content words
Caption corresponding to the
embedded multimedia
[P. Das, R. K. Srihari and J. J. Corso. “Translating Related Words to Videos and
Back through Latent Topics,” WSDM, Rome, Italy, 2013+

 Expensive frame-wise manual
annotation efforts by drawing
bounding boxes
 Difficulties: camera
shakes, camera motion, zooming
 Careful consideration to which
objects/concepts to annotate?
 Focus on object/concept detection –
noisy for videos in-the-wild
 Does not answer which
objects/concepts are important for
summary generation?
Man with
microphone
Climbing
person
Annotations for training object/concept models
Trained Models
Information Extraction from Videos

Learning latent translation
spaces a.k.a topics
A young man is
climbing an artificial
rock wall indoors
Human Synopsis
 Mixed membership of
latent topics
 Some topics capture
observations that co-
occur commonly
 Other topics allow for
discrimination
 Different topics can be
responsible for
different modalities
No annotations
needed – only
need clip level
summary
Translating across modalities
MMGLDA model

Using learnt translation
spaces for prediction
?
Text Translation
( ) ( )
, , , ,
1 1 1 1
( | , )
( | ) ( | )
v O H
O K H K
O H
d o i v i d h i v i
o i h i
p w w w
p w p w
 Topics are marginalized
out to permute
vocabulary for
predictions
 The lower the
correlation among
topics, the better the
permutation
 Sensitive to priors for
real valued data
MMGLDA model

Use learnt translation
spaces for prediction
?
Text Translation
( ) ( )
, , , ,
1 1 1 1
( | , )
( | ) ( | )
v O H
O K H K
O H
d o i v i d h i v i
o i h i
p w w w
p w p w
 Topics are marginalized
out to permute
vocabulary for
predictions
 The lower the
correlation among
topics, the better the
permutation
 Sensitive to priors for
real valued dataResponsibility of
topic i over real
valued observations
Responsibility of
topic i over discrete
video features
Probability of learnt
topic i explaining
words in the text
vocabulary
MMGLDA model

• We first formulated the MMGLDA model just
two rooms left of where I am standing now!
An aside

1. There is a guy climbing on a rock-climbing wall.
Multiple Human Summaries: (Max 10 words i.e. imposing a length constraint)
2. A man is bouldering at an indoor rock climbing gym.
3. Someone doing indoor rock climbing.
4. A person is practicing indoor rock climbing.
5. A man is doing artificial rock climbing.
To understand whether we speak all that we see?

1. There is a guy climbing on a rock-climbing wall.
Multiple Human Summaries: (Max 10 words for imposing a length constraint)
Hand holding
climbing
surface
How many
rocks?
The sketch in
the board
Wrist-watch
What’s there
in the back?
Color of the
floor/wall
Dress of the
climber
Not so
important!
2. A man is bouldering at an indoor rock climbing gym.
Empty slots
3. Someone doing indoor rock climbing.
4. A person is practicing indoor rock climbing.
5. A man is doing artificial rock climbing.
Summaries point toward information needs!
Center of Attentions: Central Objects and Actions

Skateboarding
Feeding
animals
Landing fishes
Wedding
ceremony
Woodworking
project
Multimedia
Topic Model
– permute
event specific
vocabularies
Bag of keywords
multi-document
summaries
Sub-events e.g.
skateboarding, snowboarding, sur
fing
Multiple sets of
documents (sets of
frames in videos)
Natural language
multi-document
summaries
Multiple sentences (group of
segments in frames)
Once again: A Summarization Perspective

Evaluation: Held out ELBOs
 In a purely multinomial MMLDA model, failures of independent
events contribute highly negative terms to the log likelihoods
 NOT a measure of keyword summary generation power
 Test ELBOs on events 1-5 in the
Dev-T set
 Prediction ELBOs on events
1-5 in the Dev-T set

Skateboarding
Feeding
animals
Landing fishes
Wedding
ceremony
Woodworking
project
Multimedia
Topic Model
– permute
event specific
vocabularies
Bag of words
multi-document
summaries
Sub-events e.g.
fing
Multiple sets of
documents (sets of
frames in videos)
Natural language
multi-document
summaries
segments in frames)
 A c-SVM classier from the libSVM package is
used with default settings for multiclass (15
classes) classification
 55% test accuracy easily achievable
(completely off-the-shelf)
Evaluate using ROUGE-1
HEXTAC 2009: 100-word human references vs. 100-word manually extracted summaries
Average Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661)
Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)
Event Classification and Summarization

Skateboarding
Feeding
animals
Landing fishes
Wedding
ceremony
Woodworking
project
Multimedia
Topic Model
– permute
event specific
vocabularies
Bag of words
multi-document
summaries
Sub-events e.g.
fing
Multiple sets of
documents (sets of
frames in videos)
Natural language
multi-document
summaries
segments in frames)
 A c-SVM classier from the libSVM package is
used with default settings for multiclass (15
classes) classification
 55% test accuracy easily achievable
(completely off-the-shelf)
Event Classification and Summarization
Evaluate using ROUGE-1
HEXTAC 2009: 100-word human references vs. 100-word manually extracted summaries
Average Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661)
Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)
 If we can achieve 10% of this
for 10 word summaries, we
are doing pretty good!
 Caveat – Text multi-document
summarization task is much
more complex

 MMLDA can show poor ELBO – a bit
misleading
 Performs quite well on predicting
summary worthy keywords
 Sum-normalizing the real valued data
to lie in [0,1]P distorts reality for Corr-
MGLDA w.r.t. quantitative evaluation
 Summary worthiness of predicted
keywords is not good but topics are
good
 MMGLDA produces better topics and
higher ELBO
 Summary worthiness of keywords
almost same as MMLDA for lower n
Evaluation: ROUGE-1 Performance

• Simply predicting more and more keywords
(or creating sentences out of them) does not
improve the relevancy of the generated
summaries
• Instead, selecting sentences from the training
set in an intuitive way almost doubles the
relevancy of the lingual descriptions
Improving ROUGE-1/2 performance

YouCook, iAnalyze
Das et al. WSDM 2013 Das et al. CVPR 2013
Precision
2-gram
Precision
1-gram
Recall
2-gram
Recall
1-gram
Precision
2-gram
Precision
1-gram
Recall
2-gram
Recall
1-gram
0.006 15.47 0.006 19.02 5.14 25.76 6.49 32.87
ROUGE scores for “YouCook” dataset[Corso et al.]

Roadmap
Introduction
to LDA
09 Poster]
Computer Vision and Applications
– Core Technologies
Translating Related Words
to Videos and Back
through Latent Topics
[WSDM 2013 Oral]
Applied Statistics
Supervised
Learning, Structured
Prediction
Documents Tagged from Two
Perspectives [CIKM 2011 Oral]
Core NLP including
summarization, information
extraction, unsupervised
grammar
induction, dependency
parsing, rhetorical
parsing, sentiment and
polarity analysis…
Using Tag-Topic Models and
Rhetorical Structure Trees to
Generate Bulleted List
Summaries[to be submitted to
TOIS]
Linear, Quadratic and Conic
Programming Variants
A Thousand Frames in just a Few
Words: Lingual Descriptions of
Videos through Latent Topic
Models and Sparse Object
Stitching [CVPR 2013 Spotlight]

Just one last thing…
• We want to analyze documents not only for
topic discovery but also for turning these

• into this
 A previous study on sleep deprivation that less sleep resulted in
impaired glucose metabolism.
 Women who slept less than or equal to 5 hours a night were twice as
likely to suffer from hypertension than women. [*]
 Children ages 3 to 5 years get 11-13 hours of sleep per night.
 Chronic sleep deprivation can do more it can also stress your heart.
 Sleeping less than eight hours at night, frequent nightmares and
difficulty initiating sleep were significantly associated with drinking.
 A single night of sleep deprivation can limit the consolidation of
memory the next day.
 Women’s health is much more at risk. [*]
[*] means that the sentences belong to the same document

• using these
Accidents and
Natural
Disasters
Attacks
Health and
Safety
Endangered
Resources
Investigations
and Trials
Document sets
or “Docsets”
Global Tag-Topic Model
Local
Models
Documents and
sentences
Local
Models
Local
Models
Local
Models
Training using
documents
Fitting sentences from
Docsets to the learnt
model
Candidate summary
sentence for a Docset
Weighting a
summary sentence
from local and
global models
Candidate summary
sentence for a Docset

• and these
Attribution
Cause
Elaboration
distractions such as
computers or video
games in kids '
bedrooms may
lessen sleep quality.
that only 20
percent of
adolescents get the
recommended nine
hours of sleep ;
The National
Sleep
Foundation
reported in 2006
Satellite (Leaf:
Span 1)
Nucleus (Leaf: Span 2) Nucleus (Leaf: Span 3)
Nucleus
[2]
Root [2, 3]
Attribution
Joint
and need more
than eight hours of
sleep per day .
because they 're
nocturnal
Sleep-deprived
teens crash just
about anywhere
Nucleus (Leaf:
Span 1)
Nucleus (Leaf: Span 2) Nucleus (Leaf: Span 3)
Satellite
[2,3]
Root [1, 3]
Explanation
Joint
early-risers are
actually at a higher
risk of developing
heart problems.
but a Japanese
study says
Generations
have praised the
wisdom of
getting up early
in the morning,
Nucleus (Leaf:
Span 1)
Satellite (Leaf: Span 2) Nucleus (Leaf: Span 3)
Nucleus
[2,3]
Root [1, 3]
Contrast Attribution
Fortunately
for sleepy
women , a
Penn State
College of
Medicine
study found,
Satellite
(Leaf: Span 1)
Nucleus
[2,4]
Root [1, 4]
that they 're
much better
than men at
enduring
sleep
deprivation,
Nucleus (Leaf:
Span 2)
possibly because
of '' profound
demands of
infant and child
care
Nucleus (Leaf:
Span 3)
placed on them
for most of
mankind 's
history.
Satellite (Leaf:
Span 4)
Satellite
[2,3]

• With scores like these

• and these

• We want to analyze documents
not only for topic discovery but
also for turning these
• into this
• using these
• and these
• with scores like these
• and these
The final song: Recap

The ending…
Interviewer: Do you agree with President Obama’s approach towards Libya?
Presidential: [Libya??] I just wanted to make sure we're talking about the same
Candidate thing before I say, 'Yes, I agreed' or 'No I didn't agree.' I do not
agree with the way he handled it for the following reason -- nope,
that's a different one. I got all this stuff twirling around in my head
• So that we can always have the right information
on our fingertips

Summary
• Summarize a task using contextual
exploratory analysis tools as well as
deep NLP and
• Make decisions for us!
• Topic models can now talk to structured
prediction models
• Efficient text summarization/translation of
domain specific videos is now possible
• With multi-document summarization systems
which exploit meaning in text, we are getting
closer to our ultimate dream:
– Construct an artificial assistant who can

Future Directions
• Core Algorithms
– Non-parametric Tag2LDA family models
– Address sparsity in tags and scaling of real-valued variables
in mixed domain topic models
– Efficient inference with more structure among hidden
variables
• Applications
– Type in text and get an object detector [borrowed from VPML]
– Intention analysis of videographers in social networks and
the evolution of intentions over time
– Large scale visualization using rhetorics and topic analysis
– Large scale multi-media multi-document summarization

Thank You All for Listening
Questions?

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Editor's Notes