Here are the key points about centering theory:- Centering theory models local coherence of discourse by tracking transitions of discourse entities (centers) across utterances. - It defines three types of centers for each utterance: Cf (forward-looking center), Cp (preferred center), Cb (backward-looking center).- Cf is the set of discourse entities mentioned in the utterance. Cp is the most salient entity of Cf. Cb is the Cp of the previous utterance.- It classifies center transitions between utterances into different types (continue, retain, smooth-shift, rough-shift) which indicate the level of coherence.- A key problem
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
More Related Content
Similar to Here are the key points about centering theory:- Centering theory models local coherence of discourse by tracking transitions of discourse entities (centers) across utterances. - It defines three types of centers for each utterance: Cf (forward-looking center), Cp (preferred center), Cb (backward-looking center).- Cf is the set of discourse entities mentioned in the utterance. Cp is the most salient entity of Cf. Cb is the Cp of the previous utterance.- It classifies center transitions between utterances into different types (continue, retain, smooth-shift, rough-shift) which indicate the level of coherence.- A key problem
Similar to Here are the key points about centering theory:- Centering theory models local coherence of discourse by tracking transitions of discourse entities (centers) across utterances. - It defines three types of centers for each utterance: Cf (forward-looking center), Cp (preferred center), Cb (backward-looking center).- Cf is the set of discourse entities mentioned in the utterance. Cp is the most salient entity of Cf. Cb is the Cp of the previous utterance.- It classifies center transitions between utterances into different types (continue, retain, smooth-shift, rough-shift) which indicate the level of coherence.- A key problem (20)
Scaling API-first – The story of a global engineering organization
Here are the key points about centering theory:- Centering theory models local coherence of discourse by tracking transitions of discourse entities (centers) across utterances. - It defines three types of centers for each utterance: Cf (forward-looking center), Cp (preferred center), Cb (backward-looking center).- Cf is the set of discourse entities mentioned in the utterance. Cp is the most salient entity of Cf. Cb is the Cp of the previous utterance.- It classifies center transitions between utterances into different types (continue, retain, smooth-shift, rough-shift) which indicate the level of coherence.- A key problem
1. Natural Language
Summarization of Text and
Videos using Topic Models
Pradipto Das
PhD Dissertation Defense
CSE Department, SUNY at Buffalo
Rohini K. Srihari Sargur N. Srihari Aidong Zhang
Professor and Committee Chair Distinguished Professor Professor and Chair
CSE Dept., SUNY Buffalo CSE Dept., SUNY Buffalo CSE Dept., SUNY Buffalo
Download this presentation from http://bit.ly/pdasthesispptx or http://bit.ly/pdasthesispptxpdf
Primary committee members
2. Using Tag-Topic Models and
Rhetorical Structure Trees
to Generate Bulleted List
Summaries[journal
submission]
The Road Ahead (modulo presenter)
Discovering Voter Preferences using Mixtures
of Topic Models [AND Wkshp 2009]
Simultaneous Joint and
Conditional Modeling of
documents Tagged from Two
Perspectives [CIKM 2011]
A Thousand Frames in just a Few
Words: Lingual Descriptions of Videos
through Latent Topic Models and
Sparse Object Stitching [CVPR 2013]
Translating Related Words
to Videos and Back through
Latent Topics [WSDM 2013]
Introduction
to LDA
Learning to
Summarize using
Coherence [NIPS
Wkshp 2009]
3. • Stay hungry
• Stay foolish
The answers are coming within the
next 60-75 minutes.. so..
Steve Jobs: Stanford Commencement
Speech, 2005
there is great food,
green tea and coffee
at the back!
But if you stay hungry I will happily
grab the leftovers!
4. Contributions of this thesis
We can explore our data, extrapolate from our data and
use context to guide decisions about new information
Can we find topics from a corpus without human
intervention? Can we use these topics to annotate
documents and use annotations to organize, summarize
and search text? Well, yes, LDA does that for us! That is so 2003!
Well, can LDA model documents tagged from at least
two different viewpoints or perspectives? No!
Can we do that after reading this thesis? Yes we can!
Can we generate bulleted lists from multiple
documents after reading this thesis? Yes we can!
Can we go further and translate videos into text and
vice versa after reading this thesis? Yes we can!
Bottomline:
6. • Unsupervised topic exploration using LDA
– Full text of first 50 patents from uspto.gov using search
keywords of “rocket” & full text of 50 scientific papers from
American journal of Aerospace Engineering
– Vocabulary size: 10102 words; Total word count: 219568
Theme 1 Theme 2 Theme 3 Theme 4 Theme 5
insulation fuel launch rocket system
composition matter mission assembly fuel
fiber A-B space nozzle engine
system engineer system surface combustion
sensor tower vehicle portion propulsion
fire magnetic earth ring pump
water electron orbit motor oxidizeTopic from
patent
documents
Topic from
journal
papers
Topic from
patent
documents
Topic from
journal
papers
Topic from
journal
papers
Explore and extrapolate from context
7. Power of LDA: Language independence
Topic Translation Topic Translation Topic Translation
,
, ,
,
,
,
,
Tsunami,
earthquake,
Chile,
Pichilemu,
gone,
warning ,
news, city
,
, ,
,
,
, ,
,
,
flight,
Air, France,
Brazil,
A, 447,
disappear,
ocean
France
,
,
,
, ,
,
,
China,
Olympic,
Beijing,
Gore,
function,
stadium,
games
Topic Translation Topic Translation Topic Translation
,
,
xx->xx,
,
,
,
,
:xx->xx
Tsunami,
earthquake,
earthquake:x
x->xx, city,
local, UTC,
Mayor,
Tsunami:xx-
>xx
xx-
>xx,
xx->xx,
xx->xx,
Brazil, A,
disappeared,
search, flight,
aircraft:xx-
>xx, ocean,
ship:xx->xx,
air:xx->xx,
air, space
xx->xx,
xx-
>xx, xx-
>xx,
xx-
>xx,
China,
Olympic,
China:xx->xx,
Olympic:xx-
>xx, Gore:xx-
>xx, Gore,
gold,
Beijing:xx-
>xx, National
TopicsoverwordsTopicsovercontrolledvocabulary
8. How does LDA look at documents?
A boring view
of Wikipedia
9. What about other perspectives?
Words
forming
other Wiki
articles
Article
specific
content
words
Words forming
section titles
An exciting
view of
Wikipedia
10. Insulation,
composition, fiber
system, sensor,
fire, water
Fuel, matter, A-B
Engineer, tower
magnetic, electron
Rocket, assembly,
Nozzle, surface,
Portion, ring,
motor
Launch, mission,
Space, system,
Vehicle, earth
orbit
We are identifying the
landscape from within the
landscape – similar to
finding the map of a maze
from within the maze!
Fuel, matter, A-B
Engineer, tower
magnetic, electron
Explore and extrapolate from context
12. Success of LDA
• Fitting themes to an UNSEEN patent document on insulating a
rocket motor using basalt fibers, nanoclay compositions etc.
Theme 1 Theme 2 Theme 3 Theme 4 Theme 5
insulation fuel launch rocket system
composition matter mission assembly fuel
fiber A-B space nozzle engine
system engineer system surface combustion
sensor tower vehicle portion propulsion
fire magnetic earth ring pump
water electron orbit motor oxidize
“What is claimed is:
1. An insulation composition comprising: a polymer comprising at least one
of a nitrile butadiene rubber and polybenzimidazole fibers; basalt fibers
having a diameter that is at least 5 .mu.m
2. (lots more) …”
Topic from
patent
documents
Topic from
journal
papers
Topic from
patent
documents
Topic from
journal
papers
Topic from
journal
papers
14. Model Complexities (modulo presenter)
K-Means
GMM
Hierarchical
Clustering
LDA: VB
Dynamic
LDA
MMLDA
Corr-LDA
Hierarchi
cal LDA
Markov
LDA
Syntactic
LDA
Suffix
Tree LDA
TagLDA
Corr-
METag2LDA
Corr-
MMG
LDA
Hair Loss
LDA: Gibbs
15. Why do we want to explore?
Master Yoda, how do I find wisdom
from so many things happening
around us?
Go to the center of the data and
find your wisdom you will
16. parkour perform traceur area flip footage jump park
urban run outdoor outdoors kid group pedestrian
playground
lobster burger dress celery Christmas wrap roll mix
tarragon steam season scratch stick live water lemon
garlic
floor parkour wall jump handrail locker contestant
school run interview block slide indoor perform build
tab duck
make dog sandwich man outdoors guy bench black
sit park white disgustingly toe cough feed rub
contest parody
Can you find your wisdom?
Corr-
MMGLDA
17. Corr-
MMGLDA
parkour perform traceur area flip footage jump park
urban run outdoor outdoors kid group pedestrian
lobster burger dress celery Christmas wrap roll mix
tarragon steam season scratch stick live water lemon
floor parkour wall jump handrail locker contestant
school run interview block slide indoor perform build
tab duck
make dog sandwich man outdoors guy bench black
sit park white disgustingly toe cough feed rub
contest parody
tutorial: man explains how to make lobster rolls from scratch
One guy is making sandwich outdoors
montage of guys free running up
a tree and through the woods
interview with parkour contestants
Kid does parkour around the park
Footage of group of performing parkour outdoors
A family holds a strange burger assembly
and wrapping contest at Christmas
Actualground-truthsynopsesoverlaid
Man performs parkour in various locations
Are these what you were thinking?
18. 1 10 11 12 13 142 3 4 5 6 7 8 9
• No ground truth label assignments are known
The Classical Partitioning Problem
19. 1 10 11 12 13 142 3 4 5 6 7 8 9
• Then, select the one with the lowest loss; for example the one
shown – blue = +1, red = -1
• But we don’t really have a good way to measure loss here!
Distance from or closeness
to a central point
The Classical Partitioning Problem
20. 1 10 11 12 13 142 3 4 5 6 7 8 9
• Then, select the one with the lowest loss; for example the one
shown – blue = +1, red = -1
• But we don’t really have a good way to measure loss here!
Distance from or closeness
to a central point
Lets sample one more point
21. The Ground Truth – Two “Topics”
The seven
virtues
The seven
vices
Assume, now, that we have some vocabulary V of English words
X is a set of positions and each element of X is labeled with an
element from V
22. If X is a multi-set of words (set of positions), then it has an inherent
structure in it: for e.g.
• We no longer see:
• We are used to: and #pow is in #doing
Additional Partitioning: Documents
The seven
virtues
The seven
vices
23. Success behind LDA
Allocate as few topics to a document
Allocate as few words to each topic
I am Nik
WalLenDABalancing Act
This checker board
pattern has a
significance – in general
NP-Hard to figure out
the correct pattern from
limited samples even for
2 topics
The topic ALLOCATION is controlled by the parameter of a DIRICHLET distribution
governing a LATENT proportion of topics over each document
24. Current Timeline Consequent Timeline
Event Categories: Accidents/Natural Disasters; Attacks (Criminal/Terrorist); Health &
Safety; Endangered Resources; Investigations (Criminal/Legal/Other)
Previously, long long time ago
25. Centers of an utterance – Entities serving to link that
utterance to other utterances in the current discourse
segment
Sparse Coherence Flows
[BarbaraJ.Grosz,ScottWeinstein,andArvindK.Joshi.Centering:Aframeworkfor
modelingthelocalcoherenceofdiscourse.InComputationalLinguistics,volume21,
pages203–225,1995]
a. Bob opened a new dealership last week. [Cf=Bob,
dealership; Cp=Bob; Cb=undefined]
b. John took a look at the Fords in his lot. [Cf=John, Fords;
Cp=John; Cb=Bob] {Retain}
c. He ended up buying one.
i. [Cf=John; Cp=John; Cb=John] {Smooth-Shift} OR
ii. [Cf=Bob; Cp=Bob; Cb=Bob] {Continue}
Previously, long long time ago
Centerapproximation=the(word,[Grammatical/
Semantic]role)pair(GSR)e.g.(Bob,Subject),(John,
Subject),(dealership,Noun)
Algorithmically
By inspection
For n+1 = 3 and case ii
26. Global (document/section level) focus
Problems with Centering Theory
a. The house appeared to have been burgled. [Cf=house ]
b. The door was ajar. [ Cb=house; Cf=door, house; Cp=door]
c. The furniture was in disarray. [ Cb=house; Cf=furniture,
house; Cp=furniture] {?}
Previously, long long time ago
For n+1 = 3
Utterances like these are the majority in most free text
documents [redundancy reduction]
In general, co-reference resolution is very HARD
27. An example summary sentence from folder D0906B-A of TAC2009 A timeline:
• “A fourth day of thrashing thunderstorms began to take a heavier toll on southern
California on Sunday with at least three deaths blamed on the rain, as flooding and
mudslides forced road closures and emergency crews carried out harrowing rescue
operations.”
The next two contextual sentences in the document of the previous sentence are:
• “In Elysian Park, just north of downtown, a 42-year-old homeless man was killed
and another injured when a mudslide swept away their makeshift encampment.”
• “Another man was killed on Pacific Coast Highway in Malibu when his sport utility
vehicle skidded into a mud patch and plunged into the Pacific Ocean.”
If the query is, “Describe the effects and responses to the heavy rainfall and mudslides
in Southern California,” observe the focus of attention on mudslides as subject in
the first two sentences in the table below:
Sentence-GSR grid for a sample summary document slice
Summarization using Coherence
Incorporating coherence this way does not necessarily
lead to the final summary being coherent
Coherence is best obtained in a post processing step
using the Traveling Salesman Problem
28. measure project lady
tape indoor sew
marker pleat
highwaist zigzag
scissor card mark
teach cut fold stitch
pin woman skirt
machine fabric inside
scissors make leather
kilt man beltloop
sew woman fabric
make machine show
baby traditional loom
blouse outdoors
blanket quick
rectangle hood knit
indoor stitch scissors
pin cut iron studio
montage measure kid
penguin dad stuff
thread
One lady is doing sewing project indoors
Woman demonstrating different stitches using a
serger/sewing machine
dad sewing up stuffed penguin for kids Woman makes a bordered hem skirt
A pair of hands do a sewing project using a sewing machine
ground-truthsynopsesoverlaid
But what we really want is this
29. ground-truthsynopsesoverlaid
clock mechanism
repair computer tube
wash machine lapse
click desk mouse time
front wd40 pliers
reattach knob make
level video water
control person clip
part wire inside
indoor whirlpool man
gear machine guy
repair sew fan test
make replace grease
vintage motor box
indoor man tutorial
fuse bypass brush
wrench repairman
lubricate workshop
bottom remove screw
unscrew screwdriver
video wire
How to repair the water level control mechanism on a
Whirlpool washing machine
a man is repairing a whirlpool washer
how to remove blockage from
a washing machine pump
Woman demonstrates replacing a door hinge
on a dishwasher
A guy shows how to make
repairs on a microwave
How to fix a broken agitator on a Whirlpool
washing machine
A guy working on a vintage box
fan
And this
32. Roadmap
Introduction
to LDA
Discovering Voter Preferences Using
Mixtures of Topic Models [AND’09 Oral]
Learning to Summarize
Using Coherence [NIPS
09 Poster]
Core NLP
including summarization,
information extraction,
unsupervised grammar
induction, dependency parsing,
rhetorical parsing, sentiment
and polarity analysis…
Non-parametric Bayes
Applied StatisticsExit 2
Exit 1
Uncharted territory –
proceed at your own risk
33. Why
When
Who
Where
TagLDA: More Observed Constraints
Domain knowledge
Topic
distribution
over words
Annotation/
Tag
distribution
over words
Is there a model which
can take additional clues
and attempt to correct
the misclassifications?
34. Why
When
Who
Where
Domain knowledge
Incorporating Prior Knowledge
Topic
distribution
over words
but
conditioned
over tags
Number of
parameters
= (K+T)V
TagLDA
switches to
this view for
partial
normalization
of some
weights
- x5 and x10 are annotated with the
orange label and x5 co-occurs with x9
both in documents d1 and d2
- It is thus likely that x5, x9 and x10
belong to the same class since both d1
and d2 should contain as few topics
40. News Article
What if the documents
are plain text files?
Understanding the Two Perspectives
41. It is believed US investigators have asked
for, but have been so far refused access to,
evidence accumulated by German
prosecutors probing allegations that former
GM director, Mr. Lopez, stole industrial
secrets from the US group and took them
with him when he joined VW last year.
This investigation was launched by US
President Bill Clinton and is in principle a far
more simple or at least more single-minded
pursuit than that of Ms. Holland.
Dorothea Holland, until four months ago
was the only prosecuting lawyer on the
German case.
News Article
Imagine browsing over many reports on an event
Understanding the Two Perspectives
42. It is believed USinvestigators have asked for,
but have been so far refused access to, evidence
accumulated by German prosecutors
probing allegations that former GM director, Mr.
Lopez, stole industrial secrets from the USgroup
and took them with him when he joined VW last year.
This investigation was launched by US
President Bill Clinton and is in principle a far more simple
or at least more single-minded pursuit than that of Ms.
Holland.
Dorothea Holland, until four months ago
was the only prosecutinglawyer on the
German case.News Article
The “document level”
perspective
What words can we remember after a first browse?
German, US,
investigations,
GM, Dorothea
Holland, Lopez,
prosecute
Understanding the Two Perspectives
43. Important Verbs
and Dependents
Named Entities
What helped us remember?
ORGANIZATION
It is believed US investigators have asked
for, but have been so far refused access to,
evidence accumulated by German
prosecutors probing allegations that former
GM director, Mr. Lopez, stole industrial
secrets from the US group and took them
with him when he joined VW last year.
This investigation was launched by US
President Bill Clinton and is in principle a far
more simple or at least more single-minded
pursuit than that of Ms. Holland.
Dorothea Holland, until four months ago
was the only prosecuting lawyer on the
German case.
News Article
LOCATION
MISC
PERSON
WHAT
HAPPENED?
The “word level”
perspective
The “document level”
perspective
German, US,
investigations,
GM, Dorothea
Holland, Lopez,
prosecute
Understanding the Two Perspectives
44. Summarization power of the perspectives
It is believed US investigators have asked
for, but have been so far refused access to,
evidence accumulated by German
prosecutors probing allegations that former
GM director, Mr. Lopez, stole industrial
secrets from the US group and took them
with him when he joined VW last year.
This investigation was launched by US
President Bill Clinton and is in principle a far
more simple or at least more single-minded
pursuit than that of Ms. Holland
Dorothea Holland, until four months ago
was the only prosecuting lawyer on the
German case.
German, US,
investigations,
GM, Dorothea
Holland, Lopez,
prosecute
Sentence Boundaries
What if we turn the document off?BeginMiddleEnd
45. A young man climbs an artificial rock wall indoors
Adjective modifier
(What kind of wall?)
Direct Object
Direct
Subject
Adverb modifier
(climbing where?)
Major Topic: Rock climbing
Sub-topics: artificial rock wall, indoor rock climbing gym
And as if that wasn’t enough!
46. Categories: Weather hazards to aircraft | Accidents involving fog | Snow or ice weather
phenomena | Fog | Psychrometrics Labeled by human editors
BeginningMiddleEnd
A Wikipedia Article on “fog”
47. Take the first category label – “weather hazards to aircraft”
“aircraft” doesn’t occur in the document body!
“hazard” only appears in a section title read as “Visibility
hazards”
“Weather” appears only 6 out of 15 times in the main body
However, the images suggest that fog is related to concepts like
fog over the Golden Gate bridge, fog in streets, poor visibility
and quality of air
Wiki categories: Abstract or specific?
Labeled by a Tag2LDA model from title and image captions
Categories: Weather hazards to aircraft | Accidents
involving fog | Snow or ice weather phenomena | Fog |
Psychrometrics Labeled by human editors
Categories: fog, San Francisco, visible, high,
temperature, streets, Bay, lake, California, bridge, air
50. Topic ALLOCATION is controlled by the parameter of a
DIRICHLET distribution governing a LATENT proportion of
topics over each document
I am Nik
WalLenDA
Bi-Perspective Topic Model – METag2LDA
And this
balancing act
got a whole
lot tougher
61. Mean Field Optimization
Very similar to finding the basic feasible solution
(BFS) in linear programming
• Start with pivot at the origin (only slack variables
as solution)
• Cycle the pivot through the extreme points i.e.
replace slacks in BFS until solution is found
62. Mean Field Optimization
However, mean field optimization space is
inherently non-convex over the set of tractable
distributions due to the delta functions which match
the extreme points of the convex hull of sufficient
statistics of the original discrete distributions
67. Topics conditioned on different section identifiers
(WL tag categories)
Topic Marginals
Topics
over
image
captions
Correspondence
of DL tag words
with content
words
Topic Labeling
Faceted Bi-Perspective Document Organization
All of the inference machinery *is needed*
to generate exploratory outputs like this!
68. • METag2LDA: A topic generating all DL tags in a document
does not necessarily mean that the same topic generates
all words in the document
• Corr-METag2LDA: A topic generating *all* DL tags in a
document does mean that the same topic generates all
words in the document - a considerable strongpoint
Topic concentration parameter
Document specific topic proportions
Document content words
Document Level (DL) tags
Word Level (WL) tags
Indicator variables
Topic Parameters
Tag Parameters
CorrME-
Tag2LDA
METag2LDA
The Family of Tag2LDA Models
69. Experiments
Wikipedia articles with images and captions manually
collected along {food, animal, countries, sport, war,
transportation, nature, weapon, universe and ethnic
groups} concepts
Annotations/tags used:
DL Tags – image caption words and the article titles
WL Annotations – Positions of sections binned into 5
bins
Objective: to generate category labels for test documents
Evaluation
– ELBO: to see performance among various TagLDA models
– WordNet based similarity evaluation between actual category
labels and proxies for them from caption words
70. Held-out ELBO
Selected Wikipedia Articles
WL annotations – Section positions in the document
DL tags – image caption words and article titles
TagLDA perplexity is comparable to MM(METag2)LDA
The (image caption words + article titles) and the content words
are independently discriminative enough
Corr-MM(METag2)LDA performs best since almost all image caption
words and the article title for a Wikipedia document are about a
specific topic
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
K=20 K=50 K=100 K=200Millions
MMLDA TagLDA corrLDA METag2LDA corrMETag2LDA
71. 0
0.5
1
1.5
2
40 60 80 100
Millions
MMLDA METag2LDA corrLDA
corrMETag2LDA TagLDA
Held-out ELBO
DUC05 Newswire Dataset (Recent Experiments with TagLDA Included)
WL annotations – Named Entities
DL tags – abstract coherence tuples like (subject, object) e.g. “Mary(Subject) taught the
class. Everybody liked Mary(Object).” *Ignoring coref resolution]
Abstract markers like (“subj” “obj”) acting as DL perspective are not document
discriminative or even topical markers
Rather they indicate a semantic perspective of coherence which is intricately linked
to words
By ignoring the DL perspective completely leads to better fit by TagLDA due to variations
in word distributions only
1.35
1.4
1.45
1.5
1.55
1.6
1.65
40 60 80 100
Millions
MMLDA METag2LDA corrLDA corrMETag2LDA
72. Are Categories more abstract or specific?
Inverse Hop distance in WordNet ontology
Top 5 words from the caption vocabulary are chosen
Max Weighted Average = 5, Max Best = 1
METag2LDA almost always wins by narrow margins
METag2LDA reweights the vocabulary of caption words and article titles that are about a
topic and hence may miss specializations relevant to document within the top (5) ones
In WordNet ontology, specializations lead to more hop distance
Ontology based scoring helps explain connections to caption words to ground truths e.g.
Skateboard skate glide snowboard
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
K=20 K=50 K=100 K=200
METag2LDA-
AverageDistance
corrMETag2LDA-
AverageDistance
METag2LDA-
BestDistance
corrMETag2LDA-
BestDistance
73. • Applications
– Document classification using reduced dimensions
– Find faceted topics automatically through word level tags
– Learn correspondences between perspectives
– Label topics through document level multimedia
– Create recommendations based on perspectives
– Video analysis: word prediction given video features
– Tying “multilingual comparable corpora” through topics
– Multi-document summarization using coherence
– E-Textbook aided discussion forum mining:
• Explore topics through the lens of students and teachers
• Label topics from posts through concepts in the e-textbook
Model Usefulness and Applications
74. Roadmap
Introduction
to LDA
Discovering Voter Preferences Using
Mixtures of Topic Models [AND’09 Oral]
Learning to Summarize
Using Coherence [NIPS
09 Poster]
Core NLP including
summarization, information
extraction, unsupervised
grammar
induction, dependency
parsing, rhetorical
parsing, sentiment and
polarity analysis…
Non-parametric Bayes
Computer Vision and Applications
– Core Technologies
Applied Statistics
Supervised
Learning, Structured
Prediction
Simultaneous Joint and
Conditional Modeling of
Documents Tagged from Two
Perspectives [CIKM 2011 Oral]
76. Previously
Words
forming
other Wiki
articles
Article specific content words
Caption corresponding to the
embedded multimedia
[P. Das, R. K. Srihari and Y. Fu. “Simultaneous Joint and Conditional Modeling of
Documents Tagged from Two Perspectives,” CIKM, Glasgow Scotland, 2011+
77. Afterwards
Words
forming
other Wiki
articles
Article specific content words
Caption corresponding to the
embedded multimedia
[P. Das, R. K. Srihari and J. J. Corso. “Translating Related Words to Videos and
Back through Latent Topics,” WSDM, Rome, Italy, 2013+
78. Expensive frame-wise manual
annotation efforts by drawing
bounding boxes
Difficulties: camera
shakes, camera motion, zooming
Careful consideration to which
objects/concepts to annotate?
Focus on object/concept detection –
noisy for videos in-the-wild
Does not answer which
objects/concepts are important for
summary generation?
Man with
microphone
Climbing
person
Annotations for training object/concept models
Trained Models
Information Extraction from Videos
79. Learning latent translation
spaces a.k.a topics
A young man is
climbing an artificial
rock wall indoors
Human Synopsis
Mixed membership of
latent topics
Some topics capture
observations that co-
occur commonly
Other topics allow for
discrimination
Different topics can be
responsible for
different modalities
No annotations
needed – only
need clip level
summary
Translating across modalities
MMGLDA model
80. Translating across modalities
Using learnt translation
spaces for prediction
?
Text Translation
( ) ( )
, , , ,
1 1 1 1
( | , )
( | ) ( | )
v O H
O K H K
O H
d o i v i d h i v i
o i h i
p w w w
p w p w
Topics are marginalized
out to permute
vocabulary for
predictions
The lower the
correlation among
topics, the better the
permutation
Sensitive to priors for
real valued data
MMGLDA model
81. Translating across modalities
Use learnt translation
spaces for prediction
?
Text Translation
( ) ( )
, , , ,
1 1 1 1
( | , )
( | ) ( | )
v O H
O K H K
O H
d o i v i d h i v i
o i h i
p w w w
p w p w
Topics are marginalized
out to permute
vocabulary for
predictions
The lower the
correlation among
topics, the better the
permutation
Sensitive to priors for
real valued dataResponsibility of
topic i over real
valued observations
Responsibility of
topic i over discrete
video features
Probability of learnt
topic i explaining
words in the text
vocabulary
MMGLDA model
82. • We first formulated the MMGLDA model just
two rooms left of where I am standing now!
An aside
83. 1. There is a guy climbing on a rock-climbing wall.
Multiple Human Summaries: (Max 10 words i.e. imposing a length constraint)
2. A man is bouldering at an indoor rock climbing gym.
3. Someone doing indoor rock climbing.
4. A person is practicing indoor rock climbing.
5. A man is doing artificial rock climbing.
To understand whether we speak all that we see?
84. 1. There is a guy climbing on a rock-climbing wall.
Multiple Human Summaries: (Max 10 words for imposing a length constraint)
Hand holding
climbing
surface
How many
rocks?
The sketch in
the board
Wrist-watch
What’s there
in the back?
Color of the
floor/wall
Dress of the
climber
Not so
important!
2. A man is bouldering at an indoor rock climbing gym.
Empty slots
3. Someone doing indoor rock climbing.
4. A person is practicing indoor rock climbing.
5. A man is doing artificial rock climbing.
Summaries point toward information needs!
Center of Attentions: Central Objects and Actions
86. Evaluation: Held out ELBOs
In a purely multinomial MMLDA model, failures of independent
events contribute highly negative terms to the log likelihoods
NOT a measure of keyword summary generation power
Test ELBOs on events 1-5 in the
Dev-T set
Prediction ELBOs on events
1-5 in the Dev-T set
87. Skateboarding
Feeding
animals
Landing fishes
Wedding
ceremony
Woodworking
project
Multimedia
Topic Model
– permute
event specific
vocabularies
Bag of words
multi-document
summaries
Sub-events e.g.
skateboarding, snowboarding, sur
fing
Multiple sets of
documents (sets of
frames in videos)
Natural language
multi-document
summaries
Multiple sentences (group of
segments in frames)
A c-SVM classier from the libSVM package is
used with default settings for multiclass (15
classes) classification
55% test accuracy easily achievable
(completely off-the-shelf)
Evaluate using ROUGE-1
HEXTAC 2009: 100-word human references vs. 100-word manually extracted summaries
Average Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661)
Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)
Event Classification and Summarization
88. Skateboarding
Feeding
animals
Landing fishes
Wedding
ceremony
Woodworking
project
Multimedia
Topic Model
– permute
event specific
vocabularies
Bag of words
multi-document
summaries
Sub-events e.g.
skateboarding, snowboarding, sur
fing
Multiple sets of
documents (sets of
frames in videos)
Natural language
multi-document
summaries
Multiple sentences (group of
segments in frames)
A c-SVM classier from the libSVM package is
used with default settings for multiclass (15
classes) classification
55% test accuracy easily achievable
(completely off-the-shelf)
Event Classification and Summarization
Evaluate using ROUGE-1
HEXTAC 2009: 100-word human references vs. 100-word manually extracted summaries
Average Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661)
Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)
If we can achieve 10% of this
for 10 word summaries, we
are doing pretty good!
Caveat – Text multi-document
summarization task is much
more complex
89. MMLDA can show poor ELBO – a bit
misleading
Performs quite well on predicting
summary worthy keywords
Sum-normalizing the real valued data
to lie in [0,1]P distorts reality for Corr-
MGLDA w.r.t. quantitative evaluation
Summary worthiness of predicted
keywords is not good but topics are
good
MMGLDA produces better topics and
higher ELBO
Summary worthiness of keywords
almost same as MMLDA for lower n
Evaluation: ROUGE-1 Performance
90. • Simply predicting more and more keywords
(or creating sentences out of them) does not
improve the relevancy of the generated
summaries
• Instead, selecting sentences from the training
set in an intuitive way almost doubles the
relevancy of the lingual descriptions
Improving ROUGE-1/2 performance
91. YouCook, iAnalyze
Das et al. WSDM 2013 Das et al. CVPR 2013
Precision
2-gram
Precision
1-gram
Recall
2-gram
Recall
1-gram
Precision
2-gram
Precision
1-gram
Recall
2-gram
Recall
1-gram
0.006 15.47 0.006 19.02 5.14 25.76 6.49 32.87
ROUGE scores for “YouCook” dataset[Corso et al.]
92. Roadmap
Introduction
to LDA
Discovering Voter Preferences Using
Mixtures of Topic Models [AND’09 Oral]
Learning to Summarize
Using Coherence [NIPS
09 Poster]
Non-parametric Bayes
Computer Vision and Applications
– Core Technologies
Translating Related Words
to Videos and Back
through Latent Topics
[WSDM 2013 Oral]
Applied Statistics
Supervised
Learning, Structured
Prediction
Simultaneous Joint and
Conditional Modeling of
Documents Tagged from Two
Perspectives [CIKM 2011 Oral]
Core NLP including
summarization, information
extraction, unsupervised
grammar
induction, dependency
parsing, rhetorical
parsing, sentiment and
polarity analysis…
Using Tag-Topic Models and
Rhetorical Structure Trees to
Generate Bulleted List
Summaries[to be submitted to
TOIS]
Linear, Quadratic and Conic
Programming Variants
A Thousand Frames in just a Few
Words: Lingual Descriptions of
Videos through Latent Topic
Models and Sparse Object
Stitching [CVPR 2013 Spotlight]
93. Just one last thing…
• We want to analyze documents not only for
topic discovery but also for turning these
94. Just one last thing…
• into this
A previous study on sleep deprivation that less sleep resulted in
impaired glucose metabolism.
Women who slept less than or equal to 5 hours a night were twice as
likely to suffer from hypertension than women. [*]
Children ages 3 to 5 years get 11-13 hours of sleep per night.
Chronic sleep deprivation can do more it can also stress your heart.
Sleeping less than eight hours at night, frequent nightmares and
difficulty initiating sleep were significantly associated with drinking.
A single night of sleep deprivation can limit the consolidation of
memory the next day.
Women’s health is much more at risk. [*]
[*] means that the sentences belong to the same document
95. Just one last thing…
• using these
Accidents and
Natural
Disasters
Attacks
Health and
Safety
Endangered
Resources
Investigations
and Trials
Document sets
or “Docsets”
Global Tag-Topic Model
Local
Models
Documents and
sentences
Local
Models
Local
Models
Local
Models
Training using
documents
Fitting sentences from
Docsets to the learnt
model
Candidate summary
sentence for a Docset
Weighting a
summary sentence
from local and
global models
Candidate summary
sentence for a Docset
96. • and these
Attribution
Cause
Elaboration
Just one last thing…
distractions such as
computers or video
games in kids '
bedrooms may
lessen sleep quality.
that only 20
percent of
adolescents get the
recommended nine
hours of sleep ;
The National
Sleep
Foundation
reported in 2006
Satellite (Leaf:
Span 1)
Nucleus (Leaf: Span 2) Nucleus (Leaf: Span 3)
Nucleus
[2]
Root [2, 3]
Attribution
Joint
and need more
than eight hours of
sleep per day .
because they 're
nocturnal
Sleep-deprived
teens crash just
about anywhere
Nucleus (Leaf:
Span 1)
Nucleus (Leaf: Span 2) Nucleus (Leaf: Span 3)
Satellite
[2,3]
Root [1, 3]
Explanation
Joint
early-risers are
actually at a higher
risk of developing
heart problems.
but a Japanese
study says
Generations
have praised the
wisdom of
getting up early
in the morning,
Nucleus (Leaf:
Span 1)
Satellite (Leaf: Span 2) Nucleus (Leaf: Span 3)
Nucleus
[2,3]
Root [1, 3]
Contrast Attribution
Fortunately
for sleepy
women , a
Penn State
College of
Medicine
study found,
Satellite
(Leaf: Span 1)
Nucleus
[2,4]
Root [1, 4]
that they 're
much better
than men at
enduring
sleep
deprivation,
Nucleus (Leaf:
Span 2)
possibly because
of '' profound
demands of
infant and child
care
Nucleus (Leaf:
Span 3)
placed on them
for most of
mankind 's
history.
Satellite (Leaf:
Span 4)
Satellite
[2,3]
99. • We want to analyze documents
not only for topic discovery but
also for turning these
• into this
• using these
• and these
• with scores like these
• and these
The final song: Recap
100. The ending…
Interviewer: Do you agree with President Obama’s approach towards Libya?
Presidential: [Libya??] I just wanted to make sure we're talking about the same
Candidate thing before I say, 'Yes, I agreed' or 'No I didn't agree.' I do not
agree with the way he handled it for the following reason -- nope,
that's a different one. I got all this stuff twirling around in my head
• So that we can always have the right information
on our fingertips
101. Summary
• Summarize a task using contextual
exploratory analysis tools as well as
deep NLP and
• Make decisions for us!
• Topic models can now talk to structured
prediction models
• Efficient text summarization/translation of
domain specific videos is now possible
• With multi-document summarization systems
which exploit meaning in text, we are getting
closer to our ultimate dream:
– Construct an artificial assistant who can
102. Future Directions
• Core Algorithms
– Non-parametric Tag2LDA family models
– Address sparsity in tags and scaling of real-valued variables
in mixed domain topic models
– Efficient inference with more structure among hidden
variables
• Applications
– Type in text and get an object detector [borrowed from VPML]
– Intention analysis of videographers in social networks and
the evolution of intentions over time
– Large scale visualization using rhetorics and topic analysis
– Large scale multi-media multi-document summarization
We are shaping a problem space. Each node is a problem and each peak represents a possible solution to that problem.Each problem has associated with it several smaller problems which need to be solved along the way giving rise to the mountainous terrain.We actually do not see this landscape beforehand and shape it as we move forward. A PhD candidate has to go from one peak to another to get a view of the entire landscape from where the candidate can put the landscape created by other luminaries in the field in perspective. So this is my long journey and I did not want to get stuck in one peak only and explore low lying hills (similar to writing one paper and then merely extending it)To create a landscape, we need tools to make the roads and clear away obstacles. But once done, it allows other researchers and practitioners to make use of the road infrastructure to build communities and businesses if the peaks are interesting enough to attract visitors and, of course, go from one place to another with ease.So, let’s get started… the stories of my journey will need some time to be told… and…
The answers are coming in the next one and half hours
A very recent talk by David Blei, who is considered to be the father of topic modeling research, also listed the importance of the problem we tackled as one of the open problems
What topic models do? For sure, they can identify signature words from a corpus of documents in a data driven way.Also you can figure out which of these topics belong to which classes of documents if you have that information-----------------------------And people really wanted this for a long time!
these models are language agnostic (Multilingual capability)--------------------------------------------Imagine automatically producing larger font on some important words in an HTML document – easily done not just by the words alone but also justifying it through their coherence properties
From just counts to richness
Each node is labeled witha word and each hill brings related nodes together with the closeness related to the length of the edges between themImagine all such points lying in a flat piece of paper on a uniform 2D grid of equal length edgesOur job is to re-arrange the nodes, connect them according to their closeness and create the triangulations so that we can discover the landscape shown hereAnd we have to do this without the model ever having any idea of the 3D landscapeThis brings us to an important question…---------------------------------------
+ Success of LDA+ Almost 660 citations/year!+ Really widely extended and applied in different contexts------------------------------
But the success of LDA has really been in its generalization performance to fit unseen documents to the trained topic spaceMuch better generalization performance than PLSA or LSALDA can find a basis for distributions over topics unlike SVD which assumes 1 topic per document or computes a span over the topic vectorsmodels improved and they became more and more complex…-------------------------------------
+Comparison of model complexities+Y-axis = HLA X-axis = model complexity
HL = Hair Loss axisAll of these models address the common problem of looking at central tendencies of data
Why do we want to explore? We want to explore because we seek wisdom from everything that is happening around usBut where to start?Well, as Yoda points out, we can start at the centers of the data-------------------------------------Your:Each one of us has our own model of wisdom that gets shaped through our personal exploration of the world around usEach one of us assumes that there is some hypothesis which gives rise to the data around usCenters of data:Big data problem---lots of data around us but which ones are meaningful?Need statistics from the data that meaningfully encode multiple views i.e. modalitiesSufficient statistics (i.e. the function of a sample that encodes all information about the sample) usually represent the centers of the data
+ Let’s start at the central tendencies…+ We want to go beyond words to full clips to visualize topics!-------------------------We have devices which continuously capture data and we seek wisdom from such large amounts of data:Wisdom is really about looking at set of representative examples (centers)Wisdom encodes variance in information compactly and completely and this improves decision making
How do the centers look like?These are actual outputs from one of our models. The ground-truth synopses overlaid over the multimedia topics obtained from training data
Assume each data point has an associated binary labelBut we have no training data which is representative of the classes----------------------------------------------
With labels, we can optimize a loss function (similar to interpolation and extrapolation)But we do not have labels and so we need to make assumptions about some function of data only which summarizes all observations and how the observations vary from that summary i.e. find the location and scale estimates as best as possibleLet’s choose the algorithm to be K-means which yields a simple hypothesis set \arg\max_{i,j}(d(x, \mu_i), d(x, \mu_j)) i = 0, j = 1--------------------------------
Lets sample one more point from the ground truth blue class and see if K means made the correct decisions based on limited samples but without any ground truth knowledge? --------------------------------
Clearly there are two misclassifications
We can have additional observed constraints on X: e.g. they are structured into books as a collection of sections which can focus on an idea in a coherent fashionThese structures give rise to co-occurrence which has been exploited before in IR for thesaurus constructionThe better the structure, the better the read – look at the Egyptian man – that man’s hair grew white by just scrolling through the scrollsRodin’s thinker replies to Dr. Corso’s twitter comment on our CVPR paper with “#pow is in #doing” -----------------------------------------There is a inherent partitioning of the linear position space of all wordsThis partitioning is due to the result of some sort of authorship (LDA with many authors = author topic model)
The success behind LDA is really about a balancing actNot easy to balance perfectly: x_9 and x_10 can be misclassified since LDA may want to allocate as few topics to d_2 as possible and chooses the red topicWell, at least now we know why NikWalLenDA can rope walk so easily--------------------------------
Summarization problem (see TAC competitions from NIST)
+ Earlier research on discourse analysis mainly used for co-reference resolution+ Has some really intriguing ideas!--------------------------For a sequence of utterances to be a discourse, it must exhibit coherenceIf we denote U_n and U_{n+1} to be adjacent utterances, the backward-looking center of U_n, denoted as C_b(U_n), represents the entity currently being focused on in the discourse after U_n is interpreted. The forward-looking centers of U_n, denoted as C_f(U_n), form an ordered list containing the entities mentioned in U_n, all of which can serve as C_b(U_{n+1}). In general, however, C_b(U_{n+1}) is the most highly ranked element of C_f(U_n) mentioned in U_{n+1}. The C_b of the first utterance in a discourse is undefined. Brennan et al. uses the following ordering:Subject > Existential predicate nominal > Object > Indirect object or oblique > Demarcated adverbial PP
+ Inducing a coherence flow comes through a lot of good writing practice+ Imputing a paragraph with salient concepts comes first to the minds of most authors and they tend to focus on the topic, which here is {house, door, furniture, burglary}--------------------------------------
+ Incorporating coherence this way does not necessarily lead to the final summary being coherent+ Coherence is best handled as a post processing step using the Traveling Salesman Problem [Conroy et al.]+ There are lots of open question on just multi-document summarization itself…But what I really wanted is… ------------------------------------
to “see” what topics mean?+ Interpreting topics can still be tedious+ Most LDA models ignore metadata even if they are useful--------------------------------These are actual outputs:This is a tough event to match words with frames. The event is “Working on a sewing project”
This is again another tough event to match words with frames. The event is “Repairing an appliance”
+ Describing a domain specific video with annotated keywords! This can be useful in robotic vision!
+ allowing robots and video recording devices to communicate at a human level
Moving on – PART II+ At this point I was not sure where I should be moving? I had only a very vague idea!+ And you actually don’t know if there *are* other peaks!+ As Yoda pointed out… “Clouded your future is!”
+ So now lets visit the document space again+ We look at another model – TagLDA which can incorporate a certain kind of domain knowledge into LDA+ Document partitioned words have associated annotations -> gives rise to two different distributions over words and each distribution affects the other+ A word is observed under the effects of both these distributions------------------------------------
What does this representation buy us?+ Goal is to assign x_9 and x_10 to their correct cluster with the use of domain knowledge+ x_5 and x_10 are annotated with the orange label and x_5 co-occurs with x_9 both in d_1 and d_2+ It is thus likely that x_5, x_9 and x_10 belong to the same class since both documents d_1 and d_2 should contain as few topics as possible
+ Fitting a model amounts to forming an hypothesis which can best explain a set of observations+ TagLDA implicitly expands the hypothesis space of topics to search for the best explanation needed to describe the observations with the help of the annotations from domain knowledge------------------------------------
What if we assume that there is an additional perspective overd_i w.r.t x’- Is this an unnatural assumption?
Well not at all!word level tags: Hyperlinked text in bodydocument level tags: Categories
word level tags: question/answerdocument level tags: actual tags for the forum post
word level tags: title, image descriptiondocument level tags: tags given by users
Is the bi-perspective nature of documents ubiquitous?
We don’t have annotations but let’s see how that can be built up!Seems like this is a document on investigation of an industrial espionage
Words to the right are relevant to the topic of the document set – mostly by frequency
Natural language processing based content annotationSince documents are mostly about some events; Certain words strike us – NEs mentioned frequently and across sentencesDependencies between subjects and objects of the important verbs from the document set
The word and doc level tagged words alone are sufficient to summarize the document as bags of wordsSo are we done with the summarization problem?
And now we want things like these! If you are in doubt, ask any member of Dr. Corso’s VPML LabBut,+ High level descriptions are complex+ Spoken Language is complicated with high degrees of paraphrasingThe translator does consciously what the author did instinctively
How can bi-perspective topic models be exploited?The experiments really started off by looking at the image captions and category labels
This slide is self explanatory
Some people call it a mere combinationBut I say it is e-Harmony!
We now cover a particular METag^2LDA model\pi: tag i.e. word annotation distribution over words\beta topic distributions over words\mu and \sigma are fixed regularizersi.e some fixed priors that help in proper scaling of the parameters while optimization-----------------------------
Joint probability distribution belongs to exponential family following Maximum Entropy principleIn the original model, the hidden variables and parameters are coupled leading to an exponential state space to search for the right posterior over the hidden variables---------------------------------------
Delete all observations and edges which lead to the passages of the Bayes Ball being obstructed---------------------------------
Resulting in decoupling of the variables over which posterior needs to be computed+The more the decoupling, the more tractable the inference--------------------------------------
+ We use fixed regularizers here+ Introducing exponential family priors for \pi and \beta will need more complicated inference machinery+ There are several other approximation techniques to compute posteriors and hence compute marginal and Mean Field is a deterministic local optimization technique but celebrities have endorsed it--------------------------------------
Even Adrian Monk likes Mean Field factorization!
And now let me introduce our friend – the Mixture of Gaussians for real valued data+ Keep x_1 and x_2 fixed and try to explain the two samples through different location parameters of the Gaussians through log likelihood+ The two surfaces are the error surfaces for the mixture model likelihood for x_1 and x_2 individually+ For discrete data, the mean parameters of the generating distribution is not discrete-------------------------------------
Mixture of two Gaussiansmodel+ Keep the two true location parameters fixed and try to explain samples generated at different distances from the two means through log likelihood+ There is a relation between the parameters of the distribution over the data (usually unknown) and the sufficient statistics as a function of data onlyWhich leads us to…------------------------------------------
Mean Parameters = Expected sufficient statisticsField = energy arising out of interactions with neighboring nodes (in mathematics a field is nothing but a space)\mu_e (the red dots) are the extreme points of the polytope is a function of the sufficient statistics \Upsilon(z,x) for fixed xWhen we optimize over this space, we select one of these red dots and corresponding to it, there is an optimal mean parameter \mu^{\star}Suppose we have the complete data as (Z,X) Z = hidden variables and X = observationsM(G) is the mean parameter space corresponding to expected sufficient statistics of the hidden variables in the original graphical model GFor discrete distributions M(G) is a convex polytope due to intersection of finitely many linear inequalities i.e. half spacesFor each fixed x and p_{\theta}, there is a \mu and As p is varied holding x fixed, the set M is formed\mu provides alternative parameterization of the exponential family distribution and any mean parameter in interior(M(G)) yields a lower bound to A(\theta)Any mean parameter can mean mean parameters of distributions whose moments can be easily computed e.g. factored distributions and those assumptions lead to a non-convex domain over which optimization is performedCartoon constraint is shown in the upper right cornerZ|x ~ Mult(\theta). \Upsilon(z) is the sufficient statistics for z------------------------------------------------log partition functions play an important role in the mapping of \mu to \theta and vice versa\M_F(G) is a subset of M(G) having only the extreme points in common and dependent of the factorization Fover Z to allow discovery of this backward mapping possible in finite timeEasiest implementation of mean filed principle is to consider no direct dependencies between the distributions of the hidden vars
Classic Estimator finding problem:Maximize log likelihood whose objective includes the empirical mean and the log partition functionClassic theorem:Maximize over \mu given a set of observations x to get as close to \theta\mu is dependent on the sufficient statistics associated with the variables whose likelihood we need to maximizeA(\theta) is the log partition function expressed in terms of the dual A(\mu)The dual, A(\mu) is maximized at the negative entropy of the distribution over mean parameters when the latter belong to interior(M(G))Relation between the derivatives of the log partition functions of the primal and the dual are shown in lower left corner
Mean field approximation to joint p(z_1, z_2, z_3): product of independent Bernoulli distributions, p(z_i)In this case, the mean field distributions are exactly in the same exponential family as the true distributionsWrite down each q in exponential form with the log partition function (as a function of canonical parameters in this case)Solve for A(\mu) using maximum over the dual formulation yielding \theta(\mu) and A*(\mu)Solving for A(\theta) using A*(\mu) yields \mu(\theta) = exp(\theta)/(1+exp(\theta))----------------------------------------
Goal is to find \mu from sufficient statistics of ZIn practical problems, there are exponentially many extreme points for all realizations of sufficient statisticsThis shows a cartoon illustration of the solution for mean parameters using linear programming
Unfortunately the set of \mu s under the factorization assumption is a strict subset of M(G)This subset itself is also convex if did not have to match its extreme points to those of the enclosing setThe region over which optimization for \mu needs to happen with the tractable distribution assumption is thus non-convex- This means that we won’t get a globally optimal solution
Let us now look at the relation of this formulation to the mean field formulation of METag^2LDA\Theta^T \mu = \Theta^T \int \sum \Upsilon(\theta, y, z) q(\theta, y, z) for e.g. \sum_{k=1}^K \phi_{m,k} I_{k}[z_m] \log \beta_{k}-A*(\mu) = +H(q) = -q \log q
The big red box is the ELBOTop: E-step inner loop (update variational distributions for every document)Bottom: M-step parameter updates based on mean parameters of the associated document dependent sufficient statisticsFor \beta and \pi, we only do MAP estimation here corresponding to fixed priors which act as regularizers
All of this inference machinery *is needed* to generate exploratory outputs like this!
Non-Correspondence topic models vs. Correspondence topic models
+ Within the family of (Corr)MM(E)(Tag2)LDAs modeling joint observations, Corr-METag2LDA performs best+ We need to be careful about what kind of document level tags are we considering? Do those tags really collaborate in refining the topical perspective?-----------------------------------------
Cons:Collocations need to be addressedChains don’t involve causality e.g. (fogs & accidents, [hop length = 12])
So what’s next?
I never looked seriously at this paper “Modeling annotated data” until very late (around 2010)
From this to
This (actually the other way around)
Upper row – training (camera motion and shakes are a real problem for maintaining the bounding boxes)Lower row – trained models
+ Role of alpha – alpha provides a topic for every observation. Alpha is a K-vector+ Here each component of alpha is different which helps assign different proportions of observations differently (e.g. one topic can be focusing solely on “stop-words”, another one on “commonly occurring words” and other ones on the different topics etc.)+ This helps identifying a set of “basis” of topic distributions while SVD computes a span over topic distributions
Translation formula (Marginalization over topics)- If there are two topics i.e. K=2, then (for e.g for the 2nd term) 0.5*0.5 + 0.5*0.5 = 0.5 < 0*0.0001 + 0.9*0.9- Values of the inferred \phi’s are very important for the real valued data – separated Gaussians are better but does not always happen- This raises an issue where the real valued data may need to be preprocessed to increase the chances of separation
The sum over K is the marginalization over z in p(w,z)
Again all of these are needed to translate videos into text and vice versa
This is the core problem of video summarization
Psycholinguistics are needed to confirm but that’s not a concern at this pointIn our dataset we have only one ground truth summary---base case for ROUGE evaluation
There are no individual summaries for shots within the clip – only one high level summaryProblems with shot-wise nearest neighbor matching precisely for this reason?The dataset that we use for the video summarization task is released as part of NIST's 2011 TRECVID Multimedia Event Detection (MED) evaluation set. The dataset consists of a collection of Internet multimedia content posted to the various Internet video hosting sites. The training set is organized into 15 event categories, some of which are: 1) Attempting a board trick 2) Feeding an animal 3) Landing a fish 4) Wedding ceremony 5) Working on a woodworking project etc.We use the videos and their textual metadata in all the 15 events as training data. There are 2062 clips with summaries in the training set with almost equal distribution amongst the events. The test set which we use is called the TransparentDevelopment (Dev-T) collection. The Dev-T collection includes positive instances of the first 5 training events and near positive instances for the last 10 events---a total of 630 videos labeled with event category information (and associated human synopses which are to be compared against for summarization performance). Each summary is a short and very high level description of the entire video and ranges from 2 to 40 words but on average 10 words (with stopwords). We remove standard English stopwords and retain only the word morphologies (not required) from the synopses as our training vocabularies. The proportion of videos belonging to events 6 through 15 in the Dev-T set is much low compared to the proportion for the other events since those clips are considered to be “related" instances which cover only part of the event category specifications. The performances of our topic models are evaluated on those kinds of clips as well. The numbers of videos in events 6 through 15 in the Dev-T set are {4,9,5,7,8,3,3,3,10,8} while there are around 120 videos per event for the first 5 events. All other videos in the Dev-T set neither have any event category label nor are identified as positive, negative or related videos and we do not consider these videos in our experiments.
Test ELBOs on events 1-5 in the Dev-T set – Measuring held-out log likelihoods on both videos and associated human summariesPrediction ELBOs on events 1-5 in the Dev-T set – Measuring held-out log likelihoods on just videos in absence of the textLower inverse covariance contributes high positive values to log likelihood + Gaussian entropy can be high too due to overlapping tails
The HEXTAC scores can change from dataset to dataset but max around 40-45% for 100 word summaries
If we can achieve 10% of this for 10 word summaries, we are doing pretty good!Caveat – The text multi-document summarization task is much more complex than this simpler task (w.r.t. summarization)
Purely multinomial topic models showing lower ELBOs can perform quite well in BoW summarization. MMLDA assigns likelihoods based on success and failure of independent events and failures contribute highly negative terms to the log likelihoods but this does not indicate the model's summarization performance where low probability terms are pruned out. Gaussian components allow different but related topics to model GIST features almost equally (strong overlap in the tail of the bell shaped curves - Gaussians) and show poor permutation of predicted words due to the violation of the soft probabilistic constraint of correspondence (this also leads to higher entropy)Scaling of variables in these kinds of mixed domain topic models needs to be looked at more closely
To improve relevancy of the lingual descriptions generated for the domain specific test videos, we present… for the first time ever…
iAnalyze for your videos…
A computer science graduate should never have to cope with information twirling around his head! We need high quality tools to address this problem.
I had taken Late Amar Gopal Bose’s advice in preparing these slides: I took some time out to prepare them leaving everything else behind. As Dr. Bose would say that “creativity never comes under emotional stress or tension. The real creativity comes when the mind finally relaxes and it is quiet and then you can focus.” Watch here [http://www.ndtv.com/video/player/news/remembering-amar-bose/282935?pfrom=home-topstories]. And yes, most of these slides were prepared with a Bose iOE2 headphone over my ears.