SlideShare una empresa de Scribd logo
1 de 40
Translating Related Words to
Videos and Back through Latent

Pradipto Das, Rohini K. Srihari and Jason J. Corso
                 SUNY Buffalo
           WSDM 2013, Rome, Italy
WiSDoM is beyond words
Master Yoda, how do I find wisdom
                                    Go to the center of the data and
 from so many things happening
                                    find your wisdom you will
           around us?
WiSDoM is beyond words
Master Yoda, how do I find wisdom
                                    Go to the center of the data and
 from so many things happening
                                    find your wisdom you will
           around us?
How do the centers look like?

parkour perform traceur area flip footage jump park urban run   lobster burger dress celery Christmas wrap roll mix tarragon
outdoor outdoors kid group pedestrian playground                steam season scratch stick live water lemon garlic

floor parkour wall jump handrail locker contestant school run   make dog sandwich man outdoors guy bench black sit park
interview block slide indoor perform build tab duck             white disgustingly toe cough feed rub contest parody
                                                                     Be careful on what people do with
       Interviews indoors can be tough!
                                                                             their sandwiches!
The actual ground-truth synopses overlaid
                                          Man performs
     Kid does parkour                                                A family holds a strange burger assembly
                                        parkour in various
     around the park                                                    and wrapping contest at Christmas

   Footage of group of performing parkour outdoors

                                                                                    tutorial: man explains how to
parkour perform traceur area flip footageguys free urban run
                          montage of jump park running          lobster burger dressmake lobster rolls from scratch
                                                                                    celery Christmas wrap roll mix tarragon
                            up a tree and through the
outdoor outdoors kid group pedestrian playground                steam season scratch stick live water lemon garlic

                  interview with parkour contestants                                            One guy is making
floor parkour wall jump handrail locker contestant school run                                   sandwich outdoors
                                                                make dog sandwich man outdoors guy bench black sit park
interview block slide indoor perform build tab duck             white disgustingly toe cough feed rub contest parody
                                                                     Be careful on what people do with
       Interviews indoors can be tough!
                                                                             their sandwiches!
Back to conventional wisdom: Translation
 S. Nishimoto, A. T. Vu, T. Naselaris, Y. Benjamini, B. Yu and J. Gallant, “Reconstructing Visual Experiences from Brain
 Activity Evoked by Natural Movies,” Current biology Vol. 21(19), 2011

 There is some model that captures the correspondence of the blood flow patterns in the
  brain to the world being observed
 Given a slightly different pattern we are able to translate them to concepts present in our
  vocabulary to a lingual description
 Three basic assumptions of Machine Learning are satisfied:
    1) There is pattern 2) We do not know the target function 3) There is data to learn from
                                  Training                                                          Testing
                                    (LDA)               Regression

 F. Pereira, G. Detre and M. Botvinick, "Generating text from functional brain images," In Frontiers in Human
 Neuroscience, Vol. 5(72), 2011
Back to conventional wisdom: Translation
 S. Nishimoto, A. T. Vu, T. Naselaris, Y. Benjamini, B. Yu and J. Gallant, “Reconstructing Visual Experiences from Brain
 Activity Evoked by Natural Movies,” Current biology Vol. 21(19), 2011

                                                                     Giving back to the community:
                                                     Driverless blood flow patterns in the
 There is some model that captures the correspondence of thecars are already helpingthe
  brain to the world being observed                    visually impaired to drive around
                                                     It will them to to enable visually
 Given a slightly different pattern we are able to translate be good concepts present in our
  vocabulary to a lingual description                  impaired drivers to hear the scenery
                                                       in front
 Three basic assumptions of Machine Learning are satisfied:
    1) There is pattern 2) We do not know the target function 3) There is data to learn from
                                  Training                                                          Testing
                                    (LDA)               Regression

 F. Pereira, G. Detre and M. Botvinick, "Generating text from functional brain images," In Frontiers in Human
 Neuroscience, Vol. 5(72), 2011
Do we speak all that we see?

Multiple Human Summaries: (Max 10 words i.e. imposing a length constraint)
1. There is a guy climbing on a rock-climbing wall.      4. A person is practicing indoor rock climbing.
2. A man is bouldering at an indoor rock climbing gym.   5. A man is doing artificial rock climbing.
3. Someone doing indoor rock climbing.
Centers of attention (topics)                                           Not so
                                                                                          Hand holding
                                                                                           How many
                                                                                          The sketch in
                                                                                           the board
                                                                                          What’s there
                                                                                          in the back?
                                                                                          Dress of the
                                                                                           Empty slots
                                                                                           Color of the
Multiple Human Summaries: (Max 10 words i.e. imposing a length constraint)
1. There is a guy climbing on a rock-climbing wall.    4. A person is practicing indoor rock climbing.
2. A man is bouldering at an indoor rock climbing gym. 5. A man is doing artificial rock climbing.
3. Someone doing indoor rock climbing.           Summaries point toward information needs!
From patterns to topics to sentences
                         Adverb modifier
                         (climbing where?)

                                      Direct Object
A young man climbs an artificial rock wall indoors
   Spoken Language is complex –                  Adjective modifier
    structured according to various               (What kind of wall?)
    grammars and dependent on
    active topics
   Different paraphrases describe
    the same visual input
                     Major Topic: Rock climbing
      Sub-topics: artificial rock wall, indoor rock climbing gym
Object detection models

               Annotations for training object/concept models
                                          Expensive frame-wise manual
                                           annotation efforts by drawing
                                           bounding boxes
                                              Difficulties: camera
                                                shakes, camera motion, zooming
                                          Careful consideration to which
                                           objects/concepts to annotate?
                                          Focus on object/concept detection –
 Man with                 Climbing         noisy for videos in-the-wild
microphone                 person         Does not answer which
                                           objects/concepts are important for
         Trained Models
                                           summary generation?
Translating across modalities
                           Learning latent translation
                               spaces a.k.a topics

                                                          Mixed membership of
                                                           latent topics
                                                          Some topics capture
                                                           observations that co-
                                                           occur commonly
                                                          Other topics allow for
                                                          Different topics can be
                                                           responsible for
                                                           different modalities

No annotations     Human Synopsis
needed – only     A young man is
need clip level   climbing an artificial
  summary         rock wall indoors
Translating across modalities
                                                 Using learnt translation
                                                  spaces for prediction

                                                                             Topics are marginalized
                                                                              out to permute
                                                                              vocabulary for
                                                                             The lower the
                                                                              correlation among
                                                                              topics, the better the
                                                                             Sensitive to priors for
                                                                              real valued data

                                    Text Translation
                                 ? p( wv | wO , wH ) 
O     K                               H    K

o 1 i 1
            (O )
            d , o ,i   p( wv | i ) d( H ,)i p( wv | i )
                                     h 1 i 1
Translating across modalities
                                                          Use learnt translation
                                                          spaces for prediction

                                                                                         Topics are marginalized
                                                                                          out to permute
                                                                                          vocabulary for
                                                                                         The lower the
                                                                                          correlation among
                                                                                          topics, the better the
                                                                                         Sensitive to priors for
  Responsibility of                          Responsibility of                            real valued data
  topic i over real                        topic i over discrete
valued observations                           video features
                                             Text Translation                      Probability of learnt
                                          ? p( wv | wO , wH )                      topic i explaining
         O     K                               H    K
                                                                                    words in the text
         o 1 i 1
                     (O )
                     d , o ,i   p( wv | i ) d( H ,)i p( wv | i )
                                              h 1 i 1                                vocabulary
Wisdom of the young padawans
                 OB (Object Bank)
                  High level semantic
                   representation of images from
                   low level features
                 [L-J. Li, H. Su, E. P. Xing, and L. Fei-fei. Object bank:
                 A high-level image representation for scene
                 classification and semantic feature sparsification.
                 In NIPS, 2010]

                 HOG3D (Histogram of oriented
                 gradients in 3D)
                  Effective action recognition
                    features for videos
                 [A. Klaser, M. Marszalek, and C. Schmid. A spatio-
                 temporal descriptor based on 3d-gradients. In
                 BMVC, 2008]

                 Color Histogram:
                  512 RGB color bins
                  histograms are computed on
                    densely sampled frames
                             large deviations in the
                              extremities of the color spectrum
                              are discarded
Wisdom of the young padawans
                             The video is about a man answering      Two camera men film a cop
                             to a question from the podium by        taking a camera from a woman
                             using a microphone                      sitting in a group
         Town hall meeting

                                   Scenes from images belonging to different topics and sub-topics
         Rock climbing

                             An young man climbs an                   A man climbs a boulder
                             artificial rock wall indoors             outdoors with a friend spotting
Wisdom of the young padawans
Global GIST energy [A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the
spatial envelope. Int. J. Comput. Vision, 42(3):145{175, 2001.]
 eight perceptual dimensions capture most of the 3D structures of real-world scenes
     naturalness, openness, perspective or expansion, size or
        roughness, ruggedness, mean depth, symmetry and complexity

GIST in general terms:
 An energy space that pervades the arrangements of objects
 Does not really care about the specificity of the objects
 Helps us summarize an image even after it has disappeared from our sight
Yoda’s wisdom
   The video is about a man answering   Two camera men film a cop
   to a question from the podium by     taking a camera from a woman
It will bemicrophone
   using a nice                         sitting in a group
 to have the
  Force as a

         For my ally is the Force,
Its energy surrounds us and binds us.
   An young man climbs an we,
        Luminous beings are             A man climbs a boulder
   artificial rock wall indoors
          not this crude matter.        outdoors with a friend spotting
 NIST's 2011 TRECVID Multimedia Event Detection (MED) Events and Dev-T

 Training set is organized into 15 event categories, some of which are: 1)
  Attempting a board trick 2) Feeding an animal 3) Landing a fish 4) Wedding
  ceremony 5) Woodworking project 6) Birthday party 7) Changing a vehicle
  tire 8) Flash mob gathering 9) Getting a vehicle unstuck 10) Grooming an
  animal 11) Making a sandwich 12) Parade 13) Parkour 14) Repairing an
  appliance 15) Working on a sewing project

 Each video has its own high level summary – varies from 2 to 40 words but
  on average 10 words

 2062 clips in the training set and 530 clips for the first 5 events in the Dev-T

 Dev-T summaries are only used as reference summaries for evaluation with
  up to 10 predicted keywords
The summarization perspective
        Sub-events e.g.
                                         Multiple sets of   Multiple sentences (group of
skateboarding, snowboarding, sur
               fing                    documents (sets of      segments in frames)
                                        frames in videos)
                                                                           Topic Model
       Skateboarding                                                        – permute
                                   Wedding                                 event specific
                                   ceremony                                vocabularies

                 animals                                                  Bag of keywords
         Landing fishes                                                   Natural language
The summarization perspective
         Sub-events e.g.
Why event snowboarding,vocabularies? sets of
                specific sur     Multiple                     Multiple sentences (group of
              fing                   documents (sets of          segments in frames)
                                      frames in videos)
         Skateboarding                                                        Topic Model
                                                                               – permute
 Model         Actual Synopsis    Wedding
                                 Predicted Words (top 10)                    event specific
                                 ceremony                                     vocabularies
 One school    man feeds fish    fish jump bread fishing skateboard
 of thought    bread             pole machine car dog cat
 Another Feeding feeds fish
               man               bread shampoo sit condiment place
 school of animals
               bread             fill plate jump pole fishing
                                                                            Bag of keywords
  Intuitively multiple objects and actions are shared and many               summaries
   different words across eventsWoodworking semantically
                                  get associated
      Prediction quality degenerates rapidly!
                 fishes                                                     Natural language
             [P. Das, R. K. Srihari and Y. Fu. “Simultaneous Joint and Conditional Modeling of
             Documents Tagged from Two Perspectives,” CIKM, Glasgow Scotland, 2011]

 forming                    Article specific content words
other Wiki
                                                                            Words corresponding to the
                                                                             embedded multimedia
             [P. Das, R. K. Srihari and J. J. Corso. “Translating Related Words to Videos and
             Back through Latent Topics,” WSDM, Rome, Italy, 2013]

 forming                    Article specific content words
other Wiki
                                                                             Words corresponding to the
                                                                              embedded multimedia
The family of multimedia topic models
• Corr-MMGLDA: If a single topic generates a scene – the same topic
  generates all text in the document – a considerable strongpoint but a
  drawback for summary generation if this is not the case
• MMGLDA: More diffuse translation of both visual and textual patterns
  through the latent translation spaces
   – Intuitively it aids frequency based summarization

MMGLDA                                                             Corr-
                     Key is to use an asymmetric Dirichlet prior   MMGLDA
                       Document specific topic proportions
                                 Indicator variables

                                  Synopses words
                                   GIST features
                                  Visual “words”

                       Topic Parameters for explaining latent
                      structure within observation ensembles
Topic modeling performance

 Test ELBOs on events 1-5 in                              Prediction ELBOs on events
  the Dev-T set                                             1-5 in the Dev-T set
 Measuring held-out log                                   Measuring held-out log
  likelihoods on both videos and                            likelihoods on just videos in
  associated human summaries                                absence of the text
   In a purely multinomial MMLDA model, failures of independent events
    contribute highly negative terms to the log likelihoods
       Clearly NOT a measure of keyword summary generation power
   For the MMGLDA family of models, Gaussian components can partially
    remove the independence through covariance modeling
   This allows only the responsible topic-Gaussians to contribute to the likelihood
Translating Related Words to Videos

      Corr-MMGLDA                                                                                   MMGLDA

                          1          2          3          4          5          6          7        8        9           10
Corr-MMGLDA-α      0.445936   0.451391   0.462443   0.397392   0.374922   0.573839   0.425912 0.375423 0.38186      0.189047
MMGLDA-α           0.414354   0.422954   0.427442   0.359592   0.353317   0.552872    0.39681 0.349695 0.345466     0.163971
Corr-MMGLDA: log
(α/|Λ|)             12.6479    61.7312    50.0512    58.7659    60.1194    104.628    28.2949   31.3856   18.9223      8.164
(α/|Λ|)              12.498    61.4666    49.8858     58.643    59.9248    104.623    28.2264   31.2219   18.6953     8.1025
Translating Related Words to Videos
                                   Corr-MMGLDA is able to capture
                                    more variance relative to
      Corr-MMGLDA                   for CorrMMGLDA is also slightly                               MMGLDA
                                    higher than that for MMGLDA
                                   This can allow related but topically
                                    unique concepts to appear upfront

                          1          2          3          4          5          6          7        8        9           10
Corr-MMGLDA-α      0.445936   0.451391   0.462443   0.397392   0.374922   0.573839   0.425912 0.375423 0.38186      0.189047
MMGLDA-α           0.414354   0.422954   0.427442   0.359592   0.353317   0.552872    0.39681 0.349695 0.345466     0.163971
Corr-MMGLDA: log
(α/|Λ|)             12.6479    61.7312    50.0512    58.7659    60.1194    104.628    28.2949   31.3856   18.9223      8.164
(α/|Λ|)              12.498    61.4666    49.8858     58.643    59.9248    104.623    28.2264   31.2219   18.6953     8.1025
Related Words to Videos – Difficult Examples
                                  measure project lady
                                  tape indoor sew
                                  marker pleat
                                  highwaist zigzag
                                  scissor card mark
                                  teach cut fold stitch
                                  pin woman skirt
                                  machine fabric inside
                                  scissors make leather
                                  kilt man beltloop
                                  sew woman fabric
                                  make machine show
                                  baby traditional loom
                                  blouse outdoors
                                  blanket quick
                                  rectangle hood knit
                                  indoor stitch scissors
                                  pin cut iron studio
                                  montage measure kid
                                  penguin dad stuff
Related Words to Videos – Difficult Examples
                                  clock mechanism
                                  repair computer tube
                                  wash machine lapse
                                  click desk mouse time
                                  front wd40 pliers
                                  reattach knob make
                                  level video water
                                  control person clip
                                  part wire inside
                                  indoor whirlpool man
                                  gear machine guy
                                  repair sew fan test
                                  make replace grease
                                  vintage motor box
                                  indoor man tutorial
                                  fuse bypass brush
                                  wrench repairman
                                  lubricate workshop
                                  bottom remove screw
                                  unscrew screwdriver
                                  video wire
A few words is worth a thousand frames!

               From MMGLDA
A few words is worth a thousand frames!

               From MMGLDA
Event classification and summarization
    Sub-events e.g. skateboarding,
                                          Multiple sets of   Multiple sentences (group of
        snowboarding, surfing
                                        documents (sets of      segments in frames)
                                         frames in videos)
                                                                            Topic Model
          Skateboarding                                                      – permute
                                     Wedding                                event specific
                                     ceremony                               vocabularies

  A c-SVM classier from the libSVM package is
                   animals                                                  Bag of words
    used with default settings for multiclass (15                          multi-document
    classes) classification                                                  summaries
  55% test accuracy easily achievable
           Landing fishes
 Evaluate using ROUGE-1                                              Natural language
 HEXTAC 2009: 100-word human references vs. 100-word manually extracted summary
 Average Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661)   summaries
 Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)
Event classification and summarization
    Sub-events e.g. skateboarding,
                                       Multiple sets of   Multiple sentences (group of
        snowboarding, surfing
                                     documents (sets of      segments in frames)
                                      frames in videos)
                                                                         Topic Model
          Skateboarding                                                   – permute
                     - Usually changes from dataset to dataset
                                  Wedding                                event specific
                       but max around 40-45% for 100 word
                                  ceremony                               vocabularies
                       system summaries
                     - If we can achieve 10% of this for 10
                  Feeding summaries, we are doing pretty
  A c-SVM classier from the libSVM package is
                       good!                                             Bag of words
   used with default Caveat – The text multi-document
                     - settings for multiclass (15                      multi-document
   classes) classification                                                summaries
                       summarization task is much more
  55% test accuracy easily achievable
                       complex than this simpler task
          Landing fishes
 Evaluate using ROUGE-1                                              Natural language
 HEXTAC 2009: 100-word human references vs. 100-word manually extracted summary
 Average Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661)    summaries
 Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)
Future Directions:
                                              - Typically lots of features help in
Event classification and summarization          classification but do we need all of
    Sub-events e.g. skateboarding,
                                                them for better summary generation?
                                          Multiple sets of
                                              - Does better event classification of
                                                              Multiple sentences (group
        snowboarding, surfing
                                        documents (sets of        segments in frames)
                                                performance always mean better
                                         frames in videos)
                                                summarization performance?
                                                                         Topic Model
          Skateboarding                                                   – permute
                                     Wedding                             event specific
                                     ceremony                            vocabularies

                     - Usually changes from dataset to dataset
                  Feeding max around 40-45% for 100 word
  A c-SVM classier from the summaries
                       system libSVM package is
                  animals                                                Bag of words
   used with default If we can achieve 10% of this for 10
                     - settings for multiclass (15                      multi-document
   classes) classification summaries, we are doing pretty                 summaries
                       word Woodworking
  55% test accuracy easily achievable
                       good!        project
          Landing fishes
 Evaluate using ROUGE-1                                              Natural language
 HEXTAC 2009: 100-word human references vs. 100-word manually extracted summary
 Average Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661)    summaries
 Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)
ROUGE-1 performance
 MMLDA can show poor ELBO – a bit
 Performs quite well on predicting
  summary worthy keywords

 MMGLDA produces better topics and
  higher ELBO
 Summary worthiness of keywords
  almost same as MMLDA for lower n

 Sum-normalizing the real valued data
  to lie in [0,1]P distorts reality for Corr-
  MGLDA w.r.t. quantitative evaluation

 Summary worthiness of keywords is
  not good but topics are good
 Different but related topics can model
  GIST features almost equally (strong
  overlap in the tail of the Gaussians)
ROUGE-1 performance
 MMLDA can show poor ELBO – a bit
 Performs quite well on predicting
  summary worthy keywords

                          Future Directions
 MMGLDA produces better topics and
  higher ELBO  Need better initialization of
 Summary worthiness of keywords parameters
                  priors governing
  almost same as MMLDA forvaluedndata
                  for real lower
                     [N. Nasios and A.G. Bors. Variational learning for gaussian
                     mixture models. IEEE Transactions on Systems, Man, and
   Sum-normalizing the real B: Cybernetics, 36(4):849 {862, 2006]
                     Cybernetics, Part valued data
    to lie in [0,1]P distorts reality for Corr-
    MGLDA w.r.t. quantitative evaluation

 Summary worthiness of keywords is
  not good but topics are good
 Different but related topics can model
  GIST features almost equally (strong
  overlap in the tail of the Gaussians)
Model usefulness and applications
• Applications
  – Label topics through document level multimedia
  – Movie recommendations through semantically
    related frames
  – Video analysis: word prediction given video features
  – Adword creation through semantics of multimedia
    (Using transcripts only can be noisy)
  – Semantic compression of videos
  – Allowing the visually impaired to hear the world
    through text
Long list of acknowledgements
• Scott McCloskey (Honeywell ACS Labs)
• Sangmin Oh, Amitha Perera (Kitware Inc.)
• Kevin Cannons, Arash Vahdat, Greg Mori (SFU)
For helping us with feature extractions, event classification evaluations and
many fruitful discussions throughout this project

      • Jack Gallant (UC Berkeley)
      • Francisco Pereira (Siemens Corporate Research)
      For allowing us to reuse some of their illustrations in this presentation

• Lucy Vanderwende (Microsoft Research)
• Enrique Alfonseca (Google Research)
For helpful discussions during TAC 2011 on the importance of the
summarization problem outside of the competitions on newswire collections
Long list of acknowledgements
This work was supported by the Intelligence Advanced Research
Projects Activity (IARPA) via Department of Interior National Business
Center contract number D11PC20069. The U.S. Government is
authorized to reproduce and distribute reprints for Governmental
purposes notwithstanding any copyright annotation thereon.
Disclaimer: The views and conclusions contained herein are those of
the authors and should not be interpreted as necessarily representing
the official policies or endorsements, either expressed or implied, of
IARPA, DOI/NBC, or the U.S. Government.

We also thank the anonymous reviewers for their comments

Más contenido relacionado

Similar a Translating Related Words to Videos and Back through Latent Topics

Lesson 4 visions of the future 1 - v2
Lesson 4   visions of the future 1 - v2Lesson 4   visions of the future 1 - v2
Lesson 4 visions of the future 1 - v2Boojie Cowell
Deep Learning - What's the buzz all about
Deep Learning - What's the buzz all aboutDeep Learning - What's the buzz all about
Deep Learning - What's the buzz all aboutDebdoot Sheet
That's not what I meant! - Fran Alexander
That's not what I meant! - Fran Alexander That's not what I meant! - Fran Alexander
That's not what I meant! - Fran Alexander Incisive_Events
Brains@Bay Meetup: Open-ended Skill Acquisition in Humans and Machines: An Ev...
Brains@Bay Meetup: Open-ended Skill Acquisition in Humans and Machines: An Ev...Brains@Bay Meetup: Open-ended Skill Acquisition in Humans and Machines: An Ev...
Brains@Bay Meetup: Open-ended Skill Acquisition in Humans and Machines: An Ev...Numenta
Influence of ICT on Japanese handwriting skills
Influence of ICT on Japanese handwriting skillsInfluence of ICT on Japanese handwriting skills
Influence of ICT on Japanese handwriting skillsMLTA of NSW
David Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AIDavid Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AIBayes Nets meetup London
Everything you always wanted to know about psychology and technical communica...
Everything you always wanted to know about psychology and technical communica...Everything you always wanted to know about psychology and technical communica...
Everything you always wanted to know about psychology and technical communica...Chris Atherton @finiteattention
SLanguages2008 Chinese School
SLanguages2008   Chinese SchoolSLanguages2008   Chinese School
SLanguages2008 Chinese Schoolguest49e4d1
Montreal IWB Presentation
Montreal IWB PresentationMontreal IWB Presentation
Montreal IWB PresentationChris Betcher
NIPS2009: Understand Visual Scenes - Part 1
NIPS2009: Understand Visual Scenes - Part 1NIPS2009: Understand Visual Scenes - Part 1
NIPS2009: Understand Visual Scenes - Part 1zukun
Rhys Davelaar (MediaMonks) CMC - Nieuwe Interfaces & Design
Rhys Davelaar (MediaMonks) CMC - Nieuwe Interfaces & DesignRhys Davelaar (MediaMonks) CMC - Nieuwe Interfaces & Design
Rhys Davelaar (MediaMonks) CMC - Nieuwe Interfaces & DesignMedia Perspectives
Suter Aera Presentation Post
Suter Aera Presentation PostSuter Aera Presentation Post
Suter Aera Presentation Postvsuter
Individual Brain Charting, a high-resolution fMRI dataset for cognitive mappi...
Individual Brain Charting, a high-resolution fMRI dataset for cognitive mappi...Individual Brain Charting, a high-resolution fMRI dataset for cognitive mappi...
Individual Brain Charting, a high-resolution fMRI dataset for cognitive mappi...Ana Luísa Pinho

Similar a Translating Related Words to Videos and Back through Latent Topics (17)

Lesson 4 visions of the future 1 - v2
Lesson 4   visions of the future 1 - v2Lesson 4   visions of the future 1 - v2
Lesson 4 visions of the future 1 - v2
Deep Learning - What's the buzz all about
Deep Learning - What's the buzz all aboutDeep Learning - What's the buzz all about
Deep Learning - What's the buzz all about
The Mobile Virtual Cane
The Mobile Virtual CaneThe Mobile Virtual Cane
The Mobile Virtual Cane
Where is my mind?
Where is my mind?Where is my mind?
Where is my mind?
Thesis final SUBMITTED
Thesis final SUBMITTEDThesis final SUBMITTED
Thesis final SUBMITTED
That's not what I meant! - Fran Alexander
That's not what I meant! - Fran Alexander That's not what I meant! - Fran Alexander
That's not what I meant! - Fran Alexander
Brains@Bay Meetup: Open-ended Skill Acquisition in Humans and Machines: An Ev...
Brains@Bay Meetup: Open-ended Skill Acquisition in Humans and Machines: An Ev...Brains@Bay Meetup: Open-ended Skill Acquisition in Humans and Machines: An Ev...
Brains@Bay Meetup: Open-ended Skill Acquisition in Humans and Machines: An Ev...
Influence of ICT on Japanese handwriting skills
Influence of ICT on Japanese handwriting skillsInfluence of ICT on Japanese handwriting skills
Influence of ICT on Japanese handwriting skills
David Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AIDavid Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AI
Everything you always wanted to know about psychology and technical communica...
Everything you always wanted to know about psychology and technical communica...Everything you always wanted to know about psychology and technical communica...
Everything you always wanted to know about psychology and technical communica...
SLanguages2008 Chinese School
SLanguages2008   Chinese SchoolSLanguages2008   Chinese School
SLanguages2008 Chinese School
Montreal IWB Presentation
Montreal IWB PresentationMontreal IWB Presentation
Montreal IWB Presentation
NIPS2009: Understand Visual Scenes - Part 1
NIPS2009: Understand Visual Scenes - Part 1NIPS2009: Understand Visual Scenes - Part 1
NIPS2009: Understand Visual Scenes - Part 1
Rhys Davelaar (MediaMonks) CMC - Nieuwe Interfaces & Design
Rhys Davelaar (MediaMonks) CMC - Nieuwe Interfaces & DesignRhys Davelaar (MediaMonks) CMC - Nieuwe Interfaces & Design
Rhys Davelaar (MediaMonks) CMC - Nieuwe Interfaces & Design
Suter Aera Presentation Post
Suter Aera Presentation PostSuter Aera Presentation Post
Suter Aera Presentation Post
Individual Brain Charting, a high-resolution fMRI dataset for cognitive mappi...
Individual Brain Charting, a high-resolution fMRI dataset for cognitive mappi...Individual Brain Charting, a high-resolution fMRI dataset for cognitive mappi...
Individual Brain Charting, a high-resolution fMRI dataset for cognitive mappi...


Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan

Último (20)

Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand

Translating Related Words to Videos and Back through Latent Topics

  • 1. Translating Related Words to Videos and Back through Latent Topics Pradipto Das, Rohini K. Srihari and Jason J. Corso SUNY Buffalo WSDM 2013, Rome, Italy
  • 2. WiSDoM is beyond words Master Yoda, how do I find wisdom Go to the center of the data and from so many things happening find your wisdom you will around us?
  • 3. WiSDoM is beyond words Master Yoda, how do I find wisdom Go to the center of the data and from so many things happening find your wisdom you will around us?
  • 4. How do the centers look like? parkour perform traceur area flip footage jump park urban run lobster burger dress celery Christmas wrap roll mix tarragon outdoor outdoors kid group pedestrian playground steam season scratch stick live water lemon garlic floor parkour wall jump handrail locker contestant school run make dog sandwich man outdoors guy bench black sit park interview block slide indoor perform build tab duck white disgustingly toe cough feed rub contest parody Be careful on what people do with Interviews indoors can be tough! their sandwiches!
  • 5. The actual ground-truth synopses overlaid Man performs Kid does parkour A family holds a strange burger assembly parkour in various around the park and wrapping contest at Christmas locations Footage of group of performing parkour outdoors tutorial: man explains how to parkour perform traceur area flip footageguys free urban run montage of jump park running lobster burger dressmake lobster rolls from scratch celery Christmas wrap roll mix tarragon up a tree and through the outdoor outdoors kid group pedestrian playground steam season scratch stick live water lemon garlic woods interview with parkour contestants One guy is making floor parkour wall jump handrail locker contestant school run sandwich outdoors make dog sandwich man outdoors guy bench black sit park interview block slide indoor perform build tab duck white disgustingly toe cough feed rub contest parody Be careful on what people do with Interviews indoors can be tough! their sandwiches!
  • 6. Back to conventional wisdom: Translation S. Nishimoto, A. T. Vu, T. Naselaris, Y. Benjamini, B. Yu and J. Gallant, “Reconstructing Visual Experiences from Brain Activity Evoked by Natural Movies,” Current biology Vol. 21(19), 2011  There is some model that captures the correspondence of the blood flow patterns in the brain to the world being observed  Given a slightly different pattern we are able to translate them to concepts present in our vocabulary to a lingual description  Three basic assumptions of Machine Learning are satisfied: 1) There is pattern 2) We do not know the target function 3) There is data to learn from Training Testing Topic Model (LDA) Regression F. Pereira, G. Detre and M. Botvinick, "Generating text from functional brain images," In Frontiers in Human Neuroscience, Vol. 5(72), 2011
  • 7. Back to conventional wisdom: Translation S. Nishimoto, A. T. Vu, T. Naselaris, Y. Benjamini, B. Yu and J. Gallant, “Reconstructing Visual Experiences from Brain Activity Evoked by Natural Movies,” Current biology Vol. 21(19), 2011 Giving back to the community:  Driverless blood flow patterns in the  There is some model that captures the correspondence of thecars are already helpingthe brain to the world being observed visually impaired to drive around  It will them to to enable visually  Given a slightly different pattern we are able to translate be good concepts present in our vocabulary to a lingual description impaired drivers to hear the scenery in front  Three basic assumptions of Machine Learning are satisfied: 1) There is pattern 2) We do not know the target function 3) There is data to learn from Training Testing Topic Model (LDA) Regression F. Pereira, G. Detre and M. Botvinick, "Generating text from functional brain images," In Frontiers in Human Neuroscience, Vol. 5(72), 2011
  • 8. Do we speak all that we see? Multiple Human Summaries: (Max 10 words i.e. imposing a length constraint) 1. There is a guy climbing on a rock-climbing wall. 4. A person is practicing indoor rock climbing. 2. A man is bouldering at an indoor rock climbing gym. 5. A man is doing artificial rock climbing. 3. Someone doing indoor rock climbing.
  • 9. Centers of attention (topics) Not so important! Hand holding climbing surface How many rocks? The sketch in the board Wrist-watch What’s there in the back? Dress of the climber Empty slots Color of the floor Multiple Human Summaries: (Max 10 words i.e. imposing a length constraint) 1. There is a guy climbing on a rock-climbing wall. 4. A person is practicing indoor rock climbing. 2. A man is bouldering at an indoor rock climbing gym. 5. A man is doing artificial rock climbing. 3. Someone doing indoor rock climbing. Summaries point toward information needs!
  • 10. From patterns to topics to sentences Adverb modifier (climbing where?) Direct Subject Direct Object A young man climbs an artificial rock wall indoors  Spoken Language is complex – Adjective modifier structured according to various (What kind of wall?) grammars and dependent on active topics  Different paraphrases describe the same visual input Major Topic: Rock climbing Sub-topics: artificial rock wall, indoor rock climbing gym
  • 11. Object detection models Annotations for training object/concept models  Expensive frame-wise manual annotation efforts by drawing bounding boxes  Difficulties: camera shakes, camera motion, zooming  Careful consideration to which objects/concepts to annotate?  Focus on object/concept detection – Man with Climbing noisy for videos in-the-wild microphone person  Does not answer which objects/concepts are important for Trained Models summary generation?
  • 12. Translating across modalities Learning latent translation spaces a.k.a topics  Mixed membership of latent topics  Some topics capture observations that co- occur commonly  Other topics allow for discrimination  Different topics can be responsible for different modalities No annotations Human Synopsis needed – only A young man is need clip level climbing an artificial summary rock wall indoors
  • 13. Translating across modalities Using learnt translation spaces for prediction  Topics are marginalized out to permute vocabulary for predictions  The lower the correlation among topics, the better the permutation  Sensitive to priors for real valued data Text Translation ? p( wv | wO , wH )  O K H K  o 1 i 1 (O ) d , o ,i p( wv | i ) d( H ,)i p( wv | i ) ,h h 1 i 1
  • 14. Translating across modalities Use learnt translation spaces for prediction  Topics are marginalized out to permute vocabulary for predictions  The lower the correlation among topics, the better the permutation  Sensitive to priors for Responsibility of Responsibility of real valued data topic i over real topic i over discrete valued observations video features Text Translation Probability of learnt ? p( wv | wO , wH )  topic i explaining O K H K words in the text  o 1 i 1 (O ) d , o ,i p( wv | i ) d( H ,)i p( wv | i ) ,h h 1 i 1 vocabulary
  • 15. Wisdom of the young padawans OB (Object Bank)  High level semantic representation of images from low level features [L-J. Li, H. Su, E. P. Xing, and L. Fei-fei. Object bank: A high-level image representation for scene classification and semantic feature sparsification. In NIPS, 2010] HOG3D (Histogram of oriented gradients in 3D)  Effective action recognition features for videos [A. Klaser, M. Marszalek, and C. Schmid. A spatio- temporal descriptor based on 3d-gradients. In BMVC, 2008] Color Histogram:  512 RGB color bins  histograms are computed on densely sampled frames  large deviations in the extremities of the color spectrum are discarded
  • 16. Wisdom of the young padawans The video is about a man answering Two camera men film a cop to a question from the podium by taking a camera from a woman using a microphone sitting in a group Town hall meeting Topics Scenes from images belonging to different topics and sub-topics Rock climbing An young man climbs an A man climbs a boulder artificial rock wall indoors outdoors with a friend spotting Sub-Topics
  • 17. Wisdom of the young padawans Global GIST energy [A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vision, 42(3):145{175, 2001.]  eight perceptual dimensions capture most of the 3D structures of real-world scenes  naturalness, openness, perspective or expansion, size or roughness, ruggedness, mean depth, symmetry and complexity GIST in general terms:  An energy space that pervades the arrangements of objects  Does not really care about the specificity of the objects  Helps us summarize an image even after it has disappeared from our sight
  • 18. Yoda’s wisdom The video is about a man answering Two camera men film a cop to a question from the podium by taking a camera from a woman It will bemicrophone using a nice sitting in a group to have the Force as a “feature”! For my ally is the Force, Its energy surrounds us and binds us. An young man climbs an we, Luminous beings are A man climbs a boulder artificial rock wall indoors not this crude matter. outdoors with a friend spotting
  • 19. Datasets  NIST's 2011 TRECVID Multimedia Event Detection (MED) Events and Dev-T datasets  Training set is organized into 15 event categories, some of which are: 1) Attempting a board trick 2) Feeding an animal 3) Landing a fish 4) Wedding ceremony 5) Woodworking project 6) Birthday party 7) Changing a vehicle tire 8) Flash mob gathering 9) Getting a vehicle unstuck 10) Grooming an animal 11) Making a sandwich 12) Parade 13) Parkour 14) Repairing an appliance 15) Working on a sewing project  Each video has its own high level summary – varies from 2 to 40 words but on average 10 words  2062 clips in the training set and 530 clips for the first 5 events in the Dev-T set  Dev-T summaries are only used as reference summaries for evaluation with up to 10 predicted keywords
  • 20. The summarization perspective Sub-events e.g. Multiple sets of Multiple sentences (group of skateboarding, snowboarding, sur fing documents (sets of segments in frames) frames in videos) Multimedia Topic Model Skateboarding – permute Wedding event specific ceremony vocabularies Feeding animals Bag of keywords multi-document summaries Woodworking project Landing fishes Natural language multi-document summaries
  • 21. The summarization perspective Sub-events e.g. Why event snowboarding,vocabularies? sets of skateboarding, specific sur Multiple Multiple sentences (group of fing documents (sets of segments in frames) frames in videos) Multimedia Skateboarding Topic Model – permute Model Actual Synopsis Wedding Predicted Words (top 10) event specific ceremony vocabularies One school man feeds fish fish jump bread fishing skateboard of thought bread pole machine car dog cat Another Feeding feeds fish man bread shampoo sit condiment place school of animals bread fill plate jump pole fishing Bag of keywords thought multi-document  Intuitively multiple objects and actions are shared and many summaries different words across eventsWoodworking semantically get associated project  Prediction quality degenerates rapidly! Landing fishes Natural language multi-document summaries
  • 22. Previously [P. Das, R. K. Srihari and Y. Fu. “Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives,” CIKM, Glasgow Scotland, 2011] Words forming Article specific content words other Wiki articles Words corresponding to the embedded multimedia
  • 23. Afterwards [P. Das, R. K. Srihari and J. J. Corso. “Translating Related Words to Videos and Back through Latent Topics,” WSDM, Rome, Italy, 2013] Words forming Article specific content words other Wiki articles Words corresponding to the embedded multimedia
  • 24. The family of multimedia topic models • Corr-MMGLDA: If a single topic generates a scene – the same topic generates all text in the document – a considerable strongpoint but a drawback for summary generation if this is not the case • MMGLDA: More diffuse translation of both visual and textual patterns through the latent translation spaces – Intuitively it aids frequency based summarization MMGLDA Corr- Key is to use an asymmetric Dirichlet prior MMGLDA Document specific topic proportions Indicator variables Synopses words GIST features Visual “words” Topic Parameters for explaining latent structure within observation ensembles
  • 25. Topic modeling performance  Test ELBOs on events 1-5 in  Prediction ELBOs on events the Dev-T set 1-5 in the Dev-T set  Measuring held-out log  Measuring held-out log likelihoods on both videos and likelihoods on just videos in associated human summaries absence of the text  In a purely multinomial MMLDA model, failures of independent events contribute highly negative terms to the log likelihoods  Clearly NOT a measure of keyword summary generation power  For the MMGLDA family of models, Gaussian components can partially remove the independence through covariance modeling  This allows only the responsible topic-Gaussians to contribute to the likelihood
  • 26. Translating Related Words to Videos Corr-MMGLDA MMGLDA 1 2 3 4 5 6 7 8 9 10 Corr-MMGLDA-α 0.445936 0.451391 0.462443 0.397392 0.374922 0.573839 0.425912 0.375423 0.38186 0.189047 MMGLDA-α 0.414354 0.422954 0.427442 0.359592 0.353317 0.552872 0.39681 0.349695 0.345466 0.163971 Corr-MMGLDA: log (α/|Λ|) 12.6479 61.7312 50.0512 58.7659 60.1194 104.628 28.2949 31.3856 18.9223 8.164 MMGLDA: log (α/|Λ|) 12.498 61.4666 49.8858 58.643 59.9248 104.623 28.2264 31.2219 18.6953 8.1025
  • 27. Translating Related Words to Videos  Corr-MMGLDA is able to capture more variance relative to MMGLDA Corr-MMGLDA   for CorrMMGLDA is also slightly MMGLDA higher than that for MMGLDA  This can allow related but topically unique concepts to appear upfront 1 2 3 4 5 6 7 8 9 10 Corr-MMGLDA-α 0.445936 0.451391 0.462443 0.397392 0.374922 0.573839 0.425912 0.375423 0.38186 0.189047 MMGLDA-α 0.414354 0.422954 0.427442 0.359592 0.353317 0.552872 0.39681 0.349695 0.345466 0.163971 Corr-MMGLDA: log (α/|Λ|) 12.6479 61.7312 50.0512 58.7659 60.1194 104.628 28.2949 31.3856 18.9223 8.164 MMGLDA: log (α/|Λ|) 12.498 61.4666 49.8858 58.643 59.9248 104.623 28.2264 31.2219 18.6953 8.1025
  • 28. Related Words to Videos – Difficult Examples measure project lady tape indoor sew marker pleat highwaist zigzag scissor card mark teach cut fold stitch pin woman skirt machine fabric inside scissors make leather kilt man beltloop sew woman fabric make machine show baby traditional loom blouse outdoors blanket quick rectangle hood knit indoor stitch scissors pin cut iron studio montage measure kid penguin dad stuff thread
  • 29. Related Words to Videos – Difficult Examples clock mechanism repair computer tube wash machine lapse click desk mouse time front wd40 pliers reattach knob make level video water control person clip part wire inside indoor whirlpool man gear machine guy repair sew fan test make replace grease vintage motor box indoor man tutorial fuse bypass brush wrench repairman lubricate workshop bottom remove screw unscrew screwdriver video wire
  • 30. A few words is worth a thousand frames! From MMGLDA
  • 31. A few words is worth a thousand frames! From MMGLDA
  • 32. Event classification and summarization Sub-events e.g. skateboarding, Multiple sets of Multiple sentences (group of snowboarding, surfing documents (sets of segments in frames) frames in videos) Multimedia Topic Model Skateboarding – permute Wedding event specific ceremony vocabularies Feeding  A c-SVM classier from the libSVM package is animals Bag of words used with default settings for multiclass (15 multi-document classes) classification summaries Woodworking  55% test accuracy easily achievable project Landing fishes Evaluate using ROUGE-1 Natural language HEXTAC 2009: 100-word human references vs. 100-word manually extracted summary multi-document Average Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661) summaries Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)
  • 33. Event classification and summarization Sub-events e.g. skateboarding, Multiple sets of Multiple sentences (group of snowboarding, surfing documents (sets of segments in frames) frames in videos) Multimedia Topic Model Skateboarding – permute - Usually changes from dataset to dataset Wedding event specific but max around 40-45% for 100 word ceremony vocabularies system summaries - If we can achieve 10% of this for 10 Feeding summaries, we are doing pretty word  A c-SVM classier from the libSVM package is animals good! Bag of words used with default Caveat – The text multi-document - settings for multiclass (15 multi-document classes) classification summaries summarization task is much more Woodworking  55% test accuracy easily achievable project complex than this simpler task Landing fishes Evaluate using ROUGE-1 Natural language HEXTAC 2009: 100-word human references vs. 100-word manually extracted summary multi-document Average Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661) summaries Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)
  • 34. Future Directions: - Typically lots of features help in Event classification and summarization classification but do we need all of Sub-events e.g. skateboarding, them for better summary generation? Multiple sets of - Does better event classification of Multiple sentences (group snowboarding, surfing documents (sets of segments in frames) performance always mean better frames in videos) summarization performance? Multimedia Topic Model Skateboarding – permute Wedding event specific ceremony vocabularies - Usually changes from dataset to dataset Feeding max around 40-45% for 100 word but  A c-SVM classier from the summaries system libSVM package is animals Bag of words used with default If we can achieve 10% of this for 10 - settings for multiclass (15 multi-document classes) classification summaries, we are doing pretty summaries word Woodworking  55% test accuracy easily achievable good! project Landing fishes Evaluate using ROUGE-1 Natural language HEXTAC 2009: 100-word human references vs. 100-word manually extracted summary multi-document Average Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661) summaries Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)
  • 35. ROUGE-1 performance  MMLDA can show poor ELBO – a bit misleading  Performs quite well on predicting summary worthy keywords  MMGLDA produces better topics and higher ELBO  Summary worthiness of keywords almost same as MMLDA for lower n  Sum-normalizing the real valued data to lie in [0,1]P distorts reality for Corr- MGLDA w.r.t. quantitative evaluation  Summary worthiness of keywords is not good but topics are good  Different but related topics can model GIST features almost equally (strong overlap in the tail of the Gaussians)
  • 36. ROUGE-1 performance  MMLDA can show poor ELBO – a bit misleading  Performs quite well on predicting summary worthy keywords Future Directions  MMGLDA produces better topics and higher ELBO  Need better initialization of  Summary worthiness of keywords parameters priors governing almost same as MMLDA forvaluedndata for real lower [N. Nasios and A.G. Bors. Variational learning for gaussian mixture models. IEEE Transactions on Systems, Man, and  Sum-normalizing the real B: Cybernetics, 36(4):849 {862, 2006] Cybernetics, Part valued data to lie in [0,1]P distorts reality for Corr- MGLDA w.r.t. quantitative evaluation  Summary worthiness of keywords is not good but topics are good  Different but related topics can model GIST features almost equally (strong overlap in the tail of the Gaussians)
  • 37. Model usefulness and applications • Applications – Label topics through document level multimedia – Movie recommendations through semantically related frames – Video analysis: word prediction given video features – Adword creation through semantics of multimedia (Using transcripts only can be noisy) – Semantic compression of videos – Allowing the visually impaired to hear the world through text
  • 38. Long list of acknowledgements • Scott McCloskey (Honeywell ACS Labs) • Sangmin Oh, Amitha Perera (Kitware Inc.) • Kevin Cannons, Arash Vahdat, Greg Mori (SFU) For helping us with feature extractions, event classification evaluations and many fruitful discussions throughout this project • Jack Gallant (UC Berkeley) • Francisco Pereira (Siemens Corporate Research) For allowing us to reuse some of their illustrations in this presentation • Lucy Vanderwende (Microsoft Research) • Enrique Alfonseca (Google Research) For helpful discussions during TAC 2011 on the importance of the summarization problem outside of the competitions on newswire collections
  • 39. Long list of acknowledgements This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20069. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/NBC, or the U.S. Government. We also thank the anonymous reviewers for their comments

Notas del editor

  1. Big data problem---lots of data around us but which ones are meaningful?Need statistics from the data that meaningfully encode multiple views i.e. modalitiesSufficient statistics (i.e. the function of a sample that encodes all information about the sample) usually represent the centers of the data
  2. Big data problem---lots of data around us but which ones are meaningful?Need statistics from the data that meaningfully encode multiple views i.e. modalitiesSufficient statistics (i.e. the function of a sample that encodes all information about the sample) usually represent the centers of the data
  3. Centers are the topics which correspond to some best description of data which are similar in some wayTrue Centers are never known---each one of us has an algorithm for finding centers---our own topic model
  4. The actual ground-truth synopses overlaid over the training topics
  5. BOLD (Blood Oxygen Level Dependent) and fMRI patternsImages used with permission from Jack Gallant and Francisco Pereira (by the way, both of them are now applying topic models to map brain patterns to movies or text)
  6. A genuine philanthropic use case
  7. The importance of relating multi-document summaries to that for summarizing videos – every frame is a document
  8. Psycholinguistics are needed to confirm but that’s not a concern at this pointIn our dataset we have only one ground truth summary---base case for ROUGE evaluation
  9. Ground truth annotationComplex high level descriptionsSpoken Language is complicated – We are corresponding it to a minimal set of features (next)
  10. Upper row – training (camera motion and shakes are a real problem for maintaining the bounding boxes)Lower row – trained models
  11. Role of alpha – alpha provides a topic for every observation. Alpha is a K-vectorHere each component of alpha is different which helps assign different proportions of observations differently (e.g. one topic can be focusing solely on “stop-words”, another one on “commonly occurring words” and other ones on the different topics etc.)
  12. Translation formula (Marginalization over topics)- If there are two topics i.e. K=2, then (for e.g for the 2nd term) 0.5*0.5 + 0.5*0.5 = 0.5 < 0*0.0001 + 0.9*0.9- Values of the inferred \\phi’s are very important for the real valued data – separated Gaussians are better but does not always happen- This raises an issue where the real valued data may need to be preprocessed to increase the chances of separation
  13. Object Bank (Computed on keyframes), HOG3D and Color histograms – features through the lens of computer vision
  14. Important references: | The principal components of the spectrogram of real-world scenes. The spectrogram is sampled at 4 × 4 spatial location for a better visualization. Each subimage corresponds to the local energy spectrum at the corresponding spatial locationGlobal GIST patterns should be different for topics and sub-topicsAnother relevant piece of information for image representation concerns the spatial relationships between the main structures in the image. Spatial distribution of spectral information can be described by means of the windowed Fourier transform (WFT)
  15. Red arrow means “lack of the corresponding GIST property” and green means ok- The principal components of the spectrogram of real-world scenes. The spectrogram is sampled at 4 × 4 spatial location for a better visualization. Each subimage corresponds to the local energy spectrum at the corresponding spatial locationGlobal GIST patterns are different for topics and sub-topicsAnother relevant piece of information for image representation concerns the spatial relationships between the main structures in the image. Spatial distribution of spectral information can be described by means of the windowed Fourier transform (WFT)
  16. - The dataset that we use for the video summarization task is released as part of NIST's 2011 TRECVID Multimedia Event Detection (MED) evaluation set. The dataset consists of a collection of Internet multimedia content posted to the various Internet video hosting sites. The training set is organized into 15 event categories, some of which are: 1) Attempting a board trick 2) Feeding an animal 3) Landing a fish 4) Wedding ceremony 5) Working on a woodworking project etc.We use the videos and their textual metadata in all the 15 events as training data. There are 2062 clips with summaries in the training set with almost equal distribution amongst the events. The test set which we use is called the TransparentDevelopment (Dev-T) collection. The Dev-T collection includes positive instances of the first 5 training events and near positive instances for the last 10 events---a total of 630 videos labeled with event category information (and associated human synopses which are to be compared against for summarization performance). Each summary is a short and very high level description of the entire video and ranges from 2 to 40 words but on average 10 words (with stopwords). We remove standard English stopwords and retain only the word morphologies (not required) from the synopses as our training vocabularies. The proportion of videos belonging to events 6 through 15 in the Dev-T set is much low compared to the proportion for the other events since those clips are considered to be “related" instances which cover only part of the event category specifications. The performances of our topic models are evaluated on those kinds of clips as well. The numbers of videos in events 6 through 15 in the Dev-T set are {4,9,5,7,8,3,3,3,10,8} while there are around 120 videos per event for the first 5 events. All other videos in the Dev-T set neither have any event category label nor are identified as positive, negative or related videos and we do not consider these videos in our experiments.
  17. There are no individual summaries for shots within the clip – only one high level summaryProblems with shot-wise nearest neighbor matching precisely for this reason?
  18. Why event specific vocabularies
  19. Modeling correspondence of caption words to the main text content which can be annotated in various ways
  20. “Dear Wikipedia readers: We are the small non-profit that runs the #5 website in the world. We have only 150 staff but serve 450 million users” – finding the reason why it might be so? (Both the main and embedded content reflects coherent topics e.g. if there appears an irrelevant advertisement, the topic will drift and Wikipedia will loose its appeal)
  21. Corr-MMGLDA seems to be capturing more variance relative to MMGLDA\\alpha for CorrMMGLDA is thus slightly higher than that for MMGLDATopic parameters over words are seeded through documents during initialization and hence are same for both models here
  22. This is a tough event to match words with frames. The event is “Working on a sewing project”Top row: frames coming from only one video. We do not put a constraint that we can select only 5 frames per video. Although this can be easily done. The shown video’s actual synopsis is “One lady is doing sewing project indoors.”Bottom row: better variance – Note how it captures dad sewing kid’s penguin with a needle and threadFirst row: “Woman demonstrating different stitches using a serger/sewing machine”Second row: “dad sewing up stuffed penguin for kids”Third row: “Woman makes a bordered hem skirt.”; Last one: “A pair of hands do a sewing project using a sewing machine.”Other features might help: Action, objects, GIST and color may not be enough
  23. This is again another tough event to match words with frames. The event is “Repairing an appliance”Top row: frames coming from only one video. Bad example. The shown video’s actual synopsis is “How to repair the water level control mechanism on a Whirlpool washing machine.”Bottom row: better variance – Row1,Cols1-3: “a man is repairing a whirlpool washer” ;Row1,Col4 “how to remove blockage from a washing machine pump”; Row2,Cols1-3: “Woman demonstrates replacing a door hinge on a dishwasher”;Row2,Col4: “A guy shows how to make repairs on a microwave”;Row3,Cols1-3: “How to fix a broken agitator on a Whirlpool washing machine”;Row3,Col4: “A guy working on a vintage box fan”Other features might help: Action, objects, GIST and color not enough
  24. Usually changes from dataset to dataset but max around 40-45% for 100 word summariesIf we can achieve 10% of this for 10 word summaries, we are doing pretty good!
  25. Caveat – The text multi-document summarization task is much more complex than this simpler task (w.r.t. summarization)
  26. Caveat – The multi-document summarization task is much more complex than this simpler task (w.r.t. summarization)
  27. Purely multinomial topic models showing lower ELBOs can perform quite well in BoW summarization. MMLDA assigns likelihoods based on success and failure of independent events and failures contribute highly negative terms to the log likelihoods but this does not indicate the model's summarization performance where low probability terms are pruned out. Gaussian components can partially remove the independence through covariance modeling but this can also allow different but related topics to model GIST features almost equally (strong overlap in the tail of the bell shaped curves - Gaussians) and show poor permutation of predicted words due to the violation of the soft probabilistic constraint of correspondence
  28. There has been some work done for initialization of priors for a Gaussian Mixture Model (GMM) setting but no work has been done on the effects of such initializations for topic models involving Gaussians and Multinomials
  29. Never had the chance to acknowledge them all in the paper