SlideShare una empresa de Scribd logo
1 de 41
Descargar para leer sin conexión
Multimedia Information Retrieval and
User Behavior in Social Media
Eleonora Ciceri, ciceri@elet.polimi.it

Date 22/10/2012
Outline

✤   Multimedia Information Retrieval on large data sets
    ✤ The “giants” of photo uploads

    ✤ Image search

    ✤ Descriptors

    ✤ Bag of Visual Words




✤   Analyzing User Motivations in Video Blogging
    ✤ What is a video blog?

    ✤ Non-verbal communication


    ✤ Automatic processing pipeline

    ✤ Cues extraction & Results

    ✤ Cues vs. Social Attention
Multimedia Information Retrieval
on large data sets
The “giants” of photo uploads

✤   Flickr uploads: (source: http://www.flickr.com/)
    ✤ 1,54 million photos per day in average

    ✤ 51 million users

    ✤ 6 billion images



✤   Facebook uploads: (source: http://thenextweb.com/)
    ✤ 250 million photos per day in average

    ✤ 845 million users in February 2012

    ✤ 90+ billion in August 2011



✤   “Flickr hits 6 billion total photos, Facebook does that every two months”
Image search

✤   Query by example: look for a particular object / scene / location in a
    collection of images
Image search

✤   Copy detection




✤   Annotation / Classification / Detection




          “dog”         “dog”?         “dog”
                                      “child”
Descriptors

✤   How can we look for similar images?

    ✤   Compute a descriptor: mathematical representation

    ✤   Find similar descriptors

✤   Problem: occlusions, changes in rotations-scale-lighting
Descriptors

✤   How can we look for similar images?

    ✤   Compute a descriptor: mathematical representation

    ✤   Find similar descriptors

✤   Solution: invariant descriptors (to scale / rotation...)
Global descriptors

✤   Global descriptors: one descriptor per image (highly scalable)

✤   Color histogram: representation of the distribution of colors

    ✤   Pros: high invariance to many transformations

    ✤   Cons: high invariance to TOO many transformations (limited
        discriminative power)
Local descriptors

✤   Local descriptors: find regions of interest that will be exploited for
    image comparison

✤   SIFT: Scale Invariant Feature Transform

    ✤   Extract key-points (maxima and minima in the Difference of
        Gaussian image)

    ✤   Assign orientation to key-points (result: rotation invariance)

    ✤   Generate the feature vector for each key-point
Direct matching
                                                                         query
✤   Assumption:                                                          image

    ✤   m=1000 descriptors for one image
    ✤   Each descriptor has d=128 dimensions
    ✤   N>1000000 images in the data set

✤   Search: a query is submitted; results are retrieved

    ✤   Each descriptor of the query image is tested again each descriptor
        of the image data set

    ✤   Complexity: m2Nd elementary operations; Required space: ???
Bag of Visual Words

✤   Objective: “put the images into words” (visual words)
✤   What is a visual word? “A small part of the image that carries some
    kind of information related to the features” [Wikipedia]

✤   Analogy Text-Image:

    ✤   Visual word: small patch of the image

    ✤   Visual term: cluster of patches that give the same information

    ✤   Bag of visual words: collection of words that give information about
        the meaning of the image at all
Bag of Visual Words

✤   How to build a visual dictionary?

    ✤   Local descriptors are clustered

    ✤   A local descriptor is assigned to its nearest neighbor:

        q(x) = arg min  x − µw 2
                         w∈ω
                     w∈ω
                                        Mean of the
                                         cluster w
                             Visual
               Cluster
                           dictionary
Why Visual Words?

✤   Pros:

    ✤   Much more compact representation

    ✤   We can take advantage from text retrieval techniques to apply
        them to image retrieval system


                                                                                          Find similar
                                                                                                          Results
                                                                                            vectors

                            Rd

                                                                   f (t, d)                 |D|
                                       tf idf (t, d, D) =                         log
                                                            max{f (w, d) : w ∈ d}     |{d ∈ D : t ∈ d}|
Analyzing User Motivations in
Video Blogging
What is a video blog?

✤   Video blog (vlog): conversational videos where people (usually a
    single person) discuss facing the camera and addressing the audience
    in a Skype-style fashion

    ✤   Examples: video testimonial (companies pay for testing products),
        video advice (e.g., how to get dressed for a party), discussions
Why vlogs are used


                               Corporate communication
                                                         Life documentary


   Comments        Ratings
                                                                     E-learning
                             Marketing     COMMUNITY

    High participation
                                     Daily interaction
              Discussion                                      Entertainment
   Critique
Why vlogs are studied

✤   Why are vlogs relevant?

    ✤   Automatic analysis of personal websites, blogs and social networks
        is limited to text (in order to understand users’ motivations)

        ✤   Vlog is a new type of social media (40% of the most viewed
            videos on YouTube): how to do automatic analysis?

    ✤   Study a real-life communication scenario

        ✤   Humans judgements are based on first impressions: can we
            predict them?
Real communication vs. vlogs

✤   Real communication              ✤   Vlog

✤   Synchronous                     ✤   Asynchronous

✤   Two (or more) people interact   ✤   Monologue

                                    ✤   Metadata
            blah
                     blah ?
            blah

                                               blah
                                               blah
Non-verbal communication

✤   The process of communication through sending and receiving
    wordless/visual cues between people

                            Body             Speech

                           Gestures        Voice quality
                             Touch             Rate
                        Body language          Pitch
                            Posture          Volume
                       Facial expression     Rhythm
                         Eye contact        Intonation

✤   Why? To express aspects of identity (age, occupation, culture,
    personality)
An example: dominance


✤   Power: capacity or right to control others

✤   Dominance: way of exerting power
    involving the motive to control others

    ✤   Behavior: talk louder, talk longer, speak
        first, interrupt more, add gestures,
        receive more visual attention
Automatic processing pipeline
                                                                                                                               Audio cues

                                      Face detection                                        Shot selection
                                                                                                                               Visual cues
                                       A     B      C       D        E                      A    B     C       D      E
                                                                                                                              (for each shot)
                                           (Viola-Jones algorithm)               Without faces   Short: not talking




            VlogSense: Conversational Behavior and Social Attention in YouTube   •   1:7

                                      Shot boundary                                                                         Aggregate
  !#$%
#'()*+,%
)-$-.$/#(
                                        detection                                                                         shot-level cues
             0+#($*1
               0*.-
            )-$-.$/#(                                                                                                            Aggregated
                                                                                                                                    cues
     !#$
   !-1-.$/#(                                   (based on color                                                                (at video level)
                                           histogram differences)
Visual cue extraction




                                                                     binary image
                                                                    containing the


✤
               Figure 2: wMEI Images for two vlogs.  (D )
    Weighted Motion Energy Images (wMEI): wM EI =
                                                                    moving pixels
                                                                       in frame f
                                                        f
                                                           f ∈V
✤   Feature Extraction
    It indicates the visual activity of a pixel (accumulated motion through
    We video)
    the automatically extracted nonverbal cues from both au-
    dio and video with the purpose of characterizing vloggers’
    behavior. Given regions with higher motion of vlogs, in our
       Brighter pixels: the conversational nature
    ✤
Visual cue extraction
1:8    •     J-I. Biel and D. Gatica-Perez




Fig. 5. Nonverbal cues are extracted based
on speech/non-speech, looking/non-looking
segmentations, and multimodal segmenta-
tions.




 ✤    It is difficult to estimate the actual direction of the eyes
in step (2), we simplified the task with the detection of frontal faces, a reasonable solution given the
inherent nature of conversational vlogging. In addition to its robustness, a face detector may generalize
  ✤   If the face is in frontal position I’m most likely looking at the camera
better to the case of vloggers who do not display much of their upper body. For each shot, we assessed
the presence of a talking-head by measuring the ratio of frames with face detections. Then, in step
(3), we selected conversational shots based on a linear combination of the face detection rate and the
duration of the shot relative to the whole duration of the video. This latter condition is motivated by
the observation that non-conversational shots tend to be short, independently on whether they feature
people or not.
   We used existing implementations of algorithms based the OpenCV library [Bradski and Kaehler
2008]. The shot boundary detector finds shot discontinuities by thresholding the Bhattacharyya dis-
                                  We are interested in frontal face detection
tance between RGB color histograms of consecutive frames. The face detector implements the boosted
classifiers and Haar-like features from the Viola-Jones algorithm [Viola and Jones 2002] using an ex-
isting cascaded on the OpenCV version 2.0 [Bradski and Kaehler 2008], which scans faces as small as
20x20 pixels. The shot boundary detection algorithm (step 1) and
   and the conversational shot selection (step 3) were tuned in a development set constructed from
1:8
    Visual cue extraction
       •     J-I. Biel and D. Gatica-Perez




Fig. 5. Nonverbal cues are extracted based
on speech/non-speech, looking/non-looking
segmentations, and multimodal segmenta-
tions.




                            how much the
                            vlogger looks                                                                      choice of
                            to the camera                                                                   addressing the
                                                                                                             camera from
                                                                                                               close-ups

✤     Looking time: looking activity                       persistence of a
                                                                              ✤   Proximity to camera
in step (2), we simplified the task with the detection of  frontal faces, a reasonable solution given the
                                                            vlogger’s gaze
inherent nature of conversational vlogging. In addition to its robustness, a face detector may generalize
better to the case of vloggers who do not display much of their upper body. For each shot, we assessed
✤     Looking segment length: persistence                                   ✤     Vertical framing: upper body
the presence of a talking-head by measuring the ratio of frames with face detections. Then, in step
(3), we selected conversational shots based on a linear combination of the face detection rate and the
duration of the shot relative to the whole duration of the video. This latter condition is motivated by                   how much the
the observation that non-conversational shots tend to be short, independently on whether they feature                     vlogger shows
✤     Looking turns: looking activity
people or not.
                                                                            ✤     Vertical head motion                   the upper body
   We used existing implementations of algorithms based the OpenCV library [Bradski and Kaehler
2008]. The shot boundary detector finds shot discontinuities by thresholding the Bhattacharyya dis-
                                how much the
tance between RGB color histogramslooks
                                vlogger of consecutive frames. The face detector implements the boosted
                                to the camera
classifiers and Haar-like features from the Viola-Jones algorithm [Viola and Jones 2002] using an ex-
isting cascaded on the OpenCV version 2.0 [Bradski and Kaehler 2008], which scans faces as small as
20x20 pixels. The shot boundary detection algorithm (step 1) and
   and the conversational shot selection (step 3) were tuned in a development set constructed from
1:8
    Visual cue extraction
       •     J-I. Biel and D. Gatica-Perez




Fig. 5. Nonverbal cues are extracted based
on speech/non-speech, looking/non-looking
segmentations, and multimodal segmenta-
tions.




                                                                                                                                    face area in the
                                                                                                                                     current frame
                                                                                                                 frame containing
                            looking segment                                                                           a face
                                                                                                    number of frames
                                                                                                   containing a face    
                                       L∈V    tL                                                                             f ∈V     Af ace (f )
✤     Looking time:                                                      ✤   Proximity to camera:
                                    tV
in step (2), we simplified the task with the detection of frontal faces, a reasonable solution given the                      Nf · A(f )                frame

                                                 
inherent nature of conversational vlogging. In addition to its robustness, a face detector may generalize
                                                                                                                 face
                                                                                                                center
                                                                                                                                                        area
                                                                                                              
✤ Looking segment length:
                                                     L∈V tL
better to the case of vloggers who do not display much of their upper body. For each shot, we assessed
                                                                            ✤ Vertical framing:                   f ∈V    cf ace (f ) − c(f ) 
the presence of a talking-head by measuring the ratio of frames with face detections. Then, in step
                                                      NL
(3), we selected conversational shots based on a linear combination of the face detection rate and the             frame     Nf · f h                  frame
duration of the shot relative to the whole duration of the video. This latter condition is motivated by            height                              center
                     N
the observation that non-conversational shots tend to be short, independently on whether they feature
                                  L                                                                    cf ace (f ) − c(f ) )
                                                                             Vertical head motion: σ( cf ace (f ) − c(f ) )
                                                        number of
✤     Looking turns:
people or not.                                       looking segment
                                                                          ✤
                                                                                                   µ(
                     t
   We used existing implementations of algorithms based the OpenCV library [Bradski and Kaehler
                                 V
2008]. The shot boundary detector finds shot discontinuities by thresholding the Bhattacharyya dis-
tance between RGB color histograms of consecutive frames. The face detector implements the boosted
classifiers and Haar-like features from the Viola-Jones algorithm [Viola and Jones 2002] using an ex-
isting cascaded on the OpenCV version 2.0 [Bradski and Kaehler 2008], which scans faces as small as
20x20 pixels. The shot boundary detection algorithm (step 1) and
   and the conversational shot selection (step 3) were tuned in a development set constructed from
Audio cue extraction
:8       •    J-I. Biel and D. Gatica-Perez




ig. 5. Nonverbal cues are extracted based
n speech/non-speech, looking/non-looking
egmentations, and multimodal segmenta-
ons.                                                                                               # of phonemes
                               how much the
                                                                                                    (how fast the
                               vlogger talks
                                                                                                   vlogger speaks)

                                                                                                                      how well the
     ✤   Speaking time: speaking activity                                ✤    Voicing rate: fluency                   vlogger controls
                                                                                                                        loudness



     ✤       Speech segment avg length: fluency                           ✤   Speaking energy: emotional stability

             Speaking turns: fluency                   duration and
n step (2), we simplified the task with the detectionnumber of
   ✤                                                                         Pitch variation: emotional state
                                                       of frontal faces, ✤ reasonable solution given the
                                                                         a
nherent nature of conversational vlogging. In addition topauses
                                                    silent its robustness, a face detector may generalize
etter to the case of vloggers who do not display much of their upper body. For each shot, we assessed
                                                                                                   how well the
he presence of a talking-head by measuring the ratio of frames with face detections. Then, in step   vlogger
3), we selected conversational shots based on a linear combination of the face detection rate and the tone
                                                                                                   controls
uration of the shot relative to the whole duration of the video. This latter condition is motivated by
he observation that non-conversational shots tend to be short, independently on whether they feature
Audio cue extraction
:8       •   J-I. Biel and D. Gatica-Perez




ig. 5. Nonverbal cues are extracted based
n speech/non-speech, looking/non-looking
egmentations, and multimodal segmenta-
ons.
                                speech segment

                                                           speech segment duration

                                                   tS                                                         NS
     ✤   Speaking time:                      S∈V                                      ✤   Voicing rate:   
                                             tV                                                               S∈V   tS
                                                        video duration
                                                                   
                                                                         S∈V   tS                      σ(Senergy )
     ✤ Speech segment avg length:                          NS
                                                                            Speaking energy:
                                                                                      ✤
                                                                                                        µ(Senergy )
                                   NS              number of
                                                                                                     σ(pitch)
n step Speaking turns:
   ✤ (2), we simplified the task with the detection of frontal faces, ✤ reasonablevariation: the
                                                                       a Pitch solution given
                                                speech segments

                                    tV                                                               µ(pitch)
nherent nature of conversational vlogging. In addition to its robustness, a face detector may generalize
etter to the case of vloggers who do not display much of their upper body. For each shot, we assessed
he presence of a talking-head by measuring the ratio of frames with face detections. Then, in step
3), we selected conversational shots based on a linear combination of the face detection rate and the
uration of the shot relative to the whole duration of the video. This latter condition is motivated by
he observation that non-conversational shots tend to be short, independently on whether they feature
Combining audio and visual cues
          1:8    •     J-I. Biel and D. Gatica-Perez




          Fig. 5. Nonverbal cues are extracted based
          on speech/non-speech, looking/non-looking
          segmentations, and multimodal segmenta-
          tions.




          in step (2), we simplified the task with the detection of frontal faces, a reasonable solution given the
          inherent nature of conversational vlogging. In addition to its robustness, a face detector may generalize
✤   Combine “looking at the camera” with “speaking”: four modalities
          better to the case of vloggers who do not display much of their upper body. For each shot, we assessed
          the presence of a talking-head by measuring the ratio of frames with face detections. Then, in step
          (3), we selected conversational shots based on a linear combination of the face detection rate and the
          duration of the shot relative to the whole duration of the video. This latter condition is motivated by
          the observation that non-conversational shots tend to be short, independently on whether they feature
✤   These measures are used to determine dominance in dyadic
          people or not.
             We used existing implementations of algorithms based the OpenCV library [Bradski and Kaehler
    conversations
          2008]. The shot boundary detector finds shot discontinuities by thresholding the Bhattacharyya dis-
          tance between RGB color histograms of consecutive frames. The face detector implements the boosted
          classifiers and Haar-like features from the Viola-Jones algorithm [Viola and Jones 2002] using an ex-
          isting cascaded on the OpenCV version 2.0 [Bradski and Kaehler 2008], which scans faces as small as
          20x20 pixels. The shot boundary detection algorithm (step 1) and
    ✤   Looking-while-speaking: dominant people
             and the conversational shot selection (step 3) were tuned in a development set constructed from
          a small random sample of 100 vlogs. For this purpose, we first annotated hard shot discontinuities
          (a total of 168 hard shot cuts) and then labeled the shot conversational state (s = 1: conversational,
          s = 0: non-conversational). For shot boundary detection, we experimented with different thresholding
          methodologies, including global, relative, and adaptive thresholds [Hanjalic 2002], and we obtained
          the best performance (EER = 15%) using a global threshold γ = 0.5. This performance suggests that
Analysis: video editing elements

✤   Elements manually coded as a support to the core conversational part

    ✤   Snippets (opening, ending, intermediate), background music,
        objects brought toward the camera



     opening                               object toward     intermediate                     object toward   ending
     snippet                                the camera          snippet                        the camera     snippet



               Results
               ✤   Snippets: 45% of vlogs (16% - 20% with opening/endings, 32% with intermediate snippets)
                   ✤     Videos without snippets are monologues
                   ✤     Snippets tend to be a small fraction of the content of the video (~10%)
               ✤   Audio: 25% using soundtrack on snippets, 12% using music on the entire video
               ✤   Objects: 26% of vloggers bring the object toward the camera
Analysis: non-verbal behavior

✤   Vloggers are mainly talking: 85% of people talking for more than half
    of time

✤   Speaking segments tend to be short (hesitations and low fluency)
                                                                                              VlogSense: Conversational Behavior and Social Att



                                     15                                                  25                                                     20
                  Percent of Total




                                                                      Percent of Total




                                                                                                                             Percent of Total
                                                                                         20                                                     15
                                     10
                                                                                         15
                                                                                                                                                10
                                      5                                                  10
                                                                                         5                                                      5

                                      0                                                  0                                                      0
                                          0.0 0.2 0.4 0.6 0.8 1.0                             0    2    4    6    8 10                               0.0    0.2 0.4 0.6 0.8 1.
                                              Time speaking (ratio)                           Avg Leng. of Speech seg. (s)                                 Number of turns (Hz)


                                                                                                                                                30
                                     25
                  al




                                                                      al




                                                                                                                             al
Analysis: non-verbal behavior
                                                                                                                                                                               VlogSense: Conversational Behavior and Social Attention in YouTube                                                                                 •     1:13



                                                                                                  15                                                                      25                                                                        20                                                    20
✤                   50% of vloggers look at the camera over 90% of the time, atPercent of Total




                                                                                                                                                       Percent of Total




                                                                                                                                                                                                                                 Percent of Total




                                                                                                                                                                                                                                                                                       Percent of Total
                                                                                                                                                                          20                                                                        15
                                                                                                  10                                                                                                                                                                                                      15
                                                                                                                                                                          15

                    “standard” distance to the camera (not too close, not too far),                5                                                                      10
                                                                                                                                                                           5
                                                                                                                                                                                                                                                    10

                                                                                                                                                                                                                                                    5
                                                                                                                                                                                                                                                                                                          10
                                                                                                                                                                                                                                                                                                          5

                    showing the upper body                                                         0
                                                                                                       0.0 0.2 0.4 0.6 0.8 1.0
                                                                                                                                                                           0
                                                                                                                                                                               0    2    4    6    8 10
                                                                                                                                                                                                                                                    0
                                                                                                                                                                                                                                                         0.0    0.2 0.4 0.6 0.8 1.0
                                                                                                                                                                                                                                                                                                          0
                                                                                                                                                                                                                                                                                                               0.00 0.02 0.04 0.06 0.08 0.10
                                                                                                           Time speaking (ratio)                                               Avg Leng. of Speech seg. (s)                                                    Number of turns (Hz)                                   Voicing rate (Hz)


                                                                            VlogSense: Conversational Behavior and Social Attention in YouTube
                                                                                                                                        30                                                                                                                             •       1:13                       25
                                                                                                  25




                                                                                                                                                                                                                                                                                       Percent of Total
                                                                               Percent of Total




                                                                                                                                                       Percent of Total




                                                                                                                                                                                                                                 Percent of Total
                                                                                                                                                                          10                                                                                                                              20
                                                                                                  20
                                                                                                                                                                                                                                                    20                                                    15
                                                                                                  15
                   15                                                                             10                                            20                        5                                                                                                                               10
                                                                       25                                                                                                                                                  20                       10
Percent of Total




                                                    Percent of Total




                                                                                                                             Percent of Total




                                                                                                                                                                                                        Percent of Total
                                                                                                  5                                                                                                                                                                                                        5
                                                                       20                                                                       15
                   10                                                                                                                                                                                                      15
                                                                                                  0                                                                       0                                                                          0                                                     0
                                                                       15
                                                                                                         0.2 0.4 0.6 0.8 1.0    10                                             0.0 0.1 0.2 0.3 0.4 0.510                                                  !2     !1       0       1                              0       5    10       15
                    5                                                  10                                  Time looking (ratio)                                                  Proximity to camera (ratio)                                                Vertical framing (ratio)                                    LS/LNS
                                                                        5                                                        5                                                                           5
                                                                              Fig. 8: Selected nonverbal cue distributions for conversational shots in YouTube vlogs: four audio cues, three visual cues, and
                    0                                                   0     one multimodal.             0                                 0
✤                   Vloggers look at the camera when they speak more frequently than
                        0.0 0.2 0.4 0.6 0.8 1.0
                            Time speaking (ratio)
                                                                            0    2    4    6    8 10
                                                                            Avg Leng. of Speech seg. (s)
                                                                                                                                                     0.0              0.2 0.4 0.6 0.8 1.0
                                                                                                                                                                     Number of turns (Hz)
                                                                                                                                                                                                                                0.00 0.02 0.04 0.06 0.08 0.10
                                                                                                                                                                                                                                       Voicing rate (Hz)

                    when they are silent the fact that most of the vlogs are composed of few conversational shots (see Section 6.1).
                   25              result from                                                                                                  30                                                                         25
                                                                                                                                                                                                        Percent of Total
Percent of Total




                                                    Percent of Total




                                                                                                                             Percent of Total




                                   These distributions unveil information that may be useful to understand some basic characteristics
                                                                       10                                                                                                                                                  20
                    ✤ Behavior of dominant people For example, the speaking time distribution, biased towards high
                   20
                                                                                                                                                20
                   15              of nonverbal behavior in vlogging.                                                                                                                                                      15

                   10                     speaking times (median = 0.65, mean = 0.67, sd10 0.15), shows that 85% of the conversational shots
                                                                    10 5                           =
    5                                     contain speech more than half of the time, which suggests that vloggers who were perceived as mainly
                                                                                                   5
    0                               0     talking during the annotation process (Section 4) are indeed speaking for a significant proportion of
                                                                     0                             0
         0.2 0.4 0.6 0.8 1.0          0.0 the time. Speaking segments !1
                                           0.1 0.2 0.3 0.4 0.5         !2      tend 0 be1 short (median = 1.98s, mean = 2.36s, sd = 1.36s), which is
                                                                                       to             0      5     10    15
           Time looking (ratio)         Proximity to camera (ratio)      Vertical framing (ratio)           LS/LNS
                                          common in spontaneous speech, typically characterized by higher numbers of hesitations and lower
Fig. 8: Selected nonverbal cue distributions for conversational shots in YouTube vlogs: four audio cues, three visualper second (median = mean = 0.33,
                                          fluency [Levelt 1989]. The median number of speaking turns cues, and
one multimodal.
Social attention

✤     Social attention on YouTube is measured by considering the number
      of views received by a video

    Popularity
  Borrowed from the Latin
popularis in 1490, originally
    meant common


                                                              this measure reflects
                                                              the number of times
                                                             that the item has been
                                                              accessed (resembling
        other measures of                                    the way audiences are
    popularity, BUT: not all                                 measured in traditional
     the people that access                                    mainstream media)
       to the video like it!
Social attention

✤   Audio cues: vloggers talking longer, faster and using fewer pauses
    receive more views from the audience
Social attention

✤   Visual cues:
    ✤ The time looking at the camera and the average duration of looking turns

      are positively correlated with attention
    ✤ Vloggers that are too close to the camera are penalized: the audience

      cannot perceive body language cues
Future work (...or not?)

✤   Background analysis: do the background tell something about the
    speaker?
Bibliography
Bibliography

✤   Joan-Isaac Biel, Daniel Gatica-Perez, ‘VLogSense: Conversational
    Behavior and Social Attention in YouTube’, ACM Transactions on
    Multimedia Computing, Communications and Applications, 2010

✤   Joan-Isaac Biel, Oya Aran, Daniel Gatica-Perez, ‘You Are Known by
    How You Vlog: Personality Impressions and Nonverbal Behavior in
    YouTube’, AAAI, 2011

✤   Joan-Isaac Biel, Daniel Gatica-Perez, ‘Voices of Vlogging’, AAAI, 2010

✤   Joan-Isaac Biel, Daniel Gatica-Perez, ‘Vlogcast Yourself: Nonverbal
    Behavior and Attention in Social Media’, ICMI-MLMI, 2010
Bibliography

✤   Joan-Isaac Biel, Daniel Gatica-Perez, ‘The Good, the Bad and the Angry:
    Analyzing Crowdsourced Impressions of Vloggers’, AAAI, 2012

✤   Hervé Jégou , ‘Very Large Scale Image/Video Search’, SSMS’12, Santorini

✤   Utkarsh, ‘SIFT: Scale Invariant Feature Transform’, http://
    www.aishack.in/2010/05/sift-scale-invariant-feature-transform/

✤   Wikipedia, ‘Bag of Words’ and ‘Visual Word’

✤   Wikipedia, ‘tf-idf’

✤   Wikipedia, ‘k-means clustering’
Bibliography




✤   Rong Yan, ‘Data mining and machine learning for large-scale social media’,
    SSMS’12, Santorini

Más contenido relacionado

Destacado (16)

Edtech101report
Edtech101reportEdtech101report
Edtech101report
 
Audio media_Edtech101THX
Audio media_Edtech101THXAudio media_Edtech101THX
Audio media_Edtech101THX
 
Manipulatives
ManipulativesManipulatives
Manipulatives
 
Audio media resources
Audio media resourcesAudio media resources
Audio media resources
 
Manipulation Through Mass Media
Manipulation Through Mass MediaManipulation Through Mass Media
Manipulation Through Mass Media
 
Handout: Motion Media
Handout: Motion MediaHandout: Motion Media
Handout: Motion Media
 
Audio media.ppt
Audio media.pptAudio media.ppt
Audio media.ppt
 
Audio media
Audio mediaAudio media
Audio media
 
[EDUCATIONAL TECHNOLOGY 2] Audio media
[EDUCATIONAL TECHNOLOGY 2] Audio media[EDUCATIONAL TECHNOLOGY 2] Audio media
[EDUCATIONAL TECHNOLOGY 2] Audio media
 
Multimedia presentation
  Multimedia presentation   Multimedia presentation
Multimedia presentation
 
Multimedia ppt
Multimedia pptMultimedia ppt
Multimedia ppt
 
visual media.ppt
visual media.pptvisual media.ppt
visual media.ppt
 
Chapter 1 : INTRODUCTION TO MULTIMEDIA
Chapter 1 : INTRODUCTION TO MULTIMEDIAChapter 1 : INTRODUCTION TO MULTIMEDIA
Chapter 1 : INTRODUCTION TO MULTIMEDIA
 
CHARACTERISTICS OF AURAL AND AUDIO MEDIA
CHARACTERISTICS OF AURAL  AND  AUDIO MEDIACHARACTERISTICS OF AURAL  AND  AUDIO MEDIA
CHARACTERISTICS OF AURAL AND AUDIO MEDIA
 
Audio media
Audio mediaAudio media
Audio media
 
Multimedia
MultimediaMultimedia
Multimedia
 

Similar a Multimedia Information Retrieval and User Behavior

Lak 05
Lak 05Lak 05
Lak 05dws1d
 
Don't Design Websites. Design Web SYSTEMS! (DrupalCamp LA 2011)
Don't Design Websites. Design Web SYSTEMS! (DrupalCamp LA 2011)Don't Design Websites. Design Web SYSTEMS! (DrupalCamp LA 2011)
Don't Design Websites. Design Web SYSTEMS! (DrupalCamp LA 2011)Four Kitchens
 
Don't Design Websites. Design Web SYSTEMS! (DrupalCon London 2011)
Don't Design Websites. Design Web SYSTEMS! (DrupalCon London 2011)Don't Design Websites. Design Web SYSTEMS! (DrupalCon London 2011)
Don't Design Websites. Design Web SYSTEMS! (DrupalCon London 2011)Four Kitchens
 
Don't Design Websites. Design Web SYSTEMS! (UT Austin Drupal Users Group)
Don't Design Websites. Design Web SYSTEMS! (UT Austin Drupal Users Group)Don't Design Websites. Design Web SYSTEMS! (UT Austin Drupal Users Group)
Don't Design Websites. Design Web SYSTEMS! (UT Austin Drupal Users Group)Four Kitchens
 
Designing the Future of Broadcasting
Designing the Future of BroadcastingDesigning the Future of Broadcasting
Designing the Future of BroadcastingDaytona
 
It's Not You It's Us: How design reviews can make you better at visual design
It's Not You It's Us: How design reviews can make you better at visual designIt's Not You It's Us: How design reviews can make you better at visual design
It's Not You It's Us: How design reviews can make you better at visual designMatthew Pierce
 
Interfaces, Accessibility & Superheroes
Interfaces, Accessibility & SuperheroesInterfaces, Accessibility & Superheroes
Interfaces, Accessibility & Superheroesliviaven
 
Don't Design Websites. Design Web SYSTEMS! (BADCamp 2011)
Don't Design Websites. Design Web SYSTEMS! (BADCamp 2011)Don't Design Websites. Design Web SYSTEMS! (BADCamp 2011)
Don't Design Websites. Design Web SYSTEMS! (BADCamp 2011)Four Kitchens
 
Interfaces, Accessibility & Superheroes
Interfaces, Accessibility & SuperheroesInterfaces, Accessibility & Superheroes
Interfaces, Accessibility & Superheroesliviaven
 
Fcv hum mach_grauman
Fcv hum mach_graumanFcv hum mach_grauman
Fcv hum mach_graumanzukun
 
Don't Design Websites. Design Web SYSTEMS! (Dallas Drupal Days 2011)
Don't Design Websites. Design Web SYSTEMS! (Dallas Drupal Days 2011)Don't Design Websites. Design Web SYSTEMS! (Dallas Drupal Days 2011)
Don't Design Websites. Design Web SYSTEMS! (Dallas Drupal Days 2011)Four Kitchens
 
Don't Design Websites. Design Web SYSTEMS! (DrupalCon Chicago 2011)
Don't Design Websites. Design Web SYSTEMS! (DrupalCon Chicago 2011)Don't Design Websites. Design Web SYSTEMS! (DrupalCon Chicago 2011)
Don't Design Websites. Design Web SYSTEMS! (DrupalCon Chicago 2011)Four Kitchens
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsForward Gradient
 
Montreal IWB Presentation
Montreal IWB PresentationMontreal IWB Presentation
Montreal IWB PresentationChris Betcher
 
Don't Design Websites. Design Web SYSTEMS! (DrupalCamp Stockholm 2011)
Don't Design Websites. Design Web SYSTEMS! (DrupalCamp Stockholm 2011)Don't Design Websites. Design Web SYSTEMS! (DrupalCamp Stockholm 2011)
Don't Design Websites. Design Web SYSTEMS! (DrupalCamp Stockholm 2011)Four Kitchens
 
Usability for Web Designers
Usability for Web DesignersUsability for Web Designers
Usability for Web DesignersBen Woods
 
Building a Design Team
Building a Design TeamBuilding a Design Team
Building a Design TeamBraden Kowitz
 
Don't Just Present, Enchant !
Don't Just Present, Enchant !Don't Just Present, Enchant !
Don't Just Present, Enchant !Vivek Juneja
 
The Village Avatars: A Learning Asset Model
The Village Avatars: A Learning Asset ModelThe Village Avatars: A Learning Asset Model
The Village Avatars: A Learning Asset ModelDr. Robin Yap
 

Similar a Multimedia Information Retrieval and User Behavior (20)

Lak 05
Lak 05Lak 05
Lak 05
 
Don't Design Websites. Design Web SYSTEMS! (DrupalCamp LA 2011)
Don't Design Websites. Design Web SYSTEMS! (DrupalCamp LA 2011)Don't Design Websites. Design Web SYSTEMS! (DrupalCamp LA 2011)
Don't Design Websites. Design Web SYSTEMS! (DrupalCamp LA 2011)
 
Don't Design Websites. Design Web SYSTEMS! (DrupalCon London 2011)
Don't Design Websites. Design Web SYSTEMS! (DrupalCon London 2011)Don't Design Websites. Design Web SYSTEMS! (DrupalCon London 2011)
Don't Design Websites. Design Web SYSTEMS! (DrupalCon London 2011)
 
Don't Design Websites. Design Web SYSTEMS! (UT Austin Drupal Users Group)
Don't Design Websites. Design Web SYSTEMS! (UT Austin Drupal Users Group)Don't Design Websites. Design Web SYSTEMS! (UT Austin Drupal Users Group)
Don't Design Websites. Design Web SYSTEMS! (UT Austin Drupal Users Group)
 
Designing the Future of Broadcasting
Designing the Future of BroadcastingDesigning the Future of Broadcasting
Designing the Future of Broadcasting
 
It's Not You It's Us: How design reviews can make you better at visual design
It's Not You It's Us: How design reviews can make you better at visual designIt's Not You It's Us: How design reviews can make you better at visual design
It's Not You It's Us: How design reviews can make you better at visual design
 
Interfaces, Accessibility & Superheroes
Interfaces, Accessibility & SuperheroesInterfaces, Accessibility & Superheroes
Interfaces, Accessibility & Superheroes
 
Don't Design Websites. Design Web SYSTEMS! (BADCamp 2011)
Don't Design Websites. Design Web SYSTEMS! (BADCamp 2011)Don't Design Websites. Design Web SYSTEMS! (BADCamp 2011)
Don't Design Websites. Design Web SYSTEMS! (BADCamp 2011)
 
Interfaces, Accessibility & Superheroes
Interfaces, Accessibility & SuperheroesInterfaces, Accessibility & Superheroes
Interfaces, Accessibility & Superheroes
 
Fcv hum mach_grauman
Fcv hum mach_graumanFcv hum mach_grauman
Fcv hum mach_grauman
 
Don't Design Websites. Design Web SYSTEMS! (Dallas Drupal Days 2011)
Don't Design Websites. Design Web SYSTEMS! (Dallas Drupal Days 2011)Don't Design Websites. Design Web SYSTEMS! (Dallas Drupal Days 2011)
Don't Design Websites. Design Web SYSTEMS! (Dallas Drupal Days 2011)
 
Don't Design Websites. Design Web SYSTEMS! (DrupalCon Chicago 2011)
Don't Design Websites. Design Web SYSTEMS! (DrupalCon Chicago 2011)Don't Design Websites. Design Web SYSTEMS! (DrupalCon Chicago 2011)
Don't Design Websites. Design Web SYSTEMS! (DrupalCon Chicago 2011)
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
 
Montreal IWB Presentation
Montreal IWB PresentationMontreal IWB Presentation
Montreal IWB Presentation
 
Wrt302
Wrt302Wrt302
Wrt302
 
Don't Design Websites. Design Web SYSTEMS! (DrupalCamp Stockholm 2011)
Don't Design Websites. Design Web SYSTEMS! (DrupalCamp Stockholm 2011)Don't Design Websites. Design Web SYSTEMS! (DrupalCamp Stockholm 2011)
Don't Design Websites. Design Web SYSTEMS! (DrupalCamp Stockholm 2011)
 
Usability for Web Designers
Usability for Web DesignersUsability for Web Designers
Usability for Web Designers
 
Building a Design Team
Building a Design TeamBuilding a Design Team
Building a Design Team
 
Don't Just Present, Enchant !
Don't Just Present, Enchant !Don't Just Present, Enchant !
Don't Just Present, Enchant !
 
The Village Avatars: A Learning Asset Model
The Village Avatars: A Learning Asset ModelThe Village Avatars: A Learning Asset Model
The Village Avatars: A Learning Asset Model
 

Más de Eleonora Ciceri

DDD - 5 - Domain Driven Design_ Repositories.pdf
DDD - 5 - Domain Driven Design_ Repositories.pdfDDD - 5 - Domain Driven Design_ Repositories.pdf
DDD - 5 - Domain Driven Design_ Repositories.pdfEleonora Ciceri
 
DDD - 4 - Domain Driven Design_ Architectural patterns.pdf
DDD - 4 - Domain Driven Design_ Architectural patterns.pdfDDD - 4 - Domain Driven Design_ Architectural patterns.pdf
DDD - 4 - Domain Driven Design_ Architectural patterns.pdfEleonora Ciceri
 
DDD - 3 - Domain Driven Design: Event sourcing.pdf
DDD - 3 - Domain Driven Design: Event sourcing.pdfDDD - 3 - Domain Driven Design: Event sourcing.pdf
DDD - 3 - Domain Driven Design: Event sourcing.pdfEleonora Ciceri
 
DDD - 2 - Domain Driven Design: Tactical design.pdf
DDD - 2 - Domain Driven Design: Tactical design.pdfDDD - 2 - Domain Driven Design: Tactical design.pdf
DDD - 2 - Domain Driven Design: Tactical design.pdfEleonora Ciceri
 
DDD - 1 - A gentle introduction to Domain Driven Design.pdf
DDD - 1 - A gentle introduction to Domain Driven Design.pdfDDD - 1 - A gentle introduction to Domain Driven Design.pdf
DDD - 1 - A gentle introduction to Domain Driven Design.pdfEleonora Ciceri
 
Artificial Intelligence: an introduction.pdf
Artificial Intelligence: an introduction.pdfArtificial Intelligence: an introduction.pdf
Artificial Intelligence: an introduction.pdfEleonora Ciceri
 
Linked lists - Exercises
Linked lists - ExercisesLinked lists - Exercises
Linked lists - ExercisesEleonora Ciceri
 
Dynamic content generation
Dynamic content generationDynamic content generation
Dynamic content generationEleonora Ciceri
 
The CrowdSearch framework
The CrowdSearch frameworkThe CrowdSearch framework
The CrowdSearch frameworkEleonora Ciceri
 

Más de Eleonora Ciceri (18)

DDD - 5 - Domain Driven Design_ Repositories.pdf
DDD - 5 - Domain Driven Design_ Repositories.pdfDDD - 5 - Domain Driven Design_ Repositories.pdf
DDD - 5 - Domain Driven Design_ Repositories.pdf
 
DDD - 4 - Domain Driven Design_ Architectural patterns.pdf
DDD - 4 - Domain Driven Design_ Architectural patterns.pdfDDD - 4 - Domain Driven Design_ Architectural patterns.pdf
DDD - 4 - Domain Driven Design_ Architectural patterns.pdf
 
DDD - 3 - Domain Driven Design: Event sourcing.pdf
DDD - 3 - Domain Driven Design: Event sourcing.pdfDDD - 3 - Domain Driven Design: Event sourcing.pdf
DDD - 3 - Domain Driven Design: Event sourcing.pdf
 
DDD - 2 - Domain Driven Design: Tactical design.pdf
DDD - 2 - Domain Driven Design: Tactical design.pdfDDD - 2 - Domain Driven Design: Tactical design.pdf
DDD - 2 - Domain Driven Design: Tactical design.pdf
 
DDD - 1 - A gentle introduction to Domain Driven Design.pdf
DDD - 1 - A gentle introduction to Domain Driven Design.pdfDDD - 1 - A gentle introduction to Domain Driven Design.pdf
DDD - 1 - A gentle introduction to Domain Driven Design.pdf
 
Artificial Intelligence: an introduction.pdf
Artificial Intelligence: an introduction.pdfArtificial Intelligence: an introduction.pdf
Artificial Intelligence: an introduction.pdf
 
Sorting algorithms
Sorting algorithmsSorting algorithms
Sorting algorithms
 
Trees
TreesTrees
Trees
 
Linked lists - Exercises
Linked lists - ExercisesLinked lists - Exercises
Linked lists - Exercises
 
Doubly Linked Lists
Doubly Linked ListsDoubly Linked Lists
Doubly Linked Lists
 
Linked lists
Linked listsLinked lists
Linked lists
 
AJAX - An introduction
AJAX - An introductionAJAX - An introduction
AJAX - An introduction
 
Java Server Pages
Java Server PagesJava Server Pages
Java Server Pages
 
JDBC in Servlets
JDBC in ServletsJDBC in Servlets
JDBC in Servlets
 
Client side scripting
Client side scriptingClient side scripting
Client side scripting
 
HTML5 - An introduction
HTML5 - An introductionHTML5 - An introduction
HTML5 - An introduction
 
Dynamic content generation
Dynamic content generationDynamic content generation
Dynamic content generation
 
The CrowdSearch framework
The CrowdSearch frameworkThe CrowdSearch framework
The CrowdSearch framework
 

Multimedia Information Retrieval and User Behavior

  • 1. Multimedia Information Retrieval and User Behavior in Social Media Eleonora Ciceri, ciceri@elet.polimi.it Date 22/10/2012
  • 2. Outline ✤ Multimedia Information Retrieval on large data sets ✤ The “giants” of photo uploads ✤ Image search ✤ Descriptors ✤ Bag of Visual Words ✤ Analyzing User Motivations in Video Blogging ✤ What is a video blog? ✤ Non-verbal communication ✤ Automatic processing pipeline ✤ Cues extraction & Results ✤ Cues vs. Social Attention
  • 4. The “giants” of photo uploads ✤ Flickr uploads: (source: http://www.flickr.com/) ✤ 1,54 million photos per day in average ✤ 51 million users ✤ 6 billion images ✤ Facebook uploads: (source: http://thenextweb.com/) ✤ 250 million photos per day in average ✤ 845 million users in February 2012 ✤ 90+ billion in August 2011 ✤ “Flickr hits 6 billion total photos, Facebook does that every two months”
  • 5. Image search ✤ Query by example: look for a particular object / scene / location in a collection of images
  • 6. Image search ✤ Copy detection ✤ Annotation / Classification / Detection “dog” “dog”? “dog” “child”
  • 7. Descriptors ✤ How can we look for similar images? ✤ Compute a descriptor: mathematical representation ✤ Find similar descriptors ✤ Problem: occlusions, changes in rotations-scale-lighting
  • 8. Descriptors ✤ How can we look for similar images? ✤ Compute a descriptor: mathematical representation ✤ Find similar descriptors ✤ Solution: invariant descriptors (to scale / rotation...)
  • 9. Global descriptors ✤ Global descriptors: one descriptor per image (highly scalable) ✤ Color histogram: representation of the distribution of colors ✤ Pros: high invariance to many transformations ✤ Cons: high invariance to TOO many transformations (limited discriminative power)
  • 10. Local descriptors ✤ Local descriptors: find regions of interest that will be exploited for image comparison ✤ SIFT: Scale Invariant Feature Transform ✤ Extract key-points (maxima and minima in the Difference of Gaussian image) ✤ Assign orientation to key-points (result: rotation invariance) ✤ Generate the feature vector for each key-point
  • 11. Direct matching query ✤ Assumption: image ✤ m=1000 descriptors for one image ✤ Each descriptor has d=128 dimensions ✤ N>1000000 images in the data set ✤ Search: a query is submitted; results are retrieved ✤ Each descriptor of the query image is tested again each descriptor of the image data set ✤ Complexity: m2Nd elementary operations; Required space: ???
  • 12. Bag of Visual Words ✤ Objective: “put the images into words” (visual words) ✤ What is a visual word? “A small part of the image that carries some kind of information related to the features” [Wikipedia] ✤ Analogy Text-Image: ✤ Visual word: small patch of the image ✤ Visual term: cluster of patches that give the same information ✤ Bag of visual words: collection of words that give information about the meaning of the image at all
  • 13. Bag of Visual Words ✤ How to build a visual dictionary? ✤ Local descriptors are clustered ✤ A local descriptor is assigned to its nearest neighbor: q(x) = arg min x − µw 2 w∈ω w∈ω Mean of the cluster w Visual Cluster dictionary
  • 14. Why Visual Words? ✤ Pros: ✤ Much more compact representation ✤ We can take advantage from text retrieval techniques to apply them to image retrieval system Find similar Results vectors Rd f (t, d) |D| tf idf (t, d, D) = log max{f (w, d) : w ∈ d} |{d ∈ D : t ∈ d}|
  • 15. Analyzing User Motivations in Video Blogging
  • 16. What is a video blog? ✤ Video blog (vlog): conversational videos where people (usually a single person) discuss facing the camera and addressing the audience in a Skype-style fashion ✤ Examples: video testimonial (companies pay for testing products), video advice (e.g., how to get dressed for a party), discussions
  • 17. Why vlogs are used Corporate communication Life documentary Comments Ratings E-learning Marketing COMMUNITY High participation Daily interaction Discussion Entertainment Critique
  • 18. Why vlogs are studied ✤ Why are vlogs relevant? ✤ Automatic analysis of personal websites, blogs and social networks is limited to text (in order to understand users’ motivations) ✤ Vlog is a new type of social media (40% of the most viewed videos on YouTube): how to do automatic analysis? ✤ Study a real-life communication scenario ✤ Humans judgements are based on first impressions: can we predict them?
  • 19. Real communication vs. vlogs ✤ Real communication ✤ Vlog ✤ Synchronous ✤ Asynchronous ✤ Two (or more) people interact ✤ Monologue ✤ Metadata blah blah ? blah blah blah
  • 20. Non-verbal communication ✤ The process of communication through sending and receiving wordless/visual cues between people Body Speech Gestures Voice quality Touch Rate Body language Pitch Posture Volume Facial expression Rhythm Eye contact Intonation ✤ Why? To express aspects of identity (age, occupation, culture, personality)
  • 21.
  • 22. An example: dominance ✤ Power: capacity or right to control others ✤ Dominance: way of exerting power involving the motive to control others ✤ Behavior: talk louder, talk longer, speak first, interrupt more, add gestures, receive more visual attention
  • 23. Automatic processing pipeline Audio cues Face detection Shot selection Visual cues A B C D E A B C D E (for each shot) (Viola-Jones algorithm) Without faces Short: not talking VlogSense: Conversational Behavior and Social Attention in YouTube • 1:7 Shot boundary Aggregate !#$% #'()*+,% )-$-.$/#( detection shot-level cues 0+#($*1 0*.- )-$-.$/#( Aggregated cues !#$ !-1-.$/#( (based on color (at video level) histogram differences)
  • 24. Visual cue extraction binary image containing the ✤ Figure 2: wMEI Images for two vlogs. (D ) Weighted Motion Energy Images (wMEI): wM EI = moving pixels in frame f f f ∈V ✤ Feature Extraction It indicates the visual activity of a pixel (accumulated motion through We video) the automatically extracted nonverbal cues from both au- dio and video with the purpose of characterizing vloggers’ behavior. Given regions with higher motion of vlogs, in our Brighter pixels: the conversational nature ✤
  • 25. Visual cue extraction 1:8 • J-I. Biel and D. Gatica-Perez Fig. 5. Nonverbal cues are extracted based on speech/non-speech, looking/non-looking segmentations, and multimodal segmenta- tions. ✤ It is difficult to estimate the actual direction of the eyes in step (2), we simplified the task with the detection of frontal faces, a reasonable solution given the inherent nature of conversational vlogging. In addition to its robustness, a face detector may generalize ✤ If the face is in frontal position I’m most likely looking at the camera better to the case of vloggers who do not display much of their upper body. For each shot, we assessed the presence of a talking-head by measuring the ratio of frames with face detections. Then, in step (3), we selected conversational shots based on a linear combination of the face detection rate and the duration of the shot relative to the whole duration of the video. This latter condition is motivated by the observation that non-conversational shots tend to be short, independently on whether they feature people or not. We used existing implementations of algorithms based the OpenCV library [Bradski and Kaehler 2008]. The shot boundary detector finds shot discontinuities by thresholding the Bhattacharyya dis- We are interested in frontal face detection tance between RGB color histograms of consecutive frames. The face detector implements the boosted classifiers and Haar-like features from the Viola-Jones algorithm [Viola and Jones 2002] using an ex- isting cascaded on the OpenCV version 2.0 [Bradski and Kaehler 2008], which scans faces as small as 20x20 pixels. The shot boundary detection algorithm (step 1) and and the conversational shot selection (step 3) were tuned in a development set constructed from
  • 26. 1:8 Visual cue extraction • J-I. Biel and D. Gatica-Perez Fig. 5. Nonverbal cues are extracted based on speech/non-speech, looking/non-looking segmentations, and multimodal segmenta- tions. how much the vlogger looks choice of to the camera addressing the camera from close-ups ✤ Looking time: looking activity persistence of a ✤ Proximity to camera in step (2), we simplified the task with the detection of frontal faces, a reasonable solution given the vlogger’s gaze inherent nature of conversational vlogging. In addition to its robustness, a face detector may generalize better to the case of vloggers who do not display much of their upper body. For each shot, we assessed ✤ Looking segment length: persistence ✤ Vertical framing: upper body the presence of a talking-head by measuring the ratio of frames with face detections. Then, in step (3), we selected conversational shots based on a linear combination of the face detection rate and the duration of the shot relative to the whole duration of the video. This latter condition is motivated by how much the the observation that non-conversational shots tend to be short, independently on whether they feature vlogger shows ✤ Looking turns: looking activity people or not. ✤ Vertical head motion the upper body We used existing implementations of algorithms based the OpenCV library [Bradski and Kaehler 2008]. The shot boundary detector finds shot discontinuities by thresholding the Bhattacharyya dis- how much the tance between RGB color histogramslooks vlogger of consecutive frames. The face detector implements the boosted to the camera classifiers and Haar-like features from the Viola-Jones algorithm [Viola and Jones 2002] using an ex- isting cascaded on the OpenCV version 2.0 [Bradski and Kaehler 2008], which scans faces as small as 20x20 pixels. The shot boundary detection algorithm (step 1) and and the conversational shot selection (step 3) were tuned in a development set constructed from
  • 27. 1:8 Visual cue extraction • J-I. Biel and D. Gatica-Perez Fig. 5. Nonverbal cues are extracted based on speech/non-speech, looking/non-looking segmentations, and multimodal segmenta- tions. face area in the current frame frame containing looking segment a face number of frames containing a face L∈V tL f ∈V Af ace (f ) ✤ Looking time: ✤ Proximity to camera: tV in step (2), we simplified the task with the detection of frontal faces, a reasonable solution given the Nf · A(f ) frame inherent nature of conversational vlogging. In addition to its robustness, a face detector may generalize face center area ✤ Looking segment length: L∈V tL better to the case of vloggers who do not display much of their upper body. For each shot, we assessed ✤ Vertical framing: f ∈V cf ace (f ) − c(f ) the presence of a talking-head by measuring the ratio of frames with face detections. Then, in step NL (3), we selected conversational shots based on a linear combination of the face detection rate and the frame Nf · f h frame duration of the shot relative to the whole duration of the video. This latter condition is motivated by height center N the observation that non-conversational shots tend to be short, independently on whether they feature L cf ace (f ) − c(f ) ) Vertical head motion: σ( cf ace (f ) − c(f ) ) number of ✤ Looking turns: people or not. looking segment ✤ µ( t We used existing implementations of algorithms based the OpenCV library [Bradski and Kaehler V 2008]. The shot boundary detector finds shot discontinuities by thresholding the Bhattacharyya dis- tance between RGB color histograms of consecutive frames. The face detector implements the boosted classifiers and Haar-like features from the Viola-Jones algorithm [Viola and Jones 2002] using an ex- isting cascaded on the OpenCV version 2.0 [Bradski and Kaehler 2008], which scans faces as small as 20x20 pixels. The shot boundary detection algorithm (step 1) and and the conversational shot selection (step 3) were tuned in a development set constructed from
  • 28. Audio cue extraction :8 • J-I. Biel and D. Gatica-Perez ig. 5. Nonverbal cues are extracted based n speech/non-speech, looking/non-looking egmentations, and multimodal segmenta- ons. # of phonemes how much the (how fast the vlogger talks vlogger speaks) how well the ✤ Speaking time: speaking activity ✤ Voicing rate: fluency vlogger controls loudness ✤ Speech segment avg length: fluency ✤ Speaking energy: emotional stability Speaking turns: fluency duration and n step (2), we simplified the task with the detectionnumber of ✤ Pitch variation: emotional state of frontal faces, ✤ reasonable solution given the a nherent nature of conversational vlogging. In addition topauses silent its robustness, a face detector may generalize etter to the case of vloggers who do not display much of their upper body. For each shot, we assessed how well the he presence of a talking-head by measuring the ratio of frames with face detections. Then, in step vlogger 3), we selected conversational shots based on a linear combination of the face detection rate and the tone controls uration of the shot relative to the whole duration of the video. This latter condition is motivated by he observation that non-conversational shots tend to be short, independently on whether they feature
  • 29. Audio cue extraction :8 • J-I. Biel and D. Gatica-Perez ig. 5. Nonverbal cues are extracted based n speech/non-speech, looking/non-looking egmentations, and multimodal segmenta- ons. speech segment speech segment duration tS NS ✤ Speaking time: S∈V ✤ Voicing rate: tV S∈V tS video duration S∈V tS σ(Senergy ) ✤ Speech segment avg length: NS Speaking energy: ✤ µ(Senergy ) NS number of σ(pitch) n step Speaking turns: ✤ (2), we simplified the task with the detection of frontal faces, ✤ reasonablevariation: the a Pitch solution given speech segments tV µ(pitch) nherent nature of conversational vlogging. In addition to its robustness, a face detector may generalize etter to the case of vloggers who do not display much of their upper body. For each shot, we assessed he presence of a talking-head by measuring the ratio of frames with face detections. Then, in step 3), we selected conversational shots based on a linear combination of the face detection rate and the uration of the shot relative to the whole duration of the video. This latter condition is motivated by he observation that non-conversational shots tend to be short, independently on whether they feature
  • 30. Combining audio and visual cues 1:8 • J-I. Biel and D. Gatica-Perez Fig. 5. Nonverbal cues are extracted based on speech/non-speech, looking/non-looking segmentations, and multimodal segmenta- tions. in step (2), we simplified the task with the detection of frontal faces, a reasonable solution given the inherent nature of conversational vlogging. In addition to its robustness, a face detector may generalize ✤ Combine “looking at the camera” with “speaking”: four modalities better to the case of vloggers who do not display much of their upper body. For each shot, we assessed the presence of a talking-head by measuring the ratio of frames with face detections. Then, in step (3), we selected conversational shots based on a linear combination of the face detection rate and the duration of the shot relative to the whole duration of the video. This latter condition is motivated by the observation that non-conversational shots tend to be short, independently on whether they feature ✤ These measures are used to determine dominance in dyadic people or not. We used existing implementations of algorithms based the OpenCV library [Bradski and Kaehler conversations 2008]. The shot boundary detector finds shot discontinuities by thresholding the Bhattacharyya dis- tance between RGB color histograms of consecutive frames. The face detector implements the boosted classifiers and Haar-like features from the Viola-Jones algorithm [Viola and Jones 2002] using an ex- isting cascaded on the OpenCV version 2.0 [Bradski and Kaehler 2008], which scans faces as small as 20x20 pixels. The shot boundary detection algorithm (step 1) and ✤ Looking-while-speaking: dominant people and the conversational shot selection (step 3) were tuned in a development set constructed from a small random sample of 100 vlogs. For this purpose, we first annotated hard shot discontinuities (a total of 168 hard shot cuts) and then labeled the shot conversational state (s = 1: conversational, s = 0: non-conversational). For shot boundary detection, we experimented with different thresholding methodologies, including global, relative, and adaptive thresholds [Hanjalic 2002], and we obtained the best performance (EER = 15%) using a global threshold γ = 0.5. This performance suggests that
  • 31. Analysis: video editing elements ✤ Elements manually coded as a support to the core conversational part ✤ Snippets (opening, ending, intermediate), background music, objects brought toward the camera opening object toward intermediate object toward ending snippet the camera snippet the camera snippet Results ✤ Snippets: 45% of vlogs (16% - 20% with opening/endings, 32% with intermediate snippets) ✤ Videos without snippets are monologues ✤ Snippets tend to be a small fraction of the content of the video (~10%) ✤ Audio: 25% using soundtrack on snippets, 12% using music on the entire video ✤ Objects: 26% of vloggers bring the object toward the camera
  • 32. Analysis: non-verbal behavior ✤ Vloggers are mainly talking: 85% of people talking for more than half of time ✤ Speaking segments tend to be short (hesitations and low fluency) VlogSense: Conversational Behavior and Social Att 15 25 20 Percent of Total Percent of Total Percent of Total 20 15 10 15 10 5 10 5 5 0 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1. Time speaking (ratio) Avg Leng. of Speech seg. (s) Number of turns (Hz) 30 25 al al al
  • 33. Analysis: non-verbal behavior VlogSense: Conversational Behavior and Social Attention in YouTube • 1:13 15 25 20 20 ✤ 50% of vloggers look at the camera over 90% of the time, atPercent of Total Percent of Total Percent of Total Percent of Total 20 15 10 15 15 “standard” distance to the camera (not too close, not too far), 5 10 5 10 5 10 5 showing the upper body 0 0.0 0.2 0.4 0.6 0.8 1.0 0 0 2 4 6 8 10 0 0.0 0.2 0.4 0.6 0.8 1.0 0 0.00 0.02 0.04 0.06 0.08 0.10 Time speaking (ratio) Avg Leng. of Speech seg. (s) Number of turns (Hz) Voicing rate (Hz) VlogSense: Conversational Behavior and Social Attention in YouTube 30 • 1:13 25 25 Percent of Total Percent of Total Percent of Total Percent of Total 10 20 20 20 15 15 15 10 20 5 10 25 20 10 Percent of Total Percent of Total Percent of Total Percent of Total 5 5 20 15 10 15 0 0 0 0 15 0.2 0.4 0.6 0.8 1.0 10 0.0 0.1 0.2 0.3 0.4 0.510 !2 !1 0 1 0 5 10 15 5 10 Time looking (ratio) Proximity to camera (ratio) Vertical framing (ratio) LS/LNS 5 5 5 Fig. 8: Selected nonverbal cue distributions for conversational shots in YouTube vlogs: four audio cues, three visual cues, and 0 0 one multimodal. 0 0 ✤ Vloggers look at the camera when they speak more frequently than 0.0 0.2 0.4 0.6 0.8 1.0 Time speaking (ratio) 0 2 4 6 8 10 Avg Leng. of Speech seg. (s) 0.0 0.2 0.4 0.6 0.8 1.0 Number of turns (Hz) 0.00 0.02 0.04 0.06 0.08 0.10 Voicing rate (Hz) when they are silent the fact that most of the vlogs are composed of few conversational shots (see Section 6.1). 25 result from 30 25 Percent of Total Percent of Total Percent of Total Percent of Total These distributions unveil information that may be useful to understand some basic characteristics 10 20 ✤ Behavior of dominant people For example, the speaking time distribution, biased towards high 20 20 15 of nonverbal behavior in vlogging. 15 10 speaking times (median = 0.65, mean = 0.67, sd10 0.15), shows that 85% of the conversational shots 10 5 = 5 contain speech more than half of the time, which suggests that vloggers who were perceived as mainly 5 0 0 talking during the annotation process (Section 4) are indeed speaking for a significant proportion of 0 0 0.2 0.4 0.6 0.8 1.0 0.0 the time. Speaking segments !1 0.1 0.2 0.3 0.4 0.5 !2 tend 0 be1 short (median = 1.98s, mean = 2.36s, sd = 1.36s), which is to 0 5 10 15 Time looking (ratio) Proximity to camera (ratio) Vertical framing (ratio) LS/LNS common in spontaneous speech, typically characterized by higher numbers of hesitations and lower Fig. 8: Selected nonverbal cue distributions for conversational shots in YouTube vlogs: four audio cues, three visualper second (median = mean = 0.33, fluency [Levelt 1989]. The median number of speaking turns cues, and one multimodal.
  • 34. Social attention ✤ Social attention on YouTube is measured by considering the number of views received by a video Popularity Borrowed from the Latin popularis in 1490, originally meant common this measure reflects the number of times that the item has been accessed (resembling other measures of the way audiences are popularity, BUT: not all measured in traditional the people that access mainstream media) to the video like it!
  • 35. Social attention ✤ Audio cues: vloggers talking longer, faster and using fewer pauses receive more views from the audience
  • 36. Social attention ✤ Visual cues: ✤ The time looking at the camera and the average duration of looking turns are positively correlated with attention ✤ Vloggers that are too close to the camera are penalized: the audience cannot perceive body language cues
  • 37. Future work (...or not?) ✤ Background analysis: do the background tell something about the speaker?
  • 39. Bibliography ✤ Joan-Isaac Biel, Daniel Gatica-Perez, ‘VLogSense: Conversational Behavior and Social Attention in YouTube’, ACM Transactions on Multimedia Computing, Communications and Applications, 2010 ✤ Joan-Isaac Biel, Oya Aran, Daniel Gatica-Perez, ‘You Are Known by How You Vlog: Personality Impressions and Nonverbal Behavior in YouTube’, AAAI, 2011 ✤ Joan-Isaac Biel, Daniel Gatica-Perez, ‘Voices of Vlogging’, AAAI, 2010 ✤ Joan-Isaac Biel, Daniel Gatica-Perez, ‘Vlogcast Yourself: Nonverbal Behavior and Attention in Social Media’, ICMI-MLMI, 2010
  • 40. Bibliography ✤ Joan-Isaac Biel, Daniel Gatica-Perez, ‘The Good, the Bad and the Angry: Analyzing Crowdsourced Impressions of Vloggers’, AAAI, 2012 ✤ Hervé Jégou , ‘Very Large Scale Image/Video Search’, SSMS’12, Santorini ✤ Utkarsh, ‘SIFT: Scale Invariant Feature Transform’, http:// www.aishack.in/2010/05/sift-scale-invariant-feature-transform/ ✤ Wikipedia, ‘Bag of Words’ and ‘Visual Word’ ✤ Wikipedia, ‘tf-idf’ ✤ Wikipedia, ‘k-means clustering’
  • 41. Bibliography ✤ Rong Yan, ‘Data mining and machine learning for large-scale social media’, SSMS’12, Santorini