Multimedia Information Retrieval and User Behavior

Multimedia Information Retrieval and
User Behavior in Social Media
Eleonora Ciceri, ciceri@elet.polimi.it

Date 22/10/2012

Outline

✤ Multimedia Information Retrieval on large data sets
✤ The “giants” of photo uploads

✤ Image search

✤ Descriptors

✤ Bag of Visual Words

✤ Analyzing User Motivations in Video Blogging
✤ What is a video blog?

✤ Non-verbal communication

✤ Automatic processing pipeline

✤ Cues extraction & Results

✤ Cues vs. Social Attention

Multimedia Information Retrieval
on large data sets

The “giants” of photo uploads

✤ Flickr uploads: (source: http://www.ﬂickr.com/)
✤ 1,54 million photos per day in average

✤ 51 million users

✤ 6 billion images

✤ Facebook uploads: (source: http://thenextweb.com/)
✤ 250 million photos per day in average

✤ 845 million users in February 2012

✤ 90+ billion in August 2011

✤ “Flickr hits 6 billion total photos, Facebook does that every two months”

Image search

✤ Query by example: look for a particular object / scene / location in a
collection of images

Image search

✤ Copy detection

✤ Annotation / Classiﬁcation / Detection

“dog” “dog”? “dog”
“child”

Descriptors

✤ How can we look for similar images?

✤ Compute a descriptor: mathematical representation

✤ Find similar descriptors

✤ Problem: occlusions, changes in rotations-scale-lighting

Descriptors

✤ How can we look for similar images?

✤ Compute a descriptor: mathematical representation

✤ Find similar descriptors

✤ Solution: invariant descriptors (to scale / rotation...)

Global descriptors

✤ Global descriptors: one descriptor per image (highly scalable)

✤ Color histogram: representation of the distribution of colors

✤ Pros: high invariance to many transformations

✤ Cons: high invariance to TOO many transformations (limited
discriminative power)

Local descriptors

✤ Local descriptors: ﬁnd regions of interest that will be exploited for
image comparison

✤ SIFT: Scale Invariant Feature Transform

✤ Extract key-points (maxima and minima in the Difference of
Gaussian image)

✤ Assign orientation to key-points (result: rotation invariance)

✤ Generate the feature vector for each key-point

Direct matching
query
✤ Assumption: image

✤ m=1000 descriptors for one image
✤ Each descriptor has d=128 dimensions
✤ N>1000000 images in the data set

✤ Search: a query is submitted; results are retrieved

✤ Each descriptor of the query image is tested again each descriptor
of the image data set

✤ Complexity: m2Nd elementary operations; Required space: ???

Bag of Visual Words

✤ Objective: “put the images into words” (visual words)
✤ What is a visual word? “A small part of the image that carries some
kind of information related to the features” [Wikipedia]

✤ Analogy Text-Image:

✤ Visual word: small patch of the image

✤ Visual term: cluster of patches that give the same information

✤ Bag of visual words: collection of words that give information about
the meaning of the image at all

Bag of Visual Words

✤ How to build a visual dictionary?

✤ Local descriptors are clustered

✤ A local descriptor is assigned to its nearest neighbor:

q(x) = arg min x − µw 2
w∈ω
w∈ω
Mean of the
cluster w
Visual
Cluster
dictionary

Why Visual Words?

✤ Pros:

✤ Much more compact representation

✤ We can take advantage from text retrieval techniques to apply
them to image retrieval system

Find similar
Results
vectors

Rd

f (t, d) |D|
tf idf (t, d, D) = log
max{f (w, d) : w ∈ d} |{d ∈ D : t ∈ d}|

Analyzing User Motivations in
Video Blogging

What is a video blog?

✤ Video blog (vlog): conversational videos where people (usually a
single person) discuss facing the camera and addressing the audience
in a Skype-style fashion

✤ Examples: video testimonial (companies pay for testing products),
video advice (e.g., how to get dressed for a party), discussions

Why vlogs are used

Corporate communication
Life documentary

Comments Ratings
E-learning
Marketing COMMUNITY

High participation
Daily interaction
Discussion Entertainment
Critique

Why vlogs are studied

✤ Why are vlogs relevant?

✤ Automatic analysis of personal websites, blogs and social networks
is limited to text (in order to understand users’ motivations)

✤ Vlog is a new type of social media (40% of the most viewed
videos on YouTube): how to do automatic analysis?

✤ Study a real-life communication scenario

✤ Humans judgements are based on ﬁrst impressions: can we
predict them?

Real communication vs. vlogs

✤ Real communication ✤ Vlog

✤ Synchronous ✤ Asynchronous

✤ Two (or more) people interact ✤ Monologue

✤ Metadata
blah
blah ?
blah

blah
blah

Non-verbal communication

✤ The process of communication through sending and receiving
wordless/visual cues between people

Body Speech

Gestures Voice quality
Touch Rate
Body language Pitch
Posture Volume
Facial expression Rhythm
Eye contact Intonation

✤ Why? To express aspects of identity (age, occupation, culture,
personality)

An example: dominance

✤ Power: capacity or right to control others

✤ Dominance: way of exerting power
involving the motive to control others

✤ Behavior: talk louder, talk longer, speak
ﬁrst, interrupt more, add gestures,
receive more visual attention

Automatic processing pipeline
Audio cues

Face detection Shot selection
Visual cues
A B C D E A B C D E
(for each shot)
(Viola-Jones algorithm) Without faces Short: not talking

VlogSense: Conversational Behavior and Social Attention in YouTube • 1:7

Shot boundary Aggregate
!#$%
#'()*+,%
)-$-.$/#(
detection shot-level cues
0+#($*1
0*.-
)-$-.$/#( Aggregated
cues
!#$
!-1-.$/#( (based on color (at video level)
histogram differences)

Visual cue extraction

binary image
containing the

✤
Figure 2: wMEI Images for two vlogs. (D )
Weighted Motion Energy Images (wMEI): wM EI =
moving pixels
in frame f
f
f ∈V
✤ Feature Extraction
It indicates the visual activity of a pixel (accumulated motion through
We video)
the automatically extracted nonverbal cues from both au-
dio and video with the purpose of characterizing vloggers’
behavior. Given regions with higher motion of vlogs, in our
Brighter pixels: the conversational nature
✤

1:8 • J-I. Biel and D. Gatica-Perez

Fig. 5. Nonverbal cues are extracted based
on speech/non-speech, looking/non-looking
segmentations, and multimodal segmenta-
tions.

✤ It is difficult to estimate the actual direction of the eyes
in step (2), we simplified the task with the detection of frontal faces, a reasonable solution given the
inherent nature of conversational vlogging. In addition to its robustness, a face detector may generalize
✤ If the face is in frontal position I’m most likely looking at the camera
better to the case of vloggers who do not display much of their upper body. For each shot, we assessed
the presence of a talking-head by measuring the ratio of frames with face detections. Then, in step
(3), we selected conversational shots based on a linear combination of the face detection rate and the
duration of the shot relative to the whole duration of the video. This latter condition is motivated by
the observation that non-conversational shots tend to be short, independently on whether they feature
people or not.
We used existing implementations of algorithms based the OpenCV library [Bradski and Kaehler
2008]. The shot boundary detector finds shot discontinuities by thresholding the Bhattacharyya dis-
We are interested in frontal face detection
tance between RGB color histograms of consecutive frames. The face detector implements the boosted
classifiers and Haar-like features from the Viola-Jones algorithm [Viola and Jones 2002] using an ex-
isting cascaded on the OpenCV version 2.0 [Bradski and Kaehler 2008], which scans faces as small as
20x20 pixels. The shot boundary detection algorithm (step 1) and
and the conversational shot selection (step 3) were tuned in a development set constructed from

1:8
• J-I. Biel and D. Gatica-Perez

tions.

how much the
vlogger looks choice of
to the camera addressing the
camera from
close-ups

✤ Looking time: looking activity persistence of a
✤ Proximity to camera
vlogger’s gaze
✤ Looking segment length: persistence ✤ Vertical framing: upper body
duration of the shot relative to the whole duration of the video. This latter condition is motivated by how much the
the observation that non-conversational shots tend to be short, independently on whether they feature vlogger shows
✤ Looking turns: looking activity
people or not.
✤ Vertical head motion the upper body
how much the
tance between RGB color histogramslooks
vlogger of consecutive frames. The face detector implements the boosted
to the camera

1:8
• J-I. Biel and D. Gatica-Perez

tions.

face area in the
current frame
frame containing
looking segment a face
number of frames
containing a face
L∈V tL f ∈V Af ace (f )
✤ Looking time: ✤ Proximity to camera:
tV
in step (2), we simpliﬁed the task with the detection of frontal faces, a reasonable solution given the Nf · A(f ) frame

face
center
area

✤ Looking segment length:
L∈V tL
✤ Vertical framing: f ∈V cf ace (f ) − c(f )
NL
(3), we selected conversational shots based on a linear combination of the face detection rate and the frame Nf · f h frame
duration of the shot relative to the whole duration of the video. This latter condition is motivated by height center
N
L cf ace (f ) − c(f ) )
Vertical head motion: σ( cf ace (f ) − c(f ) )
number of
✤ Looking turns:
people or not. looking segment
✤
µ(
t
V

Audio cue extraction
:8 • J-I. Biel and D. Gatica-Perez

ig. 5. Nonverbal cues are extracted based
n speech/non-speech, looking/non-looking
egmentations, and multimodal segmenta-
ons. # of phonemes
how much the
(how fast the
vlogger talks
vlogger speaks)

how well the
✤ Speaking time: speaking activity ✤ Voicing rate: fluency vlogger controls
loudness

✤ Speech segment avg length: fluency ✤ Speaking energy: emotional stability

Speaking turns: fluency duration and
n step (2), we simplified the task with the detectionnumber of
✤ Pitch variation: emotional state
of frontal faces, ✤ reasonable solution given the
a
nherent nature of conversational vlogging. In addition topauses
silent its robustness, a face detector may generalize
etter to the case of vloggers who do not display much of their upper body. For each shot, we assessed
how well the
he presence of a talking-head by measuring the ratio of frames with face detections. Then, in step vlogger
3), we selected conversational shots based on a linear combination of the face detection rate and the tone
controls
uration of the shot relative to the whole duration of the video. This latter condition is motivated by
he observation that non-conversational shots tend to be short, independently on whether they feature

Audio cue extraction
:8 • J-I. Biel and D. Gatica-Perez

ig. 5. Nonverbal cues are extracted based
n speech/non-speech, looking/non-looking
egmentations, and multimodal segmenta-
ons.
speech segment

speech segment duration

tS NS
✤ Speaking time: S∈V ✤ Voicing rate:
tV S∈V tS
video duration

S∈V tS σ(Senergy )
✤ Speech segment avg length: NS
Speaking energy:
✤
µ(Senergy )
NS number of
σ(pitch)
n step Speaking turns:
✤ (2), we simpliﬁed the task with the detection of frontal faces, ✤ reasonablevariation: the
a Pitch solution given
speech segments

tV µ(pitch)
nherent nature of conversational vlogging. In addition to its robustness, a face detector may generalize
etter to the case of vloggers who do not display much of their upper body. For each shot, we assessed
he presence of a talking-head by measuring the ratio of frames with face detections. Then, in step
3), we selected conversational shots based on a linear combination of the face detection rate and the
uration of the shot relative to the whole duration of the video. This latter condition is motivated by
he observation that non-conversational shots tend to be short, independently on whether they feature

Combining audio and visual cues
1:8 • J-I. Biel and D. Gatica-Perez

tions.

✤ Combine “looking at the camera” with “speaking”: four modalities
duration of the shot relative to the whole duration of the video. This latter condition is motivated by
✤ These measures are used to determine dominance in dyadic
people or not.
conversations
✤ Looking-while-speaking: dominant people
a small random sample of 100 vlogs. For this purpose, we ﬁrst annotated hard shot discontinuities
(a total of 168 hard shot cuts) and then labeled the shot conversational state (s = 1: conversational,
s = 0: non-conversational). For shot boundary detection, we experimented with different thresholding
methodologies, including global, relative, and adaptive thresholds [Hanjalic 2002], and we obtained
the best performance (EER = 15%) using a global threshold γ = 0.5. This performance suggests that

Analysis: video editing elements

✤ Elements manually coded as a support to the core conversational part

✤ Snippets (opening, ending, intermediate), background music,
objects brought toward the camera

opening object toward intermediate object toward ending
snippet the camera snippet the camera snippet

Results
✤ Snippets: 45% of vlogs (16% - 20% with opening/endings, 32% with intermediate snippets)
✤ Videos without snippets are monologues
✤ Snippets tend to be a small fraction of the content of the video (~10%)
✤ Audio: 25% using soundtrack on snippets, 12% using music on the entire video
✤ Objects: 26% of vloggers bring the object toward the camera

Analysis: non-verbal behavior

✤ Vloggers are mainly talking: 85% of people talking for more than half
of time

✤ Speaking segments tend to be short (hesitations and low ﬂuency)
VlogSense: Conversational Behavior and Social Att

15 25 20
Percent of Total

Percent of Total

Percent of Total
20 15
10
15
10
5 10
5 5

0 0 0
0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.
Time speaking (ratio) Avg Leng. of Speech seg. (s) Number of turns (Hz)

30
25
al

al

al

Analysis: non-verbal behavior
VlogSense: Conversational Behavior and Social Attention in YouTube • 1:13

15 25 20 20
✤ 50% of vloggers look at the camera over 90% of the time, atPercent of Total

Percent of Total

Percent of Total

Percent of Total
20 15
10 15
15

“standard” distance to the camera (not too close, not too far), 5 10
5
10

5
10
5

showing the upper body 0
0.0 0.2 0.4 0.6 0.8 1.0
0
0 2 4 6 8 10
0
0.0 0.2 0.4 0.6 0.8 1.0
0
0.00 0.02 0.04 0.06 0.08 0.10
Time speaking (ratio) Avg Leng. of Speech seg. (s) Number of turns (Hz) Voicing rate (Hz)

VlogSense: Conversational Behavior and Social Attention in YouTube
30 • 1:13 25
25

Percent of Total
Percent of Total

Percent of Total

Percent of Total
10 20
20
20 15
15
15 10 20 5 10
25 20 10
Percent of Total

Percent of Total

Percent of Total

Percent of Total
5 5
20 15
10 15
0 0 0 0
15
0.2 0.4 0.6 0.8 1.0 10 0.0 0.1 0.2 0.3 0.4 0.510 !2 !1 0 1 0 5 10 15
5 10 Time looking (ratio) Proximity to camera (ratio) Vertical framing (ratio) LS/LNS
5 5 5
Fig. 8: Selected nonverbal cue distributions for conversational shots in YouTube vlogs: four audio cues, three visual cues, and
0 0 one multimodal. 0 0
✤ Vloggers look at the camera when they speak more frequently than
0.0 0.2 0.4 0.6 0.8 1.0
Time speaking (ratio)
0 2 4 6 8 10
Avg Leng. of Speech seg. (s)
0.0 0.2 0.4 0.6 0.8 1.0
Number of turns (Hz)
0.00 0.02 0.04 0.06 0.08 0.10
Voicing rate (Hz)

when they are silent the fact that most of the vlogs are composed of few conversational shots (see Section 6.1).
25 result from 30 25
Percent of Total
Percent of Total

Percent of Total

Percent of Total

These distributions unveil information that may be useful to understand some basic characteristics
10 20
✤ Behavior of dominant people For example, the speaking time distribution, biased towards high
20
20
15 of nonverbal behavior in vlogging. 15

10 speaking times (median = 0.65, mean = 0.67, sd10 0.15), shows that 85% of the conversational shots
10 5 =
5 contain speech more than half of the time, which suggests that vloggers who were perceived as mainly
5
0 0 talking during the annotation process (Section 4) are indeed speaking for a signiﬁcant proportion of
0 0
0.2 0.4 0.6 0.8 1.0 0.0 the time. Speaking segments !1
0.1 0.2 0.3 0.4 0.5 !2 tend 0 be1 short (median = 1.98s, mean = 2.36s, sd = 1.36s), which is
to 0 5 10 15
Time looking (ratio) Proximity to camera (ratio) Vertical framing (ratio) LS/LNS
common in spontaneous speech, typically characterized by higher numbers of hesitations and lower
Fig. 8: Selected nonverbal cue distributions for conversational shots in YouTube vlogs: four audio cues, three visualper second (median = mean = 0.33,
ﬂuency [Levelt 1989]. The median number of speaking turns cues, and
one multimodal.

Social attention

✤ Social attention on YouTube is measured by considering the number
of views received by a video

Popularity
Borrowed from the Latin
popularis in 1490, originally
meant common

this measure reﬂects
the number of times
that the item has been
accessed (resembling
other measures of the way audiences are
popularity, BUT: not all measured in traditional
the people that access mainstream media)
to the video like it!

Social attention

✤ Audio cues: vloggers talking longer, faster and using fewer pauses
receive more views from the audience

Social attention

✤ Visual cues:
✤ The time looking at the camera and the average duration of looking turns

are positively correlated with attention
✤ Vloggers that are too close to the camera are penalized: the audience

cannot perceive body language cues

Future work (...or not?)

✤ Background analysis: do the background tell something about the
speaker?

Bibliography

✤ Joan-Isaac Biel, Daniel Gatica-Perez, ‘VLogSense: Conversational
Behavior and Social Attention in YouTube’, ACM Transactions on
Multimedia Computing, Communications and Applications, 2010

✤ Joan-Isaac Biel, Oya Aran, Daniel Gatica-Perez, ‘You Are Known by
How You Vlog: Personality Impressions and Nonverbal Behavior in
YouTube’, AAAI, 2011

✤ Joan-Isaac Biel, Daniel Gatica-Perez, ‘Voices of Vlogging’, AAAI, 2010

✤ Joan-Isaac Biel, Daniel Gatica-Perez, ‘Vlogcast Yourself: Nonverbal
Behavior and Attention in Social Media’, ICMI-MLMI, 2010

Bibliography

✤ Joan-Isaac Biel, Daniel Gatica-Perez, ‘The Good, the Bad and the Angry:
Analyzing Crowdsourced Impressions of Vloggers’, AAAI, 2012

✤ Hervé Jégou , ‘Very Large Scale Image/Video Search’, SSMS’12, Santorini

✤ Utkarsh, ‘SIFT: Scale Invariant Feature Transform’, http://
www.aishack.in/2010/05/sift-scale-invariant-feature-transform/

✤ Wikipedia, ‘Bag of Words’ and ‘Visual Word’

✤ Wikipedia, ‘tf-idf’

✤ Wikipedia, ‘k-means clustering’

Bibliography

✤ Rong Yan, ‘Data mining and machine learning for large-scale social media’,
SSMS’12, Santorini

Multimedia Information Retrieval and User Behavior

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (16)

Similar a Multimedia Information Retrieval and User Behavior

Similar a Multimedia Information Retrieval and User Behavior (20)

Más de Eleonora Ciceri

Más de Eleonora Ciceri (18)

Multimedia Information Retrieval and User Behavior