Detecting Violent Content in Hollywood Movies by Mid-level Audio Representations

Competence Center Information Retrieval & Machine Learning
11th International Workshop on Content-Based Multimedia Indexing (CBMI), Veszprem, Hungary, 2013
Detecting Violent Content in Hollywood Movies by Mid-level
Audio Representations
Esra Acar
Esra Acar, Frank Hopfgartner, Sahin Albayrak

Outline
217. Juni 2013 CBMI‘2013
► Motivation
► The Violence Detection Method
 Audio Representation of Videos
 Learning Violence Detection Model
► Performance Evaluation
► Conclusions & Future Work

Motivation
317. Juni 2013 CBMI‘2013
► Goal: the detection of most violent scenes in Hollywood
movies.
► Use case: Parents select or reject movies by previewing parts of
the movies that include the most violent moments.
► We investigate the discriminative power of mid-level audio
features
 Bag-of-Audio Words (BoAW) representations based on Mel-
Frequency Cepstral Coefficients (MFCCs)
 Two different BoAW construction methods
Vector quantization-based (VQ-based) method, and
Sparse coding-based (SC-based) method

The Violence Detection Method
417. Juni 2013 CBMI‘2013
►The definition of violence: “physical violence or
accident resulting in human injury or pain”
“violence” as defined in the MediaEval Violent
Scenes Detection (VSD) task.
►Two main components of the method:
The representation of video shots
The learning of a violence model

Audio Representation of Videos (1)
517. Juni 2013 CBMI‘2013
► Mel-Frequency Cepstral Coefficients (MFCCs)
 are commonly used in speech recognition and music
information retrieval (e.g., genre classification).
 relate better to human perception.
 work well for the detection of excitement/non-excitement
(i.e., indicators of the excitement level of video segments).
► MFCC-based audio representation is employed for the
description of the audio content of Hollywood movies.
► Using mid-level representations may help modeling video
segments one step closer to human perception. Examples are:
 bags of features,
 the upper units of convolutional networks or deep belief
networks

617. Juni 2013 CBMI‘2013
► We use mid-level audio features based on MFCCs (i.e., BoAW
approach).
► The BoAW approach with two different coding schemes
 Vector quantization (by k-means clustering)
dividing feature vectors into groups, where each group is
represented by its centroid point (e.g., k-means clustering
algorithm).
 Sparse coding (by the LARS algorithm)
representing a feature vector as a linear combination of an over-
complete set of basis vectors.

717. Juni 2013 CBMI‘2013
Dictionary Generation Phase

817. Juni 2013 CBMI‘2013
Representation Construction Phase

Learning Violence Detection Model
917. Juni 2013 CBMI‘2013
Learning a Violence Model

Performance Evaluation
1017. Juni 2013 CBMI‘2013
► Dataset:
 32,708 video shots from 18 Hollywood movies of different genres
(ranging from extremely violent movies to movies without
violence).
Training set: 26,138 video shots from 15 movies.
Test set: 6,570 video shots from 3 movies.
► Ground truth:
 generated by 7 human assessors. Violent movie segments are
annotated at the frame-level.
 Each video shot is labeled as violent or non-violent.
The characteristics of training and test datasets

Evaluation Metrics
1117. Juni 2013 CBMI‘2013
► The ranking of violent shots are more important for the use
case.
► Metrics other than precision and recall are required to
compare the performance.
► Average precision at 20 & 100 are used (official metrics in the
MediaEval VSD task)
► R-precision which can be seen as an alternative to the precision
at k.

Results & Discussions (1)
1217. Juni 2013 CBMI‘2013
Average Precision at 100 for the Baseline and Our Methods
Average Precision at 20 & 100 and R-precision
for the VQ- and SC-based methods

1317. Juni 2013 CBMI‘2013
Average Precision at 20 & 100 and R-precision on Independence Day
Average Precision at 20 & 100 and R-precision on Dead Poets Society
Average Precision at 20 & 100 and R-precision on Fight Club

1417. Juni 2013 CBMI‘2013
Team Features Modality APat100*
ARF Color, texture, audio and concepts audio-visual 0.651
Shanghai-
Hong Kong
Trajectory-based features, SIFT, STIP, MFCCs audio-visual 0.624
TEC Color, motion, acoustic features audio-visual 0.618
TUM Acoustic energy and spectral, color, texture,
optical flow
audio-visual 0.484
SC-based
(ours)
BoAW with sparse coding audio 0.444
VQ-based
(ours)
BoAW with vector quantization audio 0.387
LIG-MIRM Color, texture, bag of SIFT and MFCCs audio-visual 0.314
NII Visual concepts learned from color and
texture
visual 0.308
DYNI-LSIS Multi-scale local binary pattern visual 0.125
* Average Precision at 100 (the official evaluation metric of the MediaEval VSD task)

Sample Video Shots (Correctly Classified)
1517. Juni 2013 CBMI‘2013

Sample Video Shots (Wrongly Classified)
1617. Juni 2013 CBMI‘2013

Conclusions
1717. Juni 2013 CBMI‘2013
► An approach for movie violent content detection at video shot
level is presented.
► Mid-level audio features based on BoAW approach with two
different coding schemes are employed.
► Promising results are obtained
 the SC-based BoAW outperforms all uni-modal submissions in
the MediaEval VSD task except one vision-based method.
► One significant point is that the average precision variation of
the proposed method is high for movies of varying violence
levels.

Future Work
1817. Juni 2013 CBMI‘2013
► Construction of more sophisticated mid-level representations
for video content analysis.
► Augmenting the feature set by including visual features (both
low-level and mid-level) helps further improving classification.
► Extend our approach to user-generated videos.
 Different from Hollywood movies, these videos are not
professionally edited, e.g., in order to enhance dramatic
scenes.

1917. Juni 2013 CBMI‘2013
THANKS!
QUESTIONS?

Detecting Violent Content in Hollywood Movies by Mid-level Audio Representations

Recomendados

Recomendados

Más contenido relacionado

Similar a Detecting Violent Content in Hollywood Movies by Mid-level Audio Representations

Similar a Detecting Violent Content in Hollywood Movies by Mid-level Audio Representations (20)

Último

Último (20)

Detecting Violent Content in Hollywood Movies by Mid-level Audio Representations