Detecting Violent Content in Hollywood Movies by Mid-level Audio Representations
1. Competence Center Information Retrieval & Machine Learning
11th International Workshop on Content-Based Multimedia Indexing (CBMI), Veszprem, Hungary, 2013
Detecting Violent Content in Hollywood Movies by Mid-level
Audio Representations
Esra Acar
Esra Acar, Frank Hopfgartner, Sahin Albayrak
2. Outline
217. Juni 2013 CBMI‘2013
► Motivation
► The Violence Detection Method
Audio Representation of Videos
Learning Violence Detection Model
► Performance Evaluation
► Conclusions & Future Work
3. Motivation
317. Juni 2013 CBMI‘2013
► Goal: the detection of most violent scenes in Hollywood
movies.
► Use case: Parents select or reject movies by previewing parts of
the movies that include the most violent moments.
► We investigate the discriminative power of mid-level audio
features
Bag-of-Audio Words (BoAW) representations based on Mel-
Frequency Cepstral Coefficients (MFCCs)
Two different BoAW construction methods
Vector quantization-based (VQ-based) method, and
Sparse coding-based (SC-based) method
4. The Violence Detection Method
417. Juni 2013 CBMI‘2013
►The definition of violence: “physical violence or
accident resulting in human injury or pain”
“violence” as defined in the MediaEval Violent
Scenes Detection (VSD) task.
►Two main components of the method:
The representation of video shots
The learning of a violence model
5. Audio Representation of Videos (1)
517. Juni 2013 CBMI‘2013
► Mel-Frequency Cepstral Coefficients (MFCCs)
are commonly used in speech recognition and music
information retrieval (e.g., genre classification).
relate better to human perception.
work well for the detection of excitement/non-excitement
(i.e., indicators of the excitement level of video segments).
► MFCC-based audio representation is employed for the
description of the audio content of Hollywood movies.
► Using mid-level representations may help modeling video
segments one step closer to human perception. Examples are:
bags of features,
the upper units of convolutional networks or deep belief
networks
6. Audio Representation of Videos (2)
617. Juni 2013 CBMI‘2013
► We use mid-level audio features based on MFCCs (i.e., BoAW
approach).
► The BoAW approach with two different coding schemes
Vector quantization (by k-means clustering)
dividing feature vectors into groups, where each group is
represented by its centroid point (e.g., k-means clustering
algorithm).
Sparse coding (by the LARS algorithm)
representing a feature vector as a linear combination of an over-
complete set of basis vectors.
10. Performance Evaluation
1017. Juni 2013 CBMI‘2013
► Dataset:
32,708 video shots from 18 Hollywood movies of different genres
(ranging from extremely violent movies to movies without
violence).
Training set: 26,138 video shots from 15 movies.
Test set: 6,570 video shots from 3 movies.
► Ground truth:
generated by 7 human assessors. Violent movie segments are
annotated at the frame-level.
Each video shot is labeled as violent or non-violent.
The characteristics of training and test datasets
11. Evaluation Metrics
1117. Juni 2013 CBMI‘2013
► The ranking of violent shots are more important for the use
case.
► Metrics other than precision and recall are required to
compare the performance.
► Average precision at 20 & 100 are used (official metrics in the
MediaEval VSD task)
► R-precision which can be seen as an alternative to the precision
at k.
12. Results & Discussions (1)
1217. Juni 2013 CBMI‘2013
Average Precision at 100 for the Baseline and Our Methods
Average Precision at 20 & 100 and R-precision
for the VQ- and SC-based methods
13. Results & Discussions (2)
1317. Juni 2013 CBMI‘2013
Average Precision at 20 & 100 and R-precision on Independence Day
Average Precision at 20 & 100 and R-precision on Dead Poets Society
Average Precision at 20 & 100 and R-precision on Fight Club
14. Results & Discussions (3)
1417. Juni 2013 CBMI‘2013
Team Features Modality APat100*
ARF Color, texture, audio and concepts audio-visual 0.651
Shanghai-
Hong Kong
Trajectory-based features, SIFT, STIP, MFCCs audio-visual 0.624
TEC Color, motion, acoustic features audio-visual 0.618
TUM Acoustic energy and spectral, color, texture,
optical flow
audio-visual 0.484
SC-based
(ours)
BoAW with sparse coding audio 0.444
VQ-based
(ours)
BoAW with vector quantization audio 0.387
LIG-MIRM Color, texture, bag of SIFT and MFCCs audio-visual 0.314
NII Visual concepts learned from color and
texture
visual 0.308
DYNI-LSIS Multi-scale local binary pattern visual 0.125
* Average Precision at 100 (the official evaluation metric of the MediaEval VSD task)
17. Conclusions
1717. Juni 2013 CBMI‘2013
► An approach for movie violent content detection at video shot
level is presented.
► Mid-level audio features based on BoAW approach with two
different coding schemes are employed.
► Promising results are obtained
the SC-based BoAW outperforms all uni-modal submissions in
the MediaEval VSD task except one vision-based method.
► One significant point is that the average precision variation of
the proposed method is high for movies of varying violence
levels.
18. Future Work
1817. Juni 2013 CBMI‘2013
► Construction of more sophisticated mid-level representations
for video content analysis.
► Augmenting the feature set by including visual features (both
low-level and mid-level) helps further improving classification.
► Extend our approach to user-generated videos.
Different from Hollywood movies, these videos are not
professionally edited, e.g., in order to enhance dramatic
scenes.