Video Copy Detection Using Visual and Semantic Fingerprinting

ELIS – Multimedia Lab

Video Copy Detection using
Visual and Semantic Fingerprinting

Presentations aOG MIT
November 17, 2011

Wesley De Neve

Multimedia Lab Image and Video Systems Lab

Dept. of Electronics & Information Systems Dept. of Electrical Engineering
Faculty of Engineering & Architecture College of Information Science & Technology
Ghent University – IBBT KAIST
Ghent, Belgium Daejeon, South Korea


Context Research Effort

• Worked in South Korea during the past four years
– at ICU and KAIST in Daejeon

– main focus on advising graduate students
• keeping track of the state-of-the-art
• identifying and solving research questions
• help communicate research results

– main research topics
• data-driven image annotation and tag relevance learning
• face recognition using online social network context
• video surveillance and privacy protection
• video copy detection
Video Copy Detection using Visual and Semantic Fingerprinting
Wesley De Neve 2/42
November 17, 2011


Outline

• Introduction
• Video copy detection
– using visual features
– using semantic features
• Experimental results
• Conclusions

Wesley De Neve 3/42
November 17, 2011


Outline

• Introduction
• Conclusions

Wesley De Neve 4/42
November 17, 2011


Introduction (1/3)

• Increasing consumption of online video content
– thanks to easy-to-use multimedia devices and online services
– thanks to cheap storage and bandwidth
– thanks to an increasing number of people going online

• Increasing availability of online video content
– digitization of professional video archives
– user-generated video content

Wesley De Neve 5/42
November 17, 2011


Introduction (2/3)

• Some statistics
– professional video content
• BBC Motion Gallery (as of January 2009)
o contains over 2.5 million hours of video content
o dating back 60 years in time

– user-generated video content
• YouTube (as of May 2011)
o 48 hours of new video content are uploaded each minute

Wesley De Neve 6/42
November 17, 2011


Introduction (3/3)

• Problem: digital video overload
– our ability to automatically manage video clips does not keep up with
our ability to create and store video clips
– makes it, e.g., more and more difficult to find video clips of interest

• Part of the solution: techniques for video copy detection
– help in managing vast libraries of video clips
• reduction of visual redundancy in video search results
• detection of copyright infringement
• metadata propagation along visual links
• media usage monitoring
• search by video query

Wesley De Neve 7/42
November 17, 2011


Duplicates versus Near-Duplicates

• Duplicate video clips
– exact video copies
– can be easily detected using hashing

• Near-duplicate video clips (NDVCs)
– transformed video clips
– detection is challenging

transformation

original video clip black & white cropping flipping
Wesley De Neve 8/42
November 17, 2011


Applications: Reduction of Visual Redundancy (1/2)

visual redundancy

visual redundancy

Wesley De Neve 9/42
November 17, 2011


Applications: Detection of Copyright Infringement (2/2)

• Missed by YouTube’s
ContentID

• Transformations used
o scaling
o recompression

Wesley De Neve 10/42
November 17, 2011


Outline

• Introduction
• Conclusions

November 17, 2011


System for Video Copy Detection: Conceptual Design

query video clip
Realized by means
of video signatures

collection of video matching
reference video clips

original video clip copy found

≈

November 17, 2011


Video Signatures

• Aim at uniquely characterizing a video clip

• Commonly consist of visual features
– e.g., color, texture, shape, and motion

• Are low-dimensional representations
– in order to facilitate more efficient matching

dimensionality
reduction ...

921600-D (1280x720) 128-D (128 bins)

November 17, 2011


Room for Improvement
• Observations
– no single type of visual feature has thus far emerged that is robust
against all possible transformations
– transformations tend to preserve semantic features

semantic textual
helmet face wall clothes /
features descriptions

• Research question
– how about (additionally) making use of semantic features?

November 17, 2011


Outline

• Introduction
• Conclusions

November 17, 2011


Extraction of Semantic Features (1/2)

• Question
– how to extract semantic features?

helmet

face
?
wall

clothes

• Our answer semantic features

– by means of binary concept classifiers

November 17, 2011


Extraction of Semantic Features (2/2)

• Example of a binary classifier for ‘apple’

apple
‘apple’
classifier

apple
‘not apple’
classifier

• Concept classifiers
– pieces of logic that know how, e.g., an “apple” image looks like
– more formally: pieces of logic that know the statistical distribution of
the visual features of, e.g., representative “apple” images

November 17, 2011


Challenges Concept Classification (1/2)

• Limited effectiveness

– false negatives (due to intra-concept variability)

apple
‘not apple’
classifier

– false positives (due to inter-concept variability)

apple
‘apple’
classifier

November 17, 2011


Challenges Concept Classification (2/2)

• Limited semantic coverage
– only a limited number of concept classifiers can be supported
• due to the high cost of training
• experts need to collect training images for each concept classifier

training images for training images for training images for
‘apple’ ‘orange’ ‘strawberry’

November 17, 2011

Classifier-Based Semantic Feature Extraction
for Video Copy Detection
• Challenges concept classification affect a semantic approach
towards the task of video copy detection

• How to deal with this?
– limited effectiveness of concept classifiers
• use of semantic features that can be easily and reliably detected
o e.g., ‘people’

– limited semantic coverage of concept classifiers
• use of semantic features that are general in nature
o e.g., ‘people’ versus ‘Barack Obama’
• use of the temporal variation of the semantic features
o extraction of semantic features at the level of video shots
November 17, 2011


Outline

• Introduction
• Conclusions

November 17, 2011


Reference and Query Video Clips

• 311 reference video clips with a total duration of 170 h
– 101 video clips from MUSCLE-VCD-2007
• total duration: 80 h
– 210 video clips from TRECVID 2008
• total duration: 90 h

• 500 query video clips
– the result of five transformations applied to 100 video clips randomly
selected from the 311 reference video clips

November 17, 2011


Transformations Applied

original blur pattern insertion

caption insertion change in brightness crop

November 17, 2011


Semantic Features

• Use of Support Vector Machines (SVM)
– binary classifiers with state-of-the-art effectiveness

• 32 semantic concepts used
– mean average precision (MAP): 0.51
– ‘gravel’, ‘park’, ‘pavement’, ‘road’, ‘rock’, ‘sand’, ‘sidewalk’, ‘face’,
‘people’, ‘indoor’, ‘field’, ‘peak’, ‘wood’, ‘night’, ‘street’, ‘flowers’,
‘leaves’, ‘trees’, ‘cloudy’, ‘sunny’, ‘sunset’, ‘brick’, ‘arch’, ‘buildings’,
‘wall’, ‘windows’, ‘beach’, ‘high-wave’, ‘low-wave’, ‘still water’,
‘mirrored water’, and ‘snow’

November 17, 2011


Video Matching

reference
video clip

d1

query
video clip

November 17, 2011


Video Matching

reference
video clip

d2

query
video clip

November 17, 2011


Video Matching

reference
video clip

d3

query
video clip

November 17, 2011


Video Matching

reference
video clip

d4

query
video clip

November 17, 2011


Video Matching

reference
video clip

d5

query
video clip

• di : linearly weighted combination of the Manhattan distance between the
– visual features of the query video clip and the part of the reference
video clip in the sliding window
– semantic features of the query video clip and the part of the reference
video clip in the sliding window
November 17, 2011


Normalized Detection Cost Ratio (NDCR)

• Definition
NDCR = Pmiss + β * RFA
where
Pmiss = NFN / Nqueries missed detection probablity

RFA = NFP / (Tquery * Trefdata) false alarm rate (per hour)

• We set β to a value of 2 (“balanced profile”)
– see “CBCD Evaluation Plan TRECVID 2010 v3”
– assigns a higher cost to raising false alarms

• A value of zero indicates perfect detection performance
November 17, 2011


Semantic Concept Models Used

• AP (Average Precision)
– true positive rate: #true positives / (#true positives + #false positives)
– averaged over 100 query video clips
• MAP (Mean AP) of the 32 semantic concept models: 0.52
November 17, 2011


Effectiveness of Bimodal Fusion

• Bimodal fusion of visual and semantic features outperforms the
separate use of either type of features for all transformations
November 17, 2011


Comparison of Effectiveness of Video Copy Detection

• In general, bimodal fusion of visual and semantic features
outperforms the other techniques for video copy detection
November 17, 2011


Robustness Against Variation in Semantic Coverage

• The more concepts used, the more effective NDVC detection
November 17, 2011


Influence of Effectiveness of Semantic Concept Detection

0.925

0.711

• The effectiveness of NDVC detection starts to stabilize once
the MAP of the concept detectors is higher than 0.3
November 17, 2011


Time Complexity of Creating Video Signatures

• Measurements expressed in seconds
– include the time to perform
o shot segmentation
o keyframe selection
o feature extraction
November 17, 2011


Time Complexity of Matching

• Measurements expressed in seconds
– include the time to
o compute the temporal entropy for the proposed method
o perform matching using a sliding window approach

November 17, 2011


Storage Complexity

• Measurements expressed in Mbytes
– storing the 32 semantic features requires 4 bytes per shot
– storing the MPEG-7 visual features requires about 0.4 kbytes per shot

November 17, 2011


Outline

• Introduction
• Conclusions

November 17, 2011


Conclusions (1/3)

• Discussed the novel idea of using semantic features for the
purpose of video copy detection
– given the observation that no single type of visual feature exists that
is robust against all possible transformations
– given the observation that transformations tend to preserve
semantic information
– (given the observation that the semantic features extracted can be
reused for annotation purposes)

– fusion of visual and semantic features outperforms
• the seperate use of either type of features
• temporal ordinal measurement, PCA-SIFT, and BoVW
November 17, 2011


Conclusions (2/3)
• Current and future extensions

– use of the temporal variation of concept confidence values
• studied by the National University of Singapore

– classifier-free semantic feature extraction
• takes advantage of collective knowledge available on Flickr
o unrestricted semantic concept vocabulary (higher coverage)
• accepted for publication in IEEE Trans. on CSVT

– improved semantic distance measurement

– indexing of semantic features
November 17, 2011


Conclusions (3/3)

• Publications of interest

– “Bimodal fusion of low-level visual features and high-level semantic
features for near-duplicate video clip detection”
o published in Signal Processing – Image Communication

– “Near-Duplicate Video Clip Detection Using Model-Free Semantic
Concept Detection and Adaptive Semantic Distance Measurement”
o published in IEEE Trans. on Circuits and Systems for Video Technology

November 17, 2011

Video Copy Detection Using Visual and Semantic Fingerprinting

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (10)

Más de Wesley De Neve

Más de Wesley De Neve (20)

Último

Último (20)

Video Copy Detection Using Visual and Semantic Fingerprinting