Ioannis Pitas, Professor, Aristotle University of Thessaloniki, Department of Informatics (IEEE Fellow), Semantic 3DTV Content Analysis and Description
1. Human centered 2D/3D multiview
video analysis and description
Nikos Nikolaidis
Greek SP Jam, AUTH Ioannis Pitas*
17/5/2012 AIIA Lab
Aristotle University of Thessaloniki
Greece
2. Anthropocentric video content analysis
tasks
Face detection and tracking
Body or body parts (head/hand/torso) detection
Eye/mouth detection
3D face model reconstruction
Face clustering/recognition
Facial expression analysis
Activity/gesture recognition
Semantic video content description/annotation
3. Anthropocentric video content description
Applications in video content search (e.g. in YouTube)
Applications in film and games postproduction:
Semantic annotation of video
Video indexing and retrieval
Matting
3D reconstruction
Anthropocentric (human centered) approach
humans are the most important video entity.
6. Face detection and tracking
1st frame 6th frame 11th frame 16th frame
Problem statement:
To detect the human faces that appear in each video
frame and localize their Region-Of-Interest (ROI).
To track the detected faces over the video frames.
7. Face detection and tracking
Tracking associates each detected face in the current
video frame with one in the next video frame.
Therefore, we can describe the face (ROI) trajectory
in a shot in (x,y) coordinates.
Actor instance definition: face region of interest
(ROI) plus other info
Actor appearance definition: face trajectory plus
other info
10. Face detection and tracking
Face
Feature based face Select Features
Detection
tracking module
Track
Features
YES Occlusion
Handling
NO
Result
11. Face detection and tracking
Tracking failure may occur, i.e., when a face disappears.
In such cases, face re-detection is employed.
However, if any of the detected faces coincides with any of the
faces already being tracked, the former ones are kept, while the latter
ones are discarded from any further processing.
Periodic face re-detection can be applied to account for new faces
entering the camera's field-of-view (typically every 5 video frames).
Forward and backward tracking, when the entire video is
available.
17. Multiview human detection
Problem statement:
Use information from multiple cameras to detect
bodies or body parts, e.g. head.
Camera 4 Camera 6
Applications:
Human detection/localization in postproduction.
Matting / segmentation initialization.
18. Multiview human detection
The retained voxels are projected to all views.
For every view we reject ROIs that have small overlap
with the regions resulting from the projection.
Camera 2 Camera 4
19. Eye detection
Problem statement:
To detect eye regions in
an image.
Input: A facial ROI produced by face detection
or tracking
Output: eye ROIs / eye center location.
Applications:
face analysis e.g. expression recognition, face recognition,
etc
gaze tracking.
20. Eye detection
Eye detection based on edges and distance maps from
the eye and the surrounding area.
Robustness to illumination conditions variations.
The Canny edge detector is used.
For each pixel, a vector pointing to the closest edge
pixel is calculated. Its magnitude (length) and the slope
(angle) are used.
24. 3D face reconstruction from
uncalibrated video
Problem statement:
Input: facial images or facial video frames, taken from different view
angles, provided that the face neither changes expression nor speaks.
Output: 3D face model (saved as a VRML file) and its calibration in
relation to each camera. Facial pose estimation.
Applications:
3D face reconstruction, facial pose estimation, face recognition,
face verification.
25. 3D face reconstruction from
uncalibrated video
Method overview
Manual selection of characteristic feature points on the
input facial images.
Use of an uncalibrated 3-D reconstruction algorithm.
Incorporation of the CANDIDE generic face model.
Deformation of the generic face model based on the 3-D
reconstructed feature points.
Re-projection of the face model grid onto the images and
manual refinement.
26. 3D face reconstruction from
uncalibrated video
Input: three images with a number of matched
characteristic feature points.
27. 3D face reconstruction from
uncalibrated video
The CANDIDE face model has 104
nodes and 184 triangles.
Its nodes correspond to
characteristic points of the human
face, e.g. nose tip, outline of the
eyes, outline of the mouth etc.
29. 3D face reconstruction from uncalibrated video
Selected features
CANDIDE grid reprojection 3D face reconstruction
30. 3D face reconstruction from
uncalibrated video
Performance evaluation:
Reprojection error
31. Face clustering
Problem statement:
To cluster a set of facial ROIs
Input: a set of face image ROIs
Output: several face clusters, each containing faces of
only one person.
Applications
Cluster the actor images, even if they belong in
different shots.
Cluster varying views of the same actor.
32. Face clustering
A four step clustering algorithm is used:
face tracking;
calculation of mutual information between the tracked
facial ROIs to form a similarity matrix (graph);
application of face trajectory heuristics, to improve
clustering performance;
similarity graph clustering.
33. Face clustering
Performance evaluation
85.4% correct clustering rate, 6.6% clustering errors
An error of 7.9% is introduced by the face detector.
Face detector errors have been manually removed
93% correct clustering rate, 7% clustering errors
Cluster 2 results are improved
34. Face recognition
Problem statement:
To identify a face identity
Input: a facial ROI
Output: the face id
Hugh
Sandra Grant
Applications: Bullock
Finding the movie cast
Retrieving movies of an actor
Surveillance applications
35. Face recognition
Two general approaches:
Subspace methods
Elastic graph matching methods.
36. Face recognition
Subspace methods
The original high-dimensional image space is projected
onto a low-dimensional one.
Face recognition according to a simple distance measure in
the low dimensional space.
Subspace methods: Eigenfaces (PCA), Fisherfaces (LDA),
ICA, NMF.
Main limitation of subspace methods: they require perfect
face alignment (registration).
37. Face recognition
Elastic graph matching (EGM) methods
Elastic graph matching is a simplified implementation of the
Dynamic Link Architecture (DLA).
DLA represents an object by a rectangular elastic grid.
A Gabor wavelet bank response is measured at each grid node.
Multiscale dilation-erosion at each grid node can be used,
leading to Morphological EGM (MEGM).
38. Face recognition
Output of normalized multi-scale dilation-erosion for
nine scales.
39. Face recognition
Performance evaluation:
Performance metric: Equal Error Rate (EER).
M2VTS and XM2VTS facial image data base for
training/testing
MEGM EER: 2%
Subspace techniques EER 5-10%
Kernel subspace techniques: 2%
Elastic graph matching techniques have the best overall
performance.
40. Facial expression analysis
Problem statement:
To identify a facial expression
in a facial image or 3D point cloud.
Input: a face ROI or a point cloud
Output: the facial expression label
(e.g. neutral, anger, disgust, sadness, happiness, surprise, fear).
Applications
Affective video content description
41. Facial expression analysis
Two approaches for expression analysis on images /
videos:
Feature based ones
They use texture or geometrical information as features for
expression information extraction.
Template based ones
They use 3D or 2D head and facial models as templates for facial
expression recognition.
42. Facial expression analysis
Feature based methods
The features used can be:
geometric positions of a set of fiducial points on a
face;
multi-scale and multi-orientation Gabor wavelet
coefficients at the fiducial points.
Optical flow information
Transform (e.g. DCT, ICA, NMF) coefficients
Image pixels reduced in dimension by using PCA
or LDA.
45. 3D facial expression recognition
Experiments on the BU-3DFE database
100 subjects.
6 facial expressions x 4 different intensities +neutral per subject.
83 landmark points (used in our method).
Recognition rate: 93.6%
46. Action recognition
Problem statement:
To identify the action (elementary activity) of a person
Input: a single-view or multi-view video or a sequence of 3D
human body models (or point clouds).
Output: An activity label (walk, run, jump,…) for each frame or
for the entire sequence.
run walk jump p. jump f. bend
Applications:
Semantic video content description, indexing, retrieval
47. Action recognition
Action recognition on video data
Input feature vectors: binary masks resulting from coarse body
segmentation on each frame.
run walk jump p. jump f. bend
48. Action recognition
Perform fuzzy c-means (FCM) clustering on the feature
vectors of all frames of training action videos
49. Action recognition
Find the cluster centers, “dynemes”: key human
body poses
Characterize each action video by its distance
from the dynemes.
Apply Linear Discriminant Analysis (LDA) to
further decrease dimensionality.
50. Action recognition
Characterize each test video by its distances from the
dynemes
Perform LDA
Perform classification, e.g. by SVM
The method can operate on videos containing multiple
actions performed sequentially.
96% correct action recognition on a database containing 10
actions (walking, running, jumping, waving,..)
51. 3D action recognition
Action recognition on 3D data
Extension of the video-based activity recognition
algorithm
Input to the algorithm: binary voxel-based representation
of frames
52. Visual speech detection
Problem statement:
To detect video frames where
a persons speaks using only visual information
Input: A mouth ROI produced by mouth detection
Output: yes/no visual speech indication.
The mouth is detected and localized by a similar technique
to eye detection.
53. Visual speech detection
Applications
Detection of important video frames (when someone
speaks)
Lip reading applications
Speech recognition applications
Speech intent detection and speaker determination
human--computer interaction applications
video telephony and video conferencing systems.
Dialogue detection system for movies and TV programs
54. Visual speech detection
A statistical approach for visual speech detection,
using mouth region luminance will be presented.
The method employs face and mouth region
detectors, applying signal detection algorithms to
determine lip activity.
57. Visual speech detection
The number of low grayscale intensity pixels of the mouth
region in a video sequence. The rectangle encompasses the
frames where a person is speaking.
58. Anthropocentric video content description
structure based on MPEG-7
Problem statement:
To organize and store the extracted semantic information for
video content description.
Input: video analysis results
Output: MPEG-7 compatible XML file.
Applications
Video indexing and retrieval
Video metadata analysis
59. Anthropocentric video content description
Anthropos-7 metadata description profile
Humans and their status/actions are the most important
metadata items.
Anthropos-7 presents an anthropocentric Perspective for Movie
annotation.
New video content descriptors (on top of MPEG7 entities) are
introduced, in order to store in a better way intermediate and
high level video metadata related to humans.
Human ids, actions, status descriptions are supported.
Information like:
“Actor H. Grant appears in this shot and he is smiling”
can be described in the proposed Anthropos-7 profile.
60. Anthropocentric video content description
Main Anthropos-7 Characteristics:
Anthropos-7 Descriptors gather semantic (intermediate level)
video metadata (tags).
Tags are selected to suit research areas like face detection,
object tracking, motion detection, facial expression extraction
etc.
High level semantic video metadata (events, e.g. dialogues) are
supported.
61. Anthropocentric video content description
Ds & DSs Name
Movie Class
Version Class
Scene Class
Shot Class
Take Class
Frame Class
Sound Class
Actor Class
Actor Appearance Class
Object Appearance Class
Actor Instance Class
Object Instance Class
High Semantic Class
Camera Class
Camera Use Class
Lens Class
62. Anthropocentric video content description
Movie Class Actor Class
Version Class Object Appearance Class
Scene Class Actor Appearance Class
Shot Class Object Instance Class
Take Class Actor Instance Class
Frame Class High Semantic Class
Sound Class Camera Class
Camera Use Class
Lens Class
63. Anthropocentric video content description
Shot Class
ID
Movie Class Actor Class
Shot S/N
Version Class Object Appearance Class
Scene Class Actor Appearance Class Shot Description
Shot Class Object Instance Class Take Ref
Take Class Actor Instance Class
Frame Class High Semantic Class Start Time
Sound Class Camera Class End Time
Camera Use Class
Lens Class Duration
Camera Use
Key Frame
Number Of Frames
Color Information
Object Appearances
64. Anthropocentric video content description
Movie Class Actor Class
Version Class Object Appearance Class
Scene Class Actor Appearance Class
Shot Class Object Instance Class
Take Class Actor Instance Class Actor Class
Frame Class High Semantic Class
Sound Class Camera Class
ID
Camera Use Class
Lens Class
Name
Sex
Nationality
Role Name
65. Anthropocentric video content
description
Movie Class Actor Class
Version Class Object Appearance Class
Scene Class Actor Appearance Class Actor Appearance Class
Shot Class Object Instance Class
Take Class Actor Instance Class
ID
Frame Class High Semantic Class
Sound Class Camera Class
Camera Use Class Time In List
Lens Class
Time Out List
Color Information
Motion
Actor Instances
66. Anthropocentric video content description
Examples of Actor Instances in one Actor Appearence
1st frame 6th frame 11th frame 16th frame
67. Anthropocentric video content description
Actor Instance
Class
ID
ROIDescriptionModel
Movie Class Actor Class
Version Class Object Appearance Class ID
Scene Class Actor Appearance Class Geometry
Shot Class Object Instance Class
ID
Take Class Actor Instance Class
Frame Class High Semantic Class Feature Points
Sound Class Camera Class Bounding Box
Camera Use Class Convex Hull
Lens Class Grid
Color
Status
Position In Frame
Expressions
Activity
68. Anthropos-7 Video Annotation software
A set of video annotation software tools have been created for
anthropocentric video content description.
Anthropocentric Video Content Description Structure based on
MPEG-7.
These software tools perform (semi)-automatic video content
annotation.
The video metadata results (XML file) can be edited manually by a
software tool called Anthropos-7 editor.
69. Multi-view Anthropos-7
The goal is to link the frame-based information
instantiated by the ActorInstanceType DS at each
video channel (e.g. L+R channel in stereo videos)
In other words to link the instances of the
ActorInstanceType DS belonging in different
channels, due to disparity or color similarity
71. Multi-view Anthropos-7
The proposed structure organizes the data in 3 levels
Level 1
Refers to the Correspondences tag
Level 2
Refers to the CorrespondingInstancesType DS
Level 3
Refers to the CorrespondingInstanceType DS
74. Conclusions
Anthropocentric movie description is very important
because in most cases the movies are based on human
characters, their actions and status.
We presented a number of anthropocentric video
analysis and descrition techniques that have
applications in film/games postproduction as well as
in semantic video search
Similar descriptions can be devised for stereo and
multiview video.
75. 3DTVS project
European project: 3DTV Content Search (3DTVS)
Start: 1/11/2011
Duration: 3 years
www.3dtvs-project.eu
Stay tuned!
76. Thank you very much for your attention!
Further info:
www.aiia.csd.auth.gr
pitas@aiia.csd.auth.gr
Tel: +30-2310-996304