Semantic 3DTV Content Analysis and Description

Human centered 2D/3D multiview
video analysis and description

Nikos Nikolaidis
Greek SP Jam, AUTH Ioannis Pitas*
17/5/2012 AIIA Lab
Aristotle University of Thessaloniki
Greece

Anthropocentric video content analysis
tasks
Face detection and tracking
Body or body parts (head/hand/torso) detection
Eye/mouth detection
3D face model reconstruction
Face clustering/recognition
Facial expression analysis
Activity/gesture recognition
Semantic video content description/annotation

Anthropocentric video content description
Applications in video content search (e.g. in YouTube)
Applications in film and games postproduction:
Semantic annotation of video
Video indexing and retrieval
Matting
3D reconstruction
Anthropocentric (human centered) approach
humans are the most important video entity.

Examples


1st frame 6th frame 11th frame 16th frame

Examples
Facial expression recognition



Problem statement:
To detect the human faces that appear in each video
frame and localize their Region-Of-Interest (ROI).
To track the detected faces over the video frames.


Tracking associates each detected face in the current
video frame with one in the next video frame.

Therefore, we can describe the face (ROI) trajectory
in a shot in (x,y) coordinates.

Actor instance definition: face region of interest
(ROI) plus other info
Actor appearance definition: face trajectory plus
other info


Overall system diagram


Face detection results


Face
Feature based face Select Features
Detection
tracking module
Track
Features

YES Occlusion
Handling

NO

Result

Tracking failure may occur, i.e., when a face disappears.

In such cases, face re-detection is employed.
However, if any of the detected faces coincides with any of the
faces already being tracked, the former ones are kept, while the latter
ones are discarded from any further processing.

Periodic face re-detection can be applied to account for new faces
entering the camera's field-of-view (typically every 5 video frames).
Forward and backward tracking, when the entire video is
available.

LSK tracker (template-based tracking)

LSK tracker results

Bayesian tracking in stereo video


Performance measures:
Detection rate
False alarm rate
Localization precision in the range [0,…,1]

Overall Tracking Accuracy

Multiview human detection
Problem statement:
Use information from multiple cameras to detect
bodies or body parts, e.g. head.

Camera 4 Camera 6
Applications:
Human detection/localization in postproduction.
Matting / segmentation initialization.

Multiview human detection
The retained voxels are projected to all views.
For every view we reject ROIs that have small overlap
with the regions resulting from the projection.

Camera 2 Camera 4

Eye detection
Problem statement:
To detect eye regions in
an image.
Input: A facial ROI produced by face detection
or tracking
Output: eye ROIs / eye center location.
Applications:
face analysis e.g. expression recognition, face recognition,
etc
gaze tracking.

Eye detection
Eye detection based on edges and distance maps from
the eye and the surrounding area.

Robustness to illumination conditions variations.

The Canny edge detector is used.

For each pixel, a vector pointing to the closest edge
pixel is calculated. Its magnitude (length) and the slope
(angle) are used.

Eye detection

Training
data

Eye detection

Experimental results

Eye detection

Eye detection
performance

3D face reconstruction from
uncalibrated video

Problem statement:

Input: facial images or facial video frames, taken from different view
angles, provided that the face neither changes expression nor speaks.
Output: 3D face model (saved as a VRML file) and its calibration in
relation to each camera. Facial pose estimation.
Applications:
3D face reconstruction, facial pose estimation, face recognition,
face verification.

uncalibrated video
Method overview
Manual selection of characteristic feature points on the
input facial images.
Use of an uncalibrated 3-D reconstruction algorithm.
Incorporation of the CANDIDE generic face model.
Deformation of the generic face model based on the 3-D
reconstructed feature points.
Re-projection of the face model grid onto the images and
manual refinement.

uncalibrated video

Input: three images with a number of matched
characteristic feature points.

uncalibrated video
The CANDIDE face model has 104
nodes and 184 triangles.

Its nodes correspond to
characteristic points of the human
face, e.g. nose tip, outline of the
eyes, outline of the mouth etc.

uncalibrated video
Example

3D face reconstruction from uncalibrated video

Selected features

CANDIDE grid reprojection 3D face reconstruction

uncalibrated video
Performance evaluation:

Reprojection error

Face clustering
Problem statement:

To cluster a set of facial ROIs
Input: a set of face image ROIs
Output: several face clusters, each containing faces of
only one person.

Applications
Cluster the actor images, even if they belong in
different shots.
Cluster varying views of the same actor.

Face clustering
A four step clustering algorithm is used:

face tracking;

calculation of mutual information between the tracked
facial ROIs to form a similarity matrix (graph);

application of face trajectory heuristics, to improve
clustering performance;

similarity graph clustering.

Face clustering
Performance evaluation

85.4% correct clustering rate, 6.6% clustering errors
An error of 7.9% is introduced by the face detector.

Face detector errors have been manually removed
93% correct clustering rate, 7% clustering errors
Cluster 2 results are improved

Face recognition

Problem statement:

To identify a face identity
Input: a facial ROI
Output: the face id
Hugh
Sandra Grant
Applications: Bullock
Finding the movie cast
Retrieving movies of an actor
Surveillance applications

Face recognition
Two general approaches:

Subspace methods

Elastic graph matching methods.

Face recognition
Subspace methods

The original high-dimensional image space is projected
onto a low-dimensional one.

Face recognition according to a simple distance measure in
the low dimensional space.

Subspace methods: Eigenfaces (PCA), Fisherfaces (LDA),
ICA, NMF.

Main limitation of subspace methods: they require perfect
face alignment (registration).

Face recognition
Elastic graph matching (EGM) methods

Elastic graph matching is a simplified implementation of the
Dynamic Link Architecture (DLA).

DLA represents an object by a rectangular elastic grid.
A Gabor wavelet bank response is measured at each grid node.

Multiscale dilation-erosion at each grid node can be used,
leading to Morphological EGM (MEGM).

Face recognition

Output of normalized multi-scale dilation-erosion for
nine scales.

Face recognition
Performance evaluation:

Performance metric: Equal Error Rate (EER).
M2VTS and XM2VTS facial image data base for
training/testing

MEGM EER: 2%
Subspace techniques EER 5-10%
Kernel subspace techniques: 2%

Elastic graph matching techniques have the best overall
performance.


Problem statement:

To identify a facial expression
in a facial image or 3D point cloud.
Input: a face ROI or a point cloud
Output: the facial expression label
(e.g. neutral, anger, disgust, sadness, happiness, surprise, fear).

Applications
Affective video content description

Two approaches for expression analysis on images /
videos:
Feature based ones
They use texture or geometrical information as features for
expression information extraction.
Template based ones
They use 3D or 2D head and facial models as templates for facial
expression recognition.


Feature based methods
The features used can be:
geometric positions of a set of fiducial points on a
face;
multi-scale and multi-orientation Gabor wavelet
coefficients at the fiducial points.
Optical flow information
Transform (e.g. DCT, ICA, NMF) coefficients
Image pixels reduced in dimension by using PCA
or LDA.


Fusion of geometrical
and texture
information


Confusion matrices:
Discriminant Non-
negative Matrix
Factorization

CANDIDE grid
deformation results

Fusion results

3D facial expression recognition
Experiments on the BU-3DFE database
100 subjects.
6 facial expressions x 4 different intensities +neutral per subject.
83 landmark points (used in our method).
Recognition rate: 93.6%

Action recognition
Problem statement:
To identify the action (elementary activity) of a person
Input: a single-view or multi-view video or a sequence of 3D
human body models (or point clouds).
Output: An activity label (walk, run, jump,…) for each frame or
for the entire sequence.

run walk jump p. jump f. bend
Applications:
Semantic video content description, indexing, retrieval

Action recognition
Action recognition on video data
Input feature vectors: binary masks resulting from coarse body
segmentation on each frame.

run walk jump p. jump f. bend

Action recognition
Perform fuzzy c-means (FCM) clustering on the feature
vectors of all frames of training action videos

Action recognition

Find the cluster centers, “dynemes”: key human
body poses
Characterize each action video by its distance
from the dynemes.

Apply Linear Discriminant Analysis (LDA) to
further decrease dimensionality.

Action recognition

Characterize each test video by its distances from the
dynemes
Perform LDA
Perform classification, e.g. by SVM
The method can operate on videos containing multiple
actions performed sequentially.
96% correct action recognition on a database containing 10
actions (walking, running, jumping, waving,..)

3D action recognition

Action recognition on 3D data
Extension of the video-based activity recognition
algorithm
Input to the algorithm: binary voxel-based representation
of frames

Visual speech detection

Problem statement:

To detect video frames where
a persons speaks using only visual information

Input: A mouth ROI produced by mouth detection
Output: yes/no visual speech indication.

The mouth is detected and localized by a similar technique
to eye detection.

Applications
Detection of important video frames (when someone
speaks)
Lip reading applications
Speech recognition applications
Speech intent detection and speaker determination
human--computer interaction applications
video telephony and video conferencing systems.
Dialogue detection system for movies and TV programs


A statistical approach for visual speech detection,
using mouth region luminance will be presented.

The method employs face and mouth region
detectors, applying signal detection algorithms to
determine lip activity.


Increase in the number of low intensity pixels, in the mouth region when mouth is open.


The number of low grayscale intensity pixels of the mouth
region in a video sequence. The rectangle encompasses the
frames where a person is speaking.

structure based on MPEG-7
Problem statement:

To organize and store the extracted semantic information for
video content description.

Input: video analysis results
Output: MPEG-7 compatible XML file.

Applications
Video indexing and retrieval
Video metadata analysis


Anthropos-7 metadata description profile

Humans and their status/actions are the most important
metadata items.
Anthropos-7 presents an anthropocentric Perspective for Movie
annotation.
New video content descriptors (on top of MPEG7 entities) are
introduced, in order to store in a better way intermediate and
high level video metadata related to humans.
Human ids, actions, status descriptions are supported.
Information like:
“Actor H. Grant appears in this shot and he is smiling”
can be described in the proposed Anthropos-7 profile.


Main Anthropos-7 Characteristics:

Anthropos-7 Descriptors gather semantic (intermediate level)
video metadata (tags).

Tags are selected to suit research areas like face detection,
object tracking, motion detection, facial expression extraction
etc.

High level semantic video metadata (events, e.g. dialogues) are
supported.


Ds & DSs Name
Movie Class
Version Class
Scene Class
Shot Class
Take Class
Frame Class
Sound Class
Actor Class
Actor Appearance Class
Object Appearance Class
Actor Instance Class
Object Instance Class
High Semantic Class
Camera Class
Camera Use Class
Lens Class


Movie Class Actor Class
Version Class Object Appearance Class
Scene Class Actor Appearance Class
Shot Class Object Instance Class
Take Class Actor Instance Class
Frame Class High Semantic Class
Sound Class Camera Class
Camera Use Class
Lens Class

Shot Class

ID
Shot S/N
Scene Class Actor Appearance Class Shot Description
Shot Class Object Instance Class Take Ref
Frame Class High Semantic Class Start Time

Sound Class Camera Class End Time
Camera Use Class
Lens Class Duration

Camera Use

Key Frame

Number Of Frames

Color Information

Object Appearances


Scene Class Actor Appearance Class
Take Class Actor Instance Class Actor Class
ID
Camera Use Class
Lens Class
Name

Sex

Nationality

Role Name

Anthropocentric video content
description
Scene Class Actor Appearance Class Actor Appearance Class
ID
Camera Use Class Time In List
Lens Class
Time Out List

Color Information

Motion

Actor Instances


Examples of Actor Instances in one Actor Appearence


Actor Instance
Class

ID

ROIDescriptionModel
Version Class Object Appearance Class ID
Scene Class Actor Appearance Class Geometry
ID
Frame Class High Semantic Class Feature Points
Sound Class Camera Class Bounding Box
Camera Use Class Convex Hull
Lens Class Grid

Color

Status

Position In Frame
Expressions
Activity

Anthropos-7 Video Annotation software

A set of video annotation software tools have been created for
anthropocentric video content description.

Anthropocentric Video Content Description Structure based on
MPEG-7.

These software tools perform (semi)-automatic video content
annotation.

The video metadata results (XML file) can be edited manually by a
software tool called Anthropos-7 editor.

Multi-view Anthropos-7

The goal is to link the frame-based information
instantiated by the ActorInstanceType DS at each
video channel (e.g. L+R channel in stereo videos)
In other words to link the instances of the
ActorInstanceType DS belonging in different
channels, due to disparity or color similarity

Multi-view Anthropos-7

The proposed structure organizes the data in 3 levels
Level 1
Refers to the Correspondences tag
Level 2
Refers to the CorrespondingInstancesType DS
Level 3
Refers to the CorrespondingInstanceType DS

Conclusions

Anthropocentric movie description is very important
because in most cases the movies are based on human
characters, their actions and status.
We presented a number of anthropocentric video
analysis and descrition techniques that have
applications in film/games postproduction as well as
in semantic video search
Similar descriptions can be devised for stereo and
multiview video.

3DTVS project

European project: 3DTV Content Search (3DTVS)

Start: 1/11/2011
Duration: 3 years
www.3dtvs-project.eu

Stay tuned!

Thank you very much for your attention!

Further info:
www.aiia.csd.auth.gr
pitas@aiia.csd.auth.gr
Tel: +30-2310-996304

Semantic 3DTV Content Analysis and Description

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (20)

Similar a Semantic 3DTV Content Analysis and Description

Similar a Semantic 3DTV Content Analysis and Description (20)

Más de Distinguished Lecturer Series - Leon The Mathematician

Más de Distinguished Lecturer Series - Leon The Mathematician (7)

Último

Último (20)

Semantic 3DTV Content Analysis and Description