Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform

Presented in Partial Fullment of the Requirements
of the Degree of Masters of Science in the School
of Communication and Information Technology

Fadwa Fawzy Fouad
Supervisor: Dr. Moataz M.Abdelwahab

Agenda

 Introduction

 Quick overview
 2DHOOF/2DPCA Contour Based Optical Flow
Algorithm
 Human Gesture Recognition Employing Radon
Transform/2DPCA

Introduction

• Importance & Applications
• Action V.S. Activity
• Challenges & characteristics of the domain

Importance &Applications


Human actionactivity recognition is one of the most promising applications of
computer vision. The interest of this topic is motivated by the promise of many
applications include
• character animation for games and movies
• advanced intelligent user interfaces
• biomechanical analysis of actions for sports and medicine
• automatic surveillance

Action V.S. Activity

Action

Activity

 Single person

 Complex sequence of
actions
 Single/ multiple person(s)

 Short time duration

 Long time duration

 Simple motion pattern

Challenges and
characteristics of the domain



The difficulty of the recognition process is associated with multiple variation
sources
 Inter- and intra-class variations
 Environmental Variations and Capturing conditions
 Temporal variations

• Inter-class variations (variations within single
class)
The variations in the performance of certain action due to anthropometric
differences between individuals. For example, running movements can
differ in speed and stride length.

• Intra-class variations (variations within different
classes)

Overlap between different action classes due to the similarity in actions
performance.

• Environmental variations
Destructions originate from the actor’s surroundings include dynamic or
cluttered environments, illumination variation, Body occlusion

• Capturing conditions
Depend on the method used to capture the scene, wither singlemultiple
static/dynamic camera(s) systems.

• Temporal variations
Includes the changes in the performance rate from one person to another.
Also, the changes in the recording rate (frame/sec).

Overview

The main structure of action recognition system

The main structure of
action recognition system



The structure of the action recognition system is typically hierarchical.
Action
classification
Extraction of the
action
descriptors
Human detection &
segmentation
Capture the input video

Capture the input video

For single camera, the scene is captured from only one viewpoint, so it can't
provide enough information about the action performed in case of poor
viewpoint. Besides, it can't handle the occlusion problem.
Video 1
Video 3

Video 2

Video 4

Multi-camera systems can capture the same view from different poses., so they
provide sufficient information that can alleviate the occlusion problem.

Camera 0

Camera 2

Camera 1

Camera 3

The new technology of Kinect depth camera can be utilized to capture the
performed actions. The device has: RGB camera, depth sensor and multi-array
microphone.
It provides full-body 3D motion capture, facial recognition and voice recognition
capabilities. Furthermore, depth information can be used for segmentation.

RGB
information

Kinect depth camera
Depth
information

Human detection &
segmentation
It’s the first step of the full process of human sequence evaluation.
Techniques can be divided into :
• Background Subtraction techniques
• Motion Based techniques
• Appearance Based techniques
• Depth Based Segmentation

Extraction of the
action
descriptors
Input videos consist of massive amounts of information in the form of spatiotemporal pixel intensity variations. But most of this information is not directly
relevant to the task of understanding and identifying the activity occurring in
the video.
In this work we used Non-Parametric approaches in which a set of features
are extracted per video frame, then these features are accumulated and
matched to stored templates.
Example:
Motion Energy Image

&

Motion History Image

Action
classification
When the extracted features are available for an input video, human action
recognition becomes a classification problem.

Dimensionality reduction is a common step before the actual classification and is
discussed first.

Dimensionality reduction
Image representations are often high-dimensional. This makes matching task
computationally more expensive. Also, the representation might contain noisy
features. This problem trigged the idea of obtaining a more compact, robust
feature representation by reducing the space of the image representation into a
lower dimensional space.
Example: OneTwo Dimension(s) Principal component analysis (PCA)

Nearest neighbor classification
k-Nearest neighbor (NN) classifiers use the distance between the features of an
observed sequence and those in a training set. The most common label among the
k closest training sequences is chosen as the classification.
NN classification can be either performed at the frame level, or for the whole
video sequences. In the latter case, issues with different frame lengths need to be
resolved.
In our work we used 1-NN with Euclidean distance to classify the tested
actions.

is class
is class

2DHOOF/2DPCA
Contour Based
Optical Flow Algorithm

•
•
•
•
•

Dense V.S. Sparse OF
Alignment issues with OF
The Calculation of 2D Histogram of Optical Flow(2DHOOF)
Overall System Description
Experimental Results

Dense V.S. Sparse OF

In practice, dense OF is not the best choice to get the OF. Besides it’s high
computation complexity, it is not accurate for homogenous moving objects
(aperture problem).

Alignment issues with OF


We had two choices to decide the best order for actor alignment:

 Align actor then calculate OF

 Calculate OF then Align it

Jumping & Transition effects in Running action

Calculate OF then Align OF

Align actor then calculate OF

The Calculation of 2D Histogram
of Optical Flow(2DHOOF)



Calculated OF

Histogram layers
W/m x H/m x n

An example to obtain the n-layers 2DHOOF for any
two successive frames

Accumulated 2D-HOOF that represents the whole
video

Confusion between Wave and Bend actions when
using 1DHOOF

Wave

Bend

Overall System Description


Segmentation
& Contour
Extraction

Extract the
dominant
vectors

Store
extracted
features

Training Mode
Sparse OF

Testing Mode

Segmentation
& Contour
Extraction

2DHOOF

Projection on
the dominant
vectors

Sparse OF

2DPCA

Classification
and Voting
Scheme

2DHOOF

Training Mode

Segmentation
& Contour
Extraction

Sparse OF

Extract the
dominant
vectors

2DHOOF

Store
extracted
features

2DPCA

Segmentation & Contour Extraction (Method 1)
• Geodesic segmentation
Where
xi : stroke pixels (black)
x : other pixels (white)
I : image intensity

Input
Video
Frame

Face
Detection

Initial
Stroke

Blob
Extraction

Final
Contour

Segmentation & Contour Extraction (Method 2)
• Contour extraction from Magnitude dense OF
Edge pixel has specific criteria based on it's (3 x 3) neighbor pixels.

Applying edgy criteria on the magnitude of the dense OF

Steps of contour extraction from dense OF

2DHOOF-2DPCA Features Extraction
2DHOOF of
Training Videos

Final Features
Projection

Testing Mode

Segmentation
& Contour
Extraction

Projection on
the dominant
vectors

Sparse OF

Classification
and Voting
Scheme

2DHOOF

Projection on the dominant vectors

Classification

D3

Final Decision
based on the
minimum D
value


Two experiments were conducted to evaluate the performance of the proposed
algorithm.
• For the first experiment Weizmann dataset was used to measure the
performance of the low resolution single camera operation.
• For the second Experiment IXMAS multi-view dataset was used to evaluate
the performance of the parallel camera structure.
The two experiments was conducted using the Leave-One-Actor-Out (LOAO)
technique to be consistent with the most recent algorithms.
Both datasets provide RGB frames and the actor ‘s silhouettes.

Weizmann dataset

The Weizmann dataset consists of 90 low-resolution video sequences showing 9
different actors, each performing 10 natural actions such as walk, run, jump
forward, gallop sideways, bend, wave with one hand (wave1), wave with two
hands (wave2), jump in place (Pjump), jump-jack, and skip.
Bend

Run

Jump

Jump-jack

Gallop

The confusion matrix for this experiment shows that the average recognition
accuracy is 97.78%, and eight actions were 100% accurate.

2DHOOF / 2DPCA

On the other hand, using 1DHOOF with 1DPCA decreases the accuracy to
63.34% because of the large confusion between actions (as discussed before).

1DHOOF / 1DPCA

Comparison with the most recent algorithms:
• Recognition Accuracy
Method

Accuracy

Previous Contribution

98.89%

Our Algorithm

97.79%

Shah et al.

95.57%

Yang et al.

92.8%

Yuan et al.

92.22%

• Average Testing Time
Method

Average Runtime

Our Algorithm

66.11 msec


113.00 msec

Shah et al.

18.65 sec

Blank et al.

30 sec

Samples from the calculated contour OF

Walk

Skip

P-jump

IXMAS Dataset

The proposed parallel structure algorithm was applied on the IXMAS multi-view
dataset. Each camera is considered as an independent system, then a voting
scheme was carried out between the four cameras to obtain the final decision.
This dataset consists of 5 cameras capturing the scene, 12 actors, each performing
Our
13 natural actions 3 times in which the actors are free to change their orientation
Camera0
Algorithm
for each scenario.
Our

Voting
Scheme

Camera1
The actions: check Algorithm arms, scratch head, sit down, get up, turn
watch, cross
Final
around, walk, wave, punch, kick, and pick up and throw.
Camera2

Our
Algorithm

Camera3

Our
Algorithm

Decision

Example on IXMAS multi-camera dataset. Action: Pick up and Throw

Camera 0

Camera 2

Camera 1

Camera 3

Confusion matrix for IXMAS dataset shows that average accuracy is
87.12%,where SH=Scratch head, CW=Check watch, CA=Cross arms, SD=Sit
down, GU=Get up, TA=Turn around, PU=Pick up.

Cam(2) %

Cam(3) %

12

97.29

79.04

72.47

78.53

87.12


12

78.9

78.61

80.93

77.38

84.59

Weinland et al.

10

65.04

70.00

54.30

66.00

81.30

Srivastava et al.

10

N/A

N/A

N/A

N/A

81.40

Shah et al.

12

72.00

53.00

68.00

63.00

78.00

Method

Overall
Vote%

Cam(1) %

Proposed Algorithm

Actors #

Cam(0) %

Comparison with the best reported accuracies shows that we achieved the
highest accuracy with an enhancement of 3%.

Bold indicates the best performance, N/A= Not available in published reports

Samples from the calculated contour OF
Walk

Set down

Kick

Published Paper

F. Fawzy, M. Abdelwahab, and W. Mikhael. 2DHOOF-2DPCA Contour
Based Optical Flow Algorithm for Human Activity Recognition . IEEE
International Midwest Symposium on Circuits and Systems (MWSCAS
2013), Ohio, USA.

Human Gesture Recognition
Employing Radon
Transform/2DPCA



• Radon Transform (RT)
• Overall system description

Radon Transform


The RT computes projections of an image matrix along specified directions. A
projection of a two-dimensional function f(x,y) is a set of line integrals along
parallel paths, or beams.

Overall system description


The proposed system is designed and tested for gesture recognition and can be
extended to regular action recognition.
We have two modes for this algorithm
• Training Mode
• Testing Mode
Both have a pre-processing step before feature extraction.

Pre-processing Step:
1) Input videos
The One Shot Learning ChaLearn Gesture Dataset was used for this experiment.
In this dataset a single user facing a fixed Kinect™ camera, interacting with a
computer by performing gestures was captured.
Videos are represented by RGB and depth images.
Each actor has from 8 to 15 different gestures(vocabulary) for training, and 47
input videos each has from 1 to 5 gesture(s) for testing.
We applied our algorithm on a subset of this dataset consists of 37 different
actors.

The dataset can be divided into two main groups; standing actors, and sitting
actors. In this experiment we used a subset of the standing actor group in which
actors are using their whole body to perform the gesture and make significant
motion to be captured by the MEI and MHI.

Standing actors

Sitting actors

Also, we used only the depth videos as input videos. Depth information makes
the segmentation task easier than using RGB or gray videos, especially when the
actor's clothes have the same color as the background, or textured background.

Pre-processing Step:
2) Segmentation & Blob extraction
We used Basic Global Thresholding Algorithm in order to extract the actor's
blob.

In some cases the resultant blob has some objects with it. This noise results from
some objects that were at the same depth as the actor.

Case 1

Case 2

Case 3

In this situation we perform a noise elimination step

Case 1

Case 2

Case 3

Alignment using RT of
the First Frame

• Vertical alignment using the projection on the y-axis (90o from RT)

• Horizontal alignment using the projection on the x-axis (0o from RT)

Calculate the MEI and
MHI

Whole Body
MEI

Body Parts
MHI

MEI

MHI

Get Radon Transform
for MEI and MHI


Basically, the difference between RT of the whole body and RT of the body parts is
the white portion in the center representing the projection of the actor's body

Video Chopping

As we have mentioned, the testing videos may contain from 1 to 5
different gestures per video. In this case we need to separate these
gestures into one gesture per video to test our system with.
We can do that by two main steps :
1. Calculate the plot that represents the moving area/frame
2. Apply the Local minima criteria on this plot.

1. Calculate the plot that represents the moving area/frame

2. Apply the Local minima criteria
We are searching for a frame i that satisfies the following conditions:
a) The number of frames before this i
Threshold.

is greater than or equal to the Frame

b) The amount of decrease in the area at i
value.

is greater than 50% of Peak

c) The area at i-1 and i+1 is grater than the area at i to insure that i is a
local minima between two peaks.



We did four One Shot Learning experiments
I, II

Radon
Transform
OSL
Experiments

III, IV

MEI/MHI

2DPCA
Direct
correlation
2DPCA
Direct
correlation

Recognition accuracy of the four experiments

RT
MEI/MHI

Experiment

Whole Body

Body Parts

MEI

MHI

MEI

MHI

I

71

69

82

81.5

II

70

70

81.7

81.6

III

70

68

82

81.7

IV

71.24

68.7

83.33

82.9

Better

Comparison between using RT, and using MEI/MHI directly without RT
Features

% Maintained Energy

Storage Requirements

RT

99%

72 Mbytes

MEI/MHI

88%

102Mbytes

2DPCA

Features

Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform

Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (7)

Similar a Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform

Similar a Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform (20)

Último

Último (20)

Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform

Notas del editor