Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
1. Presented in Partial Fullment of the Requirements
of the Degree of Masters of Science in the School
of Communication and Information Technology
Fadwa Fawzy Fouad
Supervisor: Dr. Moataz M.Abdelwahab
4. Importance &Applications
Human actionactivity recognition is one of the most promising applications of
computer vision. The interest of this topic is motivated by the promise of many
applications include
• character animation for games and movies
• advanced intelligent user interfaces
• biomechanical analysis of actions for sports and medicine
• automatic surveillance
5. Action V.S. Activity
Action
Activity
Single person
Complex sequence of
actions
Single/ multiple person(s)
Short time duration
Long time duration
Simple motion pattern
6. Challenges and
characteristics of the domain
The difficulty of the recognition process is associated with multiple variation
sources
Inter- and intra-class variations
Environmental Variations and Capturing conditions
Temporal variations
7. • Inter-class variations (variations within single
class)
The variations in the performance of certain action due to anthropometric
differences between individuals. For example, running movements can
differ in speed and stride length.
• Intra-class variations (variations within different
classes)
Overlap between different action classes due to the similarity in actions
performance.
8. • Environmental variations
Destructions originate from the actor’s surroundings include dynamic or
cluttered environments, illumination variation, Body occlusion
• Capturing conditions
Depend on the method used to capture the scene, wither singlemultiple
static/dynamic camera(s) systems.
• Temporal variations
Includes the changes in the performance rate from one person to another.
Also, the changes in the recording rate (frame/sec).
11. The main structure of
action recognition system
The structure of the action recognition system is typically hierarchical.
Action
classification
Extraction of the
action
descriptors
Human detection &
segmentation
Capture the input video
12. Capture the input video
For single camera, the scene is captured from only one viewpoint, so it can't
provide enough information about the action performed in case of poor
viewpoint. Besides, it can't handle the occlusion problem.
Video 1
Video 3
Video 2
Video 4
13. Multi-camera systems can capture the same view from different poses., so they
provide sufficient information that can alleviate the occlusion problem.
Camera 0
Camera 2
Camera 1
Camera 3
14. The new technology of Kinect depth camera can be utilized to capture the
performed actions. The device has: RGB camera, depth sensor and multi-array
microphone.
It provides full-body 3D motion capture, facial recognition and voice recognition
capabilities. Furthermore, depth information can be used for segmentation.
RGB
information
Kinect depth camera
Depth
information
15. Human detection &
segmentation
It’s the first step of the full process of human sequence evaluation.
Techniques can be divided into :
• Background Subtraction techniques
• Motion Based techniques
• Appearance Based techniques
• Depth Based Segmentation
16. Extraction of the
action
descriptors
Input videos consist of massive amounts of information in the form of spatiotemporal pixel intensity variations. But most of this information is not directly
relevant to the task of understanding and identifying the activity occurring in
the video.
In this work we used Non-Parametric approaches in which a set of features
are extracted per video frame, then these features are accumulated and
matched to stored templates.
Example:
Motion Energy Image
&
Motion History Image
17. Action
classification
When the extracted features are available for an input video, human action
recognition becomes a classification problem.
Dimensionality reduction is a common step before the actual classification and is
discussed first.
Dimensionality reduction
Image representations are often high-dimensional. This makes matching task
computationally more expensive. Also, the representation might contain noisy
features. This problem trigged the idea of obtaining a more compact, robust
feature representation by reducing the space of the image representation into a
lower dimensional space.
Example: OneTwo Dimension(s) Principal component analysis (PCA)
18. Nearest neighbor classification
k-Nearest neighbor (NN) classifiers use the distance between the features of an
observed sequence and those in a training set. The most common label among the
k closest training sequences is chosen as the classification.
NN classification can be either performed at the frame level, or for the whole
video sequences. In the latter case, issues with different frame lengths need to be
resolved.
In our work we used 1-NN with Euclidean distance to classify the tested
actions.
is class
is class
20. 2DHOOF/2DPCA
Contour Based
Optical Flow Algorithm
•
•
•
•
•
Dense V.S. Sparse OF
Alignment issues with OF
The Calculation of 2D Histogram of Optical Flow(2DHOOF)
Overall System Description
Experimental Results
21. Dense V.S. Sparse OF
In practice, dense OF is not the best choice to get the OF. Besides it’s high
computation complexity, it is not accurate for homogenous moving objects
(aperture problem).
22. Alignment issues with OF
We had two choices to decide the best order for actor alignment:
Align actor then calculate OF
Calculate OF then Align it
30. Overall System Description
Segmentation
& Contour
Extraction
Extract the
dominant
vectors
Store
extracted
features
Training Mode
Sparse OF
Testing Mode
Segmentation
& Contour
Extraction
2DHOOF
Projection on
the dominant
vectors
Sparse OF
2DPCA
Classification
and Voting
Scheme
2DHOOF
32. Segmentation & Contour Extraction (Method 1)
• Geodesic segmentation
Where
xi : stroke pixels (black)
x : other pixels (white)
I : image intensity
Input
Video
Frame
Face
Detection
Initial
Stroke
Blob
Extraction
Final
Contour
33. Segmentation & Contour Extraction (Method 2)
• Contour extraction from Magnitude dense OF
Edge pixel has specific criteria based on it's (3 x 3) neighbor pixels.
42. Experimental Results
Two experiments were conducted to evaluate the performance of the proposed
algorithm.
• For the first experiment Weizmann dataset was used to measure the
performance of the low resolution single camera operation.
• For the second Experiment IXMAS multi-view dataset was used to evaluate
the performance of the parallel camera structure.
The two experiments was conducted using the Leave-One-Actor-Out (LOAO)
technique to be consistent with the most recent algorithms.
Both datasets provide RGB frames and the actor ‘s silhouettes.
43. Weizmann dataset
The Weizmann dataset consists of 90 low-resolution video sequences showing 9
different actors, each performing 10 natural actions such as walk, run, jump
forward, gallop sideways, bend, wave with one hand (wave1), wave with two
hands (wave2), jump in place (Pjump), jump-jack, and skip.
Bend
Run
Jump
Jump-jack
Gallop
44. The confusion matrix for this experiment shows that the average recognition
accuracy is 97.78%, and eight actions were 100% accurate.
2DHOOF / 2DPCA
45. On the other hand, using 1DHOOF with 1DPCA decreases the accuracy to
63.34% because of the large confusion between actions (as discussed before).
1DHOOF / 1DPCA
46. Comparison with the most recent algorithms:
• Recognition Accuracy
Method
Accuracy
Previous Contribution
98.89%
Our Algorithm
97.79%
Shah et al.
95.57%
Yang et al.
92.8%
Yuan et al.
92.22%
• Average Testing Time
Method
Average Runtime
Our Algorithm
66.11 msec
Previous Contribution
113.00 msec
Shah et al.
18.65 sec
Blank et al.
30 sec
48. IXMAS Dataset
The proposed parallel structure algorithm was applied on the IXMAS multi-view
dataset. Each camera is considered as an independent system, then a voting
scheme was carried out between the four cameras to obtain the final decision.
This dataset consists of 5 cameras capturing the scene, 12 actors, each performing
Our
13 natural actions 3 times in which the actors are free to change their orientation
Camera0
Algorithm
for each scenario.
Our
Voting
Scheme
Camera1
The actions: check Algorithm arms, scratch head, sit down, get up, turn
watch, cross
Final
around, walk, wave, punch, kick, and pick up and throw.
Camera2
Our
Algorithm
Camera3
Our
Algorithm
Decision
49. Example on IXMAS multi-camera dataset. Action: Pick up and Throw
Camera 0
Camera 2
Camera 1
Camera 3
50. Confusion matrix for IXMAS dataset shows that average accuracy is
87.12%,where SH=Scratch head, CW=Check watch, CA=Cross arms, SD=Sit
down, GU=Get up, TA=Turn around, PU=Pick up.
51. Cam(2) %
Cam(3) %
12
97.29
79.04
72.47
78.53
87.12
Previous Contribution
12
78.9
78.61
80.93
77.38
84.59
Weinland et al.
10
65.04
70.00
54.30
66.00
81.30
Srivastava et al.
10
N/A
N/A
N/A
N/A
81.40
Shah et al.
12
72.00
53.00
68.00
63.00
78.00
Method
Overall
Vote%
Cam(1) %
Proposed Algorithm
Actors #
Cam(0) %
Comparison with the best reported accuracies shows that we achieved the
highest accuracy with an enhancement of 3%.
Bold indicates the best performance, N/A= Not available in published reports
53. Published Paper
F. Fawzy, M. Abdelwahab, and W. Mikhael. 2DHOOF-2DPCA Contour
Based Optical Flow Algorithm for Human Activity Recognition . IEEE
International Midwest Symposium on Circuits and Systems (MWSCAS
2013), Ohio, USA.
56. Radon Transform
The RT computes projections of an image matrix along specified directions. A
projection of a two-dimensional function f(x,y) is a set of line integrals along
parallel paths, or beams.
57.
58. Overall system description
The proposed system is designed and tested for gesture recognition and can be
extended to regular action recognition.
We have two modes for this algorithm
• Training Mode
• Testing Mode
Both have a pre-processing step before feature extraction.
60. Pre-processing Step:
1) Input videos
The One Shot Learning ChaLearn Gesture Dataset was used for this experiment.
In this dataset a single user facing a fixed Kinect™ camera, interacting with a
computer by performing gestures was captured.
Videos are represented by RGB and depth images.
Each actor has from 8 to 15 different gestures(vocabulary) for training, and 47
input videos each has from 1 to 5 gesture(s) for testing.
We applied our algorithm on a subset of this dataset consists of 37 different
actors.
61. The dataset can be divided into two main groups; standing actors, and sitting
actors. In this experiment we used a subset of the standing actor group in which
actors are using their whole body to perform the gesture and make significant
motion to be captured by the MEI and MHI.
Standing actors
Sitting actors
62. Also, we used only the depth videos as input videos. Depth information makes
the segmentation task easier than using RGB or gray videos, especially when the
actor's clothes have the same color as the background, or textured background.
66. In some cases the resultant blob has some objects with it. This noise results from
some objects that were at the same depth as the actor.
Case 1
Case 2
Case 3
67. In this situation we perform a noise elimination step
Case 1
Case 2
Case 3
75. Basically, the difference between RT of the whole body and RT of the body parts is
the white portion in the center representing the projection of the actor's body
78. Video Chopping
As we have mentioned, the testing videos may contain from 1 to 5
different gestures per video. In this case we need to separate these
gestures into one gesture per video to test our system with.
We can do that by two main steps :
1. Calculate the plot that represents the moving area/frame
2. Apply the Local minima criteria on this plot.
80. 2. Apply the Local minima criteria
We are searching for a frame i that satisfies the following conditions:
a) The number of frames before this i
Threshold.
is greater than or equal to the Frame
b) The amount of decrease in the area at i
value.
is greater than 50% of Peak
c) The area at i-1 and i+1 is grater than the area at i to insure that i is a
local minima between two peaks.
83. Experimental Results
We did four One Shot Learning experiments
I, II
Radon
Transform
OSL
Experiments
III, IV
MEI/MHI
2DPCA
Direct
correlation
2DPCA
Direct
correlation
84. Recognition accuracy of the four experiments
RT
MEI/MHI
Experiment
Whole Body
Body Parts
MEI
MHI
MEI
MHI
I
71
69
82
81.5
II
70
70
81.7
81.6
III
70
68
82
81.7
IV
71.24
68.7
83.33
82.9
Better
Comparison between using RT, and using MEI/MHI directly without RT
Features
% Maintained Energy
Storage Requirements
RT
99%
72 Mbytes
MEI/MHI
88%
102Mbytes
2DPCA
Features
Notas del editor
First, the introduction. It covers 3 main points: Importance and applications of this field, The difference between action and activity, and finally Challenges and characteristics of the domain.
The differences between Action and Activity are that …
Inter-class variations are variations within single class because Action performance can differ from one actor to another.Itra-class variations are variations between two or more different classes due to the similarity in actions performanceFor better recognition results we need Less Inter-class variations and more Intra class variations
Environmental Variations: are destructions originate form the actor’s surroundingsCapturing conditions : depends on The method used to capture the scene, include the usage of single\multiple moving or static cameras.Temporal variations : Includes the changes in performance rate from one actor to another, and changes in the recording rate
The structure of the action recognition system is typically hierarchical. Starts by capturing the input video and extract the actor’s body from it, then feature extraction and finally action classification.
As shown here, the first 3 videos are captured from a good viewpoint so we can gain enough information about the actions. But the 4th video is captured from a poor viewpoint from which the actor’s is hiding the action details.
Add gif images
***Human detection is the task of finding the presence and the position of human beings in images/videos***We briefly describe a few popular human segmentation techniques :
Human detection is the task of finding the presence and the position of human beings in images/videos
MEI: represent the locations where the motion has occurred in the image sequence.MHI: represents the history of this motion by different gray levels (new motion is brighter)
As shown here, this jumping actor has non-textured clothes, so the dense OF will have some inaccurate results because the body pixels in the current frame can’t decide it’s new location in the next frame, and only the edge points can accurately describe the actor’s motion.So we used sparse OF of the actor’s contour because it’s less computationally expensive compared to the dense OF and can accurately describe the motion without the need of excessive processing.
For results consistency, we used alignment step before feature extraction and we found thatthe order of this step affects the results with a significant difference.
Actions like running can be represented by their jumping and transition effects
If we aligned the actor then calculate OF these effects will vanish, and only the legs motion is captured, any other motion is due to the poor alignment. On the other hand, if we calculate OF and then align it, the transition and jumping effects will be captured in the calculated OF. We can see this conclusion from the OF of the head pixels. So we choose to calculate the OF then align it.
After obtaining the OF for each two successive frames from the input video, we used it to calculate the new features of the n layers 2DHOOF. The calculated OF of size WxH was divided into blocks each of size mXm . For each block a 1DHOOF with n bins represent the different ranges of angles was obtained. Then each bin from the 1DHOOF contributes in each layer of the 2DHOOF at locations correspond to the block location. So the size of the final 2DHOOF is W/m x H/m x n.
For example if we divided the calculated OF into blocks each of size (W/2 x H/2), the final 2DHOOF layers have a size of (2x2)
After calculating the 2DHOOF for each 2 successive frames from the input video, these histograms are then layer wise accumulated and normalized to obtain the total 2DHOOF for the whole video.These features are independent on the actor's scale, and tolerable to contours imperfections. Furthermore,they are independent on the start of the action, as the multi-layer 2DHOOF per frame arefinally accumulated and normalized regardless their temporal order.
the main advantage of the 2DHOOF is that it maintains the spatial relation between the moving parts compared to the 1DHOOF which concerns only with the dominant motion wherever it occurs.
As shown here, bend and wave actions have the same motion directions. The main difference here is the spatial location of this motion. Since the 1DHOOF doesn’t maintain the spatial locations of the motion. It cannot be used to discriminate between these actions as they are using the same rang of angles.
Our system is divided into two modes, Training mode in which the training features are extracted and stored. and testing mode in which the dominant features are obtained for the tested video then compared to the stored training features to get the final decision.
The first step in the training mode is actor segmentation and contour extraction.
We tried two different methods for contour extraction: the first method is by geodesic segmentation. The idea of this method is to draw an initial stroke on the actor’s body and try to expand it to cover all other pixels that are near and have low intensity variation compared to the stroke pixels. These two conditions are met by measure Geodesic distance between the initial stroke pixels and the other pixels. We used face detection to draw this initial stroke automatically.
The accuracy of this method is highly dependent on the initial stroke.
The second method is by using the magnitude of the dense OF. Edge pixel has specific criteria based on it's (3 x 3) neighbor pixels. As shown here the black dot represents the edge pixel and the ones represent the pixels that have non zero magnitude. so this criteria can be simply described by the summation of these ones. And we found that the edge point has summation value from 3 to 6.
As shown here, for each pixel we applied this edge criteria to extract the edge pixels.
The main steps of this method are: Calculate the magnitude of the dense OF, then find the edge pixels using edge criteria, and finally apply a simple threshold to remove the noise.
The second step in the training mode is extract the dominant features. After calculating the OF and the 2DHOOF we used the 2DPCA to extract the dominant features.
For each range of anglesin the training 2HOOFs, we calculate the mean and then the covariance matrix , and then obtain the dominant vectors that correspond to the maximum eigen values. The histograms are then projected on the dominant vectors to extract the final features.
These features are stored to be used in the testing mode.
The 2DHOOF of the tested video is projected on the dominant vectors to obtain the final features.
These features are then matched against the stored features using the 1NN classifier with euclidean distance. The final decision is based on the minimum distance.
We compared our algorithm with the most recent algorithms in terms of recognition accuracy and Average testing time. The achieved accuracy is comparable with the highest reported accuracy obtained in our previous contribution. This excellent accuracy was achieved in spite of the imperfect and noisy contours which makes this method independent on how perfect the extracted contours are. Also, our algorithm has the best testing time, which promotes it for real time applications.
Add gifs
This accuracy was achieved inspite of the presence of shadows and imperfections in the extracted contours.
As shown here, we choosed initial T and start the segmentation algorithm. After # of iterations we can segment the actor.
We have 3 cases: Case1: the noise and the actor are not connected.Case2: The noise and the actor are connected but can be separated using simple morphological operations.Case3 : The noise and the actor are connected but can’t be separated.
By calculating the area of each object and only keep the object with maximum area.
The segmented actor can be aligned using ($0^o$) and ($90^o$) projections from the RT. For vertical alignment we used $90^o$ projection information. We specified the projection rectangle on the y-axis and align it's center (red line) to the y-center (purple line) of the frame as shown in Figure. As the gestures don't include whole body motion (i.e. walk, run,...) we can only use the RT of the first frame, and shift the whole video frames by the same distance.
For horizontal alignment we used ($0^o$) projection information. As shown in Figure, the maximum projection value on the x-axis represents the center line of the actor's body. The distance between the maximum projection (red line) and the x-center of the frame (purple line) is the amount needed to align the whole actor's body at the x-center.
We have two types of MEI/MHI. The first one is the whole body MEI/MHI, and the second one includes only the moving body parts.For gestures that include hands motion in front of the body area, MEI/MHIfor the whole body fails to capture this motion and hide it behind the actor's body, that’s why we calculate the body parts version. This will make the MEI/MHI of the moving parts More reliable and accurate than using MEI/MHI of the whole body.
RT of the MEI/MHI is the projection of the image information on a range of angles from 0 to 180 degrees. The resultant RT has height of DL and width of 180.
After obtaining RT we applied 2DPCA to extract the final features and store them.
Testing mode is very similar to the training mode except for the video chopping step.
Take the first frame as a starting position reference for each new gesture.2. Perform frame difference between the first frame and each frame in the video to getthe moving parts.3. Calculate the area of the moving body parts by summing the number of white pixelsper frame, and then plot it.From the plot we can see that the area is decreasing when the actor is about to finish the gesture and returns to the starting position to start a new one.
After obtaining the Area plot, we apply the local minima criteria on it:1- to prevent cutting the video in the middle of the gesture.2- to ensure that the actor is returning to the starting position.3- …
We applied this method on videos contain from 2 to 5 gestures.
In some cases the actor doesn't return to the starting position between the gestures, that's why the algorithmmerges two successive gestures into one. When this happens we discard this video from the training.
We did 4 OSL experiments, the first 2 experiments use RT as action descriptor, and the second 2 experiments use MEI\MHI as action descriptors. For each case we used 2 different methods for classification which are 2DPCA and 1NN and direct correlation.
As shown in this table, there is almost no difference in accuracies between using RT of MEI/MHI and using MEI/MHI directly, but RT is better than the MEI/MHI in the amount of storage needed for the calculated features. RT reduces the storage requirementsby 30% compared to the storage requirements needed for the MEI/MHI features as shown in the 2nd table. In addition, RT maintains about 99% of the features energy compared to MEI/MHI which maintains 88% of the features energy. In all experiments the accuracy of the Body Parts is much better than Whole Body because of the motion occlusion that makes different gestures appear as if they are similar.
For example these two actions have close body MEI/MHI, although they are totally different.