"Using Deep Learning for Video Event Detection on a Compute Budget," a Presentation from PathPartner Technology

© 2019 Pathpartner Technology
Using Deep Learning for
Video Event Detection on a
Compute Budget
Praveen Nayak
Pathpartner Technology
May 2019

Outline
• Introduction to Video Event Detection
• Learning Representations from Video
• From Video Representation to Event Detection
• Decoupling the “When” and “What”
• Results, case study on UCF101-Thumos2015 challenge
• Conclusion
2

Introduction to Video Event Detection
3

Video data as viewed by ML
• Video is a 3D signal
• Spatial Coordinates x,y
(limited by WxH)
• Temporal Coordinates t
(limited by T)
• If we fix t, we obtain an
image/frame
• We can understand videos
as sequence of images
4
Introduction

Event Detection
• Retrieve start (tstart) and end (tend)points of “event” from temporally
“untrimmed” video
• Evaluation metric: mAP and Recall for a given temporal IoU (tIoU)
5
Introduction to Event Detection
nI
nU
𝑡𝐼𝑂𝑈 =
𝑛𝐼
𝑛𝑈

Learning Representations from Video
6

Spatiotemporal fusion networks
7
Learning representations from Video
Image: Kim et. al, Weighing classes and streams: toward better methods for two-stream convolutional networks

Convolutions for spatiotemporal data
8
• C3D model: All convs are 3D, Fewer parameters than 2D convolutions
over multiple frames
Image: D.Tran et. al, Learning Spatiotemporal Features with 3D Convolutional Networks
C3D feature vector
3D Convolutions

Convolutions for spatiotemporal data
9
• C3D model: All convs are 3D, Fewer parameters than 2D convolutions
over multiple frames.
Image: D.Tran et. al, Learning Spatiotemporal Features with 3D Convolutional Networks
C3D feature vector
3D Convolutions Joint Appearance and Motion
features at every layer

State-of-the-art video descriptors
10
Image: J Carriera et al, Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Action Classification Datasets
11
Dataset Action
Classes
# Clips Temporal
trimming
HMDB-51[1] 51 ~7k Yes
UCF-101[2] 101 ~13k Yes
Kinetics[3] 400 ~160k No
• The classification metric: mAP,
similar to image based
classification metric, extended in
temporal domain
• Classifier to make one decision
per clip
UCF-101
HMDB-51
Kinetics
[1] UCF-101, University of Central Florida, https://www.crcv.ucf.edu/data/UCF101.php
[2] HMDB-51, Brown university (http://serre-lab.clps.brown.edu/resource/hmdb-a-large-
human-motion-database/)
[3] Kinetics dataset, Deepmind (https://deepmind.com/research/open-source/open-
source-datasets/kinetics/)

State-of-the-art video descriptors
12
Image and Table: J Carriera et al, Quo Vadis, Action
Recognition? A New Model and the Kinetics Dataset

From Video Representation to Event
Detection
13

From RCNN to Segment-CNN
14
From Video Representation to Event Detection
Image: Z. Shou, Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs

From RCNN to Segment-CNN
15
From Video Representation to Event Detection
Image: Z. Shou, Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs
Computationally very intensive!

TAL-NET: Parallels from Faster R-CNN
• Problem split :
class-sensitive
proposal
generation,
followed by
inference
16
From Video representations to Video event detection
Faster-RCNN
TAL-NET
Image: Y. Chao et. Al, Rethinking the Faster R-CNN Architecture for Temporal Action Localization

TAL-NET: Parallels from Faster R-CNN
• Problem split :
class-sensitive
proposal
generation,
followed by
inference
17
From Video representations to Video event detection
Faster-RCNN
TAL-NET
Image: Y. Chao et. Al, Rethinking the Faster R-CNN Architecture for Temporal Action Localization
End-to-end-Trainable

Computational cost of Detection
18
From Video Representations to Video event detection
Model GMAC/
inference
#params
(Million)
GMAC/ video
VGG 15 138 46.5k
C3D- SCNN[1] 79 80 237k
C3D-LSTM[2] 24 86 72k
TAL-NET[3] 29 98 87k
SSAD[4] 61 356 183k
• Template size:
1. C3D, TAL-NET:171x128x16
2. VGG: 224x224x3
• GMAC/inference: could be for frame-level inference (VGG) or clip-
level inference (C3D)
• GMAC/video: assumes average video length of 3000 frames
GMACs per video of event detectors
[1] Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs
http://dvmmweb.cs.columbia.edu/files/dvmm_scnn_paper.pdf
[2] Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks
https://imatge-upc.github.io/activitynet-2016-cvprw/
[3] Rethinking the Faster R-CNN Architecture for Temporal Action Localization
https://arxiv.org/pdf/1804.07667.pdf
[4] Single Shot action detection https://arxiv.org/abs/1710.06236
Accuracy vs GMACs / video of event detectors

Decoupling the “When” and “What”
19

Proposed event detection system
• Break down problem into two parts:
1. Class agnostic segment proposal (low complexity)
2. Video segment inference (can be high complexity depending on
nature of video)
20
Decoupling the “when” and “what”

nature of video)
• Inference model may be long or short term temporal, called on-demand
21

nature of video)
• Inference model may be long or short term temporal, called on-demand.
• Characteristics of a good segment proposal:
• Cater to arbitrary event lengths
• Discriminate event from background, irrespective of event
• needs to run for every frame, so low in complexity
22

Class agnostic segment proposal
• Formulate problem as unsupervised “anomaly detection”
• Train a model to learn anomalies against background
• At deployment, predict a binary label, i.e., anomaly has occurred or not
23
Anomaly
Detector
Video clip
Anomaly
DetectorAnomaly
DetectorAnomaly
DetectorAnomaly
DetectorAnomaly
Detector
Yes/No
Yes/No
Yes/No
Yes/No
Yes/No
Yes/No
Result
t

Training an anomaly detector
24
Yt-1
Yt
Yt+1
E LSTM
Encoder Decoder
LSTM D
Space
Time
Memory Memory
Convolutional
Encoder
Convolutional
Decoder
Y’t+1
Error
𝑒𝑡 = 𝑌′
𝑡 + 1 −
𝑌𝑡 + 1 2
Video Autoencoder Framework

25
Yt-1
Yt
Yt+1
E LSTM
Encoder Decoder
LSTM D
Space
Time
Memory Memory
Convolutional
Encoder
Convolutional
Decoder
Y’t+1
Error
𝑒𝑡 = 𝑌′
𝑡 + 1 −
𝑌𝑡 + 1 2
2D-CNN
2D-CNN
Conv-
LSTM
Conv-
LSTM

26
Yt-1
Yt
Yt+1
E LSTM
Encoder Decoder
LSTM D
Space
Time
Memory Memory
Convolutional
Encoder
Convolutional
Decoder
Y’t+1
Error
𝑒𝑡 = 𝑌′
𝑡 + 1 −
𝑌𝑡 + 1 2
2D-CNN
2D-CNN
Conv-
LSTM
Conv-
LSTM
Sparse, low-dimensional
encoding

27
Yt-1
Yt
Yt+1
E LSTM
Encoder Decoder
LSTM D
Space
Time
Memory Memory
Convolutional
Encoder
Convolutional
Decoder
Y’t+1
Error
𝑒𝑡 = 𝑌′
𝑡 + 1 −
𝑌𝑡 + 1 2
“Learn to Represent
Background”

Deployment of anomaly detector
• When event occurs,
reconstruction is poor
• Anomaly decision
based on “Regularity
score”
• rmin and rmax are
derived on validation
set.
28
Yt-1
Yt
Yt+1
E LSTM
Encoder Decoder
LSTM D
Space
Time
Memory Memory
Convolutional
Encoder
Convolutional
Decoder
Y’t+1
Error
Regularity ScoreRt = 1 − (
𝑟𝑡
−𝑟𝑚𝑖 𝑛
𝑟 𝑚𝑎𝑥
)
Rt ≈ 1, higher likelihood of background

Deployment of anomaly detector
• When event occurs,
reconstruction is poor
• Anomaly decision
based on “Regularity
score”
• rmin and rmax are
derived on validation
set.
• 0.8 GMACs / frame,
template: 171x128
29
Yt-1
Yt
Yt+1
E LSTM
Encoder Decoder
LSTM D
Space
Time
Memory Memory
Convolutional
Encoder
Convolutional
Decoder
Y’t+1
Error
Regularity ScoreRt = 1 − (
𝑟𝑡
−𝑟𝑚𝑖 𝑛
𝑟 𝑚𝑎𝑥
)
Rt ≈ 1, higher likelihood of background

Event detection pipeline
30
Regularity
Score
t
C3D C3D
CleanAndJerk CleanAndJerk CleanAndJerk
C3D

Choosing threshold for characterizing anomalies
• Choosing threshold R critical
for accuracy and complexity
• Large values of R ≈ 1 →
more false positives , better
recall
• Small values of R ≈ rmin/rmax →
better recall with large number
of false positives
31
R
R
R
Pred
GT
Pred
GT
Pred
GT
False detection
Event Missed

Results – F1 score vs complexity on Thumos ’15
• Thumos ‘15 challenge,
conducted as a CVPR ’15
workshop
• Subset of UCF101, Only
20 human action classes
have event labels in
untrimmed videos.
• Evaluation metric:
Average recall and mAP
for given tIOU
32
Results
Threshold
(R)
#frame
proposals
mAP (0.4
tIoU)
Recall
(0.4 tIOU)
GMAC/ video
(avg)
0.3 10978 0.43 0.19 15.5k
0.4 16505 0.33 0.21 22.2k
0.5 18048 0.31 0.39 24.1k
0.6 23486 0.17 0.40 30.5k
0.7 30107 0.11 0.41 38.5k
Model #Event
Proposals
(avg)
Recall (0.2
tIoU)
GMAC/video
(avg)
TAL-NET 200 0.51 87k
TORNADO[1] 30 0.63 46.8k
Ours 84 0.421 13.9k
TORNADO is a better event proposal,
at the cost of additional compute
Effect of Threshold R on Accuracy/Compute
Comparison with Event Proposal methods
[1] TORNADO: A Spatio-Temporal Convolutional Regression Network for Video Action Proposal
http://www.ntu.edu.sg/home/shijian.lu/Publicationss

Analysis of effect of Segment proposals
33
Results
No proposals, inference at single temporal scale
Missed detections
for a class likely to
be false positive for
another!

Analysis of effect of Segment proposals
34
Results
With proposals, inference at single temporal scale
Reduced Confusion
between events →
reduced false
positives, decrease
in recall for some
events.
Also, Increase in
confusion against
background

Computational Cost of Detection
35
Results
Model GMAC/
inference
(max)
#params
(Million)
GMAC/
video
VGG 15 138 46.5k
C3D- SCNN[1] 79 80 237k
C3D-LSTM[2] 24 86 72k
TAL-NET[3] 29 98 87k
SSAD[4] 61 356 183k
TORNADO[5] 32 90.5 46.8k
Ours (ConvLSTM-
AD + C3D)
24 81 13.9k
Accuracy vs GMACs/video of event detectors
[1] Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs
http://dvmmweb.cs.columbia.edu/files/dvmm_scnn_paper.pdf
[2] Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks
https://imatge-upc.github.io/activitynet-2016-cvprw/
[3] Rethinking the Faster R-CNN Architecture for Temporal Action Localization
[4] Single Shot action detection https://arxiv.org/abs/1710.06236
[5] TORNADO: A Spatio-Temporal Convolutional Regression Network for Video Action Proposal
http://www.ntu.edu.sg/home/shijian.lu/Publicationss
GMACs per video of event detectors
Segment
Proposals

Summary
• Event Detection in Video with joint spatiotemporal features is
computationally expensive
• Decoupling models for making inferences in the spatial and long-term
temporal modalities can effectively reduce the overall GMACs/Video
• On systems with multiple compute units, decoupling provides logical
separation of algorithm into a part that needs to run at a high rate and a
part that is only called on-demand, enables heterogenous compute
• Accuracy – Complexity trade-off controlled by segment proposals
36

Thank You
37

Example of Resource Slide
38
Technical Papers
[1] Learning Spatiotemporal Features with
3D Convolutional Networks
[2] Temporal Activity Detection in Untrimmed
Videos with Recurrent Neural Networks
https://imatge-upc.github.io/activitynet-
2016-cvprw/
[3] Rethinking the Faster R-CNN Architecture
for Temporal Action Localization
Embedded Vision Summit
“Using Deep Learning for
Video Event Detection on a Compute
Budget”

"Using Deep Learning for Video Event Detection on a Compute Budget," a Presentation from PathPartner Technology

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a "Using Deep Learning for Video Event Detection on a Compute Budget," a Presentation from PathPartner Technology

Similar a "Using Deep Learning for Video Event Detection on a Compute Budget," a Presentation from PathPartner Technology (20)

Más de Edge AI and Vision Alliance

Más de Edge AI and Vision Alliance (20)

Último

Último (20)

"Using Deep Learning for Video Event Detection on a Compute Budget," a Presentation from PathPartner Technology