Moving object recognition (MOR) corresponds to the localisation and classification of moving objects in videos. Discriminating moving objects from static objects and background in videos is an essential task for many computer vision applications. MOR has widespread applications in intelligent visual surveillance, intrusion detection, anomaly detection and monitoring, industrial sites monitoring, detection-based tracking, autonomous vehicles, etc. In this session, Murari is going to talk about the deep learning algorithms to identify both locations and corresponding categories of moving objects with a convolutional network. The challenges in developing such algorithms will be discussed. The discourse will also include the implementation details of these models in both conventional and UAV videos.
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
Deep learning fundamental and Research project on IBM POWER9 system from NUS
1. Deep Moving Object Recognition:
Research Project on IBM
POWER9
Research Team: Lav Kush Kumar, Santosh Kumar Vipparthi
Vision Intelligence Lab, Malaviya National Institute of Technology Jaipur, India
Murari Mandal
Postdoctoral Researcher, NUS Singapore
2. Agenda
• Moving Object Recognition (MOR) in Regular View
▪ MotionRec: A Unified Deep Framework for Moving Object
Recognition, WACV-2020 [M. Mandal, L. K. Kumar, M. S. Saran,
S. K. Vipparthi]
• MOR in Aerial View
▪ MOR-UAV: A Benchmark Dataset and Baselines for Moving Object
Recognition in UAV Videos, ACM Multimedia-2020 [M. Mandal,
L.K. Kumar, S. K. Vipparthi]
3. Introduction
• Moving Object Recognition (MOR)?
• Simultaneous localization and classification of moving
objects in videos.
• Fundamental task for many computer vision and video
processing applications.
4. IN TELLIGENT VISUAL SU R VEILLAN C E
FOR IN TRUSION D ETECTION
TR AFFIC M ONITORING
M ARITIME S U R VEILLAN C E
source: www.google.com
6. Challenges: Regular View MOR
• MOR in Different Weather Conditions
• Background Changes and Camera Jitters
• Illumination Changes
• Variable Foreground Motion Speed
• Shadow, Camouflage and Occlusion
• Speed
6
8. Challenges: Aerial View MOR
• Intra and Inter-class Variations
• Insufficient Annotated Data
• Realtime Challenges
• Locating Motion Clues
• Variable Object Density
• Small and Large Object Shapes
• Sporadic Camera Motion
• Changes in the Aerial View
12. MotionRec: MOR in Regular View
• Current Systems:
▪ Object Detection
▪ Moving Object Detection
• Proposed System:
▪ A novel deep learning framework to perform online
moving object recognition (MOR) in streaming videos.
▪ First attempt for simultaneous localization and
classification of moving objects in a video, i.e. MOR in
a single-stage deep learning framework.
14. Preliminary Concepts
• Resnet:
▪ Deep feature extraction with
high stack of layers with
“identity shortcut connections”
• Anchors:
▪ Pre defined bounding boxes at
different scale and aspect
ratios.
15. Preliminary Concepts
• Feature Pyramid Network
▪ Rich multi-scale feature pyramid from one single
resolution input image.
▪ Bottom-up pathway computes feature maps at different
scale.
▪ Top-down pathway and lateral connection constructs
higher resolution layers from a semantic rich layer.
16. Preliminary Concepts
• Intersection over Union (IoU):
▪ Highest overlap (Intersection)
divided by non-overlap (Union).
▪ IoU greater than a threshold
shows the existence of the object
in the anchor box
• Non Max Suppression (NMS)
▪ Select cell with largest
probability among candidates
for object as a prediction.
20. Network Configurations
• MotionRec takes two tensors of shape 608x608xT (past
temporal history) and 608x608x3 (current frame) as input
and returns the spatial coordinates with class labels for
moving object instances.
• While training MotionRec, we use the ResNet50 backbone
pretrained over the ImageNet dataset.
21. Network Configurations
• For regression and classification, smooth L1 and focal loss
functions are used respectively.
• The training loss is the sum of above mentioned two losses.
The loss gradients are backpropagated through TDR blocks
as well.
22. Implementation Details
• Model Training:
• MotionRec forms a single-stage fully connected network
which ensures online operability and fast speed.
• The entire framework is implemented in Keras with
Tensorflow backend.
• Training is performed with batch size=1 over Titan V GPU
in the IBM POWER9 system.
23. Implementation Details
• We use adam optimizer with initial learning rate set to
1x10^-5.
• All models are trained for approximately 500k iterations.
• We only use horizontal image flipping for data
augmentation.
24. Implementation Details
• Inference:
• Similar to training, inference involves simply giving current
frame and recent T temporal history frames as input to the
network.
• Only few past frames (T=10/20/30) are required, enabling
online moving object recognition
25. Dataset Description
• Due to lack of available benchmark datasets with labelled
bounding boxes for MOR, we created a new set of ground
truths by annotating 42,614 objects (14,814 cars and 27,800
person) in 24,923 video frames from CDnet 2014.
• We selected 16 video sequences having 21,717 frames and
38,827 objects (13,442 cars and 25,385 person) for training.
• For testing, 3 video sequences with 3,206 frame and 3,787
objects (1 ,372 cars and 2,415 person) were chosen.
26. Dataset Description
• We created axis-aligned bounding box annotations for
moving object instances in all the frames.
• We define the baseline train and test divisions for qualitative
and quantitative evaluation.
33. Expected Features for UAV Applications
• Resource Efficient Model.
• Memory – The model must take very less memory space.
• Compute – The model must operate even with minimal
computational support.
• Accuracy – The model must offer reasonable accurate
results.
• Real-time – The model must offer scope for real-time
inference.
34. MOR in Aerial View?
• Variable sizes of the vehicles (small, medium and large).
• High/low density of vehicles and complex background in the
cameras field of view.
• Moreover, the aerial scenes in urban setup usually comprises
of a varieties of object types leading to excessive interclass
object similarities.
• No existing dataset for MOR for analysis
35. MOR-UAV: MOR in Aerial View
• Our Contribution:
▪ We introduce MOR-UAV, a large-scale video dataset for
moving object recognition (MOR) in aerial videos.
▪ A novel deep learning framework to perform online
MOR in streaming videos.
▪ To simultaneous localization and classification of
moving objects, i.e. MOR in a single-stage deep learning
framework.
36. • Dataset Details
▪ 30 videos
▪ 89,783 moving object instances
▪ 10,948 frames
▪ Avg. bounding box (BB) height = 29.01, Avg. BB width = 17.64
▪ Min. BB height = 6, Min BB width = 6
▪ Max. BB height = 181, Max. BB width = 106
▪ Avg. video sequence length = 364.93, Min. video sequence length =
64, Max. video sequence length = 1,146
• Dataset Attributes
▪ Variable object density
▪ Small and large object shapes
▪ Sporadic camera motion
▪ Changes in the aerial view
MOR-UAV Dataset
37. The bounding-box (BB) height-width scatter-plot of all the object instances in
MOR-UAV along with the complete dataset description
MOR-UAV Dataset
41. MOR-UAVNet Framework
• Schematic illustration of the proposed MOR-UAVNet
framework for MOR in UAV videos.
• The motion saliency is estimate d through cascaded optical
flow computation at multiple stages in the temp oral history
frames.
• In this figure, optical flow between the current frame and the
last (𝑂F-1), third last (𝑂F-3), fifth last frame (𝑂F-5) is
computed respectively.
• We then assimilate the salient motion features with the
current frame. These assimilated features are forwarded
through the ResNet backbone to extract spatial and temporal
dimension aware features.
42. MOR-UAVNet Framework
• Moreover, the base features from the current frame are also
extracted to reinforce the semantic context of the object
instances.
• These two feature maps are concatenated at matching scales
to produce a feature map for motion encoding.
• Afterward, multi-level feature pyramids are generated. The
dense bounding box and category scores are generated at
each level of the pyramid.
• We use 5 pyramid levels in our experiments.
44. Network Configuration
• We resize all the video frames in MORUAV dataset to
608×608×3 for a uniform setting in training and evaluation.
• We compute the dense optical flow 𝟏 with the following
values of T is used in our experiments:
• T = 3 (𝐶_OF = 1-3-5),T = 2 (𝐶_OF = 1-3), T= 2 (𝐶_OF = 1-
5), T = 1(𝐶_OF = 1).
1Gunnar Farnebäck. 2003. Two-frame motion estimation based on polynomial expansion. In
Scandinavian conference on Image analysis. Springer, 363–370.
45. Model Training
• The one-stage MOR-UAVNet network is trained end-to-end
with multiple input layers.
• The complete framework is implemented in Keras with
Tensorflw backend.
• Training is performed with batch size = 1 over Titan V GPU
in IBM POWER9 systems.
46. Model Training
• The network is optimized with Adam optimizer and initial
learning rate of 10^-5. All models are trained for
approximately 250-300k iterations.
• For regression and classification, L-1 and focal loss
functions are used, respectively.
47. Model Inference
• Similar to training, inference involves simply giving the
current frame and cascaded optical flow maps computed
from past history frames as input to the network.
• Only a few optical flow maps (T = 1/ 2/ 3) are required,
enabling online moving object recognition for real-time
analysis.
54. Discussion
• Our dataset caters to real-world demands with vivid samples
collected from numerous unconstrained circumstances.
• We feel this benchmark dataset can support promising
research trends in UAV based vehicular technology.
• Research directions for exploration:
▪ Realtime challenges
▪ Locating motion clues
55.
56. • Acknowledgement
▪ IBM POWER9
• Contact us for any queries:
▪ http://visionintelligence.github.io/
▪ https://github.com/murari023
▪ Email: murarimandal.cv@gmail.com
• Source Code
▪ https://github.com/lav-kush/MotionRec
Thank You!