Azure Monitor & Application Insight to monitor Infrastructure & Application
MIT's experience on OpenPOWER/POWER 9 platform
1. Massachusetts Institute of Technology
Ji Lin
Tiny Inference and Scalable Training
for Efficient Video Recognition
04/09/2020
2. 2
Background
• Videos are growing explosively: 105 hours of videos are uploaded to YouTube
per day
• Efficient Video processing is essential for both Cloud and Edge (e.g., hospital)
T
3. A Challenge for Modern Deep Learning
Moore’s
Law
Data
• We are solving more complicated AI problems with larger datasets,
which requires more computation.
• However, Moore’s Law is slowing down; the amount of computation
per unit cost is no longer increasing at its historic rate.
4. 4
Overview
• Efficient spatial-temporal modeling is important for video understanding
• 2D CNN is more efficient, but it cannot handle temporal modeling
• 3D CNN can perform joint spatial-temporal feature learning, but it is
computationally expensive
• We aim to achieve 3D CNN performance at 2D complexity
5. 5
Temporal Shift Module (TSM)
• Bi-directional TSM shifts part of the channels along the temporal
dimension to facilitate information exchange among neighboring frames
• Uni-directional TSM shifts channels from past to future for online video
understanding.
• It can be inserted into off-the-shelf 2D CNN to enable temporal modeling at
the cost of zero FLOPs and zero parameters
* Lin et al., TSM: Temporal Shift Module for Efficient Video Understanding, ICCV’19
9. Improving over 2D Baseline
• TSM can improve over 2D baseline (TSN) at no computation
10. Cost vs. Accuracy
• It consumes 3× less computation than the ECO family, 6× less
computation than the Non-local I3D family while achieving better
performance on Something-Something dataset
11. Latency Comparison
Batch size=1. Measured on NVIDIA Tesla P100.
Each row represents a video.
I3D:
Latency: 164.3 ms/Video Something-V1 Acc.: 41.6%
TSM:
Latency: 17.4 ms/Video Something-V1 Acc.: 43.4%
Speed-up: 9x
12. Throughput Comparison
Batch size=16. Measured on NVIDIA Tesla P100.
Each square represents a video.
I3D:
Throughput: 6.1 video/s
Something-V1 Acc.: 41.6%
TSM:
Throughput: 77.4 video/s
Something-V1 Acc.: 43.4%
12.7x larger throughput
17. Scaling Up: Large-Scale Distributed Training with Summit
Super Computer
SUMMIT Super Computer:
• CPU: 2 x 16 Core IBM POWER9 (connected
via dual NVLINK bricks, 25GB/s each side)
• GPU: 6 x NVIDIA Tesla V100
• RAM: 512 GB DDR4 memory
• Data Storage: HDD
• Connection: Dual-rail EDR InfiniBand
network of 23 GB/s
Acknowledgment: IBM and Oak Ridge National Lab
* Lin et al., Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos, arXiv 1811.08383
18. Scaling Up: Large-Scale Distributed Training with Summit
Super Computer
Scalable Hardware + Scalable Model Design
TSM is hardware friendly for distributed training:
• Arithmetic efficiency: fewer FLOPs compared to 3D models
• Data I/O efficiency: fewer frames (32->8), no downsampling
• Networking efficiency: fewer parameters
19. ● We are able to speedup the training by 200x, from 2 days to 14minutes.
● Model setup: 8-frame ResNet-50 TSM for video recognition
● Dataset: Kinetics (240k training videos) x 100 epoch
Training Time Accuracy Peak GPU
Performance
Speed-up
1 SUMMIT Nodes
(6 GPUs)
49h 50min 74.1% 46.5TFLOP/s Theoretical: 128x
Actual: 106x
Theoretical: 256x
Actual: 211x
128 SUMMIT Nodes
(768 GPUs)
28min 74.1% 5,989TFLOP/s
256 SUMMIT Nodes
(1536 GPUs)
14min 74.0% 11,978TFLOP/s
0 12.5 25 37.5 50
Time (h)
1 SUMMIT Node
128 SUMMIT Node
106x
Scaling Up: Large-Scale Distributed Training with SUMMIT
Super Computer
20. ● The performance of TSM model does not degrade when we scale up the mini-batch
size to 12k.
211x
Accuracy v.s. Batch size
22. Scalability
aining and validation curve for baseline training and large-batch distributed trainin
. The performance does not degrade for batch size 6k and 12k, while degrades for a
4k
16k
64k
256k
images/second
e throughput and scalability of distributed synchronous SGD training. Considering
The throughput and scalability of distributed synchronous SGD training. Considering the massive
number of GPUs, he system achieves a good scalability (>80%). The most of the communication
overhead is hidden by computation
23. Scalability v.s. Model
● TSM model achieves 1.6x and 2.9x higher training throughput compared to
previous I3D models
32. Thank you!
32
Papers
1. Lin et al., TSM: Temporal Shift Module for Efficient Video Understanding, ICCV’19
2. Lin et al., Training Kinetics in 15 Minutes: Large-scale Distributed Training on
Videos, arXiv 1811.08383
Media Coverage:
Website: tsm-hanlab.mit.edu
Code Released! Including gesture recognition demo.