SlideShare a Scribd company logo
1 of 32
Download to read offline
Massachusetts Institute of Technology
Ji Lin
Tiny Inference and Scalable Training
for Efficient Video Recognition
04/09/2020
2
Background
• Videos are growing explosively: 105 hours of videos are uploaded to YouTube
per day
• Efficient Video processing is essential for both Cloud and Edge (e.g., hospital)
T
A Challenge for Modern Deep Learning
Moore’s
Law
Data
• We are solving more complicated AI problems with larger datasets,
which requires more computation.
• However, Moore’s Law is slowing down; the amount of computation
per unit cost is no longer increasing at its historic rate.
4
Overview
• Efficient spatial-temporal modeling is important for video understanding
• 2D CNN is more efficient, but it cannot handle temporal modeling
• 3D CNN can perform joint spatial-temporal feature learning, but it is
computationally expensive
• We aim to achieve 3D CNN performance at 2D complexity
5
Temporal Shift Module (TSM)
• Bi-directional TSM shifts part of the channels along the temporal
dimension to facilitate information exchange among neighboring frames
• Uni-directional TSM shifts channels from past to future for online video
understanding.
• It can be inserted into off-the-shelf 2D CNN to enable temporal modeling at
the cost of zero FLOPs and zero parameters
* Lin et al., TSM: Temporal Shift Module for Efficient Video Understanding, ICCV’19
6
TSM Video Model
• Offline TSM video models
• Online TSM video models
A Simple Implementation of TSM
# shape of x: [N, T, C, H, W]
out = torch.zeros_like(x)
fold = c // fold_div
out[:, :-1, :fold] = x[:, 1:, :fold] # shift left
out[:, 1:, fold: 2 * fold] = x[:, :-1, fold: 2 * fold] # shift right
out[:, :, 2 * fold:] = x[:, :, 2 * fold:] # not shift
return out
* Naive implementation, involves large memory consumption and increases training memory consumption
8
Datasets
• Less temporal related: UCF101, HMDB51, Kinetics
• Temporal related: Something-Something (V1&V2), Jester
Improving over 2D Baseline
• TSM can improve over 2D baseline (TSN) at no computation
Cost vs. Accuracy
• It consumes 3× less computation than the ECO family, 6× less
computation than the Non-local I3D family while achieving better
performance on Something-Something dataset
Latency Comparison
Batch size=1. Measured on NVIDIA Tesla P100.
Each row represents a video.
I3D:
Latency: 164.3 ms/Video Something-V1 Acc.: 41.6%
TSM:
Latency: 17.4 ms/Video Something-V1 Acc.: 43.4%
Speed-up: 9x
Throughput Comparison
Batch size=16. Measured on NVIDIA Tesla P100.
Each square represents a video.
I3D:
Throughput: 6.1 video/s
Something-V1 Acc.: 41.6%
TSM:
Throughput: 77.4 video/s
Something-V1 Acc.: 43.4%
12.7x larger throughput
Online Video Recognition
13
14
Improving the Robustness of Online Video Detection
Improving the Robustness of Online Video Detection
15
Scaling Down: Low-Latency Low-Power
Deployment
16
LED Bulb Level!
Scaling Up: Large-Scale Distributed Training with Summit
Super Computer
SUMMIT Super Computer:
• CPU: 2 x 16 Core IBM POWER9 (connected
via dual NVLINK bricks, 25GB/s each side)

• GPU: 6 x NVIDIA Tesla V100

• RAM: 512 GB DDR4 memory

• Data Storage: HDD

• Connection: Dual-rail EDR InfiniBand
network of 23 GB/s
Acknowledgment: IBM and Oak Ridge National Lab
* Lin et al., Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos, arXiv 1811.08383
Scaling Up: Large-Scale Distributed Training with Summit
Super Computer
Scalable Hardware + Scalable Model Design

TSM is hardware friendly for distributed training:

• Arithmetic efficiency: fewer FLOPs compared to 3D models

• Data I/O efficiency: fewer frames (32->8), no downsampling

• Networking efficiency: fewer parameters
● We are able to speedup the training by 200x, from 2 days to 14minutes.
● Model setup: 8-frame ResNet-50 TSM for video recognition
● Dataset: Kinetics (240k training videos) x 100 epoch
Training Time Accuracy Peak GPU
Performance
Speed-up
1 SUMMIT Nodes 

(6 GPUs)
49h 50min 74.1% 46.5TFLOP/s Theoretical: 128x

Actual: 106x

Theoretical: 256x

Actual: 211x
128 SUMMIT Nodes 

(768 GPUs)
28min 74.1% 5,989TFLOP/s
256 SUMMIT Nodes 

(1536 GPUs)
14min 74.0% 11,978TFLOP/s
0 12.5 25 37.5 50
Time (h)
1 SUMMIT Node
128 SUMMIT Node
106x
Scaling Up: Large-Scale Distributed Training with SUMMIT
Super Computer
● The performance of TSM model does not degrade when we scale up the mini-batch
size to 12k.
211x
Accuracy v.s. Batch size
Training Curves
Scalability
aining and validation curve for baseline training and large-batch distributed trainin
. The performance does not degrade for batch size 6k and 12k, while degrades for a
4k
16k
64k
256k
images/second
e throughput and scalability of distributed synchronous SGD training. Considering
The throughput and scalability of distributed synchronous SGD training. Considering the massive
number of GPUs, he system achieves a good scalability (>80%). The most of the communication
overhead is hidden by computation
Scalability v.s. Model
● TSM model achieves 1.6x and 2.9x higher training throughput compared to
previous I3D models
TSM Dissection: Spatial-Temporal Localization
24
• Each channel learns different semantics
• Channel 5: Move something away
TSM Dissection: Spatial-Temporal Localization
25
• Each channel learns different semantics
• Channel 162: Wiping
TSM Dissection: Spatial-Temporal Localization
26
• Each channel learns different semantics
• Channel 446: Push to left
27
Demo: Hand Gesture Recognition with TSM
28
70 FPS on $99 Jetson Nano
Demo: Google Map Navigation with Gesture
29
Demo Video on Something-Something
Acknowledgement
31
Song Han

MIT
Chuang Gan

MIT-IBM Watson AI Lab
John Cohn

IBM
Thank you!
32
Papers
1. Lin et al., TSM: Temporal Shift Module for Efficient Video Understanding, ICCV’19
2. Lin et al., Training Kinetics in 15 Minutes: Large-scale Distributed Training on
Videos, arXiv 1811.08383
Media Coverage:
Website: tsm-hanlab.mit.edu
Code Released! Including gesture recognition demo.

More Related Content

What's hot

Optimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for EnergyOptimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for EnergyDavid Lecomber
 
OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM Ganesan Narayanasamy
 
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsGanesan Narayanasamy
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...Ryousei Takano
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputinginside-BigData.com
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsGanesan Narayanasamy
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionNVIDIA Taiwan
 
Omp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacOmp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacGanesan Narayanasamy
 
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre..."Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...Edge AI and Vision Alliance
 
AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group Ganesan Narayanasamy
 

What's hot (20)

OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
BSC LMS DDL
BSC LMS DDL BSC LMS DDL
BSC LMS DDL
 
Optimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for EnergyOptimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for Energy
 
CFD on Power
CFD on Power CFD on Power
CFD on Power
 
2018 bsc power9 and power ai
2018   bsc power9 and power ai 2018   bsc power9 and power ai
2018 bsc power9 and power ai
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM
 
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power Systems
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
WML OpenPOWER presentation
WML OpenPOWER presentationWML OpenPOWER presentation
WML OpenPOWER presentation
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
OpenPOWER/POWER9 AI webinar
OpenPOWER/POWER9 AI webinar OpenPOWER/POWER9 AI webinar
OpenPOWER/POWER9 AI webinar
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputing
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systems
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server Solution
 
Omp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacOmp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdac
 
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre..."Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
 
AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group
 

Similar to MIT's experience on OpenPOWER/POWER 9 platform

Deep gradient compression
Deep gradient compressionDeep gradient compression
Deep gradient compressionDavid Tung
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedWee Hyong Tok
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Fisnik Kraja
 
Accelerate Machine Learning on Google Cloud
Accelerate Machine Learning on Google CloudAccelerate Machine Learning on Google Cloud
Accelerate Machine Learning on Google CloudSamantha Guerriero
 
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...IRJET Journal
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetEric Haibin Lin
 
Machine Learning approaches at video compression
Machine Learning approaches at video compression Machine Learning approaches at video compression
Machine Learning approaches at video compression Roberto Iacoviello
 
Weakly Supervised Whole Slide Image Analysis Using Cloud Computing
Weakly Supervised Whole Slide Image Analysis Using Cloud ComputingWeakly Supervised Whole Slide Image Analysis Using Cloud Computing
Weakly Supervised Whole Slide Image Analysis Using Cloud ComputingSean Yu
 
Presentation - webinar embedded machine learning
Presentation - webinar embedded machine learningPresentation - webinar embedded machine learning
Presentation - webinar embedded machine learningSirris
 
Nervana and the Future of Computing
Nervana and the Future of ComputingNervana and the Future of Computing
Nervana and the Future of ComputingIntel Nervana
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917Bill Liu
 
Efficient video perception through AI
Efficient video perception through AIEfficient video perception through AI
Efficient video perception through AIQualcomm Research
 
AWS re:Invent 2016: Deep Learning at Cloud Scale: Improving Video Discoverabi...
AWS re:Invent 2016: Deep Learning at Cloud Scale: Improving Video Discoverabi...AWS re:Invent 2016: Deep Learning at Cloud Scale: Improving Video Discoverabi...
AWS re:Invent 2016: Deep Learning at Cloud Scale: Improving Video Discoverabi...Amazon Web Services
 
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...Edge AI and Vision Alliance
 
Using Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clustersUsing Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clustersJoy Qiao
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
Quad Core Processors - Technology Presentation
Quad Core Processors - Technology PresentationQuad Core Processors - Technology Presentation
Quad Core Processors - Technology Presentationvinaya.hs
 

Similar to MIT's experience on OpenPOWER/POWER 9 platform (20)

Deep gradient compression
Deep gradient compressionDeep gradient compression
Deep gradient compression
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learned
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
 
Accelerate Machine Learning on Google Cloud
Accelerate Machine Learning on Google CloudAccelerate Machine Learning on Google Cloud
Accelerate Machine Learning on Google Cloud
 
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
 
Machine Learning approaches at video compression
Machine Learning approaches at video compression Machine Learning approaches at video compression
Machine Learning approaches at video compression
 
Weakly Supervised Whole Slide Image Analysis Using Cloud Computing
Weakly Supervised Whole Slide Image Analysis Using Cloud ComputingWeakly Supervised Whole Slide Image Analysis Using Cloud Computing
Weakly Supervised Whole Slide Image Analysis Using Cloud Computing
 
Presentation - webinar embedded machine learning
Presentation - webinar embedded machine learningPresentation - webinar embedded machine learning
Presentation - webinar embedded machine learning
 
Jc nov.07.2019
Jc nov.07.2019Jc nov.07.2019
Jc nov.07.2019
 
Nervana and the Future of Computing
Nervana and the Future of ComputingNervana and the Future of Computing
Nervana and the Future of Computing
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
Efficient video perception through AI
Efficient video perception through AIEfficient video perception through AI
Efficient video perception through AI
 
AWS re:Invent 2016: Deep Learning at Cloud Scale: Improving Video Discoverabi...
AWS re:Invent 2016: Deep Learning at Cloud Scale: Improving Video Discoverabi...AWS re:Invent 2016: Deep Learning at Cloud Scale: Improving Video Discoverabi...
AWS re:Invent 2016: Deep Learning at Cloud Scale: Improving Video Discoverabi...
 
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
 
Using Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clustersUsing Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clusters
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
C3 w3
C3 w3C3 w3
C3 w3
 
Quad Core Processors - Technology Presentation
Quad Core Processors - Technology PresentationQuad Core Processors - Technology Presentation
Quad Core Processors - Technology Presentation
 

More from Ganesan Narayanasamy

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency programGanesan Narayanasamy
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and VerilogGanesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISAGanesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsGanesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsGanesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Ganesan Narayanasamy
 

More from Ganesan Narayanasamy (20)

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
 
Robustness in deep learning
Robustness in deep learningRobustness in deep learning
Robustness in deep learning
 
Perspectives of Frond end Design
Perspectives of Frond end DesignPerspectives of Frond end Design
Perspectives of Frond end Design
 
A2O Core implementation on FPGA
A2O Core implementation on FPGAA2O Core implementation on FPGA
A2O Core implementation on FPGA
 

Recently uploaded

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 

Recently uploaded (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 

MIT's experience on OpenPOWER/POWER 9 platform

  • 1. Massachusetts Institute of Technology Ji Lin Tiny Inference and Scalable Training for Efficient Video Recognition 04/09/2020
  • 2. 2 Background • Videos are growing explosively: 105 hours of videos are uploaded to YouTube per day • Efficient Video processing is essential for both Cloud and Edge (e.g., hospital) T
  • 3. A Challenge for Modern Deep Learning Moore’s Law Data • We are solving more complicated AI problems with larger datasets, which requires more computation. • However, Moore’s Law is slowing down; the amount of computation per unit cost is no longer increasing at its historic rate.
  • 4. 4 Overview • Efficient spatial-temporal modeling is important for video understanding • 2D CNN is more efficient, but it cannot handle temporal modeling • 3D CNN can perform joint spatial-temporal feature learning, but it is computationally expensive • We aim to achieve 3D CNN performance at 2D complexity
  • 5. 5 Temporal Shift Module (TSM) • Bi-directional TSM shifts part of the channels along the temporal dimension to facilitate information exchange among neighboring frames • Uni-directional TSM shifts channels from past to future for online video understanding. • It can be inserted into off-the-shelf 2D CNN to enable temporal modeling at the cost of zero FLOPs and zero parameters * Lin et al., TSM: Temporal Shift Module for Efficient Video Understanding, ICCV’19
  • 6. 6 TSM Video Model • Offline TSM video models • Online TSM video models
  • 7. A Simple Implementation of TSM # shape of x: [N, T, C, H, W] out = torch.zeros_like(x) fold = c // fold_div out[:, :-1, :fold] = x[:, 1:, :fold] # shift left out[:, 1:, fold: 2 * fold] = x[:, :-1, fold: 2 * fold] # shift right out[:, :, 2 * fold:] = x[:, :, 2 * fold:] # not shift return out * Naive implementation, involves large memory consumption and increases training memory consumption
  • 8. 8 Datasets • Less temporal related: UCF101, HMDB51, Kinetics • Temporal related: Something-Something (V1&V2), Jester
  • 9. Improving over 2D Baseline • TSM can improve over 2D baseline (TSN) at no computation
  • 10. Cost vs. Accuracy • It consumes 3× less computation than the ECO family, 6× less computation than the Non-local I3D family while achieving better performance on Something-Something dataset
  • 11. Latency Comparison Batch size=1. Measured on NVIDIA Tesla P100. Each row represents a video. I3D: Latency: 164.3 ms/Video Something-V1 Acc.: 41.6% TSM: Latency: 17.4 ms/Video Something-V1 Acc.: 43.4% Speed-up: 9x
  • 12. Throughput Comparison Batch size=16. Measured on NVIDIA Tesla P100. Each square represents a video. I3D: Throughput: 6.1 video/s Something-V1 Acc.: 41.6% TSM: Throughput: 77.4 video/s Something-V1 Acc.: 43.4% 12.7x larger throughput
  • 14. 14 Improving the Robustness of Online Video Detection
  • 15. Improving the Robustness of Online Video Detection 15
  • 16. Scaling Down: Low-Latency Low-Power Deployment 16 LED Bulb Level!
  • 17. Scaling Up: Large-Scale Distributed Training with Summit Super Computer SUMMIT Super Computer: • CPU: 2 x 16 Core IBM POWER9 (connected via dual NVLINK bricks, 25GB/s each side) • GPU: 6 x NVIDIA Tesla V100 • RAM: 512 GB DDR4 memory • Data Storage: HDD • Connection: Dual-rail EDR InfiniBand network of 23 GB/s Acknowledgment: IBM and Oak Ridge National Lab * Lin et al., Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos, arXiv 1811.08383
  • 18. Scaling Up: Large-Scale Distributed Training with Summit Super Computer Scalable Hardware + Scalable Model Design TSM is hardware friendly for distributed training: • Arithmetic efficiency: fewer FLOPs compared to 3D models • Data I/O efficiency: fewer frames (32->8), no downsampling • Networking efficiency: fewer parameters
  • 19. ● We are able to speedup the training by 200x, from 2 days to 14minutes. ● Model setup: 8-frame ResNet-50 TSM for video recognition ● Dataset: Kinetics (240k training videos) x 100 epoch Training Time Accuracy Peak GPU Performance Speed-up 1 SUMMIT Nodes 
 (6 GPUs) 49h 50min 74.1% 46.5TFLOP/s Theoretical: 128x Actual: 106x Theoretical: 256x Actual: 211x 128 SUMMIT Nodes 
 (768 GPUs) 28min 74.1% 5,989TFLOP/s 256 SUMMIT Nodes 
 (1536 GPUs) 14min 74.0% 11,978TFLOP/s 0 12.5 25 37.5 50 Time (h) 1 SUMMIT Node 128 SUMMIT Node 106x Scaling Up: Large-Scale Distributed Training with SUMMIT Super Computer
  • 20. ● The performance of TSM model does not degrade when we scale up the mini-batch size to 12k. 211x Accuracy v.s. Batch size
  • 22. Scalability aining and validation curve for baseline training and large-batch distributed trainin . The performance does not degrade for batch size 6k and 12k, while degrades for a 4k 16k 64k 256k images/second e throughput and scalability of distributed synchronous SGD training. Considering The throughput and scalability of distributed synchronous SGD training. Considering the massive number of GPUs, he system achieves a good scalability (>80%). The most of the communication overhead is hidden by computation
  • 23. Scalability v.s. Model ● TSM model achieves 1.6x and 2.9x higher training throughput compared to previous I3D models
  • 24. TSM Dissection: Spatial-Temporal Localization 24 • Each channel learns different semantics • Channel 5: Move something away
  • 25. TSM Dissection: Spatial-Temporal Localization 25 • Each channel learns different semantics • Channel 162: Wiping
  • 26. TSM Dissection: Spatial-Temporal Localization 26 • Each channel learns different semantics • Channel 446: Push to left
  • 27. 27
  • 28. Demo: Hand Gesture Recognition with TSM 28 70 FPS on $99 Jetson Nano
  • 29. Demo: Google Map Navigation with Gesture 29
  • 30. Demo Video on Something-Something
  • 32. Thank you! 32 Papers 1. Lin et al., TSM: Temporal Shift Module for Efficient Video Understanding, ICCV’19 2. Lin et al., Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos, arXiv 1811.08383 Media Coverage: Website: tsm-hanlab.mit.edu Code Released! Including gesture recognition demo.