Más contenido relacionado La actualidad más candente (20) Similar a "Deploying Deep Learning Models on Embedded Processors for Autonomous Systems with MATLAB," a Presentation from MathWorks (20) Más de Edge AI and Vision Alliance (20) "Deploying Deep Learning Models on Embedded Processors for Autonomous Systems with MATLAB," a Presentation from MathWorks1. © 2019 MathWorks, Inc.
Deploying Deep Learning Models on
Embedded Processors for
Autonomous Systems with MATLAB
Bill Chou, Sandeep Hiremath
MathWorks
May 2019
4. © 2019 MathWorks, Inc.
Control
Planning
Perception
Deep Learning for Perception in Autonomous Systems
Path planning
Sensor models &
model predictive control
Deep learning
Sensor fusion
4
6. © 2019 MathWorks, Inc.
Outline
Ground Truth
Labeling
Network Design
and Training
C/C++ and CUDA
Code Generation
Hardware Targeting
(CPUs and GPUs)
Key Takeaways
Platform Productivity
Framework Interoperability
Key Takeaways
Optimized C/C++ and CUDA
Hardware Targeting
Processor-in-loop (PIL) Testing
6
7. © 2019 MathWorks, Inc.
Input
Lane
Detection Coordinate
Transform
Bounding
Box
Processing
Object
Detection
Perception in Autonomous Application
Output
Example Used in Today’s Talk
7
8. © 2019 MathWorks, Inc.
Outline
Ground Truth
Labeling
Network Design
and Training
C/C++ and CUDA
Code Generation
Hardware Targeting
(CPUs and GPUs)
8
11. © 2019 MathWorks, Inc.
Input
Lane
Detection Coordinate
Transform
Bounding
Box
Processing
Object
Detection
Perception in Autonomous Application
Output
Deep Learning
Models
11
12. © 2019 MathWorks, Inc.
Importing Pre-trained Models
>> net = alexnet
OR
Modify network layers
Import Pre-trained
networks
(Alexnet, ResNet50)
Re-train network with
training data
Detector object
12
13. © 2019 MathWorks, Inc.
Interactive Network Design
Modify network layers
Import Pre-trained
networks
(Alexnet, ResNet50)
Re-train network with
training data
Detector object
13
14. © 2019 MathWorks, Inc.
Accelerated Training
Modify network layers
Import Pre-trained
networks
(Alexnet, ResNet50)
Re-train network with
training data
Evaluate trained
network
Single CPU
Single CPU
Single GPU
Single CPU
Multiple GPUs Cloud GPUs
14
15. © 2019 MathWorks, Inc.
Network Evaluation
Modify network layers
Import Pre-trained
networks
(Alexnet, ResNet50)
Re-train network with
training data
Evaluate trained
network
15
18. © 2019 MathWorks, Inc.
Outline
Ground Truth
Labeling
Network Design
and Training
C/C++ and CUDA
Code Generation
Hardware Targeting
(CPUs and GPUs)
18
19. © 2019 MathWorks, Inc.
Input
Lane
Detection Coordinate
Transform
Bounding
Box
Processing
Object
Detection
Perception in Autonomous Application
Output
19
21. © 2019 MathWorks, Inc.
Multi-Platform Deep Learning Deployment
NVIDIA Jetson
21
Logic Logic
Data CenterWorkstation NVIDIA DRIVE Raspberry Pi
22. © 2019 MathWorks, Inc.
Multi-Platform Deep Learning Deployment
GPU Coder MATLAB Coder
NVIDIA GPUs Intel CPUs ARM Cortex-A CPUs 22
Logic Logic
23. © 2019 MathWorks, Inc.
Input
Lane
Detection Coordinate
Transform
Bounding
Box
Processing
Object
Detection
Perception in Autonomous Application
Output
Generate Code from Non-Deep Learning Parts
Generate Optimized CUDA/C++ Code
23
24. © 2019 MathWorks, Inc.
2200+ Functions for C/C++, 380+ Functions for CUDA
Comm.
Toolbox
DSP
System
Toolbox
Image
Processing
Toolbox
Computer
Vision
Toolbox
Signal
Processing
Toolbox
Sensor
Fusion
Tracking
Toolbox
Wavelet
ToolboxWLAN
Toolbox
Phased
Array
System
Toolbox
Statistics
&
Machine
Learning
Toolbox
Core
Math
Fixed-
Point
Designer
Automated
Driving
Toolbox
Robotics
System
Toolbox
5G
Toolbox
24
25. © 2019 MathWorks, Inc.
Mapped to Optimization Libraries
NVIDIA GPUs
Intel CPUs
ARM Cortex-A CPUs
MATLAB
Coder
GPU
Coder
cuBLAS
cuFFT
cuSolver
Thrust
MKL-
DNN
FFTW
BLAS
TensorRT
cuDNN ARM
Compute
Library
OpenCV
OpenCV
26. © 2019 MathWorks, Inc.
GPUs: Automatically Extract Parallelism from MATLAB
1. Scalarized MATLAB
(“for-all” loops)
2. Vectorized MATLAB
(math operators and library functions)
3. Composite functions in MATLAB
(maps to cuBLAS, cuFFT, cuSolver,
cuDNN, TensorRT)
Infer CUDA
kernels from
MATLAB loops
Library
replacement
26
27. © 2019 MathWorks, Inc.
GPU Coder Compiler Transforms & Optimizations
Control-Flow Graph
Intermediate Representation
….….
CUDA Kernel
Lowering
Front End
Traditional Compiler
Optimizations
MATLAB
Library Function Mapping
Parallel Loop Creation
CUDA Kernel Creation
cudaMemcpy Minimization
Shared Memory Mapping
CUDA Code Emission
Scalarization
Loop Perfectization
Loop Interchange
Loop Fusion
Scalar Replacement
Loop
Optimizations
27
28. © 2019 MathWorks, Inc.
Input
Lane
Detection Coordinate
Transform
Bounding
Box
Processing
Object
Detection
Perception in Autonomous Application
Output
Generate Optimized Inference Code
Layer Fusion
Deep Learning Network Optimizations
Memory
Optimization
Network Re-
architecture
Generate Code from Deep Learning Networks
28
29. © 2019 MathWorks, Inc.
Original Network
Deep Learning Network Optimizations
Conv
Batch
Norm
ReLu
Add
Conv
ReLu
Max
Pool
Max
Pool
Layer Fusion
Optimized Computation
Fused
Conv
Fused Conv
BatchNormAdd
Max
Pool
Max
Pool
Buffer Minimization
Optimized Memory
Fused
Conv
Fused Conv
BatchNormAdd
Max
Pool
Buffer A
Buffer B
Buffer D
Max
Pool
Buffer C
Buffer E
X
Reuse Buffer A
X
Reuse Buffer B
29
30. © 2019 MathWorks, Inc.
Original Network
Supported Pretrained Networks
Conv
Batch
Norm
ReLu
Add
Conv
ReLu
Max
Pool
Max
Pool
Layer Fusion
Optimized Computation
Fused
Conv
Fused Conv
BatchNormAdd
Max
Pool
Max
Pool
Buffer Minimization
Optimized Memory
Fused
Conv
Fused Conv
BatchNormAdd
Max
Pool
Buffer A
Buffer B
Buffer D
Max
Pool
Buffer C
Buffer E
X
Reuse Buffer A
X
Reuse Buffer B
30
SegNet
ResNet-50
VGG-19 Inception-v3
SqueezeNet
VGG-16
AlexNet
GoogLeNet
ResNet-101
31. © 2019 MathWorks, Inc.
SegNet
ResNet-50
VGG-19 Inception-v3
SqueezeNet
VGG-16
AlexNet
GoogLeNet
ResNet-101
31
Optimized Deep Learning Libraries & Runtimes
MKL-
DNN
ARM
Compute
Library
cuDNN TensorRT
NVIDIA GPUs Intel CPUs ARM Cortex-A CPUs
GPU
Coder
MATLAB
Coder
32. © 2019 MathWorks, Inc. 32
MKL-
DNN
ARM
Compute
Library
cuDNN TensorRT
NVIDIA GPUs Intel CPUs ARM Cortex-A CPUs
GPU
Coder
MATLAB
Coder
Semantic Segmentation Defective Product Detection
Blood Smear Segmentation
35. © 2019 MathWorks, Inc.
Intel® Xeon® CPU 3.6 GHz - NVIDIA libraries: CUDA10 - cuDNN 7 - Frameworks: TensorFlow 1.13.0, MXNet 1.4.0 PyTorch 1.0.0
Single Image Inference on Titan V using cuDNN
PyTorch (1.0.0)
MXNet (1.4.0)
GPU Coder (R2019a)
TensorFlow (1.13.0)
35
36. © 2019 MathWorks, Inc.
TensorRT Accelerates Inference on Titan V
Single Image Inference with ResNet-50 (Titan V)
cuDNN TensorRT (FP32) TensorRT (INT8)
GPU Coder
TensorFlow
36
37. © 2019 MathWorks, Inc.
Single Image Inference on CPU
MATLAB
TensorFlow
MXNet
MATLAB Coder
PyTorch
CPU, Single Image Inference (Linux)
Intel® Xeon® CPU 3.6 GHz - Frameworks: TensorFlow 1.6.0, MXNet 1.2.1, PyTorch 0.3.1
37
38. © 2019 MathWorks, Inc.
Outline
Ground Truth
Labeling
Network Design
and Training
C/C++ and CUDA
Code Generation
Hardware Targeting
(CPUs and GPUs)
38
39. © 2019 MathWorks, Inc.
Access Target Peripherals from MATLAB
39
Jetson AGX Xavier
Host Machine
DRIVE AGX
Raspberry Pi
Peripheral Data
40. © 2019 MathWorks, Inc.
Jetson AGX Xavier
DRIVE AGX
Raspberry Pi
Deploy Application to Target Boards
40
Host Machine
Generated
CUDA Code
Generated
C/C++ Code
41. © 2019 MathWorks, Inc.
Deploy Application to Jetson AGX Xavier
Deploy
Generated
CUDA Code
Target Display
Video Feed
41
Jetson AGX Xavier
Host Machine
43. © 2019 MathWorks, Inc.
Deploy
Generated
CUDA Code
Processor-in-the-Loop (PIL) Testing on Hardware Boards
Jetson AGX Xavier
Host Machine
Send Inputs &
Compare Results
Data
Exchange
43
44. © 2019 MathWorks, Inc.
Musashi Seimitsu Industry Co.,Ltd.
Detect Abnormalities in Automotive Parts
MATLAB use in project:
• Preprocessing of captured images
• Image annotation for training
• Deep learning based analysis
• Various transfer learning methods
(Combinations of CNN models,
Classifiers)
• Estimation of defect area using Class
Activation Map (CAM)
• Abnormality/defect classification
• Deployment to NVIDIA Jetson using
GPU Coder
Automated visual inspection of 1.3 million bevel
gear per month
44
45. © 2019 MathWorks, Inc.
Summary
Ground Truth
Labeling
Network Design
and Training
C/C++ and CUDA
Code Generation
Hardware Targeting
(CPUs and GPUs)
45
Key Takeaways
Platform Productivity
Framework Interoperability
Key Takeaways
Optimized C/C++ and CUDA
Hardware Targeting
Processor-in-loop (PIL) Testing