SlideShare una empresa de Scribd logo
1 de 30
IMPLEMENTATION AND EVALUATION OF DEEP NEURAL NETWORKS
(DNN) ON MAINSTREAM HETEROGENEOUS SYSTEMS
JUNLI GU, MAOHUA ZHU, ZHITAO ZHOU, FENG ZHANG
ZHEN LIN, QIANFENG ZHANG, MAURICIO BRETERNITZ
AMD (RESEARCH)
JUNE 26, 2014
JUNLI.GU@AMD.COM
| DNN PROJECT2
BACKGROUND
 What is a Deep Neural Network (DNN)?
‒ 3~8 hidden layers, millions to billions of parameters
‒ DNN + Big Data is leading recent direction in machine learning
 Rich Varieties of DNN Structures
‒ MLP (Multi-level Perceptron)/ AutoEncoder
‒ CNN (Convolutional Neural Network)
‒ DBN (Deep belief network)/RBM (Restricted Boltzmann Machine)
 DNN Applications
‒ Speech Recognition
‒ Image Classification/recognition/retrieval
‒ Documentation retrieval, Handwriting recognition
‒ OCR…
 Industry Use of DNN
‒ Google, Yahoo, Baidu, Alibaba, Tencent, iFlytek, Microsoft, Bank and Finance
neurons
weighted
connection
Input
Output
hidden1
hidden2
hidden3
| DNN PROJECT3
MOTIVATION
DNN challenges hardware:
Computation Heavy, Memory Heavy and Parallel Execution
Fortunately, rich data/model parallelism of DNN
==> GPU passive hardware parallelism
==> Heterogeneous Platforms:
Clusters of CPU+GPU, or APU server?
Note: APU is a processor with both CPU and GPU on the same die.
| DNN PROJECT4
CPU+GPU CLUSTER
 Existing Platforms
‒ CPU cluster (scale out)
‒ CPU + GPU clusters (scale up + scale out)
 Bottlenecks
‒ GPU device memory size limitation for DNN data/model
‒ Every 250M parameters require 1GB memory
‒ Communication overheads are bottleneck
‒ Intra node between CPU and GPU, intern node
‒ GPU is big and power hungry, low density
• Google Brain’s 1000 processor system
• Stanford Univ. Andrew Y. Ng etc., “Deep learning with COTS HPC
systems”, International Conference on Machine Learning, 2013
CPUs
Infiniband
Connection
GPU
GPU
GPU
GPU
CPUs
GPU
GPU
GPU
GPU
CPUs
GPU
GPU
GPU
GPU
PCIE
PCIE
PCIE
A node
| DNN PROJECT5
APU AND APU SERVER
 APU
‒ In 2009, AMD launched the first chip integrated with both CPU and GPU
‒ Programming through OpenCL
 Architectural Advantages
‒ Unified address memory: GPU CPU share very big memory
‒ Very efficient data sharing: no data copy
‒ Fully coherent memory
‒ Sharing through pointers
 APU Server
‒ High density, low power data server
‒ Customized fast FABRIC
‒ In advance research on internal prototype
CPU
GPU
SharedMemory
HSA features
Credit: AMD Sea Micro
8x8x8=512 nodes
| DNN PROJECT6
SOME QUICK TAKE AWAYS
CPU+GPU cluster gets 2x speedup with 6x more power
2.4 APUs can achieve the same performance with 2.5x less
power.
APUs can be integrated as high density, power efficient data
centers to reduce complexity and cost.
| DNN PROJECT7
OUTLINE
 Background and Motivation
 DNN Algorithm Architectures
‒ MLP (Multi-Layer Perceptron )
‒ Autoencoder
 Evaluation on Multiple Platforms
 Bottleneck Analysis
 Conclusions and Next Plan
| DNN PROJECT8
DNN ALGORITHM ARCHITECTURE 1– MLP
 MLP (Multi-Layer Perceptron )
‒ Speech recognition
‒ Layers of matrix multiply + non-linear functions
 Compute Patterns
‒ Layers of matrix multiplication
‒ Reflects most DNN compute-intensive
‒ CPU prepares data, GPU computes
MLP Structure
1100
2048
2048
2048
2048
2048
2048
2048
9304
Adjacent layers are fully
connected
Parameter space:
44 million
(layer size: 1k-2k-2k-2k-
2k-2k-2k-2k-2k-9k)
Forward/Backward Propagation
Output
Input
Hidden layers
x 1z 1a 2z 2a 3z 3a
1 2 3
Input Layer Hidden Layer Hidden Layer Output Layer
1w 2w 3w
Forward
Propagation
Back
Propagation
𝑧1 = 𝑥𝑤1 + 𝑏1
𝑎1 = 𝑓(𝑧1)
𝑧2 = 𝑎1 𝑤2 + 𝑏2
𝑎2 = 𝑓(𝑧2)
𝑧3 = 𝑎2 𝑤3 + 𝑏3
𝑎3 = 𝑓(𝑧3)
𝑒 =
1
2
𝑦 − 𝑎3
2
𝛿3 = − 𝑦 − 𝑎3 .∗ 𝑓′(𝑧3)
𝜕𝑒
𝜕𝑤3
= 𝑎2
𝑇
𝛿3
𝛿2 = 𝑤3
𝑇
𝛿3.∗ 𝑓′(𝑧2)
𝜕𝑒
𝜕𝑤2
= 𝑎1
𝑇
𝛿2
𝛿1 = 𝑤2
𝑇
𝛿2.∗ 𝑓′(𝑧1)
𝜕𝑒
𝜕𝑤1
= 𝑥 𝑇
𝛿1
error
| DNN PROJECT9
 Autoencoder + L-BFGS Training
‒ Used for pre-training (Hinton et al, 2006)
‒ Semantic retrieval (Krizhevsky et al, 2011)
‒ L-BFGS good scalability (Le et al, 2011 )
DNN ALGORITHM ARCHITECTURE 2–AUTOENCODER
 Compute Patterns
‒ A mix of CPU compute with GPU compute
‒ Frequent CPU-GPU interactions and data transfers
‒ A good fit to leverage APU advantages
Input
Layer
Reconstruction
Layer
Output
Code
1 Encode the input and
then reconstruct the code
for cost computing
3072
6144
1024
W1 W2
6144
3072
W2
T W1
T
2 Parameter space:
25 million
(layer size: 3k-6k-1k-6k-3k)
Autoencoder Structure L-BFGS Training Algorithm
Back
Propagation
Forward
Propagation
Meet
line search
Condition?
Get Cost and
Gradients
Cost and
Gradients
Try New
Step Length
L-BFGS
Compute
New Direction
N
Y
CPU
GPU
| DNN PROJECT10
OUTLINE
 Background and Motivation
 DNN Algorithm Architectures
 Evaluation on Multiple Platforms
‒Implementation on APUs and GPUs
‒Performance/power/perf_per_watt comparison
 Bottleneck Analysis
 Conclusions and Next Plan
| DNN PROJECT11
EVALUATION METHODOLOGY AND PLATFORMS
 Implementations based on commercial BLAS libraries
‒ Mainstream X86 CPUs: C++ & math library
‒ AMD APUs & GPUs: OpenCL & CLAMDBLAS
‒ Mainstream GPU: CUDA C & CUBLAS (for competitive purposes)
 Platforms
Device Category Device Name
Throughput
(GFLOPS)
Price
(RMB)
TDP
(Watt)
CPU
version
AMD OCL
version
CUDA
version
Note
CPU Mainstream x86 848 2240 84 √ √ Realtime power traces
APU series
AMD APU A10-7850k 856 1299 95 √ Realtime power traces
Mainstream x86 SOC 848 2240 84 √ Realtime power traces
Customer-end
GPU
AMD HD7970 3788.8 2000 250 √ TDP used
Mainstream GPU 3977 3799 250 √ √ TDP used
| DNN PROJECT12
EVALUATION METHODOLOGY AND PLATFORMS-CONT.
 Evaluation results indicate per-unit training speed
‒CNN not tested as work still under development
‒MLP and Autoencoder tested initial results
‒DNN model parameters and mini-batch size align with Internet industry
‒Single-node results presented
‒Further (ongoing) optimizations
| DNN PROJECT13
MLP MODEL(VOICE RECOGNITION)
• Kaveri 95w v.s. Mainstream x86
1.8x speedup
• Kaveri 95w v.s. Mainstream x86 SOC’s
3.7x speedup
Mini-batch size: 1024
CPU prepares data, GPU computes
Note: CLAMDBLAS offers an architecture-aware optimization tool called
clAmdBlasTune. Make sure to tune it the first time to run on a processor.
| DNN PROJECT14
PERFORMANCE/POWER/PERF_PER_WATT
 APU achieves the highest Perf./watt
Eg. 1.2x compared to GPU
 GPU achieves 5x perf. with 7x power
 CPU gets 60% perf. with 1.9x power
1
0.3
0.22
0.7
0.8
1 0.6
0.3
4.9
6.2
1
1.9
1.3
7.3 7.3
0
1
2
3
4
5
6
7
8
0
0.2
0.4
0.6
0.8
1
1.2
A10-7850K Mainstream
x86
Mainstream
x86 SOC's
AMD HD7970 Mainstream
GPU
SpeedandPower(normalizedtoAPU)
Perf.PerWatt(normalizedtoAPU)
Performance Per Watt Ratio Performance Ratio Power Ratio
| DNN PROJECT15
AUTOENCODER (IMAGE AND DOCUMENT RETRIEVAL)
• Algorithm is mix of
CPU+GPU compute
• APU v.s. Mainstream x86
8% slow down
• APU v.s. Mainstream x86 SOC’s
3.8x speedup
 The larger the batch size is, the bigger
advantage APU presents.
Data: CIFAR10, Mini-batch size: 2048
CPU: L-BFGS; GPU: Autoencoder forward and backward propogation
| DNN PROJECT16
PERFORMANCE/POWER/PERF_PER_WATT
 APU achieves the highest Perf./watt
Eg. 2x compared to dGPU
 GPU achieves 2x perf. with 5x power
 CPU gets 90% perf. with 1.4x power
1
0.65
0.3
0.46
0.5
1
0.9
0.3
2.2 2.4
1
1.4
0.9
4.8 4.8
0
1
2
3
4
5
6
0
0.2
0.4
0.6
0.8
1
1.2
A10-7850K Mainstream
x86
Mainstream
x86 SOC's
AMD HD7970 Mainstream
GPU
SpeedandPower(normalizedtoAPU)
Perf.PerWatt(normalizedtoAPU)
Performance Per Watt Ratio Performance Ratio Power Ratio
| DNN PROJECT17
REAL CASE TRAINING
 MINIST Training through MLP Model
‒Handwritten digits , 60000 images
‒Mini-batch size 1024, 200 epochs
‒Accuracy 97% with random weights
‒Accuracy 98% with pre-trained weights
APU A10-7850 GPU HD7970 GPU vs. APU
Training
Process
Time 362 second 192 second 1.9x speedup
Average Power 47 Watt 250 Watt 5.3x power
Energy 17k Joule 40k Joule 2.4x energy
Predicting
Process
Time 8.1 second 3.5 second 2.3x speedup
Average Power 37 Watt 250 Watt 6.8x power
Energy 300 Joule 875 Joule 2.9x energy
| DNN PROJECT18
OUTLINE
 Background and Motivation
 DNN Algorithm Architectures
 Evaluation on Multiple Platforms
 Bottleneck Analysis
 Conclusions and Next Plan
| DNN PROJECT19
DNN PERFORMANCE BOTTLENECKS
 DNN is usually converted to Matrix Multiplication, which consumes major part of time.
‒ People use BLAS libraries provided on commercial processors.
 Weight matrix is transposed during back propagation.
‒ Flipped between row manner and column manner between fprop and bprop.
 Data transfers between CPU and GPU can consume most of time, especially for large
images.
‒ Task assignment: CPU prepares the data, GPU computes
‒ APU can remove the overheads through zero-copy technique
| DNN PROJECT20
FURTHER ANALYSIS-WEIGHT MATRIX TRANSPOSE
 Weight matrices will be transposed during back propagation (on BP’s critical path)
‒ 𝑧 = 𝑊 𝑇
𝜎
 What is the most efficient way to transpose on different platforms?
‒ 𝑠𝑔𝑒𝑚𝑚, 𝑠𝑔𝑒𝑚𝑚_𝑇, GPU_Tran + 𝑠𝑔𝑒𝑚𝑚, CPU_Tran + 𝑠𝑔𝑒𝑚𝑚
 Note: leveraging CPU to transposes matrix results in the worst performance, because CPU takes about a
magnitude to transpose,GPU wait_in_idle
Micro benchmark: transpose 2kx2k matrix A and multiply 𝐴 𝑇*B
Platforms AMD GPU
FX8320+HD7970
FX8320+Mainstream
GPU
AMD APU
A10-7850K
sgemm 8.62ms 6.09ms 53.26ms
sgemm_T 17.69ms 6.31ms 83.3ms
GPU Tran + sgemm 9.56ms 6.34ms 55.46ms
CPU Tran + sgemm 55.88ms 67.46ms 86.8ms
√
√
√
| DNN PROJECT21
FURTHER ANALYSIS-DATA TRANSFER OVERHEADS
 Data transfer overheads between CPU and GPU have been pointed out(A. et al., 2013) as the bottleneck
of DNN acceleration.
 First, we use autoencoder to quantify the data transfer overheads.
 Data transfer time increases linearly with data sizes. It is very difficult to train real world size images
without removal of this bottleneck.
DataTransferTime%
15%
24%
33%
18%
25%
34%
18%
27%
38%
21%
33%
40%
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
3072 5120 7168
Input Data Size with different mini-batch size
256-batch 512-batch 1024-batch 2048-batch
Data transfer overheads on CPU + Mainstream GPU, one forward prop. and backward prop.
40% time is to move data,
for 48x48 RGB images
| DNN PROJECT22
DATA TRANSFER OVERHEADS
 How to avoid data copy through the zero-copy technique on APUs?
‒ APU: Zero-copy improves performance by 10%
‒ GPUs: Zero-copy degrades performance by 3.5x for AMD HD7970 and 8.7x for Mainstream GPU.
Zero-copy technique:
APUs: CPU and GPU share the
same piece of memory, efficient
GPUs: GPU accesses host memory
through PCIe, slow
Experiment design:
CPU initializes 2kx2k matrixes
(A, B), GPU performs C=A*B
Matrix multiplication performance comparison among copy and zero-copy
45
41
19
67
23
199
0
10
20
30
40
50
60
70
80
90
100
110
120
Copy Zero Copy Copy Zero Copy Copy Zero Copy
Kaveri HD7970 Mainstream GPU
ExecutionTime(ms)
Kernel Data Transfer
| DNN PROJECT23
CONCLUSIONS-APU SERVER ADVANTAGES
BASED ON AUTOENCODER RESULTS
AMD APU Server
 2.4 APUs can achieve similar performance with ~2.5x less power
 2.5x higher performance given the same power budget
HEADER
TCO (Total cost ownership)  APU server achieves the same performance with ~1.8x less dollars
Architectural Advantages
 APU servers remove GPU’s device memory limitation and data transfer
bottleneck, which fit better for Big Data inputs
Cluster of CPU + GPU
 2.4x speedup
 6x more power
| DNN PROJECT24
NEXT PLAN-AMD SOLUTIONS
 H/W solutions: Parallel implementation on systems and system level evaluation
‒ CPU + GPUs cluster
‒ APU server
 S/W solutions: OpenCL Implementation of DNN specific kernels
‒ OpenCL implementations and optimizations, applicable to general heterogeneous platforms
 Set up real world application scenarios with external company’s involvement and apply AMD solutions
to industry
| DNN PROJECT25
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices,
Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.
| DNN PROJECT26
BACK UP SLIDES
| DNN PROJECT27
| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4628
SYSTEM OVERVIEW
APU
APU
CPU
Cluster
To DRAM
Directory
GPU
Cluster
Direct-access bus
(used for graphics)
Invalidation
traffic
GPU compute
accesses must stay
coherent
Arrow thickness
→bandwidth
| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4629
SYSTEM OVERVIEW
GPU
GPU Cluster
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
L1 L1 L1 L1 L1 L1 L1 L1L1 L1 L1 L1 L1 L1 L1 L1
CU CU CU CU CU CU CU CUCU CU CU CU CU CU CU CU
GPU L2 Cache
Very high bandwidth:
L2 has high miss rate
CU
I-Fetch / Decode
Register File
Ex Ex Ex Ex
Ex Ex Ex Ex
Ex Ex Ex Ex
Ex Ex Ex Ex
Local Scratchpad
Memory
Coalescer
To L1
| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4630
SEAMICRO

Más contenido relacionado

La actualidad más candente

HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...AMD Developer Central
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...Pradeep Redddy Raamana
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitCarlo C. del Mundo
 
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...Cloudera, Inc.
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Cloudera, Inc.
 
High performance computing for research
High performance computing for researchHigh performance computing for research
High performance computing for researchEsteban Hernandez
 
High Performance Computing: an Introduction for the Society of Actuaries
High Performance Computing: an Introduction for the Society of ActuariesHigh Performance Computing: an Introduction for the Society of Actuaries
High Performance Computing: an Introduction for the Society of ActuariesAdam DeConinck
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practicesLior Sidi
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecturemohamedragabslideshare
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software
 
Introduction to High-Performance Computing (HPC) Containers and Singularity*
Introduction to High-Performance Computing (HPC) Containers and Singularity*Introduction to High-Performance Computing (HPC) Containers and Singularity*
Introduction to High-Performance Computing (HPC) Containers and Singularity*Intel® Software
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from IntelEdge AI and Vision Alliance
 

La actualidad más candente (20)

HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
 
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
 
High performance computing for research
High performance computing for researchHigh performance computing for research
High performance computing for research
 
Google TPU
Google TPUGoogle TPU
Google TPU
 
High Performance Computing: an Introduction for the Society of Actuaries
High Performance Computing: an Introduction for the Society of ActuariesHigh Performance Computing: an Introduction for the Society of Actuaries
High Performance Computing: an Introduction for the Society of Actuaries
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
2017 04-13-google-tpu-04
2017 04-13-google-tpu-042017 04-13-google-tpu-04
2017 04-13-google-tpu-04
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Introduction to High-Performance Computing (HPC) Containers and Singularity*
Introduction to High-Performance Computing (HPC) Containers and Singularity*Introduction to High-Performance Computing (HPC) Containers and Singularity*
Introduction to High-Performance Computing (HPC) Containers and Singularity*
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 

Destacado

Redesinalambricas
RedesinalambricasRedesinalambricas
Redesinalambricassheen ramos
 
Digital Media - BG Graduate Programme
Digital Media - BG Graduate ProgrammeDigital Media - BG Graduate Programme
Digital Media - BG Graduate ProgrammeStacy-Ann Duhaney
 
Can Ho D-VELA, Phân Phối Giá Gốc Độc Quyền
Can Ho D-VELA, Phân Phối Giá Gốc Độc QuyềnCan Ho D-VELA, Phân Phối Giá Gốc Độc Quyền
Can Ho D-VELA, Phân Phối Giá Gốc Độc QuyềnVncanho.com
 
Trastornos de la Personalidad
Trastornos de la PersonalidadTrastornos de la Personalidad
Trastornos de la Personalidadmaggenmartinez
 
Ebook Gestão por Meritocracia
Ebook Gestão por MeritocraciaEbook Gestão por Meritocracia
Ebook Gestão por MeritocraciaRICARDO MATTOS
 
Позиционирование города Ачинск 2016
Позиционирование города Ачинск 2016Позиционирование города Ачинск 2016
Позиционирование города Ачинск 2016krasimr
 
EL PRINCIPIO DE LA ORGANICIDAD
EL PRINCIPIO DE LA ORGANICIDAD EL PRINCIPIO DE LA ORGANICIDAD
EL PRINCIPIO DE LA ORGANICIDAD innovalabcun
 
La educación de las mujeres en la historia
La educación de las mujeres en la historiaLa educación de las mujeres en la historia
La educación de las mujeres en la historiarcandel
 
Introduction to un supervised learning
Introduction to un supervised learningIntroduction to un supervised learning
Introduction to un supervised learningRishikesh .
 
RED y NET / LO ANALÓGICA y LO DIGITAL (Beltrán y Cordara)
RED y NET / LO ANALÓGICA y LO DIGITAL (Beltrán y Cordara)RED y NET / LO ANALÓGICA y LO DIGITAL (Beltrán y Cordara)
RED y NET / LO ANALÓGICA y LO DIGITAL (Beltrán y Cordara)Martin Beltran
 
3 d pie chart circular puzzle with hole in center process 9 stages style 1 po...
3 d pie chart circular puzzle with hole in center process 9 stages style 1 po...3 d pie chart circular puzzle with hole in center process 9 stages style 1 po...
3 d pie chart circular puzzle with hole in center process 9 stages style 1 po...SlideTeam.net
 
Anomaly detection in deep learning (Updated) English
Anomaly detection in deep learning (Updated) EnglishAnomaly detection in deep learning (Updated) English
Anomaly detection in deep learning (Updated) EnglishAdam Gibson
 
Anomaly detection in deep learning
Anomaly detection in deep learningAnomaly detection in deep learning
Anomaly detection in deep learningAdam Gibson
 

Destacado (20)

Pali alphabet
Pali alphabetPali alphabet
Pali alphabet
 
Redesinalambricas
RedesinalambricasRedesinalambricas
Redesinalambricas
 
El aborto
El abortoEl aborto
El aborto
 
Digital Media - BG Graduate Programme
Digital Media - BG Graduate ProgrammeDigital Media - BG Graduate Programme
Digital Media - BG Graduate Programme
 
Can Ho D-VELA, Phân Phối Giá Gốc Độc Quyền
Can Ho D-VELA, Phân Phối Giá Gốc Độc QuyềnCan Ho D-VELA, Phân Phối Giá Gốc Độc Quyền
Can Ho D-VELA, Phân Phối Giá Gốc Độc Quyền
 
Trastornos de la Personalidad
Trastornos de la PersonalidadTrastornos de la Personalidad
Trastornos de la Personalidad
 
PHC-QPM1
PHC-QPM1PHC-QPM1
PHC-QPM1
 
Ebook Gestão por Meritocracia
Ebook Gestão por MeritocraciaEbook Gestão por Meritocracia
Ebook Gestão por Meritocracia
 
Позиционирование города Ачинск 2016
Позиционирование города Ачинск 2016Позиционирование города Ачинск 2016
Позиционирование города Ачинск 2016
 
Democracy in buddhism
Democracy in buddhismDemocracy in buddhism
Democracy in buddhism
 
EL PRINCIPIO DE LA ORGANICIDAD
EL PRINCIPIO DE LA ORGANICIDAD EL PRINCIPIO DE LA ORGANICIDAD
EL PRINCIPIO DE LA ORGANICIDAD
 
Competencias disciplinares
Competencias disciplinaresCompetencias disciplinares
Competencias disciplinares
 
PARA GBI
PARA GBIPARA GBI
PARA GBI
 
Игровая механика MatriX Mentor
Игровая механика MatriX MentorИгровая механика MatriX Mentor
Игровая механика MatriX Mentor
 
La educación de las mujeres en la historia
La educación de las mujeres en la historiaLa educación de las mujeres en la historia
La educación de las mujeres en la historia
 
Introduction to un supervised learning
Introduction to un supervised learningIntroduction to un supervised learning
Introduction to un supervised learning
 
RED y NET / LO ANALÓGICA y LO DIGITAL (Beltrán y Cordara)
RED y NET / LO ANALÓGICA y LO DIGITAL (Beltrán y Cordara)RED y NET / LO ANALÓGICA y LO DIGITAL (Beltrán y Cordara)
RED y NET / LO ANALÓGICA y LO DIGITAL (Beltrán y Cordara)
 
3 d pie chart circular puzzle with hole in center process 9 stages style 1 po...
3 d pie chart circular puzzle with hole in center process 9 stages style 1 po...3 d pie chart circular puzzle with hole in center process 9 stages style 1 po...
3 d pie chart circular puzzle with hole in center process 9 stages style 1 po...
 
Anomaly detection in deep learning (Updated) English
Anomaly detection in deep learning (Updated) EnglishAnomaly detection in deep learning (Updated) English
Anomaly detection in deep learning (Updated) English
 
Anomaly detection in deep learning
Anomaly detection in deep learningAnomaly detection in deep learning
Anomaly detection in deep learning
 

Similar a APSys Presentation Final copy2

Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
improve deep learning training and inference performance
improve deep learning training and inference performanceimprove deep learning training and inference performance
improve deep learning training and inference performances.rohit
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUsiguazio
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptxruvex
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...E-Commerce Brasil
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionNVIDIA Taiwan
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentEricsson
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 
GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2NVIDIA
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learningAmer Ather
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...Bomm Kim
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 

Similar a APSys Presentation Final copy2 (20)

Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Ac922 cdac webinar
Ac922 cdac webinarAc922 cdac webinar
Ac922 cdac webinar
 
improve deep learning training and inference performance
improve deep learning training and inference performanceimprove deep learning training and inference performance
improve deep learning training and inference performance
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUs
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server Solution
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurations
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 

APSys Presentation Final copy2

  • 1. IMPLEMENTATION AND EVALUATION OF DEEP NEURAL NETWORKS (DNN) ON MAINSTREAM HETEROGENEOUS SYSTEMS JUNLI GU, MAOHUA ZHU, ZHITAO ZHOU, FENG ZHANG ZHEN LIN, QIANFENG ZHANG, MAURICIO BRETERNITZ AMD (RESEARCH) JUNE 26, 2014 JUNLI.GU@AMD.COM
  • 2. | DNN PROJECT2 BACKGROUND  What is a Deep Neural Network (DNN)? ‒ 3~8 hidden layers, millions to billions of parameters ‒ DNN + Big Data is leading recent direction in machine learning  Rich Varieties of DNN Structures ‒ MLP (Multi-level Perceptron)/ AutoEncoder ‒ CNN (Convolutional Neural Network) ‒ DBN (Deep belief network)/RBM (Restricted Boltzmann Machine)  DNN Applications ‒ Speech Recognition ‒ Image Classification/recognition/retrieval ‒ Documentation retrieval, Handwriting recognition ‒ OCR…  Industry Use of DNN ‒ Google, Yahoo, Baidu, Alibaba, Tencent, iFlytek, Microsoft, Bank and Finance neurons weighted connection Input Output hidden1 hidden2 hidden3
  • 3. | DNN PROJECT3 MOTIVATION DNN challenges hardware: Computation Heavy, Memory Heavy and Parallel Execution Fortunately, rich data/model parallelism of DNN ==> GPU passive hardware parallelism ==> Heterogeneous Platforms: Clusters of CPU+GPU, or APU server? Note: APU is a processor with both CPU and GPU on the same die.
  • 4. | DNN PROJECT4 CPU+GPU CLUSTER  Existing Platforms ‒ CPU cluster (scale out) ‒ CPU + GPU clusters (scale up + scale out)  Bottlenecks ‒ GPU device memory size limitation for DNN data/model ‒ Every 250M parameters require 1GB memory ‒ Communication overheads are bottleneck ‒ Intra node between CPU and GPU, intern node ‒ GPU is big and power hungry, low density • Google Brain’s 1000 processor system • Stanford Univ. Andrew Y. Ng etc., “Deep learning with COTS HPC systems”, International Conference on Machine Learning, 2013 CPUs Infiniband Connection GPU GPU GPU GPU CPUs GPU GPU GPU GPU CPUs GPU GPU GPU GPU PCIE PCIE PCIE A node
  • 5. | DNN PROJECT5 APU AND APU SERVER  APU ‒ In 2009, AMD launched the first chip integrated with both CPU and GPU ‒ Programming through OpenCL  Architectural Advantages ‒ Unified address memory: GPU CPU share very big memory ‒ Very efficient data sharing: no data copy ‒ Fully coherent memory ‒ Sharing through pointers  APU Server ‒ High density, low power data server ‒ Customized fast FABRIC ‒ In advance research on internal prototype CPU GPU SharedMemory HSA features Credit: AMD Sea Micro 8x8x8=512 nodes
  • 6. | DNN PROJECT6 SOME QUICK TAKE AWAYS CPU+GPU cluster gets 2x speedup with 6x more power 2.4 APUs can achieve the same performance with 2.5x less power. APUs can be integrated as high density, power efficient data centers to reduce complexity and cost.
  • 7. | DNN PROJECT7 OUTLINE  Background and Motivation  DNN Algorithm Architectures ‒ MLP (Multi-Layer Perceptron ) ‒ Autoencoder  Evaluation on Multiple Platforms  Bottleneck Analysis  Conclusions and Next Plan
  • 8. | DNN PROJECT8 DNN ALGORITHM ARCHITECTURE 1– MLP  MLP (Multi-Layer Perceptron ) ‒ Speech recognition ‒ Layers of matrix multiply + non-linear functions  Compute Patterns ‒ Layers of matrix multiplication ‒ Reflects most DNN compute-intensive ‒ CPU prepares data, GPU computes MLP Structure 1100 2048 2048 2048 2048 2048 2048 2048 9304 Adjacent layers are fully connected Parameter space: 44 million (layer size: 1k-2k-2k-2k- 2k-2k-2k-2k-2k-9k) Forward/Backward Propagation Output Input Hidden layers x 1z 1a 2z 2a 3z 3a 1 2 3 Input Layer Hidden Layer Hidden Layer Output Layer 1w 2w 3w Forward Propagation Back Propagation 𝑧1 = 𝑥𝑤1 + 𝑏1 𝑎1 = 𝑓(𝑧1) 𝑧2 = 𝑎1 𝑤2 + 𝑏2 𝑎2 = 𝑓(𝑧2) 𝑧3 = 𝑎2 𝑤3 + 𝑏3 𝑎3 = 𝑓(𝑧3) 𝑒 = 1 2 𝑦 − 𝑎3 2 𝛿3 = − 𝑦 − 𝑎3 .∗ 𝑓′(𝑧3) 𝜕𝑒 𝜕𝑤3 = 𝑎2 𝑇 𝛿3 𝛿2 = 𝑤3 𝑇 𝛿3.∗ 𝑓′(𝑧2) 𝜕𝑒 𝜕𝑤2 = 𝑎1 𝑇 𝛿2 𝛿1 = 𝑤2 𝑇 𝛿2.∗ 𝑓′(𝑧1) 𝜕𝑒 𝜕𝑤1 = 𝑥 𝑇 𝛿1 error
  • 9. | DNN PROJECT9  Autoencoder + L-BFGS Training ‒ Used for pre-training (Hinton et al, 2006) ‒ Semantic retrieval (Krizhevsky et al, 2011) ‒ L-BFGS good scalability (Le et al, 2011 ) DNN ALGORITHM ARCHITECTURE 2–AUTOENCODER  Compute Patterns ‒ A mix of CPU compute with GPU compute ‒ Frequent CPU-GPU interactions and data transfers ‒ A good fit to leverage APU advantages Input Layer Reconstruction Layer Output Code 1 Encode the input and then reconstruct the code for cost computing 3072 6144 1024 W1 W2 6144 3072 W2 T W1 T 2 Parameter space: 25 million (layer size: 3k-6k-1k-6k-3k) Autoencoder Structure L-BFGS Training Algorithm Back Propagation Forward Propagation Meet line search Condition? Get Cost and Gradients Cost and Gradients Try New Step Length L-BFGS Compute New Direction N Y CPU GPU
  • 10. | DNN PROJECT10 OUTLINE  Background and Motivation  DNN Algorithm Architectures  Evaluation on Multiple Platforms ‒Implementation on APUs and GPUs ‒Performance/power/perf_per_watt comparison  Bottleneck Analysis  Conclusions and Next Plan
  • 11. | DNN PROJECT11 EVALUATION METHODOLOGY AND PLATFORMS  Implementations based on commercial BLAS libraries ‒ Mainstream X86 CPUs: C++ & math library ‒ AMD APUs & GPUs: OpenCL & CLAMDBLAS ‒ Mainstream GPU: CUDA C & CUBLAS (for competitive purposes)  Platforms Device Category Device Name Throughput (GFLOPS) Price (RMB) TDP (Watt) CPU version AMD OCL version CUDA version Note CPU Mainstream x86 848 2240 84 √ √ Realtime power traces APU series AMD APU A10-7850k 856 1299 95 √ Realtime power traces Mainstream x86 SOC 848 2240 84 √ Realtime power traces Customer-end GPU AMD HD7970 3788.8 2000 250 √ TDP used Mainstream GPU 3977 3799 250 √ √ TDP used
  • 12. | DNN PROJECT12 EVALUATION METHODOLOGY AND PLATFORMS-CONT.  Evaluation results indicate per-unit training speed ‒CNN not tested as work still under development ‒MLP and Autoencoder tested initial results ‒DNN model parameters and mini-batch size align with Internet industry ‒Single-node results presented ‒Further (ongoing) optimizations
  • 13. | DNN PROJECT13 MLP MODEL(VOICE RECOGNITION) • Kaveri 95w v.s. Mainstream x86 1.8x speedup • Kaveri 95w v.s. Mainstream x86 SOC’s 3.7x speedup Mini-batch size: 1024 CPU prepares data, GPU computes Note: CLAMDBLAS offers an architecture-aware optimization tool called clAmdBlasTune. Make sure to tune it the first time to run on a processor.
  • 14. | DNN PROJECT14 PERFORMANCE/POWER/PERF_PER_WATT  APU achieves the highest Perf./watt Eg. 1.2x compared to GPU  GPU achieves 5x perf. with 7x power  CPU gets 60% perf. with 1.9x power 1 0.3 0.22 0.7 0.8 1 0.6 0.3 4.9 6.2 1 1.9 1.3 7.3 7.3 0 1 2 3 4 5 6 7 8 0 0.2 0.4 0.6 0.8 1 1.2 A10-7850K Mainstream x86 Mainstream x86 SOC's AMD HD7970 Mainstream GPU SpeedandPower(normalizedtoAPU) Perf.PerWatt(normalizedtoAPU) Performance Per Watt Ratio Performance Ratio Power Ratio
  • 15. | DNN PROJECT15 AUTOENCODER (IMAGE AND DOCUMENT RETRIEVAL) • Algorithm is mix of CPU+GPU compute • APU v.s. Mainstream x86 8% slow down • APU v.s. Mainstream x86 SOC’s 3.8x speedup  The larger the batch size is, the bigger advantage APU presents. Data: CIFAR10, Mini-batch size: 2048 CPU: L-BFGS; GPU: Autoencoder forward and backward propogation
  • 16. | DNN PROJECT16 PERFORMANCE/POWER/PERF_PER_WATT  APU achieves the highest Perf./watt Eg. 2x compared to dGPU  GPU achieves 2x perf. with 5x power  CPU gets 90% perf. with 1.4x power 1 0.65 0.3 0.46 0.5 1 0.9 0.3 2.2 2.4 1 1.4 0.9 4.8 4.8 0 1 2 3 4 5 6 0 0.2 0.4 0.6 0.8 1 1.2 A10-7850K Mainstream x86 Mainstream x86 SOC's AMD HD7970 Mainstream GPU SpeedandPower(normalizedtoAPU) Perf.PerWatt(normalizedtoAPU) Performance Per Watt Ratio Performance Ratio Power Ratio
  • 17. | DNN PROJECT17 REAL CASE TRAINING  MINIST Training through MLP Model ‒Handwritten digits , 60000 images ‒Mini-batch size 1024, 200 epochs ‒Accuracy 97% with random weights ‒Accuracy 98% with pre-trained weights APU A10-7850 GPU HD7970 GPU vs. APU Training Process Time 362 second 192 second 1.9x speedup Average Power 47 Watt 250 Watt 5.3x power Energy 17k Joule 40k Joule 2.4x energy Predicting Process Time 8.1 second 3.5 second 2.3x speedup Average Power 37 Watt 250 Watt 6.8x power Energy 300 Joule 875 Joule 2.9x energy
  • 18. | DNN PROJECT18 OUTLINE  Background and Motivation  DNN Algorithm Architectures  Evaluation on Multiple Platforms  Bottleneck Analysis  Conclusions and Next Plan
  • 19. | DNN PROJECT19 DNN PERFORMANCE BOTTLENECKS  DNN is usually converted to Matrix Multiplication, which consumes major part of time. ‒ People use BLAS libraries provided on commercial processors.  Weight matrix is transposed during back propagation. ‒ Flipped between row manner and column manner between fprop and bprop.  Data transfers between CPU and GPU can consume most of time, especially for large images. ‒ Task assignment: CPU prepares the data, GPU computes ‒ APU can remove the overheads through zero-copy technique
  • 20. | DNN PROJECT20 FURTHER ANALYSIS-WEIGHT MATRIX TRANSPOSE  Weight matrices will be transposed during back propagation (on BP’s critical path) ‒ 𝑧 = 𝑊 𝑇 𝜎  What is the most efficient way to transpose on different platforms? ‒ 𝑠𝑔𝑒𝑚𝑚, 𝑠𝑔𝑒𝑚𝑚_𝑇, GPU_Tran + 𝑠𝑔𝑒𝑚𝑚, CPU_Tran + 𝑠𝑔𝑒𝑚𝑚  Note: leveraging CPU to transposes matrix results in the worst performance, because CPU takes about a magnitude to transpose,GPU wait_in_idle Micro benchmark: transpose 2kx2k matrix A and multiply 𝐴 𝑇*B Platforms AMD GPU FX8320+HD7970 FX8320+Mainstream GPU AMD APU A10-7850K sgemm 8.62ms 6.09ms 53.26ms sgemm_T 17.69ms 6.31ms 83.3ms GPU Tran + sgemm 9.56ms 6.34ms 55.46ms CPU Tran + sgemm 55.88ms 67.46ms 86.8ms √ √ √
  • 21. | DNN PROJECT21 FURTHER ANALYSIS-DATA TRANSFER OVERHEADS  Data transfer overheads between CPU and GPU have been pointed out(A. et al., 2013) as the bottleneck of DNN acceleration.  First, we use autoencoder to quantify the data transfer overheads.  Data transfer time increases linearly with data sizes. It is very difficult to train real world size images without removal of this bottleneck. DataTransferTime% 15% 24% 33% 18% 25% 34% 18% 27% 38% 21% 33% 40% 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 3072 5120 7168 Input Data Size with different mini-batch size 256-batch 512-batch 1024-batch 2048-batch Data transfer overheads on CPU + Mainstream GPU, one forward prop. and backward prop. 40% time is to move data, for 48x48 RGB images
  • 22. | DNN PROJECT22 DATA TRANSFER OVERHEADS  How to avoid data copy through the zero-copy technique on APUs? ‒ APU: Zero-copy improves performance by 10% ‒ GPUs: Zero-copy degrades performance by 3.5x for AMD HD7970 and 8.7x for Mainstream GPU. Zero-copy technique: APUs: CPU and GPU share the same piece of memory, efficient GPUs: GPU accesses host memory through PCIe, slow Experiment design: CPU initializes 2kx2k matrixes (A, B), GPU performs C=A*B Matrix multiplication performance comparison among copy and zero-copy 45 41 19 67 23 199 0 10 20 30 40 50 60 70 80 90 100 110 120 Copy Zero Copy Copy Zero Copy Copy Zero Copy Kaveri HD7970 Mainstream GPU ExecutionTime(ms) Kernel Data Transfer
  • 23. | DNN PROJECT23 CONCLUSIONS-APU SERVER ADVANTAGES BASED ON AUTOENCODER RESULTS AMD APU Server  2.4 APUs can achieve similar performance with ~2.5x less power  2.5x higher performance given the same power budget HEADER TCO (Total cost ownership)  APU server achieves the same performance with ~1.8x less dollars Architectural Advantages  APU servers remove GPU’s device memory limitation and data transfer bottleneck, which fit better for Big Data inputs Cluster of CPU + GPU  2.4x speedup  6x more power
  • 24. | DNN PROJECT24 NEXT PLAN-AMD SOLUTIONS  H/W solutions: Parallel implementation on systems and system level evaluation ‒ CPU + GPUs cluster ‒ APU server  S/W solutions: OpenCL Implementation of DNN specific kernels ‒ OpenCL implementations and optimizations, applicable to general heterogeneous platforms  Set up real world application scenarios with external company’s involvement and apply AMD solutions to industry
  • 25. | DNN PROJECT25 DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.
  • 26. | DNN PROJECT26 BACK UP SLIDES
  • 28. | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4628 SYSTEM OVERVIEW APU APU CPU Cluster To DRAM Directory GPU Cluster Direct-access bus (used for graphics) Invalidation traffic GPU compute accesses must stay coherent Arrow thickness →bandwidth
  • 29. | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4629 SYSTEM OVERVIEW GPU GPU Cluster CU L1 CU L1 CU L1 CU L1 CU L1 CU L1 CU L1 CU L1 CU L1 CU L1 CU L1 CU L1 CU L1 CU L1 CU L1 CU L1 L1 L1 L1 L1 L1 L1 L1 L1L1 L1 L1 L1 L1 L1 L1 L1 CU CU CU CU CU CU CU CUCU CU CU CU CU CU CU CU GPU L2 Cache Very high bandwidth: L2 has high miss rate CU I-Fetch / Decode Register File Ex Ex Ex Ex Ex Ex Ex Ex Ex Ex Ex Ex Ex Ex Ex Ex Local Scratchpad Memory Coalescer To L1
  • 30. | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4630 SEAMICRO

Notas del editor

  1. DNN is becoming the leading direction in machine learning in the past two or three years. Starting from April 2013, collaboration between AMD research and development team with experts in system, architecture and openCL. This is the first time we talk about DNN project in public. AMD DNN project goal is to build DNN systems through AMD APU and GPUs. That can be applied to industry to address the h/w challenges. We implemente and accelerate core DNN algorithms in OpenCL. Today I am sharing some initial results and our insights on systems.
  2. Since we have audience from system community, let me first introduce DNN a bit.
  3. To explore which heterogeneous systems are the best efficiency motivates our project. That saying when we map DNN to hardware systems, will it be good enough to have CPU and GPU connected through motherboard, or will it be better to have CPU and GPU more closely integrated? Say on the same chip. What is the major difference of the two systems?
  4. Comm overheads are hurting both performance and the scalability of the systems when building servers and data centers, more limitation factors show up. For example, the physical space a cluster takes up and power consumption. We know, compare to small CPU, a GPU is like a big monster, and drains hundreds watt of power. This results in the cluster is low density, Keep in mind these bottlenecks and let move to APUs
  5. APU enables close colloboration between CPU and GPU in finishing a task together, the nice thing of SPU is the same size and power consumption with CPU This research is to evaluate the system effeciency in performance, power and performance per watt efficiency between those CPU +GPU clusters and APU servers. and provide insights how to build the APU server as future product
  6. What we found out through our experiments are: Those are the major take aways I hope 5 min
  7. Now let’s introduce the two of the DNN kernels we implement.
  8. Mlp refers to multi…it is a classical neural network model
  9. 8 min
  10. In this section I am going to introduce the evaluation results on GPU and APUs. And provide a quantitive comparision between perf. power and perf per watt ratio.
  11. CPU version, CUDA version, OCL version are all developed by AMD for peer to peer benchmark comparison, due to resource limitation, can NOT guarantee the code is fully optimized, this is also next direction Mainstream x86 SOC to the GPU on the soc Current testing platform is only on one single platform. OpenCL is able to run on all platforms , but for competitive purposes, we use the C abd Cuda version for our competetor’s CPU and GPUs.
  12. Before I show the results, let me clarify that Initial results, not with our full optimizations. Just to compare
  13. Doubt clAMDBlas performance lower than Mainstream GPU causing dGPU result lower performance APU compared to GPU is about 5x to 6x slower. As we mention before, GPU is 10x more bigger and consumes more power. 9 ~10 minutes
  14. In order to provide a systematic comparison, we list a more comprehensive comparison on this slide. X axis shows…different platforms Y axis on the left shows the perf. normalized to APU
  15. 11-12 minutes
  16. Larger batch size means heavier matrix multiplication workload.
  17. The previous slides show the per unit training speed. Now let’s take a look of real case We show energy here because green computing is also critical metrix these days. From this chart, we ca n see APU server can be used to build power efficient servers. 13 minutes
  18. Next I am going to go through bottlebeck analysis very quickly and share some of the OpenCL implementation experiences.
  19. Autoencoder training process involves frequent transfers of large amount of weight and gradients between CPU and GPU.
  20. The zero-copy technique refers to allocate host (CPU) a memory object but allow GPU to access it directly without the copy process. APUs leverages zero-copy mechanism naturally because CPU and integrated GPU actually share the host memory. 18 min
  21. APU server can achieve the same performance with approximately 1.8x less dollars Assume the same cost for memory, motherboard and interconnects Architectural advantages: APUs have very large unified address space Remove GPU's device memory limitation and data transfer bottleneck, which suits better for Big Data inputs.
  22. In order to stay coherent need to put all GPU coherent requests through directory Very high bandwidth Arrow thickness
  23. CU = streaming multiprocessor Talk more about i-fetch/register file/scratch pad