SlideShare una empresa de Scribd logo
1 de 17
Descargar para leer sin conexión
High-Performance GPU
Programming for Deep Learning
7 April 2016
Scott Gray
Nervana Systems
MAKING MACHINES SMARTER.™
Proprietary and confidential. Do not distribute.ner va na
High-Performance GPU kernels for deep learning
2
• Fast matrix multiply for small minibatches
• Direct convolution leveraging GEMM advances
• Even faster convolution with Winograd
Proprietary and confidential. Do not distribute.ner va na
GEMM: Basics
3
C = AB
Proprietary and confidential. Do not distribute.ner va na
GEMM: Memory Load
4
Outer product contiguous Outer product strided
threads
memory load
single tile
batched GEMM
Proprietary and confidential. Do not distribute.ner va na
Batched GEMM tiles 32 x 32
GEMM tile 32 x 64GEMM tile 32 x 32
GEMM: Tile sizes
5
threads
shared memory load
Proprietary and confidential. Do not distribute.ner va na
hGEMM Results - NN
6
Nx3072x3072 NN op
0
1500
3000
4500
6000
32 64 96 128
Nervana 32x32 cuBLAS 128x64
Batch Size (N)
GFLOPS
Proprietary and confidential. Do not distribute.ner va na
hGEMM Results - TN
7
GFLOPS
Nx3072x3072 TN op
0
1500
3000
4500
6000
32 64 96 128
Nervana 32x32 cuBLAS 128x64
Batch Size (N)
Proprietary and confidential. Do not distribute.ner va na
Direct convolution is still relevant
8
• Striding
• Odd-size filters
• Placeholder until faster algo can be implemented
• Often faster for single image or first small C layer
Proprietary and confidential. Do not distribute.ner va na
Direct convolution: implementation details
9
• Batched GEMM for efficient transpose and higher occupancy
• Compound outer product block remapping
• Square wave pattern for P,Q block mapping
• Slicing: shared memory lookup + integer division
• N vs C contiguous
• Single P,Q vs tiled P,Q
• Bprop as upside down fprop
• Update specific optimizations
Proprietary and confidential. Do not distribute.ner va na
Winograd: input transform
10
Input Feature Map
4x4 stride 2
• Input transform
• 2D Winograd is a nested
product of 1D transforms
• Transforms can be
simplified to remove zeros
Proprietary and confidential. Do not distribute.ner va na
Winograd: filter transform
11
• Filter transform
• Same as input but with
different coefficients
• Transform each feature map
independently
Proprietary and confidential. Do not distribute.ner va na
Winograd: batched GEMM
12
Proprietary and confidential. Do not distribute.ner va na
Winograd: output transform
13
Output Feature Map
• Output transform
• Same as input and filter
• Transform back to pixel
space to obtain 2x2 output
tile
Proprietary and confidential. Do not distribute.ner va na 14
Performance: VGG
VGG fp32 - Totals by operation
0
0.5
1
1.5
2
64 32 16 8 4 2 1
Winograd fp32 fprop
Winograd fp32 bprop
Winograd fp32 update
cuDNN fp32 fprop
cuDNN fp32 bprop
cuDNN fp32 update
AlgorithmicSpeedup
Batch Size
Proprietary and confidential. Do not distribute.ner va na
Performance: Alexnet convolutional layers
15
Alexnet Totals
0
0.5
1
1.5
2
128 64 32 16 8 4
Nervana fp16
Nervana fp32
CuBLAS fp16
CuBLAS fp32
Batch Size
AlgorithmicSpeedup
Proprietary and confidential. Do not distribute.ner va na
Compounding
16
• alpha / beta
• bias
• relu, prelu, tanh, …
• bprop relu, …
• bprop bias
• batchnorm mean
Compounding inside of GEMM and conv for free:
Proprietary and confidential. Do not distribute.ner va na
Summary
17
• Nervana has the fastest tools for deep learning
• neon with state-of-the-art Maxwell kernels
• Nervana Cloud with multi-GPU training
• Watch for Nervana Engine, our deep learning processor

Más contenido relacionado

La actualidad más candente

A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
Takahiro Harada
 
Ece512 h1 20139_621386735458ece512_test2_solutions
Ece512 h1 20139_621386735458ece512_test2_solutionsEce512 h1 20139_621386735458ece512_test2_solutions
Ece512 h1 20139_621386735458ece512_test2_solutions
nadia abd
 
xilinx fpga problems
xilinx fpga problemsxilinx fpga problems
xilinx fpga problems
Anish Gupta
 

La actualidad más candente (20)

Unit 5 vsp
Unit 5 vspUnit 5 vsp
Unit 5 vsp
 
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
 
Multi core k means
Multi core k meansMulti core k means
Multi core k means
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Ece512 h1 20139_621386735458ece512_test2_solutions
Ece512 h1 20139_621386735458ece512_test2_solutionsEce512 h1 20139_621386735458ece512_test2_solutions
Ece512 h1 20139_621386735458ece512_test2_solutions
 
Gsm attacks
Gsm attacksGsm attacks
Gsm attacks
 
Network simulator 2
Network simulator 2Network simulator 2
Network simulator 2
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDA
 
Scaling the #2ndhalf
Scaling the #2ndhalfScaling the #2ndhalf
Scaling the #2ndhalf
 
MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...
MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...
MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...
 
Unite2019 HLOD를 활용한 대규모 씬 제작 방법
Unite2019 HLOD를 활용한 대규모 씬 제작 방법Unite2019 HLOD를 활용한 대규모 씬 제작 방법
Unite2019 HLOD를 활용한 대규모 씬 제작 방법
 
The Internet
The InternetThe Internet
The Internet
 
Multi-Jet Generation -status report-
Multi-Jet Generation -status report-Multi-Jet Generation -status report-
Multi-Jet Generation -status report-
 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principle
 
Experiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRCExperiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRC
 
xilinx fpga problems
xilinx fpga problemsxilinx fpga problems
xilinx fpga problems
 
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
 
Rules of block diagram
Rules of block diagramRules of block diagram
Rules of block diagram
 
Grincon U.S. 2019 How to Mine Grin
Grincon U.S. 2019 How to Mine GrinGrincon U.S. 2019 How to Mine Grin
Grincon U.S. 2019 How to Mine Grin
 
Real-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/PhiReal-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/Phi
 

Destacado

GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2
NVIDIA
 
ECCV2010: feature learning for image classification, part 4
ECCV2010: feature learning for image classification, part 4ECCV2010: feature learning for image classification, part 4
ECCV2010: feature learning for image classification, part 4
zukun
 
20160913 gpu deep-learningcomminity-morpho_20160912-公開用rev2
20160913 gpu deep-learningcomminity-morpho_20160912-公開用rev220160913 gpu deep-learningcomminity-morpho_20160912-公開用rev2
20160913 gpu deep-learningcomminity-morpho_20160912-公開用rev2
Tomokazu Kanazawa
 

Destacado (20)

Intel Nervana Artificial Intelligence Meetup 11/30/16
Intel Nervana Artificial Intelligence Meetup 11/30/16Intel Nervana Artificial Intelligence Meetup 11/30/16
Intel Nervana Artificial Intelligence Meetup 11/30/16
 
Introduction to multi gpu deep learning with DIGITS 2 - Mike Wang
Introduction to multi gpu deep learning with DIGITS 2 - Mike WangIntroduction to multi gpu deep learning with DIGITS 2 - Mike Wang
Introduction to multi gpu deep learning with DIGITS 2 - Mike Wang
 
Intel Nervana Artificial Intelligence Meetup 1/31/17
Intel Nervana Artificial Intelligence Meetup 1/31/17Intel Nervana Artificial Intelligence Meetup 1/31/17
Intel Nervana Artificial Intelligence Meetup 1/31/17
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
Rethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligenceRethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligence
 
Introduction to deep learning @ Startup.ML by Andres Rodriguez
Introduction to deep learning @ Startup.ML by Andres RodriguezIntroduction to deep learning @ Startup.ML by Andres Rodriguez
Introduction to deep learning @ Startup.ML by Andres Rodriguez
 
GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2
 
The AI Era Ignited by GPU Deep Learning
The AI Era Ignited by GPU Deep Learning The AI Era Ignited by GPU Deep Learning
The AI Era Ignited by GPU Deep Learning
 
RocksDB meetup
RocksDB meetupRocksDB meetup
RocksDB meetup
 
20161122 gpu deep_learningcommunity#02
20161122 gpu deep_learningcommunity#0220161122 gpu deep_learningcommunity#02
20161122 gpu deep_learningcommunity#02
 
ECCV2010: feature learning for image classification, part 4
ECCV2010: feature learning for image classification, part 4ECCV2010: feature learning for image classification, part 4
ECCV2010: feature learning for image classification, part 4
 
Artificial general intelligence research project at Keen Software House (3/2015)
Artificial general intelligence research project at Keen Software House (3/2015)Artificial general intelligence research project at Keen Software House (3/2015)
Artificial general intelligence research project at Keen Software House (3/2015)
 
Deep learning tutorial (i)
Deep learning tutorial (i)Deep learning tutorial (i)
Deep learning tutorial (i)
 
20160913 gpu deep-learningcomminity-morpho_20160912-公開用rev2
20160913 gpu deep-learningcomminity-morpho_20160912-公開用rev220160913 gpu deep-learningcomminity-morpho_20160912-公開用rev2
20160913 gpu deep-learningcomminity-morpho_20160912-公開用rev2
 
Common Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksCommon Design of Deep Learning Frameworks
Common Design of Deep Learning Frameworks
 
Video Activity Recognition and NLP Q&A Model Example
Video Activity Recognition and NLP Q&A Model ExampleVideo Activity Recognition and NLP Q&A Model Example
Video Activity Recognition and NLP Q&A Model Example
 
Startup.Ml: Using neon for NLP and Localization Applications
Startup.Ml: Using neon for NLP and Localization Applications Startup.Ml: Using neon for NLP and Localization Applications
Startup.Ml: Using neon for NLP and Localization Applications
 
Using neon for pattern recognition in audio data
Using neon for pattern recognition in audio dataUsing neon for pattern recognition in audio data
Using neon for pattern recognition in audio data
 
Urs Köster Presenting at RE-Work DL Summit in Boston
Urs Köster Presenting at RE-Work DL Summit in BostonUrs Köster Presenting at RE-Work DL Summit in Boston
Urs Köster Presenting at RE-Work DL Summit in Boston
 
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
 

Similar a High-Performance GPU Programming for Deep Learning

Matrix glitcher tutorial
Matrix glitcher tutorialMatrix glitcher tutorial
Matrix glitcher tutorial
José Mota
 
Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016
Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016
Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016
Alex Vlachos
 
Smedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphics
changehee lee
 
new_age_graphics_android_x86
new_age_graphics_android_x86new_age_graphics_android_x86
new_age_graphics_android_x86
Droidcon Berlin
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
Edge AI and Vision Alliance
 
Alex_Vlachos_Advanced_VR_Rendering_GDC2015
Alex_Vlachos_Advanced_VR_Rendering_GDC2015Alex_Vlachos_Advanced_VR_Rendering_GDC2015
Alex_Vlachos_Advanced_VR_Rendering_GDC2015
Alex Vlachos
 

Similar a High-Performance GPU Programming for Deep Learning (20)

Boyang gao gpu k-means_gmm_final_v1
Boyang gao gpu k-means_gmm_final_v1Boyang gao gpu k-means_gmm_final_v1
Boyang gao gpu k-means_gmm_final_v1
 
Matrix glitcher tutorial
Matrix glitcher tutorialMatrix glitcher tutorial
Matrix glitcher tutorial
 
Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016
Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016
Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016
 
OpenGL for 2015
OpenGL for 2015OpenGL for 2015
OpenGL for 2015
 
Smedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphics
 
new_age_graphics_android_x86
new_age_graphics_android_x86new_age_graphics_android_x86
new_age_graphics_android_x86
 
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
 
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
 
Troubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer PerspectiveTroubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer Perspective
 
OBDPC 2022
OBDPC 2022OBDPC 2022
OBDPC 2022
 
DC GAN - GO GAME
DC GAN - GO GAMEDC GAN - GO GAME
DC GAN - GO GAME
 
WebRender (MadRust)
WebRender (MadRust)WebRender (MadRust)
WebRender (MadRust)
 
“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...
“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...
“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...
 
Advancements in-tiled-rendering
Advancements in-tiled-renderingAdvancements in-tiled-rendering
Advancements in-tiled-rendering
 
Volodymyr Lyubinets “Generative models for images”
Volodymyr Lyubinets  “Generative models for images”Volodymyr Lyubinets  “Generative models for images”
Volodymyr Lyubinets “Generative models for images”
 
Dissecting and fixing Vulkan rendering issues in drivers with RenderDoc
Dissecting and fixing Vulkan rendering issues in drivers with RenderDocDissecting and fixing Vulkan rendering issues in drivers with RenderDoc
Dissecting and fixing Vulkan rendering issues in drivers with RenderDoc
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
 
Alex_Vlachos_Advanced_VR_Rendering_GDC2015
Alex_Vlachos_Advanced_VR_Rendering_GDC2015Alex_Vlachos_Advanced_VR_Rendering_GDC2015
Alex_Vlachos_Advanced_VR_Rendering_GDC2015
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Dissecting the Rendering of The Surge
Dissecting the Rendering of The SurgeDissecting the Rendering of The Surge
Dissecting the Rendering of The Surge
 

Más de Intel Nervana

Más de Intel Nervana (10)

Introduction to Deep Learning and neon at Galvanize
Introduction to Deep Learning and neon at GalvanizeIntroduction to Deep Learning and neon at Galvanize
Introduction to Deep Learning and neon at Galvanize
 
Women in AI kickoff
Women in AI kickoff Women in AI kickoff
Women in AI kickoff
 
ODSC West
ODSC WestODSC West
ODSC West
 
Deep Learning for Robotics
Deep Learning for RoboticsDeep Learning for Robotics
Deep Learning for Robotics
 
RE-Work Deep Learning Summit - September 2016
RE-Work Deep Learning Summit - September 2016RE-Work Deep Learning Summit - September 2016
RE-Work Deep Learning Summit - September 2016
 
Nervana and the Future of Computing
Nervana and the Future of ComputingNervana and the Future of Computing
Nervana and the Future of Computing
 
Object Detection and Recognition
Object Detection and Recognition Object Detection and Recognition
Object Detection and Recognition
 
Introduction to Deep Learning with Will Constable
Introduction to Deep Learning with Will ConstableIntroduction to Deep Learning with Will Constable
Introduction to Deep Learning with Will Constable
 
Urs Köster - Convolutional and Recurrent Neural Networks
Urs Köster - Convolutional and Recurrent Neural NetworksUrs Köster - Convolutional and Recurrent Neural Networks
Urs Köster - Convolutional and Recurrent Neural Networks
 
Anil Thomas - Object recognition
Anil Thomas - Object recognitionAnil Thomas - Object recognition
Anil Thomas - Object recognition
 

Último

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdf
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 

High-Performance GPU Programming for Deep Learning

  • 1. High-Performance GPU Programming for Deep Learning 7 April 2016 Scott Gray Nervana Systems MAKING MACHINES SMARTER.™
  • 2. Proprietary and confidential. Do not distribute.ner va na High-Performance GPU kernels for deep learning 2 • Fast matrix multiply for small minibatches • Direct convolution leveraging GEMM advances • Even faster convolution with Winograd
  • 3. Proprietary and confidential. Do not distribute.ner va na GEMM: Basics 3 C = AB
  • 4. Proprietary and confidential. Do not distribute.ner va na GEMM: Memory Load 4 Outer product contiguous Outer product strided threads memory load single tile batched GEMM
  • 5. Proprietary and confidential. Do not distribute.ner va na Batched GEMM tiles 32 x 32 GEMM tile 32 x 64GEMM tile 32 x 32 GEMM: Tile sizes 5 threads shared memory load
  • 6. Proprietary and confidential. Do not distribute.ner va na hGEMM Results - NN 6 Nx3072x3072 NN op 0 1500 3000 4500 6000 32 64 96 128 Nervana 32x32 cuBLAS 128x64 Batch Size (N) GFLOPS
  • 7. Proprietary and confidential. Do not distribute.ner va na hGEMM Results - TN 7 GFLOPS Nx3072x3072 TN op 0 1500 3000 4500 6000 32 64 96 128 Nervana 32x32 cuBLAS 128x64 Batch Size (N)
  • 8. Proprietary and confidential. Do not distribute.ner va na Direct convolution is still relevant 8 • Striding • Odd-size filters • Placeholder until faster algo can be implemented • Often faster for single image or first small C layer
  • 9. Proprietary and confidential. Do not distribute.ner va na Direct convolution: implementation details 9 • Batched GEMM for efficient transpose and higher occupancy • Compound outer product block remapping • Square wave pattern for P,Q block mapping • Slicing: shared memory lookup + integer division • N vs C contiguous • Single P,Q vs tiled P,Q • Bprop as upside down fprop • Update specific optimizations
  • 10. Proprietary and confidential. Do not distribute.ner va na Winograd: input transform 10 Input Feature Map 4x4 stride 2 • Input transform • 2D Winograd is a nested product of 1D transforms • Transforms can be simplified to remove zeros
  • 11. Proprietary and confidential. Do not distribute.ner va na Winograd: filter transform 11 • Filter transform • Same as input but with different coefficients • Transform each feature map independently
  • 12. Proprietary and confidential. Do not distribute.ner va na Winograd: batched GEMM 12
  • 13. Proprietary and confidential. Do not distribute.ner va na Winograd: output transform 13 Output Feature Map • Output transform • Same as input and filter • Transform back to pixel space to obtain 2x2 output tile
  • 14. Proprietary and confidential. Do not distribute.ner va na 14 Performance: VGG VGG fp32 - Totals by operation 0 0.5 1 1.5 2 64 32 16 8 4 2 1 Winograd fp32 fprop Winograd fp32 bprop Winograd fp32 update cuDNN fp32 fprop cuDNN fp32 bprop cuDNN fp32 update AlgorithmicSpeedup Batch Size
  • 15. Proprietary and confidential. Do not distribute.ner va na Performance: Alexnet convolutional layers 15 Alexnet Totals 0 0.5 1 1.5 2 128 64 32 16 8 4 Nervana fp16 Nervana fp32 CuBLAS fp16 CuBLAS fp32 Batch Size AlgorithmicSpeedup
  • 16. Proprietary and confidential. Do not distribute.ner va na Compounding 16 • alpha / beta • bias • relu, prelu, tanh, … • bprop relu, … • bprop bias • batchnorm mean Compounding inside of GEMM and conv for free:
  • 17. Proprietary and confidential. Do not distribute.ner va na Summary 17 • Nervana has the fastest tools for deep learning • neon with state-of-the-art Maxwell kernels • Nervana Cloud with multi-GPU training • Watch for Nervana Engine, our deep learning processor