This session goes over many of the techniques we use at Nervana in GPU programming to achieve state-of-the-art performance for deep learning networks. The main focus will be on the customization of dense linear algebra kernels: Winograd 3x3 convolution, direct convolution, and small tile GEMM (matrix multiply). In particular, we'll look at how we achieve high utilization at very small mini batches which is important for multi-gpu scaling and inference. In addition we'll talk about where and how you can effectively leverage lower and mixed precision to further increase performance without loss in accuracy.
2. Proprietary and confidential. Do not distribute.ner va na
High-Performance GPU kernels for deep learning
2
• Fast matrix multiply for small minibatches
• Direct convolution leveraging GEMM advances
• Even faster convolution with Winograd
4. Proprietary and confidential. Do not distribute.ner va na
GEMM: Memory Load
4
Outer product contiguous Outer product strided
threads
memory load
single tile
batched GEMM
5. Proprietary and confidential. Do not distribute.ner va na
Batched GEMM tiles 32 x 32
GEMM tile 32 x 64GEMM tile 32 x 32
GEMM: Tile sizes
5
threads
shared memory load
6. Proprietary and confidential. Do not distribute.ner va na
hGEMM Results - NN
6
Nx3072x3072 NN op
0
1500
3000
4500
6000
32 64 96 128
Nervana 32x32 cuBLAS 128x64
Batch Size (N)
GFLOPS
7. Proprietary and confidential. Do not distribute.ner va na
hGEMM Results - TN
7
GFLOPS
Nx3072x3072 TN op
0
1500
3000
4500
6000
32 64 96 128
Nervana 32x32 cuBLAS 128x64
Batch Size (N)
8. Proprietary and confidential. Do not distribute.ner va na
Direct convolution is still relevant
8
• Striding
• Odd-size filters
• Placeholder until faster algo can be implemented
• Often faster for single image or first small C layer
9. Proprietary and confidential. Do not distribute.ner va na
Direct convolution: implementation details
9
• Batched GEMM for efficient transpose and higher occupancy
• Compound outer product block remapping
• Square wave pattern for P,Q block mapping
• Slicing: shared memory lookup + integer division
• N vs C contiguous
• Single P,Q vs tiled P,Q
• Bprop as upside down fprop
• Update specific optimizations
10. Proprietary and confidential. Do not distribute.ner va na
Winograd: input transform
10
Input Feature Map
4x4 stride 2
• Input transform
• 2D Winograd is a nested
product of 1D transforms
• Transforms can be
simplified to remove zeros
11. Proprietary and confidential. Do not distribute.ner va na
Winograd: filter transform
11
• Filter transform
• Same as input but with
different coefficients
• Transform each feature map
independently
13. Proprietary and confidential. Do not distribute.ner va na
Winograd: output transform
13
Output Feature Map
• Output transform
• Same as input and filter
• Transform back to pixel
space to obtain 2x2 output
tile
14. Proprietary and confidential. Do not distribute.ner va na 14
Performance: VGG
VGG fp32 - Totals by operation
0
0.5
1
1.5
2
64 32 16 8 4 2 1
Winograd fp32 fprop
Winograd fp32 bprop
Winograd fp32 update
cuDNN fp32 fprop
cuDNN fp32 bprop
cuDNN fp32 update
AlgorithmicSpeedup
Batch Size
15. Proprietary and confidential. Do not distribute.ner va na
Performance: Alexnet convolutional layers
15
Alexnet Totals
0
0.5
1
1.5
2
128 64 32 16 8 4
Nervana fp16
Nervana fp32
CuBLAS fp16
CuBLAS fp32
Batch Size
AlgorithmicSpeedup
16. Proprietary and confidential. Do not distribute.ner va na
Compounding
16
• alpha / beta
• bias
• relu, prelu, tanh, …
• bprop relu, …
• bprop bias
• batchnorm mean
Compounding inside of GEMM and conv for free:
17. Proprietary and confidential. Do not distribute.ner va na
Summary
17
• Nervana has the fastest tools for deep learning
• neon with state-of-the-art Maxwell kernels
• Nervana Cloud with multi-GPU training
• Watch for Nervana Engine, our deep learning processor