High-Performance GPU Programming for Deep Learning

High-Performance GPU
Programming for Deep Learning
7 April 2016
Scott Gray
Nervana Systems
MAKING MACHINES SMARTER.™

Proprietary and conﬁdential. Do not distribute.ner va na
High-Performance GPU kernels for deep learning
2
• Fast matrix multiply for small minibatches
• Direct convolution leveraging GEMM advances
• Even faster convolution with Winograd

GEMM: Basics
3
C = AB

GEMM: Memory Load
4
Outer product contiguous Outer product strided
threads
memory load
single tile
batched GEMM

Batched GEMM tiles 32 x 32
GEMM tile 32 x 64GEMM tile 32 x 32
GEMM: Tile sizes
5
threads
shared memory load

hGEMM Results - NN
6
Nx3072x3072 NN op
0
1500
3000
4500
6000
32 64 96 128
Nervana 32x32 cuBLAS 128x64
Batch Size (N)
GFLOPS

hGEMM Results - TN
7
GFLOPS
Nx3072x3072 TN op
0
1500
3000
4500
6000
32 64 96 128
Nervana 32x32 cuBLAS 128x64
Batch Size (N)

Direct convolution is still relevant
8
• Striding
• Odd-size filters
• Placeholder until faster algo can be implemented
• Often faster for single image or first small C layer

Direct convolution: implementation details
9
• Batched GEMM for efficient transpose and higher occupancy
• Compound outer product block remapping
• Square wave pattern for P,Q block mapping
• Slicing: shared memory lookup + integer division
• N vs C contiguous
• Single P,Q vs tiled P,Q
• Bprop as upside down fprop
• Update specific optimizations

Winograd: input transform
10
Input Feature Map
4x4 stride 2
• Input transform
• 2D Winograd is a nested
product of 1D transforms
• Transforms can be
simplified to remove zeros

Winograd: filter transform
11
• Filter transform
• Same as input but with
different coefficients
• Transform each feature map
independently

Winograd: batched GEMM
12

Winograd: output transform
13
Output Feature Map
• Output transform
• Same as input and filter
• Transform back to pixel
space to obtain 2x2 output
tile

Proprietary and conﬁdential. Do not distribute.ner va na 14
Performance: VGG
VGG fp32 - Totals by operation
0
0.5
1
1.5
2
64 32 16 8 4 2 1
Winograd fp32 fprop
Winograd fp32 bprop
Winograd fp32 update
cuDNN fp32 fprop
cuDNN fp32 bprop
cuDNN fp32 update
AlgorithmicSpeedup
Batch Size

Performance: Alexnet convolutional layers
15
Alexnet Totals
0
0.5
1
1.5
2
128 64 32 16 8 4
Nervana fp16
Nervana fp32
CuBLAS fp16
CuBLAS fp32
Batch Size
AlgorithmicSpeedup

Compounding
16
• alpha / beta
• bias
• relu, prelu, tanh, …
• bprop relu, …
• bprop bias
• batchnorm mean
Compounding inside of GEMM and conv for free:

Summary
17
• Nervana has the fastest tools for deep learning
• neon with state-of-the-art Maxwell kernels
• Nervana Cloud with multi-GPU training
• Watch for Nervana Engine, our deep learning processor

High-Performance GPU Programming for Deep Learning

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a High-Performance GPU Programming for Deep Learning

Similar a High-Performance GPU Programming for Deep Learning (20)

Más de Intel Nervana

Más de Intel Nervana (10)

Último

Último (20)

High-Performance GPU Programming for Deep Learning