3. THE WORLD LEADER IN VISUAL COMPUTING
PC
GeForce | Quadro
DATA CENTER
Tesla | GRID
MOBILE
Tegra | SHIELD
ENTERPRISE
VIRTUALIZATION
AUTONOMOUS
MACHINES
HPC & CLOUD
SERVICE PROVIDERSGAMING DESIGN
6. 前世代比 3倍の性能 (Fermi vs Kepler)
Tesla
M2090
Tesla
K40
CUDA コア数 512 2880
倍精度演算性能
DGEMM
665 G
400 GF
1.43 TF
1.33 TF
単精度演算性能
SGEMM
1.33 TF
0.89 TF
4.29 TF
3.22 TF
メモリバンド幅 178 GB/s 288 GB/s
メモリサイズ 6 GB 12 GB
消費電力 225W 235W
3.22 TFLOPS
0.89 TFLOPS
1.33 TFLOPS
0.40 TFLOPS
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Tesla M2090 Tesla K40
TFLOPS
Double Precision FLOPS (DGEMM)
0
0.5
1
1.5
2
2.5
3
3.5
Tesla M2090 Tesla K40
TFLOPS
Single Precision FLOPS (SGEMM)
7. Tesla Kepler Family
World’s Fastest and Most Efficient HPC Accelerators
GPUs
Single Precision
Peak (SGEMM)
Double Precision
Peak (DGEMM)
Memory
Size
Memory
Bandwidth
(ECC off)
PCIe Gen
System
Solution
CFD, BioChemistry, Neural
Networks, High Energy Physiscs,
Graph analytics, Material
Science, BioInformatics, M&E
K80 5.6 TF 1.87 TF 24 GB 480 GB/s Gen 3 Server
CFD, BioChemistry, Neural
Networks, High Energy Physiscs,
Graph analytics, Material
Science, BioInformatics, M&E
K40
4.29 TF
(3.22TF)
1.43 TF
(1.33 TF)
12 GB 288 GB/s
Gen 3
Server +
Workstation
8. M6000 K6000 K5200 K4200 K2200 K620 K420
# CUDA Cores 3072 2880 2304 1344 640 384 192
CC 5.2 3.5 3.5 3.0 5.0 5.0 3.0
Single Precision 6.8 TFLOPs 5.2 TFLOPs 3.1 TFLOPs 2.1 TFLOPs 1.3 TFLOPs 0.8 TFLOPs 0.3 TFLOPs
PCIe Gen 3.0 2.0
Memory Size 12GB 12 GB 8 GB 4 GB 4 GB 2 GB 1 GB
Memory BW 317 GB/s 288 GB/s 192 GB/s 173 GB/s 80 GB/s 29 GB/s 29 GB/s
Slots + Display
Connectors
Max Resolution 4096 x 2160 3840 x 2160
Max Displays 4 4 4 4 4 4 4
Pro Features SDI, SYNC, STEREO, MOSAIC, NVIEW MOSAIC, NVIEW
Board Power 250W 225 W 150 W 108 W 60 W 41 W 41 W
The New Quadro Family
* DisplayPort 1.2 multi-streaming can be used to drive multiple displays from a single DP connector
4x DP + 1x DVI* 2x DP + DVI*
DP + DVI*
2x DP + DVI*
DP + DVI*2x DP + 2x DVI*2x DP + 2x DVI*
9. CUDA・GPUコンピューティング
CUDA
– Compute Unified Device Architecture
– Linux・Windows・MacOS X (+Android)で動作
– 現在、7.5
GPUコンピューティング
– GPUによる、汎用コンピューティング
– GPU = Graphics Processing Unit
10. 開発者は 加速的に 増加
2008 2014
4,000
Academic Papers
15万
CUDA Downloads
60
University Courses
1億
CUDA –Capable GPUs
1
Supercomputer
44
Supercomputers
57,000
Academic Papers
770
University
Courses
250万
CUDA Downloads
5.2億
CUDA-Capable GPUs
34. Warp内部でのみ使える命令
例) SHFL (シャッフル命令)
同一Warpに属するスレッド間でのデータ交換。
Shared memoryを使用しない。
32-bit値の交換
4 バリエーション:
h d f e a c c b - - a b c d e f c d e f g h - - c d a b g h e f
a b c d e f g h
Indexed
any-to-any
Shift right to nth
neighbour
Shift left to nth
neighbour
Butterfly (XOR)
exchange
__shfl() __shfl_up() __shfl_down() __shfl_xor()