1. Massively Parallel Computing
CS 264 / CSCI E-292
Lecture #4: Intermediate-level CUDA | February 15th, 2011
Nicolas Pinto (MIT, Harvard)
pinto@mit.edu
2. Administrivia
• HW1: due Fri 2/18/11 (this week)
• Projects: think about it, consult the staff
• New guest lecturers!
• Max Lin (Google), Kurt Messersmith et al. (Amazon),
David Rich et al. (Microsoft)
3. During this course,
r CS264
adapted fo
we’ll try to
“ ”
and use existing material ;-)
13. A PI
! !"#$)*+,$(-.$2+$#@/*+#3$:+$,H*$3244#0#6,$
!"#$%&
! !"#$G*H$G#1#G$'#127#$(-.$I/0#42@8$75J
! !"#$"2;"$G#1#G$K56,29#$(-.$I/0#42@8$753:J
! >*9#$,"26;+$7:6$F#$3*6#$,"0*5;"$F*,"$(-.+L$
*,"#0+$:0#$+/#72:G2M#3
! %:6$F#$92@#3$,*;#,"#0$IH2,"$7:0#J
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
14. A PI
! (GG$B-&$7*9/5,26;$2+$/#04*09#3$*6$:$3#127#
! !*$:GG*7:,#$9#9*0=L$056$:$/0*;0:9L$#,7$*6$
,"#$":03H:0#L$H#$6##3$:$!"#$%"&%'()"*)
! '#127#$7*6,#@,+$:0#$F*563$N8N$H2,"$"*+,$
,"0#:3+$IO5+,$G2P#$A/#6BCQJ
! >*L$#:7"$"*+,$,"0#:3$9:=$":1#$:,$9*+,$*6#$3#127#$
7*6,#@,
! (63L$#:7"$3#127#$7*6,#@,$2+$:77#++2FG#$40*9$*6G=$
*6#$"*+,$,"0#:3
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
15. A PI
! (GG$3#127#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$
7*3#$*4$,=/#8$+,-"./0)
! (GG$056,29#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$
7*3#$*4$,=/#$%/!12--'-3)
! (6$26,#;#0$1:G5#$H2,"$M#0*$R$6*$#00*0
! %/!14")51.)2--'-L$%/!14")2--'-6)-$(7
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
16. A PI
! K56,29#$(-.$7:GG+$:5,*9:,27:GG=$262,2:G2M#
! '#127#$(-.$7:GG+$95+,$7:GG$%/8($)
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
22. A PI
! !"#$420+,$I*/,2*6:GSJ$+,#/$2+$,*$#659#0:,#$,"#$
:1:2G:FG#$3#127#+
! %/9"#$%"4")+'/()
! %/9"#$%"4")
! %/9"#$%"4"):1;"
! %/9"#$%"4")<')10=";'->
! %/9"#$%"4")?))-$@/)"
! !
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
23. A PI
! !"#$%&$%#'(()$%*%+$,-#$%&-.'%!"#$%&!$'$(
&$%/$.%*%+$,-#$%'*"+0$%(1%.23$%)*+$%&!$
! 4*"%"(&%#5$*.$%*%#(".$6.%&-.'%!")(,)-$.($
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
24. A PI
! !"#$%&$%'*,$%*%#(".$6.%>)*!/0($,(?%#*"%
*00(#*.$%9$9(52@%#*00%*%A;B%18"#.-("%$.#C%%
! 4(".$6.%-)%-930-#-.02%*))(#-*.$+%&-.'%#5$*.-"/%
.'5$*+
! D(%)2"#'5("-E$%*00%.'5$*+)%>4;B%'().%&-.'%
A;B%.'5$*+)?%#*00%!")(,140!2-/0&5$
! F*-.)%1(5%*00%A;B%.*)G)%.(%1-"-)'%
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
25. A PI
! :00(#*.$HI5$$%9$9(52=
! !"6$7899/!:;!"6$7<-$$
! <"-.-*0-E$%9$9(52=
! !"6$73$(
! 4(32%9$9(52=
! !"6$7!=4>(/#:;!"6$7!=4#(/>:;
!"6$7!=4#(/#
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
26. A PI
! 3")(4<'#"6."&"@'(@"(9"ABC"%(4#D4&$&"&'(-:"
56$7".()#"$+8#"6-9(*)&$6(-
! >%<@6- 96'#.
! 3")(4<'#"6."%*#&$#4"@+"'(&46-:"&"%<@6- 56$7"
!"#(2"'$,)$*-$ (*"!"#(2"'$3(*2.*-*
! ;(4<'#"%&-"@#"<-'(&4#4"56$7"
!"#(2"'$45'(*2
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
27. A PI
! E(&46-:"&")(4<'#"&'.("%(86#."6$"$("$7#"4#F6%#
! ,&-"$7#-":#$"$7#"&44*#.."(9"9<-%$6(-."&-4"
:'(@&'"F&*6&@'#.G
!"#(2"'$6$-7"5!-8(5
!"#(2"'$6$-6'(9*'
!"#(2"'$6$-:$;<$=
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
28. A PI
! H-%#"&")(4<'#"6."'(&4#4!"&-4"5#"7&F#"&"
9<-%$6(-"8(6-$#*!"5#"%&-"%&''"&"9<-%$6(-
! I#")<.$".#$<8"$7#"!"!#$%&'()!(*&+'(,!(%)
96*.$
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
29. A PI
! JK#%<$6(-"#-F6*(-)#-$"6-%'<4#.G
" L7*#&4"M'(%?"N6=#
" N7&*#4";#)(*+"N6=#
" O<-%$6(-"B&*&)#$#*.
" A*64"N6=#
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
30. A PI
! L7*#&4"M'(%?"N6=#G"
!"7"5!>$-?'(!@>A*0$
! N7&*#4";#)(*+"N6=#G
!"7"5!>$->A*)$2>8B$
! O<-%$6(-"B&*&)#$#*.G
!"C*)*%>$->8B$DE!"C*)*%>$-8DE
!"C*)*%>$-=DE!"C*)*%>$-F
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
31. A PI
! !"#$%&#'(%#)%)(*%+*%*,(%)+-(%*#-(%+)%*,(%
./01*#20%#0321+*#204
!"#$"%!&'()*
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
32. Outline
• CUDA Language & APIs (overview)
• Threading/Execution (cont’d)
• Memory/Communication (cont’d)
• Tools
• Libraries
37. Indexing Arrays: Example
In this example, the red entry would have an index of 21:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
M = 8 threads/block
blockIdx.x = 2
int index = threadIdx.x + blockIdx.x * M;
= 5 + 2 * 8;
= 21;
38. Indexing Arrays: Example
In this example, the red entry would have an index of 21:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
M = 8 threads/block
Addition with Threads and Blocks blockIdx.x = 2
int index = threadIdx.x + blockIdx.x * M;
The blockDim.x is a built-in variable for threads per block:
= 5 + 2 * 8;
int index= threadIdx.x + blockIdx.x * blockDim.x;
= 21;
A combined version of our vector addition kernel to use blocks and threads:
__global__ void add( int *a, int *b, int *c ) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
58. Constant Memory
Constants set by CPU, read by GPU
Each SM has 8kiB cache for constants
Optimized for broadcast
Accessing different elements forces serialisation
Can speed some calculations
Can relieve register pressure
59. Constant Memory
Declared at file scope
__constant__ float dc_myConst;
Set via cudaMemcpyToSymbol API call
cudaMemcpyToSymbol( “dc_myConst”, 3.14f, sizeof(float) )
Accessed by name in kernel
__global__ MyKernel( ... ) {
....
float myVal = dc_myConst+1;
....
}
60. Textures
Textures are essentially look up tables
Can only be written by the host
Cached on each multiprocessor (8kiB)
Optimised for 2D spatial locality
Hardware interpolation possible
Limited precision
Can clamp or wrap at boundaries
61. Textures
Declaration and setup rather involved
See programming guide
Accessed in kernels via texture fetches:
tex1D, tex2D, tex3D, etc.
Co-ordinates at texel centres
Have to take care when accessing elements
62. Textures
Can improve load coalescing from global memory
If whole texture fits in 8kiB cache, has grid lifetime
Clamping/wrapping can aid edge case handling
Have to test to determine benefits
63. General Principles
Memory access patterns are crucial
Even CPUs are typically memory bound
GPUs have 100x FP
Only 10x memory bandwidth
Have to keep the GPU busy
64. PC Architecture
8 GB/s
>?@
?>L9G=2%&66"K16
J%+8#"F7(&"K16
H%'2$7,6">'%("I"
A+%#$)%7(B& F+1#$)%7(B&
>@C!
E&.+%/"K16 ?>L"K16
3+ Gb/s
CD!E F!:! G#$&%8&# !
160+ GB/s
to
VRAM 25+ GB/s
modified from Matthew Bolitho
70. PCIe Transfers
PCIe 2.0 x16 bus has
Latency of 10 µs (observed)
Bandwidth of 8GB/s (theory), 5 GB/s (observed)
A lot of calculations can happen in these times
71. PCIe Transfers
PCIe transfers occur via DMA
GPU reads pages direct from CPU memory
Very bad if page gets moved mid-transfer
CUDA maintains internal pinned memory buffers
Used for cudaMemcpy calls
Data staged through these
86. PCIe Transfers Optimization
PCIe bus is slow
Try to minimize transfers
Use pinned memory on host whenever possible
Try to perform copies asynchronously
87. Outline
• CUDA Language & APIs (overview)
• Threading/Execution (cont’d)
• Memory/Communication (cont’d)
• Tools
• Libraries
104. !"#$%&'()*+,*-./0)
'#"
Up to 2x average speedup over CUBLAS 3.1
'!"
&"
&1))231'456'78$
Less variation in performance
for different dimensions vs. 3.1
%"
$"
#"
,-.
/(0'
!"
/(0#
'!#$ #!$& (!)# $!*% +'#! %'$$ )'%&
7.9*:;'2:-)/5:,/5'<=;=>
Average speedup of {S/D/C/Z}GEMM x {NN,NT,TN,TT}
!"##$%&'(%)%&'*%+,%-./0/1%$2345%!(676%89"
:;<%*6'('&'6(=%+,%>?5@A!+B2%/,C24%!+B2%DE%F-2G542HI
106. CULA (LAPACK for heterogeneous systems)
GPU Accelerated
Linear Algebra
Partnership
! Dense linear algebra Developed in
! C/C++ & FORTRAN partnership with
! 150+ Routines NVIDIA
MATLAB Interface Supercomputer Speeds
! 15+ functions Performance 7x of
! Up to 10x speedup
107. CULA - Performance
Supercomputing Speeds
This graph shows the relative speed of many CULA functions when compared to
(Fermi) and an Intel Core i7 860. More at www.culatools.com
109. Sparse Matrix Performance: CPU vs. GPU
Multiplication of a sparse matrix by multiple vectors
35x
30x
25x
20x
15x
10x
"Non-transposed"
5x "Transposed"
0x MKL 10.2
Average speedup across S,D,C,Z
!"#$%&#'()*+(,-(./010%(23456(!+787(9$"
:;<(=7*+*)*7+>(,-(?@6AB!,C3(0-D35(!,C3(EF(G.3H653IJ
134. OpenVIDIA
Open source, supported by NVIDIA
Computer Vision Workbench (CVWB)
GPU imaging & computer vision
Demonstrates most commonly used image
processing primitives on CUDA
Demos, code & tutorials/information
http://openvidia.sourceforge.net
136. References
• CUDA C Programming Guide
• CUDA C Best Practices Guide
• CUDA Reference Manual
• API Reference, PTX ISA 2.2
• CUDA-GDB User Manual
• Visual Profiler Manual
• User Guides: CUBLAS, CUFFT, CUSPARSE, CURAND
http://developer.nvidia.com/object/gpucomputing.html