IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

6.963
IT /
A@M
CUD
9
IAP0

Supercomputing on your desktop:
Programming the next generation of cheap
and massively parallel hardware using CUDA

Lecture 07
Nicolas Pinto (MIT)

#2
CUDA - Advanced
Friday, January 23, 2009

During this course,
3
6
for 6.9
ed
adapt

we’ll try to

“ ”

and use existing material ;-)

Today
yey!!


Wanna Play with
The Big Guys?


Here are the keys
to High-Performance in CUDA


ng!
rni
Wa
To optimize or not to optimize

Hoare said (and Knuth restated)

“Premature optimization is the root of all evil.”

slide by Johan Seland
Applied Mathematics 23/53


ng!
rni
Wa
To optimize or not to optimize

Hoare said (and Knuth restated)
“We should forget about small eﬃciencies, say about
97% of the time:
Premature optimization is the root of all evil.”

⇓
3% of the time we really should worry about small eﬃciencies
(Every 33rd codeline)



6.963
IT /
A@M
CUD
9
IAP0

Strategy
Memory Optimizations
Execution Optimizations


6.963
IT /
A@M
CUD
9
IAP0

CUDA
Performance Strategies


egy
rat
St
Optimization goals

We should strive to reach GPU performance
We must know the GPU performance
Vendor speciﬁcations
Syntetic benchmarks
Choose a performance metric
Memory bandwidth or GFLOPS?
Use clock() to measure
Experiment and proﬁle!



ing
ead
hr
T
Programming Model
Host Device
A kernel is executed as a Grid 1
grid of thread blocks Block Block Block
Kernel
A thread block is a batch (0, 0) (1, 0) (2, 0)
1

of threads that can Block Block Block
cooperate with each (0, 1) (1, 1) (2, 1)

other by:
Grid 2
Sharing data through
shared memory Kernel
2
Synchronizing their
execution
Block (1, 1)

Threads from different
Thread Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

blocks cannot cooperate Thread Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Thread Thread Thread Thread Thread
(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

3
© NVIDIA Corporation 2006


mory
Me
Data Movement in a CUDA Program

Host Memory
Device Memory
[Shared Memory]
COMPUTATION
[Shared Memory]
Device Memory
Host Memory

© NVIDIA Corporation 2008 10


erf
P
!quot;#$%$&'()*+,-$#.%/(0,-(#.'(123

456$%$&'($78'quot;'78'7#(quot;5-5**'*$/%

456$%$&'(5-$#.%'#$9($7#'7/$#:(;%5#.<=578>$8#.?

@,%'#$%'/($#A/(='##'-(#,(-'9,%quot;B#'(#.57(#,(959.'
123(/quot;'78/($#/(#-57/$/#,-/(,7()C3/D(7,#(%'%,-:

E,(%,-'(9,%quot;B#5#$,7(,7(#.'(123(#,(5F,$8(9,/#*:(
85#5(#-57/0'-/
GF'7(*,>(quot;5-5**'*$/%(9,%quot;B#5#$,7/(957(/,%'#$%'/(='(
05/#'-(#.57(#-57/0'--$7+(=59H(578(0,-#.(#,(.,/#

39


erf
P
!quot;#$%$&'()'%*+,(-*.'+'/0'

-*12'30'4(536(7*/80*12'30'4(9(*+4'+(*:(%1;/$#<4'
=2*>12?@*012(4'5$0'(%'%*+,(

!quot;#$%$&'(:*+(3quot;1#$12(2*012$#,($/(010.'4(#'A#<+'(
%'%*+,

B/(3.1+'4(%'%*+,C(15*$4(.$;.84';+''(>1/D(0*/:2$0#3

40


erf
P
!quot;#$%&'(quot;)*quot;+$%,-%./quot;0$'%1$2,03

45)'0$'6%,-%*72$6%-quot;6*$0%*/quot;)%+8,9quot;8%2$2,03
!/0$quot;'6%:quot;)%:,,;$0quot;*$%(7quot;%6/quot;0$'%2$2,03

<6$%,)$%=%quot;%-$>%*/0$quot;'6%*,%8,quot;'%=%:,2;5*$%'quot;*quot;%
6/quot;0$'%93%quot;88%*/0$quot;'6

<6$%7*%*,%quot;(,7'%),)?:,quot;8$6:$'%quot;::$66
.*quot;+$%8,quot;'6%quot;)'%6*,0$6%7)%6/quot;0$'%2$2,03%*,%0$?,0'$0%),)?
:,quot;8$6:$quot;98$%quot;''0$667)+
1quot;*07@%*0quot;)6;,6$%$@quot;2;8$%8quot;*$0

41


erf
P
!quot;#$%&'&((#()quot;*$+,,)-)#./(0

%&'/)/)1.$012'$-1*32/&/)1.$/1$4##3$/5#$6%!$
*2(/)3'1-#quot;quot;1'quot;$#72&((0$82quot;0
9&.0$/5'#&:quot;;$*&.0$/5'#&:$8(1-4quot;

<##3$'#quot;12'-#$2quot;&=#$(1>$#.12=5$/1$quot;2331'/$
*2(/)3(#$&-/)?#$/5'#&:$8(1-4quot;$3#'$*2(/)3'1-#quot;quot;1'
@#=)quot;/#'quot;;$quot;5&'#:$*#*1'0

42


6.963
IT /
A@M
CUD
9
IAP0

Memory
Optimizations


ory
em
M
!quot;#$%&'$()*#*+,)*$-.

/()*#*+*-0'#quot;#$%&')%,-.1quot;%.
2$,3quot;.4*-0'03$5,3'#quot;#$%&',44quot;..quot;.
6.*-0'.7,%quot;8'#quot;#$%&'quot;11quot;4)*9quot;3&

44


ory
em
M
!quot;#quot;$%"'()*&(

!*+,-*$.*./&0$#/$1/(#$.*./&0$2quot;'34,3#1$.5-1$
6/4*&$#1quot;'$3*+,-*$.*./&0$#/$3*+,-*$2quot;'34,3#1
789:($;*quot;<$=>?@A*$BCDE$+(F$GH$89:($;*quot;<$=I5quot;3&/$JK$LDHHE
G89:($)/&$>?@A*$MFH

N,',.,O*$#"'()*&(
@'#*&.*3,quot;#*$3quot;#quot;$(#&5-#5&*($-quot;'$2*$quot;66/-quot;#*3P$/;*"#*3$
/'P$quot;'3$3*quot;66/-quot;#*3$4,#1/5#$*+*&$-/;0,'Q$#1*.$#/$1/(#$
.*./&0

8&/5;$#"'()*&(
R'*$6quot;&Q*$#"'()*&$.5-1$2*##*&$#1quot;'$.quot;'0$(.quot;66$/'*(

45


ory
em
M
!quot;#$%&'()$*+,$-'./+0.quot;123$.2

(4*quot;,quot;55'(6'2789+quot;55':2+quot;55'(quot;7;'1+'3+<quot;#$%5'()$*+
='27+-$-'./
>1quot;?5$2+=;#=$27+(4*quot;,$-(</+<$.3'.-quot;1($
@AB+CDE2F+('--'1+'1+!GH%$I<.$22+8IJK9
LM+CDE2+-$quot;24.$*+'1+1N'.($+KOP;+-'7=$.?'quot;.*2+
8'Q$.(5'()$*+!GH%$9

R$$+7=$+S?quot;1*:;*7=0$27T GUVW+RVX+2quot;-<5$

U2$+:;7=+(quot;47;'1
W55'(quot;7;1#+7''+-4(=+<quot;#$%5'()$*+-$-'./+(quot;1+.$*4($+
'Q$.quot;55+2/27$-+<$.3'.-quot;1($
0$27+/'4.+2/27$-2+quot;1*+quot;<<2+7'+5$quot;.1+7=$;.+5;-;72

46


em
gm
!quot;#$%quot;&'()#*+&,(%-./0*12(.

3145(.2"%2(67+&16.2*8721#6.9&:;;<=;;&7quot;#7>&7+7quot;(.

?1>(quot;+&2#&$(&@(*A#*)%67(&$#22quot;(6(7>

B@21)1C%21#6.&7%6&4*(%2quot;+&167*(%.(&@(*A#*)%67(
D#%quot;(.71649&8@&2#&E;F&.@((-8@
?%2(67+&51-1649&8@&2#&GHIF&.@((-8@

47


em
gm
Accessing global memory

4 cycles to issue on memory fetch
but 400-600 cycles of latency
The equivalent of 100 MADs
Likely to be a performance bottleneck
Order of magnitude speedups possible
Coalesce memory access
Use shared memory to re-order non-coalesced addressing



em
gm
!quot;#$%&'()*

+,'quot;quot;-.()#/%.,-%#.,01,#,2#$345#-6,789 /2-%#.&:
+,'quot;)/(*;quot;;&,-%*(quot;),quot;3,*$quot;0#$,<%<quot;-1=
9> 01/%&,4 %#'2,/2-%#.,-%#.&,#,5quot;-.=,()/?,3$quot;#/?,@
8AB 01/%&,4 %#'2,/2-%#.,-%#.&,#,.quot;;0$%45quot;-.=,()/A?,3$quot;#/A?,@
AC9 01/%&,D %#'2,/2-%#.,-%#.&,#,E;#.45quot;-.=,()/>?,3$quot;#/>?,@
+..(/(quot;)#$,-%&/-('/(quot;)&,quot;),FBGHFIG,#-'2(/%'/;-%=
J/#-/()*,#..-%&&,3quot;-,#,-%*(quot;),<;&/,0%,#,<;$/(6$%,quot;3,-%*(quot;),
&(K%
L2%,k/2 /2-%#.,(),#,2#$345#-6,<;&/,#''%&&,/2% k/2 %$%<%)/,(),#,
0$quot;'M,0%()*,-%#.
NO'%6/(quot;)=,)quot;/,#$$,/2-%#.&,<;&/,0%,6#-/('(6#/()*
P-%.('#/%.,#''%&&?,.(Q%-*%)'%,5(/2(),#,2#$35#-6

48


em
gm
!quot;#$%&'%()*''%&&+),%#(-./)0$quot;#1&

12 13 14 17 135 136

349 374 378 352 355 395 399 3:4

*$$)1>?%#(&)C#?1-'-C#1%

12 13 14 17 135 136

349 374 378 352 355 395 399 3:4

;quot;<%)=>?%#(&)@quot;)Aquot;1)B#?1-'-C#1%

49


!quot;#$%&'(#')*+##'((,*-'%).quot;/*0&$%1(

em
gm
12 13 14 17 135 136

349 374 378 352 355 395 399 3B4

:';<=1')*+##'((*>?*@A;'%)(

12 13 14 17 137 135 136

349 374 378 352 355 395 399 3B4

C.(%&./quot;')*D1%;1.quot;/*+));'((*Equot;$1*%*<=&1.F&'*$0*85G

50


em
gm
!quot;#$%&'()*+,-(.()*,/%&0$1&

234%5(.%)1,quot;),678+,
9%5)%$+,5%#:,#,;$quot;#1<,()'5%.%)1<,=5(1%,>#'?
@A,;$quot;#1&,BCDAEF
-(.%&,#G%5#*%:,quot;G%5,C89,50)&
CD9,>$quot;'?&,3,DHI,1J5%#:&+
@HIK&,L 'quot;#$%&'%:
@HMK&,L 'quot;#$%&'%:<,".%,1J5%#:&,:quot;)N1,4#51('(4#1%
@<OPOK&,L 4%5.01%:Q.(&#$(*)%:,1J5%#:,#''%&&

51


em
gm
!quot;#$%&'()*+
,-./'-/.%&0quot;10&(2%0! 34054067089-%&
:&%0#0,-./'-/.%0quot;10;..#9&0<,quot;;=0()&-%#>0quot;10;..#90quot;10,-./'-/.%&0
<;quot;,=

?10,quot;;0(&0)quot;-0@(#A$%+
Bquot;.'%0&-./'-/.%0#$(*)C%)-+0DD#$(*)<E=40FG%.%0E0H0340540quot;.067
:&%0,IJI0-quot;0#'G(%@%0'quot;#$%&'()*

x y z Point structure

x y z x y z x y z AoS

x x x y y y z z z SoA

58


em
gm
!quot;#$%&'()*+,-.//#01

!quot;#$%&'()*,*0%#2$1,(/30quot;4%&,250quot;.*53.2

!0(2('#$,2quot;,/%/quot;0167quot;.)8,9%0)%$&

:%#8()*,&20.'2.0%&,quot;;,&(<%,quot;25%0,25#),=>,?>,quot;0,@A
712%&,B($$,70%#9,'quot;#$%&'()*+
C0%;%0,-20.'2.0%&,quot;;,D00#1& quot;4%0,Dquot;-
E;,-quot;D,(&,)quot;2,4(#7$%>,0%#8FB0(2%,250quot;.*5,-GHG

D88(2(quot;)#$,0%".0'%&+
D$(*)%8,I13%&,-JK,-#/3$%

59


em
sm
!quot;#quot;$$%$&'%()#*&+#,-./%,/0#%

12"&3quot;#quot;$$%$&(quot;,-.2%4&(quot;2*&/-#%quot;56",,%66&(%()#*
7-%#%8)#%4&(%()#*&.6&5.9.5%5&.2/)&:quot;2;6
<66%2/.quot;$&/)",-.%9%&-.=-&:quot;25>.5/-

<quot;,-&:quot;2;&,quot;2&6%#9.,%&)2%"55#%66&3%#&,*,$%
+&(%()#*&,quot;2&6%#9.,%"6&(quot;2*&6.(0$/quot;2%)06&
Bank 0
quot;,,%66%6"6&./&-quot;6&:quot;2;6
Bank 1
Bank 2
Bank 3
'0$/.3$%&6.(0$/quot;2%)06",,%66%6&/)"&:quot;2; Bank 4
#%60$/&.2"&:quot;2;&,)28$.,/& Bank 5
Bank 6
?)28$.,/.2=",,%66%6"#%&6%#.quot;$.@%5
Bank 7

Bank 15
64


em
sm
!quot;#$%&''()**+#,%-.quot;/01)*
23%!quot;#$%43#51+67* 23%!quot;#$%43#51+67*
8+#)quot;(%quot;''()**+#,% ;quot;#'3/%:<:%=)(/>7quot;7+3#
*7(+')%99%:

Thread 0 Bank 0 Thread 0 Bank 0


65


em
sm
!quot;#$%&''()**+#,%-.quot;/01)*
234quot;5%!quot;#$%67#81+9:* =34quot;5%!quot;#$%67#81+9:*
;+#)quot;(%quot;''()**+#,% ;+#)quot;(%quot;''()**+#,%
*:(+')%<<%2 *:(+')%<<%=

x8
Thread 3 Bank 3 Thread 3
Thread 4 Bank 4 Thread 4
Bank 5 Thread 5 Bank 7
Thread 8 x8
Thread 9
Thread 10

66


mem
s
!quot;#$%&&'())()$*%+$,quot;$-%./)$quot;.$012

3%.&#4&,5$quot;6$(%75$-%./$4)$89$-4,)$+('$9$7:quot;7/$7;7:()
<=77())4>($89?-4,$#quot;'&)$%'($%))4@.(&$,quot;$)=77())4>($
-%./)
012$5%)$AB$-%./)
<quot;$-%./$C$%&&'())$D$AB
<%*($%)$,5($)4E($quot;6$%$5%:6?#%'+
Fquot;$-%./$7quot;.6:47,)$-(,#((.$&466('(.,$5%:6?#%'+)G$quot;.:;$#4,54.$%$)4.@:($5%:6?#%'+

67


em
sm
!quot;#$%&'(%()$*'+#,-'.),/01.23

!quot;#$%&'(%()$*'13'#3'/#32'#3'$%4132%$3'1/'2quot;%$%'#$%'
,)'+#,-'.),/01.23

5quot;%'/#32'.#3%6
7/'#00'2quot;$%#&3')/'#'quot;#0/89#$:'#..%33'&1//%$%,2'+#,-3;'2quot;%$%'13'
,)'+#,-'.),/01.2
7/'#00'2quot;$%#&3')/'#'quot;#0/89#$:'$%#&'2quot;%'1&%,21.#0'#&&$%33;'
2quot;%$%'13',)'+#,-'.),/01.2'<+$)#&.#32=
5quot;%'30)9'.#3%6
>#,-'?),/01.26'(@021:0%'2quot;$%#&3'1,'2quot;%'3#(%'quot;#0/89#$:'
#..%33'2quot;%'3#(%'+#,-
A@32'3%$1#01B%'2quot;%'#..%33%3
?)32'C'(#D'E')/'31(@02#,%)@3'#..%33%3'2)'#'31,40%'+#,-

68


egy
rat
St
Use the right kind of memory

Constant memory:
Quite small, ≈ 20K
As fast as register access if all threads in a warp access the
same location
Texture memory:
Spatially cached
Optimized for 2D locality
Neighboring threads should read neighboring addresses
No need to think about coalescing
Constraint:
These memories can only be updated from the CPU



egy
rat
St
Memory optimizations roundup

CUDA memory handling is complex
And I have not covered all topics...
Using memory correctly can lead to huge speedups
At least CUDA expose the memory hierarchy, unlike CPUs
Get your algorithm up an running ﬁrst, then optimize
Use shared memory to let threads cooperate
Be wary of “data ownership”
A thread does not have to read/write the data it calculate



Conﬂicts,
Coalescing, Warps...
I hate growing up.


ple
xa m
E

!quot;#$%$&'#$()*+,'%quot;-./*0'#1$,*21')3quot;(3.


ple
xa m
E
!quot;#$%&'($quot;)*+,*-

./0'.quot;1+2-'34#$quot;)*+,*-56
7228*#$quot;#-*9
:,quot;2-*;%)<
=>,%?%)<'.!@!'Aquot;)B';,)C2%;#*
.+--?8+*'C,$'->-)'*1quot;22'1quot;#$%;-*

1 5 9 13
1 2 3 4

2 6 10 14
5 6 7 8

3 7 11 15
9 10 11 12

4 8 12 16
13 14 15 16

70


ple
xa m
E
!quot;#$%&'(#')*+,%quot;(-$('

__global__ void transpose_naive(float *odata, float *idata, int width, int height)
{
1. unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;
2. unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;

3. if (xIndex < width && yIndex < height)
{
unsigned int index_in = xIndex + width * yIndex;
4.
unsigned int index_out = yIndex + height * xIndex;
5.
$)%.%/0quot;)'12$3.4 = 0)%.%/0quot;)'120quot;4;
6.
}
}

71


ple
xa m
E
!quot;#$%&'(#')*+,%quot;(-$('

.'%)(*/quot;-01*2,$3*4565 <,/1'*$01-01*1$*4565

;8; ;87 ;8: ;879 ;8; 78; :8; 798;

78; 787 78: 7879 ;87 787 :87 7987

798; 7987 798: 79879 ;879 7879 :879 79879

4565 4565

Stride = 1, coalesced Stride = 16, uncoalesced

72


ple
xa m
E
!quot;#$%&'%()*+#,&-quot;&%

.&&/0-12quot;,3)0#1+24)2&)-#+1212quot;,%()2,1quot;)&5/#+%)12$%&
*6+%#(7$quot;'8)974:)7;<3
=%#()16%)974:7;< 2,-/1)12$%:)&1quot;+%)2,1quot;)>?@?
A+21%)16%)>?@?)(#1#)1quot;)97;:74< quot;/1-/1)12$%
*+#,&-quot;&%)16%)2,(%42,B)2,1quot;)>?@?

*6+%#()914:1;<3
=%#(&)%$%0%,1)914:1;< C+quot;0)2,-/1)12$%
A+21%&)%$%0%,1)914:1;< 2,1quot;)quot;/1-/1)12$%
!quot;#$%&'2,B)2&)#'62%D%()2C3
E$quot;'8F12$%)(20%,&2quot;,&)#+%)0/$12-$%&)quot;C)GH

73


ple
xa m
E
!quot;#$%&'%()*+#,&-quot;&%
4%#(&)5+quot;6)1232 .+/0%&)0quot;)7232

<9< <98 <9; <98: <9< <98 <9; <98:

89< 898 89; 898: 89< 898 89; 898:

8:9< 8:98 8:9; 8:98: 8:9< 8:98 8:9; 8:98:

4%#(&)5+quot;6)7232 .+/0%&)0quot;)1232

<9< 89< ;9< 8:9< <9< <98 <9; <98:

<98 898 ;98 8:98 89< 898 89; 898:

<98: 898: ;98: 8:98: 8:9< 8:98 8:9; 8:98:

74


ple
xa m
E
!quot;#quot;$%&'()(*+'(,-
=1+23$;0,)$!quot;#quot;

./01+23$01+2$!quot;#quot;$4('/$3'0(21$5$67
A?A 6?A @?A 6>?A

8+-9$:,-;<(:'3
A?6 6?6 @?6 6>?6

A?6> 6?6> @?6> 6>?6>

!,<B'(,-
A?A 6?A @?A 6>?A

C<<,:+'1$+-$D1E'0+F :,<B)-
A?6 6?6 @?6 6>?6
=1+2$3'0(21$5$6G
./01+23$01+2$;0,)$:,-31:B'(H1$I+-93
A?6> 6?6> @?6> 6>?6>

75


ple
xa m
E
!quot;#$%&'%()*+#,&-quot;&%
__global__ void transpose(float *odata, float *idata, int width, int height)
{
1. __shared__ float block[(BLOCK_DIM./)*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;
2.
unsigned int yBlock = blockDim.y * blockIdx.y;
3.
unsigned int xIndex = xBlock + threadIdx.x;
4.
unsigned int yIndex = yBlock + threadIdx.y;
5.
unsigned int index_out, index_transpose;
6.

{
unsigned int index_in = width * yIndex + xIndex;
8.
unsigned int index_block = threadIdx.y * (BLOCK_DIM+1) + threadIdx.x;
9.
block[index_block] = idata[index_in];
10.
index_transpose = threadIdx.x * (BLOCK_DIM+1) + threadIdx.y;
11.
index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
12.
}
13. __syncthreads();

odata[index_out] = block[index_transpose];
15.
}
76


ple
xa m
E
Coalesced transpose: Source code

__global__ void
transpose( float *out, float *in, int w, int h ) {
__shared__ float block[BLOCK_DIM*BLOCK_DIM];




if ( xIndex < width && yIndex < height ) {
unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

block[index_block] = in[index_in];

index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;
}
__synchthreads();

out[index_out] = block[index_transpose];
}
}



ple
xa m
E

__global__ void
transpose( float *out, float *in, int w, int h ) { Allocate shared memory.






}
__synchthreads();

}
}



ple
xa m
E

__global__ void
Set up indexing





}
__synchthreads();

}
}



ple
xa m
E

__global__ void
Set up indexing


unsigned int index_out, index_transpose; Check that we are within
domain, calculate more
if ( xIndex < width && yIndex < height ) { indices


}
__synchthreads();

}
}



ple
xa m
E

__global__ void
Set up indexing


Write to shared memory.

}
__synchthreads();

}
}



ple
xa m
E

__global__ void
Set up indexing


Calculate output indices.
}
__synchthreads();

}
}



ple
xa m
E

__global__ void
Set up indexing


} Synchronize.
__synchthreads(); NB:outside if-clause

}
}



ple
xa m
E

__global__ void
Set up indexing


} Synchronize.
__synchthreads(); NB:outside if-clause

Write to global mem.
Diﬀerent index
}
}



ple
xa m
E
Transpose timings

Was it worth the trouble?
Grid Size Coalesced Non-coalesced Speedup
128 × 128 0.011 ms 0.022 ms 2.0×
512 × 512 0.07 ms 0.33 ms 4.5×
1024 × 1024 0.30 ms 1.92 ms 6.4×
1024 × 2048 0.79 ms 6.6 ms 8.4×

For me, this is a clear yes.



6.963
IT /
A@M
CUD
9
IAP0

Execution
Optimizations


xec
E
Know the arithmetic cost of operations

4 clock cycles:
Floating point: add, multiply, fused multiply-add
Integer add, bitwise operations, compare, min, max
16 clock cycles:
log(x), 32-bit integer
reciprocal, reciprocal square root,
multiplication
32 clock cycles:
sin(x), cos(x) and exp(x)
36 clock cycles:
Floating point division (24-bit version in 20 cycles)
Particularly costly:
Integer division, modulo
Remedy: Replace with shifting whenever possible
Double precision (when available) will perform at half the
speed



xec
E
!quot;quot;#$%"'

()*+%,-.&/0*#quot;0.1&/-%*+-+2+quot;#0+,-/+3#+&0.%44'5-/1-
+2+quot;#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+-
4%0+".+/-%&,-8++$-0)+-)%*,7%*+-9#/'

!quot;quot;#$%"' :-;#<9+*-1=-7%*$/-*#&&.&6-
quot;1"#**+&04'-1&-%-<#40.$*1quot;+//1*-,.>.,+,-9'-
<%2.<#<-&#<9+*-1=-7%*$/-0)%0-quot;%&-*#&-
quot;1"#**+&04'

?.<.0+,-9'-*+/1#*quot;+-#/%6+@
A+6./0+*/
B)%*+,-<+<1*'

79


xec
E
!quot;#$%&'()*+,#-.+/.0quot;#12#)1

3+(4+5'()*1+6+3+(4+70'2#8quot;().11(quot;1
,(+9''+70'2#8quot;().11(quot;1+:9;.+92+'.912+(<.+5'()*+2(+.=.)02.

3+(4+5'()*1+%+3+(4+70'2#8quot;().11(quot;1+6+>
?0'2#8'.+5'()*1+)9<+quot;0<+)(<)0quot;quot;.<2'@+#<+9+70'2#8quot;().11(quot;
&'()*1+2:92+9quot;.<A2+B9#2#<C+92+9+DD1@<)2:quot;.9$1EF+*..8+2:.+
:9quot;$B9quot;.+501@
,05G.)2+2(+quot;.1(0quot;).+9;9#'95#'#2@+H quot;.C#12.quot;1I+1:9quot;.$+7.7(quot;@

3+(4+5'()*1+6+JKK+2(+1)9'.+2(+4020quot;.+$.;#).1
&'()*1+.=.)02.$+#<+8#8.'#<.+491:#(<
JKKK+5'()*1+8.quot;+Cquot;#$+B#''+1)9'.+9)quot;(11+70'2#8'.+C.<.quot;92#(<1

80


xec
E
!quot;#$%"'()quot;*quot;+,quot;+-.

!quot;/,0/1"'02'$"('quot;#$%"'(,quot;*quot;+,quot;+-.
3+%&'4-&$5+6%('quot;%47&(-/+(8quot;('quot;/,(9::(-.-7quot;%(7/"'
;-quot;+/'$5%<=>)?< @AB<

S T(.(U(JV /,,N1O:(((P1OQ(P1EQ(P1:
W(T(S U(OV /,,N1O:(((P1JQ(P1OQ(P1R

%[,/&/XYZ(UT(OV 7,N%D/'quot;,N1O:((P1OQ(XP'OEUYZ(
/,,N1O:(((((((((((P1OQ(P1OQ(P1R

A5(-5C*7quot;"7.(D$,quot;(&Dquot;(7/"+-.<(
!4+(/&(7quot;/%&(EF: &D'quot;/,%(GH(2/'*%I(*quot;'(C47&$*'5-quot;%%5'
?&(7quot;/%&(:JK 5--4*/+-.
AD'quot;/,%(,5(+5&(D/Lquot;(&5(8quot;75+#(&5(&Dquot;(%/Cquot;(&D'quot;/,(875-M

81


xec
E
!quot;#$%"'()'quot;%%*'quot;

+$,quot;(-."/01(21(*%$/#(34'quot;(&5'quot;.,%(6quot;'(78
9$3$&$/#(:.0&4'%;
<*32quot;'(4=('quot;#$%"'%(6quot;'(>quot;'/quot;-
?@AB 6quot;'(78C(6.'&$&$4/quot;,(.34/#(04/0*''quot;/&(&5'quot;.,%
D34*/&(4=(%5.'quot;,(3quot;34'1
@EFG 6quot;'(78C(6.'&$&$4/quot;,(.34/#(04/0*''quot;/&(&5'quot;.,2-40>%
H5quot;0>(I0*2$/(=$-quot;(=4'(J('quot;#$%"'%(K(>quot;'/quot;-
L%quot;(M3.N''quot;#04*/&O< =-.#(&4(<PHH
< O(,quot;%$'quot;,(3.N$3*3('quot;#$%"'%(K(>quot;'/quot;-
D&(%43quot;(64$/&(Q%6$--$/#R $/&4(98S8(3.1(400*'
!quot;,*0quot;%(6quot;'=4'3./0quot;(M 98S8($%(%-4T
H5quot;0>(I0*2$/(=$-quot;(=4'(98S8(*%.#quot;

82


xec
E
!quot;#quot;$%&'&'()$quot;*+,$-quot;),*.(quot;
/*quot;)012#3+2#&+'*4567 +2#&+')#+)'6--
8$9)-+%2&:quot;)#;quot;)<quot;$'quot;:)-+=quot;)>&#;)#;quot;)5-,?&')@:.()#+)
=quot;#quot;$%&'quot;)$quot;(&*#quot;$),*.(quot;A
82quot;')#;quot;)A-,?&')@&:quot;)>&#;).)#quot;3#)quot;=&#+$).'=):++<)@+$)
#;quot;)0-+=quot;7 *quot;-#&+'A
architecture {sm_10}
abiversion {0}
modname {cubin}
code {
per thread local memory
name = BlackScholesGPU
lmem = 0
smem = 68 per thread block shared memory
reg = 20
bar = 0
per thread registers
bincode {
0xa0004205 0x04200780 0x40024c09 0x00200780
…

83


xec
E
!quot;#$%&''()*+',%!*-'(-*./0

84


xec
E
!quot;#$%$&$'()#*+,-./)quot;,+)01234
5*22/,)#*+,-./)quot;,+)01234)-/)-)%61#$quot;1,)27)8-+quot;)/$&,
9:2$.)8-/#$'()32%quot;6#-#$2')2')6'.,+;quot;2quot;61-#,.)8-+quot;/
<2+,)#*+,-./)quot;,+)01234)==)0,##,+)%,%2+>)1-#,'3>)
*$.$'(
?6#@)%2+,)#*+,-./)quot;,+)01234)==)7,8,+)+,($/#,+/)quot;,+)
#*+,-.
A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,.
B,6+$/#$3/
<$'$%6%C)DE)#*+,-./)quot;,+)01234
!'1>)$7)%61#$quot;1,)32'36++,'#)01234/)
FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3,
J/6-11>)/#$11),'26(*)+,(/)#2)32%quot;$1,)-'.)$':24,)/633,//7611>
K*$/)-11).,quot;,'./)2')>26+)32%quot;6#-#$2'@)/2),Lquot;+$%,'#M

85


xec
E
!quot;quot;#$%"'()*(+,-./-0%",

1"-,%23&4(/quot;quot;#$%"'(5/,2(&/6(&,quot;,22%-37'(
3"-,%2,($,-./-0%",

BUT…

8/9:/quot;quot;#$%"'(0#763$-/quot;,22/-2(quot;%&&/6(%5,;#%6,7'(
<35,(7%6,"'(/&(0,0/-':=/#&5(>,-&,72
?16(%77(quot;/0,2(5/9&(6/(%-36<0,63quot;(3&6,&236'(%&5(%@%37%=7,(
$%-%77,7320A

86


xec
E
!quot;#quot;$%&%#'(%)*+,#)-../'0quot;&'+1

!quot;#quot;$%&%#'(quot;&'+1)2%/.3)quot;4quot;."&'+1)&+)4'55%#%1&)6!73

6!73)8quot;#9)'1)$quot;19):quot;93
;)+5)$,/&'.#+0%33+#3
<%$+#9)=quot;14:'4&2
>2quot;#%4)$%$+#9)3'(%
?%@'3&%#)5'/%)3'(%
A2#%quot;43).%#)=/+0B

*+,)0quot;1)%8%1)$quot;B%)quot;..3)3%/5C&,1'1@)D/'B%)EEAF)quot;14)
-AG->H
IJK.%#'$%1&L $+4%)4'30+8%#3)quot;14)3quot;8%3)+.&'$quot;/)
0+15'@,#quot;&'+1

87


xec
E
Loop unrolling

Sometimes we know some kernel parameters at compile time:
# of loop iterations
Degrees of polynomials
Number of data elements
If we could “tell” this to the compiler, it can unroll loops and
optimize register usage
We need to be generic
Avoid code duplication, sizes unknown at compile time
Templates to rescue
The same trick can be used for regular C++ sources



xec
E
Example: de Casteljau algorithm

A standard algorithm for evaluating polynomials in Bernstein form
d
f (x) = b00

Recursively deﬁned: x 1−x

d
f (x) = b00 d−1 d−1
b10 b01
k−1 k−1
k
bi,j = xbi+1,j + (1 − x)bi,j+1
1 − x2
x x
1−x
0
bi,j are coeﬃcients
d−2 d−2 d−2
b20 b11 b02



xec
E
Implementation

The de Casteljau algorithm is usually implemented as nested
for-loops
Coeﬃcients are overwritten for each iteration
d
f (x) = c00
float deCasteljau ( float ∗ c , float x , int d )
{ x 1−x
f o r ( u i n t i = 1 ; i <= d ; ++i ) {
f o r ( u i n t j = 0 ; j <= d− i ; ++j )
d−1 d−1
c10 c01
c [ j ] = ( 1 . 0 f −x ) ∗ c [ j ] + x ∗ c [ j + 1 ] ;
}

1 − x2
x x
1−x
return c [ 0 ] ;
}
d−2 d−2 d−2
c20 c11 c02



xec
E
Template loop unrolling
We make d a template parameter
template<int d>
f l o a t d e C a s t e l j a u ( f l o a t ∗ c , f l o a t x, int d ) {
f o r ( u i n t i = 1 ; i <= d ; ++i ) {
f o r ( u i n t j = 0 ; j <= d− i ; ++j )
c [ j ] = ( 1 . 0 f −x ) ∗ c [ j ] + x ∗ c [ j + 1 ] ;
}
return c [ 0 ] ;
}

Kernel is called as
switch ( d ) {
case 1:
d e C a s t e l j a u <1><<<d i m G r i d , d i m B l o c k >>>( c , x ) ; b r e a k ;
case 2:
d e C a s t e l j a u <2><<<d i m G r i d , d i m B l o c k >>>( c , x ) ; b r e a k ;
.
.
c a s e MAXD:
d e C a s t e l j a u <MAXD><<<d i m G r i d , d i m B l o c k >>>( c , x ) ; b r e a k ;
}



xec
E
Results

For the de Castelaju algorithm we see a relatively small
speedup
≈ 1.2× (20%...)
Very easy to implement
Can lead to long compile times
Conclusion:
Probably worth it near end of development cycle



xec
E
!quot;#$%&'(quot;#
)#*+,'-.#*/!)01/2+,3quot;,4.#$+/$5.,.$-+,('-($'
6+4quot;,7/$quot;.%+'$(#8
0(9+,8+#-/:,.#$5(#8
;.#</$quot;#3%($-'
=.-+#$7/5(*(#8
)'+/2+.</2+,3quot;,4.#$+/4+-,($'/-quot;/8&(*+/quot;2-(4(>.-(quot;#/
)#*+,'-.#*/2.,.%%+%/.%8quot;,(-54/$quot;42%+?(-7/-5+quot;,7
@#quot;A/5quot;A/-quot;/(*+#-(37/-72+/quot;3/:quot;--%+#+$<
+B8B/4+4quot;,7C/$quot;,+/$quot;42&-.-(quot;#C/quot;,/(#'-,&$-(quot;#/quot;9+,5+.*
D2-(4(>+/7quot;&,/.%8quot;,(-54C/then &#,quot;%%/%quot;quot;2'
)'+/-+42%.-+/2.,.4+-+,'/-quot;/8+#+,.-+/quot;2-(4.%/$quot;*+

88


ing
oﬁl
Pr
!quot;#$%&'($)*+,-.$/012*.#0

3#.4+$5#-+,0#$-67$2*67$418#68*-.$4#02105-69#$
401:.#5
;/&$-67$%/&$8*5*6<$210$-..$=#06#.$*6>19-8*16+$-67$
5#594?+
!*5#$+8-54+

(99#++$81$quot;-07@-0#$4#02105-69#$91,68#0+$

61


ing
oﬁl
Pr
!quot;#$%&'
()*$+',%-*,+-%./*0,1quot;+2,2%-01%-*,.34$+*-',3$,'quot;#$%&',quot;$,+2*,.2quot;56

+quot;7*'+%75

#&08quot;$.32*-*$+
Global memory loads/stores are coalesced
#&08.32*-*$+
(coherent) or non-coalesced (incoherent)
#'+8quot;$.32*-*$+
#'+8.32*-*$+

&3.%&8&3%0
Local loads/stores
&3.%&8'+3-*

Total branches and divergent branches
9-%$.2
0quot;)*-#*$+89-%$.2 taken by threads

quot;$'+-4.+quot;3$' : quot;$'+-4.+quot;3$,.34$+

1%-58'*-quot;%";* : +2-*%0,1%-5',+2%+,'*-quot;%";*,3$,%00-*'',.3$<".+',+3,
'2%-*0,3-,.3$'+%$+,7*73-=

.+%8&%4$.2*0 : *>*.4+*0,+2-*%0,9&3./'

62


ing
oﬁl
Pr
!quot;#$%&%$#'quot;()&%*+',$%)-*.quot;#$%/

01,.$/)%$&%$/$quot;#)$2$quot;#/)3'#4'quot;)1)#4%$15)31%&

6quot;,7)#1%($#/)*quot;$)8.,#'&%*-$//*%
01,.$/)3',,)quot;*#)-*%%$/&*quot;5)#*)#4$)#*#1,)quot;.89$%)*+)31%&/)
,1.quot;-4$5)+*%)1)&1%#'-.,1%):$%quot;$,;
<1.quot;-4)$quot;*.(4)#4%$15)9,*-:/)#*)$quot;/.%$)#41#)#4$)#1%($#)
8.,#'&%*-$//*%)'/)('2$quot;)1)-*quot;/'/#$quot;#)&$%-$quot;#1($)*+)#4$)#*#1,)
3*%:;

01,.$/)1%$)9$/#)./$5)#*)'5$quot;#'+7)%$,1#'2$)&$%+*%81quot;-$)
5'++$%$quot;-$/)9$#3$$quot;).quot;*&#'8'=$5)1quot;5)*&#'8'=$5)-*5$
!quot;)*#4$%)3*%5/>)#%7)#*)%$5.-$)#4$)81(quot;'#.5$/)*+)
(,5?(/#@'quot;-*4$%$quot;#>)5'2$%($quot;#@9%1quot;-4>)1quot;5)31%&@/$%'1,'=$

63


ple
xam
E
!quot;#$%#&'()quot;*$%#*+,*quot;-quot;"(.*#quot;/0).1%(
M.quot;C Q0&0-'.1Jquot;
N1"*O444*1(.>P <'(/F1/.D MCquot;quot;/0C MCquot;quot;/0C
Aquot;#(quot;-*2B*
638@+*&> 43869*;<=>
1(.quot;#-quot;'Jquot;/*'//#quot;>>1(G
F1.D*/1Jquot;#Gquot;(.*H#'()D1(G

Aquot;#(quot;-*4B
93+@:*&> +36@+*;<=> 43995 43995
1(.quot;#-quot;'Jquot;/*'//#quot;>>1(G
F1.D*H'(K*)%($-1).>

Aquot;#(quot;-*9B 23744*&> ?37+2*;<=> 43825 +3:65
>quot;I0quot;(.1'-*'//#quot;>>1(G

Aquot;#(quot;-*+B 83?:@*&> 273977*;<=> 23765 639+5
$1#>.*'//*/0#1(G*G-%H'-*-%'/

Aquot;#(quot;-*@B 83@9:*&> 92346?*;<=> 2365 2@3825
0(#%--*-'>.*F'#C

Aquot;#(quot;-*:B 83962*&> +93??:*;<=> 23+25 4232:5
)%&C-quot;.quot;-E*0(#%--quot;/

Aquot;#(quot;-*7B 834:6*&> :43:72*;<=> 23+45 9838+5
&0-.1C-quot;*quot;-quot;"(.>*Cquot;#*.D#quot;'/

Aquot;#(quot;-*7*%(*94,*quot;-quot;"(.>B*74*;<=>L
84


n!
ow
our
y
ild
Bu


Back Pocket Slides

slide by David Cox


6.963
IT /
A@M
CUD
9
IAP0

Misc


Tesla C1060 Computing Processor
Processor 1x Tesla T10P

Core GHz 1.33 GHz

Full ATX:
Form factor 4.736” (H) x 10.5” (L)
Dual slot wide
On-board
4 GB
memory
System I/O PCIe x16 gen2
512-bit, 800MHz DDR
Memory I/O
102 GB/s peak bandwidth

Display outputs None

Typical power 160 W

19
M02: High Performance Computing with CUDA


Tesla S1070 1U System
Processors 4 x Tesla T10P

Core GHz 1.5 GHz

1U for an EIA 19”
Form factor
4-post rack
Total 1U system
16 GB (4.0GB per GPU)
memory
System I/O 2 PCIe x16
512-bit, 800MHz GDDR
Memory I/O per
102 GB/s peak
processor
bandwidth

Display outputs None

Typical power 700 W

Chassis 1.73” H ! 17.5” W !
28.5” D
dimensions

20


Double Precision Floating Point
NVIDIA GPU SSE2 Cell SPE
IEEE 754 IEEE 754 IEEE 754
Precision
Rounding modes for FADD All 4 IEEE, round to All 4 IEEE, round to Round to
and FMUL nearest, zero, inf, -inf nearest, zero, inf, -inf zero/truncate only
Supported, costs 1000’s
Denormal handling Full speed Flush to zero
of cycles
NaN support Yes Yes No
Overflow and Infinity No infinity,
Yes Yes
support clamps to max norm
Flags No Yes Some
FMA Yes No Yes
Software with low-latency
Square root Hardware Software only
FMA-based convergence
Software with low-latency
Division Hardware Software only
FMA-based convergence
Reciprocal estimate
24 bit 12 bit 12 bit
accuracy
Reciprocal sqrt estimate
23 bit 12 bit 12 bit
accuracy
log2(x) and 2^x estimates
23 bit No No
accuracy
18


IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Recomendados

Recomendados

Más contenido relacionado

Más de npinto

Más de npinto (20)

Último

Último (20)

IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)