1. 6.963
IT /
A@M
CUD
9
IAP0
Supercomputing on your desktop:
Programming the next generation of cheap
and massively parallel hardware using CUDA
Lecture 07
Nicolas Pinto (MIT)
#2
CUDA - Advanced
Friday, January 23, 2009
2. During this course,
3
6
for 6.9
ed
adapt
we’ll try to
“ ”
and use existing material ;-)
Friday, January 23, 2009
5. Here are the keys
to High-Performance in CUDA
Friday, January 23, 2009
6. ng!
rni
Wa
To optimize or not to optimize
Hoare said (and Knuth restated)
“Premature optimization is the root of all evil.”
slide by Johan Seland
Applied Mathematics 23/53
Friday, January 23, 2009
7. ng!
rni
Wa
To optimize or not to optimize
Hoare said (and Knuth restated)
“We should forget about small efficiencies, say about
97% of the time:
Premature optimization is the root of all evil.”
⇓
3% of the time we really should worry about small efficiencies
(Every 33rd codeline)
slide by Johan Seland
Applied Mathematics 23/53
Friday, January 23, 2009
8. 6.963
IT /
A@M
CUD
9
IAP0
Strategy
Memory Optimizations
Execution Optimizations
Friday, January 23, 2009
9. 6.963
IT /
A@M
CUD
9
IAP0
CUDA
Performance Strategies
Friday, January 23, 2009
10. egy
rat
St
Optimization goals
We should strive to reach GPU performance
We must know the GPU performance
Vendor specifications
Syntetic benchmarks
Choose a performance metric
Memory bandwidth or GFLOPS?
Use clock() to measure
Experiment and profile!
slide by Johan Seland
Applied Mathematics 25/53
Friday, January 23, 2009
18. 6.963
IT /
A@M
CUD
9
IAP0
Memory
Optimizations
Friday, January 23, 2009
19. ory
em
M
!quot;#$%&'$()*#*+,)*$-.
/()*#*+*-0'#quot;#$%&')%,-.1quot;%.
2$,3quot;.4*-0'03$5,3'#quot;#$%&',44quot;..quot;.
6.*-0'.7,%quot;8'#quot;#$%&'quot;11quot;4)*9quot;3&
44
Friday, January 23, 2009
20. ory
em
M
!quot;#quot;$%"'()*&(
!*+,-*$.*./&0$#/$1/(#$.*./&0$2quot;'34,3#1$.5-1$
6/4*&$#1quot;'$3*+,-*$.*./&0$#/$3*+,-*$2quot;'34,3#1
789:($;*quot;<$=>?@A*$BCDE$+(F$GH$89:($;*quot;<$=I5quot;3&/$JK$LDHHE
G89:($)/&$>?@A*$MFH
N,',.,O*$#"'()*&(
@'#*&.*3,quot;#*$3quot;#quot;$(#&5-#5&*($-quot;'$2*$quot;66/-quot;#*3P$/;*"#*3$
/'P$quot;'3$3*quot;66/-quot;#*3$4,#1/5#$*+*&$-/;0,'Q$#1*.$#/$1/(#$
.*./&0
8&/5;$#"'()*&(
R'*$6quot;&Q*$#"'()*&$.5-1$2*##*&$#1quot;'$.quot;'0$(.quot;66$/'*(
45
Friday, January 23, 2009
21. ory
em
M
!quot;#$%&'()$*+,$-'./+0.quot;123$.2
(4*quot;,quot;55'(6'2789+quot;55':2+quot;55'(quot;7;'1+'3+<quot;#$%5'()$*+
='27+-$-'./
>1quot;?5$2+=;#=$27+(4*quot;,$-(</+<$.3'.-quot;1($
@AB+CDE2F+('--'1+'1+!GH%$I<.$22+8IJK9
LM+CDE2+-$quot;24.$*+'1+1N'.($+KOP;+-'7=$.?'quot;.*2+
8'Q$.(5'()$*+!GH%$9
R$$+7=$+S?quot;1*:;*7=0$27T GUVW+RVX+2quot;-<5$
U2$+:;7=+(quot;47;'1
W55'(quot;7;1#+7''+-4(=+<quot;#$%5'()$*+-$-'./+(quot;1+.$*4($+
'Q$.quot;55+2/27$-+<$.3'.-quot;1($
0$27+/'4.+2/27$-2+quot;1*+quot;<<2+7'+5$quot;.1+7=$;.+5;-;72
46
Friday, January 23, 2009
22. em
gm
!quot;#$%quot;&'()#*+&,(%-./0*12(.
3145(.2"%2(67+&16.2*8721#6.9&:;;<=;;&7quot;#7>&7+7quot;(.
?1>(quot;+&2#&$(&@(*A#*)%67(&$#22quot;(6(7>
B@21)1C%21#6.&7%6&4*(%2quot;+&167*(%.(&@(*A#*)%67(
D#%quot;(.71649&8@&2#&E;F&.@((-8@
?%2(67+&51-1649&8@&2#&GHIF&.@((-8@
47
Friday, January 23, 2009
23. em
gm
Accessing global memory
4 cycles to issue on memory fetch
but 400-600 cycles of latency
The equivalent of 100 MADs
Likely to be a performance bottleneck
Order of magnitude speedups possible
Coalesce memory access
Use shared memory to re-order non-coalesced addressing
slide by Johan Seland
Applied Mathematics 32/53
Friday, January 23, 2009
27. em
gm
!quot;#$%&'()*+,-(.()*,/%&0$1&
234%5(.%)1,quot;),678+,
9%5)%$+,5%#:,#,;$quot;#1<,()'5%.%)1<,=5(1%,>#'?
@A,;$quot;#1&,BCDAEF
-(.%&,#G%5#*%:,quot;G%5,C89,50)&
CD9,>$quot;'?&,3,DHI,1J5%#:&+
@HIK&,L 'quot;#$%&'%:
@HMK&,L 'quot;#$%&'%:<,".%,1J5%#:&,:quot;)N1,4#51('(4#1%
@<OPOK&,L 4%5.01%:Q.(&#$(*)%:,1J5%#:,#''%&&
51
Friday, January 23, 2009
28. em
gm
!quot;#$%&'()*+
,-./'-/.%&0quot;10&(2%0! 34054067089-%&
:&%0#0,-./'-/.%0quot;10;..#9&0<,quot;;=0()&-%#>0quot;10;..#90quot;10,-./'-/.%&0
<;quot;,=
?10,quot;;0(&0)quot;-0@(#A$%+
Bquot;.'%0&-./'-/.%0#$(*)C%)-+0DD#$(*)<E=40FG%.%0E0H0340540quot;.067
:&%0,IJI0-quot;0#'G(%@%0'quot;#$%&'()*
x y z Point structure
x y z x y z x y z AoS
x x x y y y z z z SoA
58
Friday, January 23, 2009
29. em
gm
!quot;#$%&'()*+,-.//#01
!quot;#$%&'()*,*0%#2$1,(/30quot;4%&,250quot;.*53.2
!0(2('#$,2quot;,/%/quot;0167quot;.)8,9%0)%$&
:%#8()*,&20.'2.0%&,quot;;,&(<%,quot;25%0,25#),=>,?>,quot;0,@A
712%&,B($$,70%#9,'quot;#$%&'()*+
C0%;%0,-20.'2.0%&,quot;;,D00#1& quot;4%0,Dquot;-
E;,-quot;D,(&,)quot;2,4(#7$%>,0%#8FB0(2%,250quot;.*5,-GHG
D88(2(quot;)#$,0%".0'%&+
D$(*)%8,I13%&,-JK,-#/3$%
59
Friday, January 23, 2009
30. em
sm
!quot;#quot;$$%$&'%()#*&+#,-./%,/0#%
12"&3quot;#quot;$$%$&(quot;,-.2%4&(quot;2*&/-#%quot;56",,%66&(%()#*
7-%#%8)#%4&(%()#*&.6&5.9.5%5&.2/)&:quot;2;6
<66%2/.quot;$&/)",-.%9%&-.=-&:quot;25>.5/-
<quot;,-&:quot;2;&,quot;2&6%#9.,%&)2%"55#%66&3%#&,*,$%
+&(%()#*&,quot;2&6%#9.,%"6&(quot;2*&6.(0$/quot;2%)06&
Bank 0
quot;,,%66%6"6&./&-quot;6&:quot;2;6
Bank 1
Bank 2
Bank 3
'0$/.3$%&6.(0$/quot;2%)06",,%66%6&/)"&:quot;2; Bank 4
#%60$/&.2"&:quot;2;&,)28$.,/& Bank 5
Bank 6
?)28$.,/.2=",,%66%6"#%&6%#.quot;$.@%5
Bank 7
Bank 15
64
Friday, January 23, 2009
31. em
sm
!quot;#$%&''()**+#,%-.quot;/01)*
23%!quot;#$%43#51+67* 23%!quot;#$%43#51+67*
8+#)quot;(%quot;''()**+#,% ;quot;#'3/%:<:%=)(/>7quot;7+3#
*7(+')%99%:
Thread 0 Bank 0 Thread 0 Bank 0
Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3 Bank 3
Thread 4 Bank 4 Thread 4 Bank 4
Thread 5 Bank 5 Thread 5 Bank 5
Thread 6 Bank 6 Thread 6 Bank 6
Thread 7 Bank 7 Thread 7 Bank 7
Thread 15 Bank 15 Thread 15 Bank 15
65
Friday, January 23, 2009
32. em
sm
!quot;#$%&''()**+#,%-.quot;/01)*
234quot;5%!quot;#$%67#81+9:* =34quot;5%!quot;#$%67#81+9:*
;+#)quot;(%quot;''()**+#,% ;+#)quot;(%quot;''()**+#,%
*:(+')%<<%2 *:(+')%<<%=
x8
Thread 0 Bank 0 Thread 0 Bank 0
Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3
Thread 4 Bank 4 Thread 4
Bank 5 Thread 5 Bank 7
Bank 6 Thread 6 Bank 8
Bank 7 Thread 7 Bank 9
Thread 8 x8
Thread 9
Thread 10
Thread 11 Bank 15 Thread 15 Bank 15
66
Friday, January 23, 2009
33. mem
s
!quot;#$%&&'())()$*%+$,quot;$-%./)$quot;.$012
3%.&,5$quot;6$(%75$-%./$4)$89$-4,)$+('$9$7:quot;7/$7;7:()
<=77())4>($89?-4,$#quot;'&)$%'($%))4@.(&$,quot;$)=77())4>($
-%./)
012$5%)$AB$-%./)
<quot;$-%./$C$%&&'())$D$AB
<%*($%)$,5($)4E($quot;6$%$5%:6?#%'+
Fquot;$-%./$7quot;.6:47,)$-(,#((.$&466('(.,$5%:6?#%'+)G$quot;.:;$#4,54.$%$)4.@:($5%:6?#%'+
67
Friday, January 23, 2009
34. em
sm
!quot;#$%&'(%()$*'+#,-'.),/01.23
!quot;#$%&'(%()$*'13'#3'/#32'#3'$%4132%$3'1/'2quot;%$%'#$%'
,)'+#,-'.),/01.23
5quot;%'/#32'.#3%6
7/'#00'2quot;$%#&3')/'#'quot;#0/89#$:'#..%33'&1//%$%,2'+#,-3;'2quot;%$%'13'
,)'+#,-'.),/01.2
7/'#00'2quot;$%#&3')/'#'quot;#0/89#$:'$%#&'2quot;%'1&%,21.#0'#&&$%33;'
2quot;%$%'13',)'+#,-'.),/01.2'<+$)#&.#32=
5quot;%'30)9'.#3%6
>#,-'?),/01.26'(@021:0%'2quot;$%#&3'1,'2quot;%'3#(%'quot;#0/89#$:'
#..%33'2quot;%'3#(%'+#,-
A@32'3%$1#01B%'2quot;%'#..%33%3
?)32'C'(#D'E')/'31(@02#,%)@3'#..%33%3'2)'#'31,40%'+#,-
68
Friday, January 23, 2009
35. egy
rat
St
Use the right kind of memory
Constant memory:
Quite small, ≈ 20K
As fast as register access if all threads in a warp access the
same location
Texture memory:
Spatially cached
Optimized for 2D locality
Neighboring threads should read neighboring addresses
No need to think about coalescing
Constraint:
These memories can only be updated from the CPU
slide by Johan Seland
Applied Mathematics 31/53
Friday, January 23, 2009
36. egy
rat
St
Memory optimizations roundup
CUDA memory handling is complex
And I have not covered all topics...
Using memory correctly can lead to huge speedups
At least CUDA expose the memory hierarchy, unlike CPUs
Get your algorithm up an running first, then optimize
Use shared memory to let threads cooperate
Be wary of “data ownership”
A thread does not have to read/write the data it calculate
slide by Johan Seland
Applied Mathematics 41/53
Friday, January 23, 2009
37. Conflicts,
Coalescing, Warps...
I hate growing up.
Friday, January 23, 2009
38. ple
xa m
E
!quot;#$%$&'#$()*+,'%quot;-./*0'#1$,*21')3quot;(3.
Friday, January 23, 2009
46. ple
xa m
E
!quot;#$%&'%()*+#,&-quot;&%
__global__ void transpose(float *odata, float *idata, int width, int height)
{
1. __shared__ float block[(BLOCK_DIM./)*BLOCK_DIM];
unsigned int xBlock = blockDim.x * blockIdx.x;
2.
unsigned int yBlock = blockDim.y * blockIdx.y;
3.
unsigned int xIndex = xBlock + threadIdx.x;
4.
unsigned int yIndex = yBlock + threadIdx.y;
5.
unsigned int index_out, index_transpose;
6.
7. if (xIndex < width && yIndex < height)
{
unsigned int index_in = width * yIndex + xIndex;
8.
unsigned int index_block = threadIdx.y * (BLOCK_DIM+1) + threadIdx.x;
9.
block[index_block] = idata[index_in];
10.
index_transpose = threadIdx.x * (BLOCK_DIM+1) + threadIdx.y;
11.
index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
12.
}
13. __syncthreads();
14. if (xIndex < width && yIndex < height)
odata[index_out] = block[index_transpose];
15.
}
76
Friday, January 23, 2009
47. ple
xa m
E
Coalesced transpose: Source code
__global__ void
transpose( float *out, float *in, int w, int h ) {
__shared__ float block[BLOCK_DIM*BLOCK_DIM];
unsigned int xBlock = blockDim.x * blockIdx.x;
unsigned int yBlock = blockDim.y * blockIdx.y;
unsigned int xIndex = xBlock + threadIdx.x;
unsigned int yIndex = yBlock + threadIdx.y;
unsigned int index_out, index_transpose;
if ( xIndex < width && yIndex < height ) {
unsigned int index_in = width * yIndex + xIndex;
unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
block[index_block] = in[index_in];
index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;
index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
}
__synchthreads();
if ( xIndex < width && yIndex < height ) {
out[index_out] = block[index_transpose];
}
}
slide by Johan Seland
Applied Mathematics 39/53
Friday, January 23, 2009
48. ple
xa m
E
Coalesced transpose: Source code
__global__ void
transpose( float *out, float *in, int w, int h ) { Allocate shared memory.
__shared__ float block[BLOCK_DIM*BLOCK_DIM];
unsigned int xBlock = blockDim.x * blockIdx.x;
unsigned int yBlock = blockDim.y * blockIdx.y;
unsigned int xIndex = xBlock + threadIdx.x;
unsigned int yIndex = yBlock + threadIdx.y;
unsigned int index_out, index_transpose;
if ( xIndex < width && yIndex < height ) {
unsigned int index_in = width * yIndex + xIndex;
unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
block[index_block] = in[index_in];
index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;
index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
}
__synchthreads();
if ( xIndex < width && yIndex < height ) {
out[index_out] = block[index_transpose];
}
}
slide by Johan Seland
Applied Mathematics 39/53
Friday, January 23, 2009
49. ple
xa m
E
Coalesced transpose: Source code
__global__ void
transpose( float *out, float *in, int w, int h ) { Allocate shared memory.
__shared__ float block[BLOCK_DIM*BLOCK_DIM];
Set up indexing
unsigned int xBlock = blockDim.x * blockIdx.x;
unsigned int yBlock = blockDim.y * blockIdx.y;
unsigned int xIndex = xBlock + threadIdx.x;
unsigned int yIndex = yBlock + threadIdx.y;
unsigned int index_out, index_transpose;
if ( xIndex < width && yIndex < height ) {
unsigned int index_in = width * yIndex + xIndex;
unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
block[index_block] = in[index_in];
index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;
index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
}
__synchthreads();
if ( xIndex < width && yIndex < height ) {
out[index_out] = block[index_transpose];
}
}
slide by Johan Seland
Applied Mathematics 39/53
Friday, January 23, 2009
50. ple
xa m
E
Coalesced transpose: Source code
__global__ void
transpose( float *out, float *in, int w, int h ) { Allocate shared memory.
__shared__ float block[BLOCK_DIM*BLOCK_DIM];
Set up indexing
unsigned int xBlock = blockDim.x * blockIdx.x;
unsigned int yBlock = blockDim.y * blockIdx.y;
unsigned int xIndex = xBlock + threadIdx.x;
unsigned int yIndex = yBlock + threadIdx.y;
unsigned int index_out, index_transpose; Check that we are within
domain, calculate more
if ( xIndex < width && yIndex < height ) { indices
unsigned int index_in = width * yIndex + xIndex;
unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
block[index_block] = in[index_in];
index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;
index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
}
__synchthreads();
if ( xIndex < width && yIndex < height ) {
out[index_out] = block[index_transpose];
}
}
slide by Johan Seland
Applied Mathematics 39/53
Friday, January 23, 2009
51. ple
xa m
E
Coalesced transpose: Source code
__global__ void
transpose( float *out, float *in, int w, int h ) { Allocate shared memory.
__shared__ float block[BLOCK_DIM*BLOCK_DIM];
Set up indexing
unsigned int xBlock = blockDim.x * blockIdx.x;
unsigned int yBlock = blockDim.y * blockIdx.y;
unsigned int xIndex = xBlock + threadIdx.x;
unsigned int yIndex = yBlock + threadIdx.y;
unsigned int index_out, index_transpose; Check that we are within
domain, calculate more
if ( xIndex < width && yIndex < height ) { indices
unsigned int index_in = width * yIndex + xIndex;
unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
Write to shared memory.
block[index_block] = in[index_in];
index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;
index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
}
__synchthreads();
if ( xIndex < width && yIndex < height ) {
out[index_out] = block[index_transpose];
}
}
slide by Johan Seland
Applied Mathematics 39/53
Friday, January 23, 2009
52. ple
xa m
E
Coalesced transpose: Source code
__global__ void
transpose( float *out, float *in, int w, int h ) { Allocate shared memory.
__shared__ float block[BLOCK_DIM*BLOCK_DIM];
Set up indexing
unsigned int xBlock = blockDim.x * blockIdx.x;
unsigned int yBlock = blockDim.y * blockIdx.y;
unsigned int xIndex = xBlock + threadIdx.x;
unsigned int yIndex = yBlock + threadIdx.y;
unsigned int index_out, index_transpose; Check that we are within
domain, calculate more
if ( xIndex < width && yIndex < height ) { indices
unsigned int index_in = width * yIndex + xIndex;
unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
Write to shared memory.
block[index_block] = in[index_in];
Calculate output indices.
index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;
index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
}
__synchthreads();
if ( xIndex < width && yIndex < height ) {
out[index_out] = block[index_transpose];
}
}
slide by Johan Seland
Applied Mathematics 39/53
Friday, January 23, 2009
53. ple
xa m
E
Coalesced transpose: Source code
__global__ void
transpose( float *out, float *in, int w, int h ) { Allocate shared memory.
__shared__ float block[BLOCK_DIM*BLOCK_DIM];
Set up indexing
unsigned int xBlock = blockDim.x * blockIdx.x;
unsigned int yBlock = blockDim.y * blockIdx.y;
unsigned int xIndex = xBlock + threadIdx.x;
unsigned int yIndex = yBlock + threadIdx.y;
unsigned int index_out, index_transpose; Check that we are within
domain, calculate more
if ( xIndex < width && yIndex < height ) { indices
unsigned int index_in = width * yIndex + xIndex;
unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
Write to shared memory.
block[index_block] = in[index_in];
Calculate output indices.
index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;
index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
} Synchronize.
__synchthreads(); NB:outside if-clause
if ( xIndex < width && yIndex < height ) {
out[index_out] = block[index_transpose];
}
}
slide by Johan Seland
Applied Mathematics 39/53
Friday, January 23, 2009
54. ple
xa m
E
Coalesced transpose: Source code
__global__ void
transpose( float *out, float *in, int w, int h ) { Allocate shared memory.
__shared__ float block[BLOCK_DIM*BLOCK_DIM];
Set up indexing
unsigned int xBlock = blockDim.x * blockIdx.x;
unsigned int yBlock = blockDim.y * blockIdx.y;
unsigned int xIndex = xBlock + threadIdx.x;
unsigned int yIndex = yBlock + threadIdx.y;
unsigned int index_out, index_transpose; Check that we are within
domain, calculate more
if ( xIndex < width && yIndex < height ) { indices
unsigned int index_in = width * yIndex + xIndex;
unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
Write to shared memory.
block[index_block] = in[index_in];
Calculate output indices.
index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;
index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
} Synchronize.
__synchthreads(); NB:outside if-clause
Write to global mem.
if ( xIndex < width && yIndex < height ) {
Different index
out[index_out] = block[index_transpose];
}
}
slide by Johan Seland
Applied Mathematics 39/53
Friday, January 23, 2009
55. ple
xa m
E
Transpose timings
Was it worth the trouble?
Grid Size Coalesced Non-coalesced Speedup
128 × 128 0.011 ms 0.022 ms 2.0×
512 × 512 0.07 ms 0.33 ms 4.5×
1024 × 1024 0.30 ms 1.92 ms 6.4×
1024 × 2048 0.79 ms 6.6 ms 8.4×
For me, this is a clear yes.
slide by Johan Seland
Applied Mathematics 40/53
Friday, January 23, 2009
68. xec
E
Loop unrolling
Sometimes we know some kernel parameters at compile time:
# of loop iterations
Degrees of polynomials
Number of data elements
If we could “tell” this to the compiler, it can unroll loops and
optimize register usage
We need to be generic
Avoid code duplication, sizes unknown at compile time
Templates to rescue
The same trick can be used for regular C++ sources
slide by Johan Seland
Applied Mathematics 43/53
Friday, January 23, 2009
69. xec
E
Example: de Casteljau algorithm
A standard algorithm for evaluating polynomials in Bernstein form
d
f (x) = b00
Recursively defined: x 1−x
d
f (x) = b00 d−1 d−1
b10 b01
k−1 k−1
k
bi,j = xbi+1,j + (1 − x)bi,j+1
1 − x2
x x
1−x
0
bi,j are coefficients
d−2 d−2 d−2
b20 b11 b02
slide by Johan Seland
Applied Mathematics 44/53
Friday, January 23, 2009
70. xec
E
Implementation
The de Casteljau algorithm is usually implemented as nested
for-loops
Coefficients are overwritten for each iteration
d
f (x) = c00
float deCasteljau ( float ∗ c , float x , int d )
{ x 1−x
f o r ( u i n t i = 1 ; i <= d ; ++i ) {
f o r ( u i n t j = 0 ; j <= d− i ; ++j )
d−1 d−1
c10 c01
c [ j ] = ( 1 . 0 f −x ) ∗ c [ j ] + x ∗ c [ j + 1 ] ;
}
1 − x2
x x
1−x
return c [ 0 ] ;
}
d−2 d−2 d−2
c20 c11 c02
slide by Johan Seland
Applied Mathematics 45/53
Friday, January 23, 2009
71. xec
E
Template loop unrolling
We make d a template parameter
template<int d>
f l o a t d e C a s t e l j a u ( f l o a t ∗ c , f l o a t x, int d ) {
f o r ( u i n t i = 1 ; i <= d ; ++i ) {
f o r ( u i n t j = 0 ; j <= d− i ; ++j )
c [ j ] = ( 1 . 0 f −x ) ∗ c [ j ] + x ∗ c [ j + 1 ] ;
}
return c [ 0 ] ;
}
Kernel is called as
switch ( d ) {
case 1:
d e C a s t e l j a u <1><<<d i m G r i d , d i m B l o c k >>>( c , x ) ; b r e a k ;
case 2:
d e C a s t e l j a u <2><<<d i m G r i d , d i m B l o c k >>>( c , x ) ; b r e a k ;
.
.
c a s e MAXD:
d e C a s t e l j a u <MAXD><<<d i m G r i d , d i m B l o c k >>>( c , x ) ; b r e a k ;
}
slide by Johan Seland
Applied Mathematics 46/53
Friday, January 23, 2009
72. xec
E
Results
For the de Castelaju algorithm we see a relatively small
speedup
≈ 1.2× (20%...)
Very easy to implement
Can lead to long compile times
Conclusion:
Probably worth it near end of development cycle
slide by Johan Seland
Applied Mathematics 47/53
Friday, January 23, 2009
74. ing
ofil
Pr
!quot;#$%&'($)*+,-.$/012*.#0
3#.4+$5#-+,0#$-67$2*67$418#68*-.$4#02105-69#$
401:.#5
;/&$-67$%/&$8*5*6<$210$-..$=#06#.$*6>19-8*16+$-67$
5#594?+
!*5#$+8-54+
(99#++$81$quot;-07@-0#$4#02105-69#$91,68#0+$
61
Friday, January 23, 2009
75. ing
ofil
Pr
!quot;#$%&'
()*$+',%-*,+-%./*0,1quot;+2,2%-01%-*,.34$+*-',3$,'quot;#$%&',quot;$,+2*,.2quot;56
+quot;7*'+%75
#&08quot;$.32*-*$+
Global memory loads/stores are coalesced
#&08.32*-*$+
(coherent) or non-coalesced (incoherent)
#'+8quot;$.32*-*$+
#'+8.32*-*$+
&3.%&8&3%0
Local loads/stores
&3.%&8'+3-*
Total branches and divergent branches
9-%$.2
0quot;)*-#*$+89-%$.2 taken by threads
quot;$'+-4.+quot;3$' : quot;$'+-4.+quot;3$,.34$+
1%-58'*-quot;%";* : +2-*%0,1%-5',+2%+,'*-quot;%";*,3$,%00-*'',.3$<".+',+3,
'2%-*0,3-,.3$'+%$+,7*73-=
.+%8&%4$.2*0 : *>*.4+*0,+2-*%0,9&3./'
62
Friday, January 23, 2009
83. 6.963
IT /
A@M
CUD
9
IAP0
Misc
Friday, January 23, 2009
84. Tesla C1060 Computing Processor
Processor 1x Tesla T10P
Core GHz 1.33 GHz
Full ATX:
Form factor 4.736” (H) x 10.5” (L)
Dual slot wide
On-board
4 GB
memory
System I/O PCIe x16 gen2
512-bit, 800MHz DDR
Memory I/O
102 GB/s peak bandwidth
Display outputs None
Typical power 160 W
19
M02: High Performance Computing with CUDA
Friday, January 23, 2009
85. Tesla S1070 1U System
Processors 4 x Tesla T10P
Core GHz 1.5 GHz
1U for an EIA 19”
Form factor
4-post rack
Total 1U system
16 GB (4.0GB per GPU)
memory
System I/O 2 PCIe x16
512-bit, 800MHz GDDR
Memory I/O per
102 GB/s peak
processor
bandwidth
Display outputs None
Typical power 700 W
Chassis 1.73” H ! 17.5” W !
28.5” D
dimensions
20
M02: High Performance Computing with CUDA
Friday, January 23, 2009
86. Double Precision Floating Point
NVIDIA GPU SSE2 Cell SPE
IEEE 754 IEEE 754 IEEE 754
Precision
Rounding modes for FADD All 4 IEEE, round to All 4 IEEE, round to Round to
and FMUL nearest, zero, inf, -inf nearest, zero, inf, -inf zero/truncate only
Supported, costs 1000’s
Denormal handling Full speed Flush to zero
of cycles
NaN support Yes Yes No
Overflow and Infinity No infinity,
Yes Yes
support clamps to max norm
Flags No Yes Some
FMA Yes No Yes
Software with low-latency
Square root Hardware Software only
FMA-based convergence
Software with low-latency
Division Hardware Software only
FMA-based convergence
Reciprocal estimate
24 bit 12 bit 12 bit
accuracy
Reciprocal sqrt estimate
23 bit 12 bit 12 bit
accuracy
log2(x) and 2^x estimates
23 bit No No
accuracy
18
M02: High Performance Computing with CUDA
Friday, January 23, 2009