Exploring Parallel Merging In GPU Based Systems Using CUDA C.

Exploring Parallel Merging
Technique In GPU Based System
Presented By
Twasif F. Rahman
ID: 2011-3-60-005
&
Md. Rakib Bahadur
ID:2011- 3 60 025

Objectives
• Understand the GPU architecture implementing Adaptive Merge Sort
algorithm
• Minimize execution time
• Draw a standard design so that the oversized data might not retaliate in
long run by adding overhead to the total execution time
– Reducing communication overheard (cpu<->gpu)

What is GPGPU
• General Purpose Graphics Processing Unit (GPGPU)
• Very high performance at low cost
– 30-100X speedup over CPU
• Architecture well suited for myriads of parallel applications
– data parallel processing (SIMD/SPMD)
• Integrated programmable unit

SP SP SP SP SP
SP SP SP SP SP
SP SP SP SP SP
LD/ST
LD/ST
LD/ST
Architecture of GPU- GTX 650
SFU
SFU
SFU
• SP or streaming processor (core)-192
• LD/ST loads and stores
• SFU handles cos() , sin () , log() , exp () , sqrt()

Architecture of GPU- GTX 650
• Registers hold value , 65536 at max , fast memory access
• Warp Scheduler unites groups of instructions (threads) and dispatches them
to the SPs through Dispatch Unit.
Shared memory / L1 Cache(read-only)

Application: Adaptive Merge Sort [1]
• Works in three (3) steps
Partitioning data set into sub lists or
nodes based on order
Formulate the nodes in ascending
order
Merge all nodes

Serial Implementation
• Experiment with: N = 7,168 random numbers
• After partitioning:
– total P= 2968 nodes with 1491 nodes in descending order
– All descending order nodes converted to ascending order
• Then the one step of merging two nodes commenced in three (3) steps:
– determine the node, in which the selected value resides
– determine the position in newly merged node and
– update old nodes information after introduction of a newly merged node
– The merging process will repeat for every data item on each level of merging
until all fragments are converged into one.

How Many Levels
Merging to be recurred equal to the number of the merge tree height. As we got
2968 nodes the height must be 11.
So, the merge function would be called (11×7168) = 78,848 times.
The total time for calculation = 0.161 sec.
Merge function needs time for execute once = 2.04 µs (in theoretical)

Bottle neck
Q: What will happen to merge billions of data with millions of node?
Answer: The merge function has to be called billions of times and would
require more than hours to calculate.
Necessity of parallel computation
execute multiple merging operations at the
same stride of time (in parallel) and reduce
consumption time

Implementation In GPU
– Threads
– Blocks and Grids
– Kernel
Blocks and Grids

Implementation In GPU
Kernel
• CUDA C allows programmer to define C functions, called kernel, that will
execute
• times equal to thread numbers specified when called under Host or device function.
– Kernel function is defined by using __global__ declaration identifier
– Number of threads and blocks specified inside “<<<…>>>”execution
configuration syntax

Partitioning nodes
0
7
1
3
2
4
3
6
4
9
5
15
6
13
7
9
8
0
9
2
10
2
11
3
Node 1
Node 0 Node 2
Node 3

Reversing Descending nodes
7 3 13 9 0
Node 0 Node 2
THREAD 0 THREAD 1

Thread 0
3
0
7
1
4
2
6
3
9
4
15
5
0
6
9
7
13
8
2
9
2
10
3
11
New position = Position in own node of the data in current index + Number
of data in high node <= the data in current index
Node 0 Node 1 Node 2 Node 3

Thread 0
0 1 2 3 4 5
3
0
7
1
4
2
6
3
9
4
15
5
0
6
9
7
13
8
2
9
2
10
3
11
Node 0

Thread 2
3
0
7
1
4
2
6
3
9
4
15
5
0
6
9
7
13
8
2
9
2
10
3
11
New position = Position in own node of the data in current index + number
of data in high node is less than the data in current index

Thread 2
3
0
7
1
4
2
6
3
9
4
15
5
0 1 2 3 4 5
0
6
9
7
13
8
2
9
2
10
3
11
Node 0

Parallel implementation
After all 12 thread run in parallel (After 1st level merging)
0
6
2
7
2
8
3
9
9
10
13
11
Node 1
After 2nd level merging
3
0
4
1
6
2
7
3
9
4
15
5
Node 0
6
6
7
7
9
8
9
9
13
10
13
11
0
0
2
1
2
2
3
3
3
4
4
5
Node 0

Data GPU CPU
4096 0.006472 0.031
7168 0.013 0.21
10240 0.0258 0.205
13312 0.0298 0.345
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
4096 7168 10240 13312
CPU
GPU
0
0.1
0.2
0.3
0.4
4096 7168 1024013312
CPU
GPU
GPU vs CPU time comparison

Static Gridsize and Variable Blocksize
Nearly similar execution time for equal Gridsize and Variable Blocksize (max 1024)

0
5
10
15
20
25
1024 X 4 1024 X 8 1024 X 16 1024 X 32 1024 X 64
Time of execution (microseconds)
Time of execution
(microseconds)
Variable Gridsize and Static Blocksize
Avg Time per iteration = 1.011 micro seconds

Execution Time Analysis
• Parallel Block # (PB) = Max Thread per block/ User Defined
BLOCKSIZE
• PB<16
– Parallel Code Loop# = User defined GRIDSIZE/PB*# of
SMX
• PB> 16
– Parallel Code Loop# = User defined GRIDSIZE/16*# of
SMX

0
10
20
30
40
50
60
70
2 2 2 2 2 4 8 16 32 64
Theroritical Execution
Time
Actual Execution Time
Error
Theory vs. Reality
DATA/Iterations Theroritical Execution Time Actual Execution Time Error
1024*4(2 iter) 2.022 3.3 0.38
512*8(2 iter) 2.022 3.3 0.38
256*16(2 iter) 2.022 3.3 0.38
128*32(2 iter) 2.022 3.3 0.38
64*64(2 iter) 2.022 4.192 0.52
32*128(4 iter) 4.044 5.631 0.28
16*256(8 iter) 8.088 8.652 0.065
8*512(16 iter) 16.176 14.8034 0.0927
4*1024(32 iter) 32.352 26.8096 0.2
2*2048(64 iter) 64.704 50.483 0.28

0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
4K 7K 10K 13K
Execution
Time
Memory copy
Time
0%
20%
40%
60%
80%
100%
4k 7k 10k 13k
Memcopy
time
Exexution
Time
GPU Exec Time vs. Mem Transfer Time
Data Execution Time Memory copy Time
4K 0.006472 0.0725
7K 0.013 0.0825
10K 0.0258 0.0975
13K 0.0298 0.125

Conclusion
• A successful investigation
• GPU’s calculation prowess should be harnessed to solve more
merging problems
• Examples and the deign should be followed to get upper hand
before a problem is approached

Future Works
• Interpolation Merge Sort
• More efficiency and better memory handling in updated work
• Grid Level parallelism (requires multiple GPU)

References
• [1] Shamim Akhter et.al., 2010, Sorting N-
elements Using Natural Order: A New Adaptive
Sorting Algorithm, Journal of Computer Science
6 (2): 163-167.

Exploring Parallel Merging In GPU Based Systems Using CUDA C.

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (18)

Similar a Exploring Parallel Merging In GPU Based Systems Using CUDA C.

Similar a Exploring Parallel Merging In GPU Based Systems Using CUDA C. (20)

Último

Último (20)

Exploring Parallel Merging In GPU Based Systems Using CUDA C.