We present a program that implemented to execute Adaptive merge sort algorithm in parallel on a GPU based system. Parallel implementation is used to get better performance than serial implementation in runtime perspective. Parallel implementation executes independent executable operation in parallel using large number of cores in GPU based system. Results from a parallel implementation of the algorithm is given and compared with its serial implementation on run time basis. The parallel version is implemented with CUDA platform in a system based on NVIDIA GPU (GTX 650)
2. Objectives
• Understand the GPU architecture implementing Adaptive Merge Sort
algorithm
• Minimize execution time
• Draw a standard design so that the oversized data might not retaliate in
long run by adding overhead to the total execution time
– Reducing communication overheard (cpu<->gpu)
3. What is GPGPU
• General Purpose Graphics Processing Unit (GPGPU)
• Very high performance at low cost
– 30-100X speedup over CPU
• Architecture well suited for myriads of parallel applications
– data parallel processing (SIMD/SPMD)
• Integrated programmable unit
5. Architecture of GPU- GTX 650
• Registers hold value , 65536 at max , fast memory access
• Warp Scheduler unites groups of instructions (threads) and dispatches them
to the SPs through Dispatch Unit.
Shared memory / L1 Cache(read-only)
7. Application: Adaptive Merge Sort [1]
• Works in three (3) steps
Partitioning data set into sub lists or
nodes based on order
Formulate the nodes in ascending
order
Merge all nodes
8. Serial Implementation
• Experiment with: N = 7,168 random numbers
• After partitioning:
– total P= 2968 nodes with 1491 nodes in descending order
– All descending order nodes converted to ascending order
• Then the one step of merging two nodes commenced in three (3) steps:
– determine the node, in which the selected value resides
– determine the position in newly merged node and
– update old nodes information after introduction of a newly merged node
– The merging process will repeat for every data item on each level of merging
until all fragments are converged into one.
9. How Many Levels
Merging to be recurred equal to the number of the merge tree height. As we got
2968 nodes the height must be 11.
So, the merge function would be called (11×7168) = 78,848 times.
The total time for calculation = 0.161 sec.
Merge function needs time for execute once = 2.04 µs (in theoretical)
10. Bottle neck
Q: What will happen to merge billions of data with millions of node?
Answer: The merge function has to be called billions of times and would
require more than hours to calculate.
Necessity of parallel computation
execute multiple merging operations at the
same stride of time (in parallel) and reduce
consumption time
12. Implementation In GPU
Kernel
• CUDA C allows programmer to define C functions, called kernel, that will
execute
• times equal to thread numbers specified when called under Host or device function.
– Kernel function is defined by using __global__ declaration identifier
– Number of threads and blocks specified inside “<<<…>>>”execution
configuration syntax
20. Data GPU CPU
4096 0.006472 0.031
7168 0.013 0.21
10240 0.0258 0.205
13312 0.0298 0.345
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
4096 7168 10240 13312
CPU
GPU
0
0.1
0.2
0.3
0.4
4096 7168 1024013312
CPU
GPU
GPU vs CPU time comparison
21. Static Gridsize and Variable Blocksize
Nearly similar execution time for equal Gridsize and Variable Blocksize (max 1024)
22. 0
5
10
15
20
25
1024 X 4 1024 X 8 1024 X 16 1024 X 32 1024 X 64
Time of execution (microseconds)
Time of execution
(microseconds)
Variable Gridsize and Static Blocksize
Avg Time per iteration = 1.011 micro seconds
26. Execution Time Analysis
• Parallel Block # (PB) = Max Thread per block/ User Defined
BLOCKSIZE
• PB<16
– Parallel Code Loop# = User defined GRIDSIZE/PB*# of
SMX
• PB> 16
– Parallel Code Loop# = User defined GRIDSIZE/16*# of
SMX
28. 0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
4K 7K 10K 13K
Execution
Time
Memory copy
Time
0%
20%
40%
60%
80%
100%
4k 7k 10k 13k
Memcopy
time
Exexution
Time
GPU Exec Time vs. Mem Transfer Time
Data Execution Time Memory copy Time
4K 0.006472 0.0725
7K 0.013 0.0825
10K 0.0258 0.0975
13K 0.0298 0.125
29. Conclusion
• A successful investigation
• GPU’s calculation prowess should be harnessed to solve more
merging problems
• Examples and the deign should be followed to get upper hand
before a problem is approached
30. Future Works
• Interpolation Merge Sort
• More efficiency and better memory handling in updated work
• Grid Level parallelism (requires multiple GPU)
31. References
• [1] Shamim Akhter et.al., 2010, Sorting N-
elements Using Natural Order: A New Adaptive
Sorting Algorithm, Journal of Computer Science
6 (2): 163-167.