International Journal of Engineering Research and Development (IJERD)
Dzanan_Bajgoric_C2CUDA_MscThesis_Present
1. Master Thesis:
Automatic Program Parallelization
for GPU Environment
Candidate: Džanan Bagorić
Supervisor: Jan Kwiatkowski DSc
Wrocław, 2016
2. Agenda
1. Automatic
Parallelization
• Data Dependence Analysis
• Loop Transformations
2. C2CUDA Compiler
• Compiler Organization
• Process Diagram
3. Experiments and
Conclusions
• Matrix Addition
• Naïve Matrix Multiplication
• Magnetic Resonance Imaging – Q
4. Future and potential
Improvements
2 of 15
3. What is Automatic Parallelization?
• Automatic parallelization is a technique by which
compiler automatically translates a sequential
program into its parallel equivalent.
3 of 15
4. Main Goals.
Data Dependence Analysis
• Mathematical apparatus
• Linear data dependence problem
• General and Uniform dependence algorithm
Loop Transformations and
Restructuring
• Unimodular loop transformations
• Outer loop parallelization
• Inner loop parallelization
GPU Architecture and
CUDA Model
• Intimate understanding of GPU architecture
• CUDA architecture and programming model
C2CUDA Compiler
• Design and implementation of C2CUDA
automatic parallelizer
• Applying C2CUDA on number of test
programs, and recording the results
4 of 15
5. Data Dependence Analysis
• Data dependence analysis consists of finding all dependent statements in a
given program. This thesis is limited on solving the data dependence
problem on the level of a single perfect loop nest.
𝐿1: do 𝐼1 = 𝑝1, 𝑞1
⋮ ⋮
𝐿 𝑚: do 𝐼 𝑚 = 𝑝 𝑚, 𝑞 𝑚
𝑺: 𝑋 𝑰𝑨 + 𝒂 𝟎 = ⋯
𝑻: ⋯ = X 𝐈𝐁 + 𝐛 𝟎
enddo
⋮
enddo
Dependence problem:
Find all data dependences
between the statements in
the nest body H(I).
• Form system of linear diophantine equations:
𝒊𝑨 + 𝒂0 = 𝒋𝑩 + 𝒃0
• Use echelon reduction algorithm to solve the
system
• If system has an integer solution, the dependence
may exist. Otherwise, there is no dependence
• If diophantine system has solution, it has to be
checked against loop nest bounds
• This leads to system of linear inequalities that
may be solve with Fourier Method of Elimination
• If system has an integer solution the dependence
exists. Otherwise, dependence doesn’t exist
within the loop bounds.
5 of 15
6. Loop Transformations
• Loop transformations are used to restructure the loop nest in order to better expose
the parallelism considering the target architecture.
𝐿1: do 𝐼1 = 1,100
𝐿2: do 𝐼2 = 1,100000
S: 𝑋 𝐼1, 𝐼2 = 𝑋 𝐼1 − 1, 𝐼2
enddo
enddo
Loop transformation problem:
Based on the dependence constraints and
target architecture details, transform the given
loop nest to better expose the parallelism
while respecting the dependence constraints.
• Decide which transformation (if any) to apply to
better expose the parallelism
• C2CUDA supports family of so-called unimodular
transformations
• Outer loop parallelization, inner loop paralleliza-
tion, loop permutation, loop interchange
outer loop parallelization
6 of 15
Outermost loop
carries dependence
𝐿𝑈1: do 𝐾1 = 1,100000
𝐿𝑈2: do 𝐾2 = 1,100
S: 𝑋 𝐾2, 𝐾1 = 𝑋 𝐾2 − 1, 𝐾1
enddo
enddo
Innermost loop
carries dependence
7. C2CUDA Compiler
• C2CUDA is an automatic parallelizer designed and implemented as part of this thesis.
• It is capable of translating relatively simple ANSI C programs into their CUDA C
equivalents.
C2CUDA
Dependence
C2CUDA Frontend
C2CUDA Utilization
C2CUDA
Transformation
Backbone for the rest of the components (echelon
reduction, diagonalization, diophantine solver, Fourier
elimination solver, common matrix/vector operations)
Implements data
dependence analyzer
(general and uniform
data dependence
algorithms)
Implements loop
transformation and
restructuring engine
(supports outer and
inner loop paralleliza-
tion)
Implements a parser (uses CFG as an intermediate
representation), transformation frontend and code
generator (that produces final CUDA C code)
7 of 15
8. C2CUDA Process Diagram
C2CUDA Frontend
C2CUDA
Dependence
C2CUDA
Transformation
C2CUDA Utilization
Loop nest Dependence
information
Dependence
information
Transformation
CUDA C codeSequential C code
• The following is a high-level process diagram of the C2CUDA compiler,
showing how each of the components mutually interoperate.
8 of 15
9. Conclusions and Experiments. Matrix
Addition.
Matrix dimension Sequential
Parallel (without
transfer
overhead)
Parallel (with
transfer
overhead)
1024 × 1024 0.005 0.001 0.011
2048 × 3072 0.030 0.003 0.059
4096 × 2048 0.04 0.004 0.078
4096 × 4096 0.075 0.008 0.15
8192 × 4096 0.147 0.016 0.302
8192 × 8192 0.314 0.032 0.598
10240 × 8192 0.409 0.04 0.761
8192 × 16384 0.689 0.06 1.212
unsigned int i = 0;
for (; i < n_rows; ++i)
{
unsigned int j = 0;
for (; j < n_cols; ++j)
c[i][j] = a[i][j] + b[i][j];
}
Sequential C program
Dependence-free nest
__global__ void MatrixAdditionC2CUDA(int *a, int *b, int *c,
const unsigned int n_cols)
{
unsigned int i = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int j = blockIdx.x * blockDim.x + threadIdx.x;
if (i >= n_rows || j >= n_cols)
return;
unsigned int idx = i * n_cols + j;
c[idx] = a[idx] + b[idx];
}
C2CUDA generated matrix addition
Every CUDA thread
maps to exactly one
nest iteration!
Execution
times in
seconds
9 of 15
15. Improvements and Future Work.
C2CUDA
Negative loop strides
Non-unit loop strides
Extend the specter of
supported transformations
Improve heuristic for
selecting transformations
Improve mapping of CUDA
threads to loop iterations
Improvements
Conditional statements
support
Utilize CUDA shared and
constant memory
Support other loop types
(while, do-while, mixed)
Speed-up gain estimation
algorithm
Runtime analysis Pointer analysis
15 of 15