Dzanan_Bajgoric_C2CUDA_MscThesis_Present

Master Thesis:
Automatic Program Parallelization
for GPU Environment
Candidate: Džanan Bagorić
Supervisor: Jan Kwiatkowski DSc
Wrocław, 2016

Agenda
1. Automatic
Parallelization
• Data Dependence Analysis
• Loop Transformations
2. C2CUDA Compiler
• Compiler Organization
• Process Diagram
3. Experiments and
Conclusions
• Matrix Addition
• Naïve Matrix Multiplication
• Magnetic Resonance Imaging – Q
4. Future and potential
Improvements
2 of 15

What is Automatic Parallelization?
• Automatic parallelization is a technique by which
compiler automatically translates a sequential
program into its parallel equivalent.
3 of 15

Main Goals.
Data Dependence Analysis
• Mathematical apparatus
• Linear data dependence problem
• General and Uniform dependence algorithm
Loop Transformations and
Restructuring
• Unimodular loop transformations
• Outer loop parallelization
• Inner loop parallelization
GPU Architecture and
CUDA Model
• Intimate understanding of GPU architecture
• CUDA architecture and programming model
C2CUDA Compiler
• Design and implementation of C2CUDA
automatic parallelizer
• Applying C2CUDA on number of test
programs, and recording the results
4 of 15

Data Dependence Analysis
• Data dependence analysis consists of finding all dependent statements in a
given program. This thesis is limited on solving the data dependence
problem on the level of a single perfect loop nest.
𝐿1: do 𝐼1 = 𝑝1, 𝑞1
⋮ ⋮
𝐿 𝑚: do 𝐼 𝑚 = 𝑝 𝑚, 𝑞 𝑚
𝑺: 𝑋 𝑰𝑨 + 𝒂 𝟎 = ⋯
𝑻: ⋯ = X 𝐈𝐁 + 𝐛 𝟎
enddo
⋮
enddo
Dependence problem:
Find all data dependences
between the statements in
the nest body H(I).
• Form system of linear diophantine equations:
𝒊𝑨 + 𝒂0 = 𝒋𝑩 + 𝒃0
• Use echelon reduction algorithm to solve the
system
• If system has an integer solution, the dependence
may exist. Otherwise, there is no dependence
• If diophantine system has solution, it has to be
checked against loop nest bounds
• This leads to system of linear inequalities that
may be solve with Fourier Method of Elimination
• If system has an integer solution the dependence
exists. Otherwise, dependence doesn’t exist
within the loop bounds.
5 of 15

Loop Transformations
• Loop transformations are used to restructure the loop nest in order to better expose
the parallelism considering the target architecture.
𝐿1: do 𝐼1 = 1,100
𝐿2: do 𝐼2 = 1,100000
S: 𝑋 𝐼1, 𝐼2 = 𝑋 𝐼1 − 1, 𝐼2
enddo
enddo
Loop transformation problem:
Based on the dependence constraints and
target architecture details, transform the given
loop nest to better expose the parallelism
while respecting the dependence constraints.
• Decide which transformation (if any) to apply to
better expose the parallelism
• C2CUDA supports family of so-called unimodular
transformations
• Outer loop parallelization, inner loop paralleliza-
tion, loop permutation, loop interchange
outer loop parallelization
6 of 15
Outermost loop
carries dependence
𝐿𝑈1: do 𝐾1 = 1,100000
𝐿𝑈2: do 𝐾2 = 1,100
S: 𝑋 𝐾2, 𝐾1 = 𝑋 𝐾2 − 1, 𝐾1
enddo
enddo
Innermost loop
carries dependence

C2CUDA Compiler
• C2CUDA is an automatic parallelizer designed and implemented as part of this thesis.
• It is capable of translating relatively simple ANSI C programs into their CUDA C
equivalents.
C2CUDA
Dependence
C2CUDA Frontend
C2CUDA Utilization
C2CUDA
Transformation
Backbone for the rest of the components (echelon
reduction, diagonalization, diophantine solver, Fourier
elimination solver, common matrix/vector operations)
Implements data
dependence analyzer
(general and uniform
data dependence
algorithms)
Implements loop
transformation and
restructuring engine
(supports outer and
inner loop paralleliza-
tion)
Implements a parser (uses CFG as an intermediate
representation), transformation frontend and code
generator (that produces final CUDA C code)
7 of 15

C2CUDA Process Diagram
C2CUDA Frontend
C2CUDA
Dependence
C2CUDA
Transformation
C2CUDA Utilization
Loop nest Dependence
information
Dependence
information
Transformation
CUDA C codeSequential C code
• The following is a high-level process diagram of the C2CUDA compiler,
showing how each of the components mutually interoperate.
8 of 15

Conclusions and Experiments. Matrix
Addition.
Matrix dimension Sequential
Parallel (without
transfer
overhead)
Parallel (with
transfer
overhead)
1024 × 1024 0.005 0.001 0.011
2048 × 3072 0.030 0.003 0.059
4096 × 2048 0.04 0.004 0.078
4096 × 4096 0.075 0.008 0.15
8192 × 4096 0.147 0.016 0.302
8192 × 8192 0.314 0.032 0.598
10240 × 8192 0.409 0.04 0.761
8192 × 16384 0.689 0.06 1.212
unsigned int i = 0;
for (; i < n_rows; ++i)
{
unsigned int j = 0;
for (; j < n_cols; ++j)
c[i][j] = a[i][j] + b[i][j];
}
Sequential C program
Dependence-free nest
__global__ void MatrixAdditionC2CUDA(int *a, int *b, int *c,
const unsigned int n_cols)
{
unsigned int i = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int j = blockIdx.x * blockDim.x + threadIdx.x;
if (i >= n_rows || j >= n_cols)
return;
unsigned int idx = i * n_cols + j;
c[idx] = a[idx] + b[idx];
}
C2CUDA generated matrix addition
Every CUDA thread
maps to exactly one
nest iteration!
Execution
times in
seconds
9 of 15

Conclusions and Experiments. Matrix
Addition.
Matrix dimension
Speedup (without
transfer overhead)
Speedup (with transfer
overhead)
MD_1 1024 × 1024 6.06 0.43
MD_2 2048 × 3072 9.64 0.52
MD_3 4096 × 2048 9.48 0.50
MD_4 4096 × 4096 9.23 0.49
MD_5 8192 × 4096 9.01 0.48
MD_6 8192 × 8192 9.67 0.52
MD_7 10240 × 8192 10.26 0.54
MD_8 8192 × 16384 11.44 0.57
10 of 15

Conclusions and Experiments. Naïve
Matrix Multiplication.
unsigned int i = 0;
for (; i < a_n_rows; ++i)
{
unsigned int j = 0;
for (; j < b_n_cols; ++j)
{
c[i][j] = 0;
unsigned int k = 0;
for (; k < a_n_cols; ++k)
{
c[i][j] = c[i][j] +
a[i][k] * b[k][j];
}
}
}
Innermost loop carries dependence
Execution
times in
seconds
__global__ void MatrixMultC2CUDA(const int *a, const int *b,
int *c, const unsigned int a_n_rows, const unsigned int
a_n_cols, const unsigned int b_n_cols)
{
unsigned int i = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int j = blockIdx.x * blockDim.x + threadIdx.x;
if (i >= a_n_rows || j >= b_n_cols)
return;
unsigned int c_idx = i * b_n_cols + j;
c[c_idx] = 0;
unsigned int k = 0;
for (; k < a_n_cols; ++k)
{
unsigned int a_idx = i * a_n_cols + k;
unsigned int b_idx = k * b_n_cols + j;
c[c_idx] = c[c_idx] + a[a_idx] * b[b_idx];
}
}
C2CUDA generated matrix multiplication
Each CUDA
thread executes
entire innermost
loop
11 of 15
Matrices dimensions Sequential
Parallel (without
transfer
overhead)
Parallel (with
transfer
overhead)
𝑨 = 256 × 512,
𝑩 = 512 × 256
0.077 0.006 0.007
𝑨 = 512 × 512,
𝑩 = 512 × 512
0.297 0.024 0.026
𝑨 = 1024 × 512,
𝑩 = 512 × 1024
1.251 0.095 0.1
𝑨 = 1024 × 1024,
𝑩 = 1024 × 1024
3.662 0.192 0.201
𝑨 = 2048 × 1024,
𝑩 = 1024 × 512
2.489 0.195 0.206
𝑨 = 2048 × 1024,
𝑩 = 1024 × 2048
32.719 0.746 0.770
𝑨 = 2048 × 2048,
𝑩 = 2048 × 2048
74.143 1.493 1.527
𝑨 = 4096 × 512,
𝑩 = 512 × 4096
33.952 1.475 1.537
𝑨 = 4096 × 2048,
𝑩 = 2048 × 1024
71.344 1.469 1.509

Conclusions and Experiments. Naïve
Matrix Multiplication.
Matrices
dimension
Speedup
(without
transfer
overhead)
Speedup
(with
transfer
overhead)
MD_1
𝑨 = 256 × 512,
𝑩 = 512 × 256 12.84 10.54
MD_2
𝑨 = 512 × 512,
𝑩 = 512 × 512 12.34 11.26
MD_3
𝑨 = 1024 × 512,
𝑩 = 512 × 1024 13.14 12.39
MD_4
𝑨 = 1024 × 1024,
𝑩 = 1024 × 1024 19.04 18.21
MD_5
𝑨 = 2048 × 1024,
𝑩 = 1024 × 512 12.76 12.10
MD_6
𝑨 = 2048 × 1024,
𝑩 = 1024 × 2048 43.83 42.47
MD_7
𝑨 = 2048 × 2048,
𝑩 = 2048 × 2048 48.98 47.87
MD_8
𝑨 = 4096 × 512,
𝑩 = 512 × 4096 23.02 22.02
MD_9
𝑨 = 4096 × 2048,
𝑩 = 2048 × 1024 48.54 47.26
12 of 15

Conclusions and Experiments. Magnetic
Resonance Imaging – Q.
void ComputeQCPU(int numK, int numX, kValues *kVals,
float* x, float* y, float* z, float *Qr, float *Qi)
{
int indexK, indexX;
for (indexK = 0; indexK < numK; indexK++)
{
for (indexX = 0; indexX < numX; indexX++)
{
float expArg = PIx2 * (kVals[indexK].Kx * x[indexX] +
kVals[indexK].Ky * y[indexX] +
kVals[indexK].Kz * z[indexX]);
float cosArg = cosf(expArg);
float sinArg = sinf(expArg);
float phi = kVals[indexK].PhiMag;
Qr[indexX] += phi * cosArg;
Qi[indexX] += phi * sinArg;
}
}
}
outer loop parallelization
Outermost loop
carries the
dependence
for (k1 = 0; k1 < numX; k1++)
{
for (k2 = 0; k2 < numK; k2++)
{
float expArg = PIx2*(kVals[k2].Kx * x[k1] +
kVals[k2].Ky * y[k1] +
kVals[k2].Kz * z[k1]);
float phi = kVals[k2].PhiMag;
Qr[k1] += phi * cosArg;
Qi[k1] += phi * sinArg;
}
}
Dependence
pushed down
to innermost
loop
C2CUDA generated MRI-Q computation
__global__ void ComputeQC2CUDA(int numK, int numX, kValues *kVals, float* x,
float* y, float* z, float *Qr, float *Qi) {
unsigned int k1 = blockIdx.x * blockDim.x + threadIdx.x;
if (k1 >= numX - 1)
return;
Qr[k1] = 0.0f;
Qi[k1] = 0.0f;
unsigned int k2 = 0;
for (; k2 < numK; k2++) {
float expArg = PIx2 * (kVals[k2].Kx * x[k1] +
kVals[k2].Ky * y[k1] +
kVals[k2].Kz * z[k1]);
float phi = kVals[k2].PhiMag;
Qr[k1] += phi * cosArg;
Qi[k1] += phi * sinArg;
}
}
Each CUDA thread
executes complete
innermost loop
13 of 15

Conclusions and Experiments. Magnetic
Resonance Imaging – Q.
Dataset size Sequential
Parallel
(without
transfer
overhead)
Parallel
(with
transfer
overhead)
𝑛𝑢𝑚𝐾 = 512
𝑛𝑢𝑚𝑋 = 32768
0.492 0.008 0.009
𝑛𝑢𝑚𝐾 = 1024
𝑛𝑢𝑚𝑋 = 65536
1.855 0.034 0.034
𝑛𝑢𝑚𝐾 = 1024
𝑛𝑢𝑚𝑋 = 131072
3.704 0.067 0.068
𝑛𝑢𝑚𝐾 = 2048
𝑛𝑢𝑚𝑋 = 131072
7.415 0.133 0.134
𝑛𝑢𝑚𝐾 = 2048
𝑛𝑢𝑚𝑋 = 262144
22.203 0.261 0.263
𝑛𝑢𝑚𝐾 = 2048
𝑛𝑢𝑚𝑋 = 393216
33.317 0.381 0.384
𝑛𝑢𝑚𝐾 = 204𝟖
𝑛𝑢𝑚𝑋 = 524288
44.563 0.493 0.497
𝑛𝑢𝑚𝐾 = 4096
𝑛𝑢𝑚𝑋 = 786432
133.301 1.389 1.395
Dataset size
Speedup
(without
transfer
overhead)
Speedup
(with
transfer
overhead)
DS_1
𝑛𝑢𝑚𝐾 = 512
𝑛𝑢𝑚𝑋 = 32768
58.53 54.17
DS_2
𝑛𝑢𝑚𝐾 = 1024
𝑛𝑢𝑚𝑋 = 65536
55.32 54.15
DS_3
𝑛𝑢𝑚𝐾 = 1024
𝑛𝑢𝑚𝑋 = 131072
55.33 54.38
DS_4
𝑛𝑢𝑚𝐾 = 2048
𝑛𝑢𝑚𝑋 = 131072
56.07 55.49
DS_5
𝑛𝑢𝑚𝐾 = 2048
𝑛𝑢𝑚𝑋 = 262144
85.01 84.34
DS_6
𝑛𝑢𝑚𝐾 = 2048
𝑛𝑢𝑚𝑋 = 393216
87.53 86.87
DS_7
𝑛𝑢𝑚𝐾 = 2048
𝑛𝑢𝑚𝑋 = 524288
90.35 89.69
DS_8
𝑛𝑢𝑚𝐾 = 4096
𝑛𝑢𝑚𝑋 = 786432
95.95 95.59
Execution times in
seconds 14 of 15

Improvements and Future Work.
C2CUDA
Negative loop strides
Non-unit loop strides
Extend the specter of
supported transformations
Improve heuristic for
selecting transformations
Improve mapping of CUDA
threads to loop iterations
Improvements
Conditional statements
support
Utilize CUDA shared and
constant memory
Support other loop types
(while, do-while, mixed)
Speed-up gain estimation
algorithm
Runtime analysis Pointer analysis
15 of 15

Dzanan_Bajgoric_C2CUDA_MscThesis_Present

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Dzanan_Bajgoric_C2CUDA_MscThesis_Present

Similar to Dzanan_Bajgoric_C2CUDA_MscThesis_Present (20)

Dzanan_Bajgoric_C2CUDA_MscThesis_Present