SlideShare a Scribd company logo
1 of 15
Master Thesis:
Automatic Program Parallelization
for GPU Environment
Candidate: Džanan Bagorić
Supervisor: Jan Kwiatkowski DSc
Wrocław, 2016
Agenda
1. Automatic
Parallelization
• Data Dependence Analysis
• Loop Transformations
2. C2CUDA Compiler
• Compiler Organization
• Process Diagram
3. Experiments and
Conclusions
• Matrix Addition
• Naïve Matrix Multiplication
• Magnetic Resonance Imaging – Q
4. Future and potential
Improvements
2 of 15
What is Automatic Parallelization?
• Automatic parallelization is a technique by which
compiler automatically translates a sequential
program into its parallel equivalent.
3 of 15
Main Goals.
Data Dependence Analysis
• Mathematical apparatus
• Linear data dependence problem
• General and Uniform dependence algorithm
Loop Transformations and
Restructuring
• Unimodular loop transformations
• Outer loop parallelization
• Inner loop parallelization
GPU Architecture and
CUDA Model
• Intimate understanding of GPU architecture
• CUDA architecture and programming model
C2CUDA Compiler
• Design and implementation of C2CUDA
automatic parallelizer
• Applying C2CUDA on number of test
programs, and recording the results
4 of 15
Data Dependence Analysis
• Data dependence analysis consists of finding all dependent statements in a
given program. This thesis is limited on solving the data dependence
problem on the level of a single perfect loop nest.
𝐿1: do 𝐼1 = 𝑝1, 𝑞1
⋮ ⋮
𝐿 𝑚: do 𝐼 𝑚 = 𝑝 𝑚, 𝑞 𝑚
𝑺: 𝑋 𝑰𝑨 + 𝒂 𝟎 = ⋯
𝑻: ⋯ = X 𝐈𝐁 + 𝐛 𝟎
enddo
⋮
enddo
Dependence problem:
Find all data dependences
between the statements in
the nest body H(I).
• Form system of linear diophantine equations:
𝒊𝑨 + 𝒂0 = 𝒋𝑩 + 𝒃0
• Use echelon reduction algorithm to solve the
system
• If system has an integer solution, the dependence
may exist. Otherwise, there is no dependence
• If diophantine system has solution, it has to be
checked against loop nest bounds
• This leads to system of linear inequalities that
may be solve with Fourier Method of Elimination
• If system has an integer solution the dependence
exists. Otherwise, dependence doesn’t exist
within the loop bounds.
5 of 15
Loop Transformations
• Loop transformations are used to restructure the loop nest in order to better expose
the parallelism considering the target architecture.
𝐿1: do 𝐼1 = 1,100
𝐿2: do 𝐼2 = 1,100000
S: 𝑋 𝐼1, 𝐼2 = 𝑋 𝐼1 − 1, 𝐼2
enddo
enddo
Loop transformation problem:
Based on the dependence constraints and
target architecture details, transform the given
loop nest to better expose the parallelism
while respecting the dependence constraints.
• Decide which transformation (if any) to apply to
better expose the parallelism
• C2CUDA supports family of so-called unimodular
transformations
• Outer loop parallelization, inner loop paralleliza-
tion, loop permutation, loop interchange
outer loop parallelization
6 of 15
Outermost loop
carries dependence
𝐿𝑈1: do 𝐾1 = 1,100000
𝐿𝑈2: do 𝐾2 = 1,100
S: 𝑋 𝐾2, 𝐾1 = 𝑋 𝐾2 − 1, 𝐾1
enddo
enddo
Innermost loop
carries dependence
C2CUDA Compiler
• C2CUDA is an automatic parallelizer designed and implemented as part of this thesis.
• It is capable of translating relatively simple ANSI C programs into their CUDA C
equivalents.
C2CUDA
Dependence
C2CUDA Frontend
C2CUDA Utilization
C2CUDA
Transformation
Backbone for the rest of the components (echelon
reduction, diagonalization, diophantine solver, Fourier
elimination solver, common matrix/vector operations)
Implements data
dependence analyzer
(general and uniform
data dependence
algorithms)
Implements loop
transformation and
restructuring engine
(supports outer and
inner loop paralleliza-
tion)
Implements a parser (uses CFG as an intermediate
representation), transformation frontend and code
generator (that produces final CUDA C code)
7 of 15
C2CUDA Process Diagram
C2CUDA Frontend
C2CUDA
Dependence
C2CUDA
Transformation
C2CUDA Utilization
Loop nest Dependence
information
Dependence
information
Transformation
CUDA C codeSequential C code
• The following is a high-level process diagram of the C2CUDA compiler,
showing how each of the components mutually interoperate.
8 of 15
Conclusions and Experiments. Matrix
Addition.
Matrix dimension Sequential
Parallel (without
transfer
overhead)
Parallel (with
transfer
overhead)
1024 × 1024 0.005 0.001 0.011
2048 × 3072 0.030 0.003 0.059
4096 × 2048 0.04 0.004 0.078
4096 × 4096 0.075 0.008 0.15
8192 × 4096 0.147 0.016 0.302
8192 × 8192 0.314 0.032 0.598
10240 × 8192 0.409 0.04 0.761
8192 × 16384 0.689 0.06 1.212
unsigned int i = 0;
for (; i < n_rows; ++i)
{
unsigned int j = 0;
for (; j < n_cols; ++j)
c[i][j] = a[i][j] + b[i][j];
}
Sequential C program
Dependence-free nest
__global__ void MatrixAdditionC2CUDA(int *a, int *b, int *c,
const unsigned int n_cols)
{
unsigned int i = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int j = blockIdx.x * blockDim.x + threadIdx.x;
if (i >= n_rows || j >= n_cols)
return;
unsigned int idx = i * n_cols + j;
c[idx] = a[idx] + b[idx];
}
C2CUDA generated matrix addition
Every CUDA thread
maps to exactly one
nest iteration!
Execution
times in
seconds
9 of 15
Conclusions and Experiments. Matrix
Addition.
Matrix dimension
Speedup (without
transfer overhead)
Speedup (with transfer
overhead)
MD_1 1024 × 1024 6.06 0.43
MD_2 2048 × 3072 9.64 0.52
MD_3 4096 × 2048 9.48 0.50
MD_4 4096 × 4096 9.23 0.49
MD_5 8192 × 4096 9.01 0.48
MD_6 8192 × 8192 9.67 0.52
MD_7 10240 × 8192 10.26 0.54
MD_8 8192 × 16384 11.44 0.57
10 of 15
Conclusions and Experiments. Naïve
Matrix Multiplication.
unsigned int i = 0;
for (; i < a_n_rows; ++i)
{
unsigned int j = 0;
for (; j < b_n_cols; ++j)
{
c[i][j] = 0;
unsigned int k = 0;
for (; k < a_n_cols; ++k)
{
c[i][j] = c[i][j] +
a[i][k] * b[k][j];
}
}
}
Sequential C program
Innermost loop carries dependence
Execution
times in
seconds
__global__ void MatrixMultC2CUDA(const int *a, const int *b,
int *c, const unsigned int a_n_rows, const unsigned int
a_n_cols, const unsigned int b_n_cols)
{
unsigned int i = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int j = blockIdx.x * blockDim.x + threadIdx.x;
if (i >= a_n_rows || j >= b_n_cols)
return;
unsigned int c_idx = i * b_n_cols + j;
c[c_idx] = 0;
unsigned int k = 0;
for (; k < a_n_cols; ++k)
{
unsigned int a_idx = i * a_n_cols + k;
unsigned int b_idx = k * b_n_cols + j;
c[c_idx] = c[c_idx] + a[a_idx] * b[b_idx];
}
}
C2CUDA generated matrix multiplication
Each CUDA
thread executes
entire innermost
loop
11 of 15
Matrices dimensions Sequential
Parallel (without
transfer
overhead)
Parallel (with
transfer
overhead)
𝑨 = 256 × 512,
𝑩 = 512 × 256
0.077 0.006 0.007
𝑨 = 512 × 512,
𝑩 = 512 × 512
0.297 0.024 0.026
𝑨 = 1024 × 512,
𝑩 = 512 × 1024
1.251 0.095 0.1
𝑨 = 1024 × 1024,
𝑩 = 1024 × 1024
3.662 0.192 0.201
𝑨 = 2048 × 1024,
𝑩 = 1024 × 512
2.489 0.195 0.206
𝑨 = 2048 × 1024,
𝑩 = 1024 × 2048
32.719 0.746 0.770
𝑨 = 2048 × 2048,
𝑩 = 2048 × 2048
74.143 1.493 1.527
𝑨 = 4096 × 512,
𝑩 = 512 × 4096
33.952 1.475 1.537
𝑨 = 4096 × 2048,
𝑩 = 2048 × 1024
71.344 1.469 1.509
Conclusions and Experiments. Naïve
Matrix Multiplication.
Matrices
dimension
Speedup
(without
transfer
overhead)
Speedup
(with
transfer
overhead)
MD_1
𝑨 = 256 × 512,
𝑩 = 512 × 256 12.84 10.54
MD_2
𝑨 = 512 × 512,
𝑩 = 512 × 512 12.34 11.26
MD_3
𝑨 = 1024 × 512,
𝑩 = 512 × 1024 13.14 12.39
MD_4
𝑨 = 1024 × 1024,
𝑩 = 1024 × 1024 19.04 18.21
MD_5
𝑨 = 2048 × 1024,
𝑩 = 1024 × 512 12.76 12.10
MD_6
𝑨 = 2048 × 1024,
𝑩 = 1024 × 2048 43.83 42.47
MD_7
𝑨 = 2048 × 2048,
𝑩 = 2048 × 2048 48.98 47.87
MD_8
𝑨 = 4096 × 512,
𝑩 = 512 × 4096 23.02 22.02
MD_9
𝑨 = 4096 × 2048,
𝑩 = 2048 × 1024 48.54 47.26
12 of 15
Conclusions and Experiments. Magnetic
Resonance Imaging – Q.
void ComputeQCPU(int numK, int numX, kValues *kVals,
float* x, float* y, float* z, float *Qr, float *Qi)
{
int indexK, indexX;
for (indexK = 0; indexK < numK; indexK++)
{
for (indexX = 0; indexX < numX; indexX++)
{
float expArg = PIx2 * (kVals[indexK].Kx * x[indexX] +
kVals[indexK].Ky * y[indexX] +
kVals[indexK].Kz * z[indexX]);
float cosArg = cosf(expArg);
float sinArg = sinf(expArg);
float phi = kVals[indexK].PhiMag;
Qr[indexX] += phi * cosArg;
Qi[indexX] += phi * sinArg;
}
}
}
Sequential C program
outer loop parallelization
Outermost loop
carries the
dependence
for (k1 = 0; k1 < numX; k1++)
{
for (k2 = 0; k2 < numK; k2++)
{
float expArg = PIx2*(kVals[k2].Kx * x[k1] +
kVals[k2].Ky * y[k1] +
kVals[k2].Kz * z[k1]);
float cosArg = cosf(expArg);
float sinArg = sinf(expArg);
float phi = kVals[k2].PhiMag;
Qr[k1] += phi * cosArg;
Qi[k1] += phi * sinArg;
}
}
Dependence
pushed down
to innermost
loop
C2CUDA generated MRI-Q computation
__global__ void ComputeQC2CUDA(int numK, int numX, kValues *kVals, float* x,
float* y, float* z, float *Qr, float *Qi) {
unsigned int k1 = blockIdx.x * blockDim.x + threadIdx.x;
if (k1 >= numX - 1)
return;
Qr[k1] = 0.0f;
Qi[k1] = 0.0f;
unsigned int k2 = 0;
for (; k2 < numK; k2++) {
float expArg = PIx2 * (kVals[k2].Kx * x[k1] +
kVals[k2].Ky * y[k1] +
kVals[k2].Kz * z[k1]);
float cosArg = cosf(expArg);
float sinArg = sinf(expArg);
float phi = kVals[k2].PhiMag;
Qr[k1] += phi * cosArg;
Qi[k1] += phi * sinArg;
}
}
Each CUDA thread
executes complete
innermost loop
13 of 15
Conclusions and Experiments. Magnetic
Resonance Imaging – Q.
Dataset size Sequential
Parallel
(without
transfer
overhead)
Parallel
(with
transfer
overhead)
𝑛𝑢𝑚𝐾 = 512
𝑛𝑢𝑚𝑋 = 32768
0.492 0.008 0.009
𝑛𝑢𝑚𝐾 = 1024
𝑛𝑢𝑚𝑋 = 65536
1.855 0.034 0.034
𝑛𝑢𝑚𝐾 = 1024
𝑛𝑢𝑚𝑋 = 131072
3.704 0.067 0.068
𝑛𝑢𝑚𝐾 = 2048
𝑛𝑢𝑚𝑋 = 131072
7.415 0.133 0.134
𝑛𝑢𝑚𝐾 = 2048
𝑛𝑢𝑚𝑋 = 262144
22.203 0.261 0.263
𝑛𝑢𝑚𝐾 = 2048
𝑛𝑢𝑚𝑋 = 393216
33.317 0.381 0.384
𝑛𝑢𝑚𝐾 = 204𝟖
𝑛𝑢𝑚𝑋 = 524288
44.563 0.493 0.497
𝑛𝑢𝑚𝐾 = 4096
𝑛𝑢𝑚𝑋 = 786432
133.301 1.389 1.395
Dataset size
Speedup
(without
transfer
overhead)
Speedup
(with
transfer
overhead)
DS_1
𝑛𝑢𝑚𝐾 = 512
𝑛𝑢𝑚𝑋 = 32768
58.53 54.17
DS_2
𝑛𝑢𝑚𝐾 = 1024
𝑛𝑢𝑚𝑋 = 65536
55.32 54.15
DS_3
𝑛𝑢𝑚𝐾 = 1024
𝑛𝑢𝑚𝑋 = 131072
55.33 54.38
DS_4
𝑛𝑢𝑚𝐾 = 2048
𝑛𝑢𝑚𝑋 = 131072
56.07 55.49
DS_5
𝑛𝑢𝑚𝐾 = 2048
𝑛𝑢𝑚𝑋 = 262144
85.01 84.34
DS_6
𝑛𝑢𝑚𝐾 = 2048
𝑛𝑢𝑚𝑋 = 393216
87.53 86.87
DS_7
𝑛𝑢𝑚𝐾 = 2048
𝑛𝑢𝑚𝑋 = 524288
90.35 89.69
DS_8
𝑛𝑢𝑚𝐾 = 4096
𝑛𝑢𝑚𝑋 = 786432
95.95 95.59
Execution times in
seconds 14 of 15
Improvements and Future Work.
C2CUDA
Negative loop strides
Non-unit loop strides
Extend the specter of
supported transformations
Improve heuristic for
selecting transformations
Improve mapping of CUDA
threads to loop iterations
Improvements
Conditional statements
support
Utilize CUDA shared and
constant memory
Support other loop types
(while, do-while, mixed)
Speed-up gain estimation
algorithm
Runtime analysis Pointer analysis
15 of 15

More Related Content

What's hot

To Swift 2...and Beyond!
To Swift 2...and Beyond!To Swift 2...and Beyond!
To Swift 2...and Beyond!Scott Gardner
 
NeuralArt 電腦作畫
NeuralArt 電腦作畫NeuralArt 電腦作畫
NeuralArt 電腦作畫Mark Chang
 
Control systems
Control systemsControl systems
Control systemsRajat Garg
 
SCS-MCSA- Based Architecture for Montgomery Modular Multiplication
SCS-MCSA- Based Architecture for Montgomery Modular MultiplicationSCS-MCSA- Based Architecture for Montgomery Modular Multiplication
SCS-MCSA- Based Architecture for Montgomery Modular MultiplicationIRJET Journal
 
EdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for JavaEdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for JavaLisa Hua
 
Neural Art (English Version)
Neural Art (English Version)Neural Art (English Version)
Neural Art (English Version)Mark Chang
 
The Kokkos C++ Performance Portability EcoSystem
The Kokkos C++ Performance Portability EcoSystemThe Kokkos C++ Performance Portability EcoSystem
The Kokkos C++ Performance Portability EcoSysteminside-BigData.com
 
Lecture 4: Stochastic Hydrology (Site Characterization)
Lecture 4: Stochastic Hydrology (Site Characterization)Lecture 4: Stochastic Hydrology (Site Characterization)
Lecture 4: Stochastic Hydrology (Site Characterization)Amro Elfeki
 
ماترێکس به‌ کوردی ئارام
ماترێکس به‌ کوردی ئارامماترێکس به‌ کوردی ئارام
ماترێکس به‌ کوردی ئارامAram Jamal
 
Computational Linguistics week 5
Computational Linguistics  week 5Computational Linguistics  week 5
Computational Linguistics week 5Mark Chang
 
RedisConf18 - Lower Latency Graph Queries in Cypher with Redis Graph
RedisConf18 - Lower Latency Graph Queries in Cypher with Redis GraphRedisConf18 - Lower Latency Graph Queries in Cypher with Redis Graph
RedisConf18 - Lower Latency Graph Queries in Cypher with Redis GraphRedis Labs
 
ACM Distinguished Program: Cooperative Testing and Analysis: Human-Tool, Tool...
ACM Distinguished Program: Cooperative Testing and Analysis: Human-Tool, Tool...ACM Distinguished Program: Cooperative Testing and Analysis: Human-Tool, Tool...
ACM Distinguished Program: Cooperative Testing and Analysis: Human-Tool, Tool...Tao Xie
 
White-Box Testing on Methods
White-Box Testing on MethodsWhite-Box Testing on Methods
White-Box Testing on MethodsMinhas Kamal
 
Ecad &amp;vlsi lab 18
Ecad &amp;vlsi lab 18Ecad &amp;vlsi lab 18
Ecad &amp;vlsi lab 18Shekar Midde
 
Implementation of Efficiency CORDIC Algorithmfor Sine & Cosine Generation
Implementation of Efficiency CORDIC Algorithmfor Sine & Cosine GenerationImplementation of Efficiency CORDIC Algorithmfor Sine & Cosine Generation
Implementation of Efficiency CORDIC Algorithmfor Sine & Cosine GenerationIOSR Journals
 
Passive network-redesign-ntua
Passive network-redesign-ntuaPassive network-redesign-ntua
Passive network-redesign-ntuaIEEE NTUA SB
 

What's hot (20)

C programs
C programsC programs
C programs
 
To Swift 2...and Beyond!
To Swift 2...and Beyond!To Swift 2...and Beyond!
To Swift 2...and Beyond!
 
Vectorization in ATLAS
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
 
NeuralArt 電腦作畫
NeuralArt 電腦作畫NeuralArt 電腦作畫
NeuralArt 電腦作畫
 
Control systems
Control systemsControl systems
Control systems
 
SCS-MCSA- Based Architecture for Montgomery Modular Multiplication
SCS-MCSA- Based Architecture for Montgomery Modular MultiplicationSCS-MCSA- Based Architecture for Montgomery Modular Multiplication
SCS-MCSA- Based Architecture for Montgomery Modular Multiplication
 
EdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for JavaEdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for Java
 
Neural Art (English Version)
Neural Art (English Version)Neural Art (English Version)
Neural Art (English Version)
 
The Kokkos C++ Performance Portability EcoSystem
The Kokkos C++ Performance Portability EcoSystemThe Kokkos C++ Performance Portability EcoSystem
The Kokkos C++ Performance Portability EcoSystem
 
Lecture 4: Stochastic Hydrology (Site Characterization)
Lecture 4: Stochastic Hydrology (Site Characterization)Lecture 4: Stochastic Hydrology (Site Characterization)
Lecture 4: Stochastic Hydrology (Site Characterization)
 
ماترێکس به‌ کوردی ئارام
ماترێکس به‌ کوردی ئارامماترێکس به‌ کوردی ئارام
ماترێکس به‌ کوردی ئارام
 
C046051216
C046051216C046051216
C046051216
 
Computational Linguistics week 5
Computational Linguistics  week 5Computational Linguistics  week 5
Computational Linguistics week 5
 
RedisConf18 - Lower Latency Graph Queries in Cypher with Redis Graph
RedisConf18 - Lower Latency Graph Queries in Cypher with Redis GraphRedisConf18 - Lower Latency Graph Queries in Cypher with Redis Graph
RedisConf18 - Lower Latency Graph Queries in Cypher with Redis Graph
 
Ch17p 3rd Naemen
Ch17p 3rd NaemenCh17p 3rd Naemen
Ch17p 3rd Naemen
 
ACM Distinguished Program: Cooperative Testing and Analysis: Human-Tool, Tool...
ACM Distinguished Program: Cooperative Testing and Analysis: Human-Tool, Tool...ACM Distinguished Program: Cooperative Testing and Analysis: Human-Tool, Tool...
ACM Distinguished Program: Cooperative Testing and Analysis: Human-Tool, Tool...
 
White-Box Testing on Methods
White-Box Testing on MethodsWhite-Box Testing on Methods
White-Box Testing on Methods
 
Ecad &amp;vlsi lab 18
Ecad &amp;vlsi lab 18Ecad &amp;vlsi lab 18
Ecad &amp;vlsi lab 18
 
Implementation of Efficiency CORDIC Algorithmfor Sine & Cosine Generation
Implementation of Efficiency CORDIC Algorithmfor Sine & Cosine GenerationImplementation of Efficiency CORDIC Algorithmfor Sine & Cosine Generation
Implementation of Efficiency CORDIC Algorithmfor Sine & Cosine Generation
 
Passive network-redesign-ntua
Passive network-redesign-ntuaPassive network-redesign-ntua
Passive network-redesign-ntua
 

Viewers also liked (20)

Blogger vs wikispace
Blogger  vs wikispaceBlogger  vs wikispace
Blogger vs wikispace
 
Eimear O'Reilly Portfolio
Eimear O'Reilly PortfolioEimear O'Reilly Portfolio
Eimear O'Reilly Portfolio
 
Modalidad deportes
Modalidad deportesModalidad deportes
Modalidad deportes
 
Exploracion en los portales ´´aulaclic.es´´ y ´´aulafacil.com´´
Exploracion en los portales ´´aulaclic.es´´ y ´´aulafacil.com´´Exploracion en los portales ´´aulaclic.es´´ y ´´aulafacil.com´´
Exploracion en los portales ´´aulaclic.es´´ y ´´aulafacil.com´´
 
Encontro Familia Volkswagen Revista Marketing
Encontro Familia Volkswagen Revista MarketingEncontro Familia Volkswagen Revista Marketing
Encontro Familia Volkswagen Revista Marketing
 
2015-2016 Mid Year Report_V5 FINAL
2015-2016 Mid Year Report_V5 FINAL2015-2016 Mid Year Report_V5 FINAL
2015-2016 Mid Year Report_V5 FINAL
 
Solo oir
Solo oirSolo oir
Solo oir
 
My favourite gadget
My favourite gadgetMy favourite gadget
My favourite gadget
 
Blogger vs wikispace
Blogger  vs wikispaceBlogger  vs wikispace
Blogger vs wikispace
 
Capacitacion gruas
Capacitacion gruasCapacitacion gruas
Capacitacion gruas
 
Redes informaticas
Redes informaticasRedes informaticas
Redes informaticas
 
Tablas de frecuencia para datos no agrupados
Tablas de frecuencia para datos no agrupadosTablas de frecuencia para datos no agrupados
Tablas de frecuencia para datos no agrupados
 
Fotografia de Esportes
Fotografia de EsportesFotografia de Esportes
Fotografia de Esportes
 
Melancolia I
Melancolia I Melancolia I
Melancolia I
 
luisa
luisaluisa
luisa
 
Google
GoogleGoogle
Google
 
Exploracion de portales
Exploracion de portalesExploracion de portales
Exploracion de portales
 
Google
GoogleGoogle
Google
 
luisa *-*
luisa *-* luisa *-*
luisa *-*
 
red informatica
red informaticared informatica
red informatica
 

Similar to Dzanan_Bajgoric_C2CUDA_MscThesis_Present

Project seminar ppt_steelcasting
Project seminar ppt_steelcastingProject seminar ppt_steelcasting
Project seminar ppt_steelcastingRudra Narayan Paul
 
FPGA Implementation of Pipelined CORDIC Sine Cosine Digital Wave Generator
FPGA Implementation of Pipelined CORDIC Sine Cosine Digital Wave Generator FPGA Implementation of Pipelined CORDIC Sine Cosine Digital Wave Generator
FPGA Implementation of Pipelined CORDIC Sine Cosine Digital Wave Generator cscpconf
 
Quantum Computing Notes Ver1.0
Quantum Computing Notes Ver1.0Quantum Computing Notes Ver1.0
Quantum Computing Notes Ver1.0Vijayananda Mohire
 
Quantum Computing Notes Ver 1.2
Quantum Computing Notes Ver 1.2Quantum Computing Notes Ver 1.2
Quantum Computing Notes Ver 1.2Vijayananda Mohire
 
Compiler optimization techniques
Compiler optimization techniquesCompiler optimization techniques
Compiler optimization techniquesHardik Devani
 
Mathematics and development of fast TLS handshakes
Mathematics and development of fast TLS handshakesMathematics and development of fast TLS handshakes
Mathematics and development of fast TLS handshakesAlexander Krizhanovsky
 
VLSI Design Final Project - 32 bit ALU
VLSI Design Final Project - 32 bit ALUVLSI Design Final Project - 32 bit ALU
VLSI Design Final Project - 32 bit ALUSachin Kumar Asokan
 
Unit 2 - Single Purpose Processors
Unit 2 - Single Purpose ProcessorsUnit 2 - Single Purpose Processors
Unit 2 - Single Purpose ProcessorsButtaRajasekhar2
 
How Data Flow analysis works in a static code analyzer
How Data Flow analysis works in a static code analyzerHow Data Flow analysis works in a static code analyzer
How Data Flow analysis works in a static code analyzerAndrey Karpov
 
ESL Anyone?
ESL Anyone? ESL Anyone?
ESL Anyone? DVClub
 
Compact and safely: static DSL on Kotlin
Compact and safely: static DSL on KotlinCompact and safely: static DSL on Kotlin
Compact and safely: static DSL on KotlinDmitry Pranchuk
 
Chainer-Compiler 動かしてみた
Chainer-Compiler 動かしてみたChainer-Compiler 動かしてみた
Chainer-Compiler 動かしてみたAkira Maruoka
 
Implementation performance analysis of cordic
Implementation performance analysis of cordicImplementation performance analysis of cordic
Implementation performance analysis of cordiciaemedu
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 

Similar to Dzanan_Bajgoric_C2CUDA_MscThesis_Present (20)

Project seminar ppt_steelcasting
Project seminar ppt_steelcastingProject seminar ppt_steelcasting
Project seminar ppt_steelcasting
 
FPGA Implementation of Pipelined CORDIC Sine Cosine Digital Wave Generator
FPGA Implementation of Pipelined CORDIC Sine Cosine Digital Wave Generator FPGA Implementation of Pipelined CORDIC Sine Cosine Digital Wave Generator
FPGA Implementation of Pipelined CORDIC Sine Cosine Digital Wave Generator
 
Scala to assembly
Scala to assemblyScala to assembly
Scala to assembly
 
Report
ReportReport
Report
 
Scala.io
Scala.ioScala.io
Scala.io
 
Quantum Computing Notes Ver1.0
Quantum Computing Notes Ver1.0Quantum Computing Notes Ver1.0
Quantum Computing Notes Ver1.0
 
ERTS UNIT 3.ppt
ERTS UNIT 3.pptERTS UNIT 3.ppt
ERTS UNIT 3.ppt
 
Quantum Computing Notes Ver 1.2
Quantum Computing Notes Ver 1.2Quantum Computing Notes Ver 1.2
Quantum Computing Notes Ver 1.2
 
Compiler optimization techniques
Compiler optimization techniquesCompiler optimization techniques
Compiler optimization techniques
 
Mathematics and development of fast TLS handshakes
Mathematics and development of fast TLS handshakesMathematics and development of fast TLS handshakes
Mathematics and development of fast TLS handshakes
 
VLSI Design Final Project - 32 bit ALU
VLSI Design Final Project - 32 bit ALUVLSI Design Final Project - 32 bit ALU
VLSI Design Final Project - 32 bit ALU
 
Boosting Developer Productivity with Clang
Boosting Developer Productivity with ClangBoosting Developer Productivity with Clang
Boosting Developer Productivity with Clang
 
Unit 2 - Single Purpose Processors
Unit 2 - Single Purpose ProcessorsUnit 2 - Single Purpose Processors
Unit 2 - Single Purpose Processors
 
Unit 4 dica
Unit 4 dicaUnit 4 dica
Unit 4 dica
 
How Data Flow analysis works in a static code analyzer
How Data Flow analysis works in a static code analyzerHow Data Flow analysis works in a static code analyzer
How Data Flow analysis works in a static code analyzer
 
ESL Anyone?
ESL Anyone? ESL Anyone?
ESL Anyone?
 
Compact and safely: static DSL on Kotlin
Compact and safely: static DSL on KotlinCompact and safely: static DSL on Kotlin
Compact and safely: static DSL on Kotlin
 
Chainer-Compiler 動かしてみた
Chainer-Compiler 動かしてみたChainer-Compiler 動かしてみた
Chainer-Compiler 動かしてみた
 
Implementation performance analysis of cordic
Implementation performance analysis of cordicImplementation performance analysis of cordic
Implementation performance analysis of cordic
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 

Dzanan_Bajgoric_C2CUDA_MscThesis_Present

  • 1. Master Thesis: Automatic Program Parallelization for GPU Environment Candidate: Džanan Bagorić Supervisor: Jan Kwiatkowski DSc Wrocław, 2016
  • 2. Agenda 1. Automatic Parallelization • Data Dependence Analysis • Loop Transformations 2. C2CUDA Compiler • Compiler Organization • Process Diagram 3. Experiments and Conclusions • Matrix Addition • Naïve Matrix Multiplication • Magnetic Resonance Imaging – Q 4. Future and potential Improvements 2 of 15
  • 3. What is Automatic Parallelization? • Automatic parallelization is a technique by which compiler automatically translates a sequential program into its parallel equivalent. 3 of 15
  • 4. Main Goals. Data Dependence Analysis • Mathematical apparatus • Linear data dependence problem • General and Uniform dependence algorithm Loop Transformations and Restructuring • Unimodular loop transformations • Outer loop parallelization • Inner loop parallelization GPU Architecture and CUDA Model • Intimate understanding of GPU architecture • CUDA architecture and programming model C2CUDA Compiler • Design and implementation of C2CUDA automatic parallelizer • Applying C2CUDA on number of test programs, and recording the results 4 of 15
  • 5. Data Dependence Analysis • Data dependence analysis consists of finding all dependent statements in a given program. This thesis is limited on solving the data dependence problem on the level of a single perfect loop nest. 𝐿1: do 𝐼1 = 𝑝1, 𝑞1 ⋮ ⋮ 𝐿 𝑚: do 𝐼 𝑚 = 𝑝 𝑚, 𝑞 𝑚 𝑺: 𝑋 𝑰𝑨 + 𝒂 𝟎 = ⋯ 𝑻: ⋯ = X 𝐈𝐁 + 𝐛 𝟎 enddo ⋮ enddo Dependence problem: Find all data dependences between the statements in the nest body H(I). • Form system of linear diophantine equations: 𝒊𝑨 + 𝒂0 = 𝒋𝑩 + 𝒃0 • Use echelon reduction algorithm to solve the system • If system has an integer solution, the dependence may exist. Otherwise, there is no dependence • If diophantine system has solution, it has to be checked against loop nest bounds • This leads to system of linear inequalities that may be solve with Fourier Method of Elimination • If system has an integer solution the dependence exists. Otherwise, dependence doesn’t exist within the loop bounds. 5 of 15
  • 6. Loop Transformations • Loop transformations are used to restructure the loop nest in order to better expose the parallelism considering the target architecture. 𝐿1: do 𝐼1 = 1,100 𝐿2: do 𝐼2 = 1,100000 S: 𝑋 𝐼1, 𝐼2 = 𝑋 𝐼1 − 1, 𝐼2 enddo enddo Loop transformation problem: Based on the dependence constraints and target architecture details, transform the given loop nest to better expose the parallelism while respecting the dependence constraints. • Decide which transformation (if any) to apply to better expose the parallelism • C2CUDA supports family of so-called unimodular transformations • Outer loop parallelization, inner loop paralleliza- tion, loop permutation, loop interchange outer loop parallelization 6 of 15 Outermost loop carries dependence 𝐿𝑈1: do 𝐾1 = 1,100000 𝐿𝑈2: do 𝐾2 = 1,100 S: 𝑋 𝐾2, 𝐾1 = 𝑋 𝐾2 − 1, 𝐾1 enddo enddo Innermost loop carries dependence
  • 7. C2CUDA Compiler • C2CUDA is an automatic parallelizer designed and implemented as part of this thesis. • It is capable of translating relatively simple ANSI C programs into their CUDA C equivalents. C2CUDA Dependence C2CUDA Frontend C2CUDA Utilization C2CUDA Transformation Backbone for the rest of the components (echelon reduction, diagonalization, diophantine solver, Fourier elimination solver, common matrix/vector operations) Implements data dependence analyzer (general and uniform data dependence algorithms) Implements loop transformation and restructuring engine (supports outer and inner loop paralleliza- tion) Implements a parser (uses CFG as an intermediate representation), transformation frontend and code generator (that produces final CUDA C code) 7 of 15
  • 8. C2CUDA Process Diagram C2CUDA Frontend C2CUDA Dependence C2CUDA Transformation C2CUDA Utilization Loop nest Dependence information Dependence information Transformation CUDA C codeSequential C code • The following is a high-level process diagram of the C2CUDA compiler, showing how each of the components mutually interoperate. 8 of 15
  • 9. Conclusions and Experiments. Matrix Addition. Matrix dimension Sequential Parallel (without transfer overhead) Parallel (with transfer overhead) 1024 × 1024 0.005 0.001 0.011 2048 × 3072 0.030 0.003 0.059 4096 × 2048 0.04 0.004 0.078 4096 × 4096 0.075 0.008 0.15 8192 × 4096 0.147 0.016 0.302 8192 × 8192 0.314 0.032 0.598 10240 × 8192 0.409 0.04 0.761 8192 × 16384 0.689 0.06 1.212 unsigned int i = 0; for (; i < n_rows; ++i) { unsigned int j = 0; for (; j < n_cols; ++j) c[i][j] = a[i][j] + b[i][j]; } Sequential C program Dependence-free nest __global__ void MatrixAdditionC2CUDA(int *a, int *b, int *c, const unsigned int n_cols) { unsigned int i = blockIdx.y * blockDim.y + threadIdx.y; unsigned int j = blockIdx.x * blockDim.x + threadIdx.x; if (i >= n_rows || j >= n_cols) return; unsigned int idx = i * n_cols + j; c[idx] = a[idx] + b[idx]; } C2CUDA generated matrix addition Every CUDA thread maps to exactly one nest iteration! Execution times in seconds 9 of 15
  • 10. Conclusions and Experiments. Matrix Addition. Matrix dimension Speedup (without transfer overhead) Speedup (with transfer overhead) MD_1 1024 × 1024 6.06 0.43 MD_2 2048 × 3072 9.64 0.52 MD_3 4096 × 2048 9.48 0.50 MD_4 4096 × 4096 9.23 0.49 MD_5 8192 × 4096 9.01 0.48 MD_6 8192 × 8192 9.67 0.52 MD_7 10240 × 8192 10.26 0.54 MD_8 8192 × 16384 11.44 0.57 10 of 15
  • 11. Conclusions and Experiments. Naïve Matrix Multiplication. unsigned int i = 0; for (; i < a_n_rows; ++i) { unsigned int j = 0; for (; j < b_n_cols; ++j) { c[i][j] = 0; unsigned int k = 0; for (; k < a_n_cols; ++k) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } } } Sequential C program Innermost loop carries dependence Execution times in seconds __global__ void MatrixMultC2CUDA(const int *a, const int *b, int *c, const unsigned int a_n_rows, const unsigned int a_n_cols, const unsigned int b_n_cols) { unsigned int i = blockIdx.y * blockDim.y + threadIdx.y; unsigned int j = blockIdx.x * blockDim.x + threadIdx.x; if (i >= a_n_rows || j >= b_n_cols) return; unsigned int c_idx = i * b_n_cols + j; c[c_idx] = 0; unsigned int k = 0; for (; k < a_n_cols; ++k) { unsigned int a_idx = i * a_n_cols + k; unsigned int b_idx = k * b_n_cols + j; c[c_idx] = c[c_idx] + a[a_idx] * b[b_idx]; } } C2CUDA generated matrix multiplication Each CUDA thread executes entire innermost loop 11 of 15 Matrices dimensions Sequential Parallel (without transfer overhead) Parallel (with transfer overhead) 𝑨 = 256 × 512, 𝑩 = 512 × 256 0.077 0.006 0.007 𝑨 = 512 × 512, 𝑩 = 512 × 512 0.297 0.024 0.026 𝑨 = 1024 × 512, 𝑩 = 512 × 1024 1.251 0.095 0.1 𝑨 = 1024 × 1024, 𝑩 = 1024 × 1024 3.662 0.192 0.201 𝑨 = 2048 × 1024, 𝑩 = 1024 × 512 2.489 0.195 0.206 𝑨 = 2048 × 1024, 𝑩 = 1024 × 2048 32.719 0.746 0.770 𝑨 = 2048 × 2048, 𝑩 = 2048 × 2048 74.143 1.493 1.527 𝑨 = 4096 × 512, 𝑩 = 512 × 4096 33.952 1.475 1.537 𝑨 = 4096 × 2048, 𝑩 = 2048 × 1024 71.344 1.469 1.509
  • 12. Conclusions and Experiments. Naïve Matrix Multiplication. Matrices dimension Speedup (without transfer overhead) Speedup (with transfer overhead) MD_1 𝑨 = 256 × 512, 𝑩 = 512 × 256 12.84 10.54 MD_2 𝑨 = 512 × 512, 𝑩 = 512 × 512 12.34 11.26 MD_3 𝑨 = 1024 × 512, 𝑩 = 512 × 1024 13.14 12.39 MD_4 𝑨 = 1024 × 1024, 𝑩 = 1024 × 1024 19.04 18.21 MD_5 𝑨 = 2048 × 1024, 𝑩 = 1024 × 512 12.76 12.10 MD_6 𝑨 = 2048 × 1024, 𝑩 = 1024 × 2048 43.83 42.47 MD_7 𝑨 = 2048 × 2048, 𝑩 = 2048 × 2048 48.98 47.87 MD_8 𝑨 = 4096 × 512, 𝑩 = 512 × 4096 23.02 22.02 MD_9 𝑨 = 4096 × 2048, 𝑩 = 2048 × 1024 48.54 47.26 12 of 15
  • 13. Conclusions and Experiments. Magnetic Resonance Imaging – Q. void ComputeQCPU(int numK, int numX, kValues *kVals, float* x, float* y, float* z, float *Qr, float *Qi) { int indexK, indexX; for (indexK = 0; indexK < numK; indexK++) { for (indexX = 0; indexX < numX; indexX++) { float expArg = PIx2 * (kVals[indexK].Kx * x[indexX] + kVals[indexK].Ky * y[indexX] + kVals[indexK].Kz * z[indexX]); float cosArg = cosf(expArg); float sinArg = sinf(expArg); float phi = kVals[indexK].PhiMag; Qr[indexX] += phi * cosArg; Qi[indexX] += phi * sinArg; } } } Sequential C program outer loop parallelization Outermost loop carries the dependence for (k1 = 0; k1 < numX; k1++) { for (k2 = 0; k2 < numK; k2++) { float expArg = PIx2*(kVals[k2].Kx * x[k1] + kVals[k2].Ky * y[k1] + kVals[k2].Kz * z[k1]); float cosArg = cosf(expArg); float sinArg = sinf(expArg); float phi = kVals[k2].PhiMag; Qr[k1] += phi * cosArg; Qi[k1] += phi * sinArg; } } Dependence pushed down to innermost loop C2CUDA generated MRI-Q computation __global__ void ComputeQC2CUDA(int numK, int numX, kValues *kVals, float* x, float* y, float* z, float *Qr, float *Qi) { unsigned int k1 = blockIdx.x * blockDim.x + threadIdx.x; if (k1 >= numX - 1) return; Qr[k1] = 0.0f; Qi[k1] = 0.0f; unsigned int k2 = 0; for (; k2 < numK; k2++) { float expArg = PIx2 * (kVals[k2].Kx * x[k1] + kVals[k2].Ky * y[k1] + kVals[k2].Kz * z[k1]); float cosArg = cosf(expArg); float sinArg = sinf(expArg); float phi = kVals[k2].PhiMag; Qr[k1] += phi * cosArg; Qi[k1] += phi * sinArg; } } Each CUDA thread executes complete innermost loop 13 of 15
  • 14. Conclusions and Experiments. Magnetic Resonance Imaging – Q. Dataset size Sequential Parallel (without transfer overhead) Parallel (with transfer overhead) 𝑛𝑢𝑚𝐾 = 512 𝑛𝑢𝑚𝑋 = 32768 0.492 0.008 0.009 𝑛𝑢𝑚𝐾 = 1024 𝑛𝑢𝑚𝑋 = 65536 1.855 0.034 0.034 𝑛𝑢𝑚𝐾 = 1024 𝑛𝑢𝑚𝑋 = 131072 3.704 0.067 0.068 𝑛𝑢𝑚𝐾 = 2048 𝑛𝑢𝑚𝑋 = 131072 7.415 0.133 0.134 𝑛𝑢𝑚𝐾 = 2048 𝑛𝑢𝑚𝑋 = 262144 22.203 0.261 0.263 𝑛𝑢𝑚𝐾 = 2048 𝑛𝑢𝑚𝑋 = 393216 33.317 0.381 0.384 𝑛𝑢𝑚𝐾 = 204𝟖 𝑛𝑢𝑚𝑋 = 524288 44.563 0.493 0.497 𝑛𝑢𝑚𝐾 = 4096 𝑛𝑢𝑚𝑋 = 786432 133.301 1.389 1.395 Dataset size Speedup (without transfer overhead) Speedup (with transfer overhead) DS_1 𝑛𝑢𝑚𝐾 = 512 𝑛𝑢𝑚𝑋 = 32768 58.53 54.17 DS_2 𝑛𝑢𝑚𝐾 = 1024 𝑛𝑢𝑚𝑋 = 65536 55.32 54.15 DS_3 𝑛𝑢𝑚𝐾 = 1024 𝑛𝑢𝑚𝑋 = 131072 55.33 54.38 DS_4 𝑛𝑢𝑚𝐾 = 2048 𝑛𝑢𝑚𝑋 = 131072 56.07 55.49 DS_5 𝑛𝑢𝑚𝐾 = 2048 𝑛𝑢𝑚𝑋 = 262144 85.01 84.34 DS_6 𝑛𝑢𝑚𝐾 = 2048 𝑛𝑢𝑚𝑋 = 393216 87.53 86.87 DS_7 𝑛𝑢𝑚𝐾 = 2048 𝑛𝑢𝑚𝑋 = 524288 90.35 89.69 DS_8 𝑛𝑢𝑚𝐾 = 4096 𝑛𝑢𝑚𝑋 = 786432 95.95 95.59 Execution times in seconds 14 of 15
  • 15. Improvements and Future Work. C2CUDA Negative loop strides Non-unit loop strides Extend the specter of supported transformations Improve heuristic for selecting transformations Improve mapping of CUDA threads to loop iterations Improvements Conditional statements support Utilize CUDA shared and constant memory Support other loop types (while, do-while, mixed) Speed-up gain estimation algorithm Runtime analysis Pointer analysis 15 of 15