GPUs for GEC Competition @ GECCO-2013

GECCO 2013 GPUs for GEC
GECCO 2013 GPUs for Genetic and
Evolutionary Computation Competition
Daniele Loiacono and Antonino Tumeo

Why GPUs?
!  The GPU has evolved into a very flexible and powerful processor:
" It’s programmable using high-level languages
" Now supports 32-bit and 64-bit floating point IEEE-754 precision
" It offers lots of GFLOPS
!  GPU in every PC and workstation

!  Goal
" Attract the applications of genetic and evolutionary
computation that can maximally exploit the parallelism
provided by low-cost consumer graphical cards.
!  Evaluation
" 50% – Quality and Performance
" 30% - Relevance for EC community
" 20% – Novelty
!  Panel
… and myself
This competition…
Simon Harding El-Ghazali TalbiAntonino TumeoJaume Bacardit

Entries
“Fast QAP Solver with ACO and Taboo Search on Multiple GPUs
with the Move-Cost Adjusted Thread Assignment”.
Shigeyoshi Tsutsui and Noriyuki Fujimoto
“GPOCL: A Massively Parallel GP Implementation in OpenCL”
Douglas A. Augusto Helio J.C. Barbosa

GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary%
Fast QAP Solver with ACO and Taboo
Search on Multiple GPUs with the Move-
Cost Adjusted Thread Assignment

Quadratic Assignment Problem (QAP)
•  One of the hardest combinatorial optimization
problem
–  There are many real-world applications:
•  Optimum location allocation of factories in a multinational company
•  Optimum section allocation in a big building
•  …
•  Definition:
–  Given n locations and n facilities, the task is to assign the
facilities to the locations to minimize the cost
•  aij is the distance matrix for each pair of locations i and j
•  bij is the flow matrix for each pair of facilities i and j
∑∑= =
=
n
i
ji
n
j
ijbaf
1
)()(
1
)( φφφ

Initialize Pheromone density τ"
Update pheromone density τ"
Construct solutions based on τ"
Apply local search (Tabu search)
start
end
ACO+TS on a Single GPU
Pheromone
Density
Matrix
τ"
Initialize Pheromone density τ"
Construct solutions based on τ"
Apply local search (Tabu search)
Update pheromone density τ"
Terminate?Terminate?
Instances
Construction
of solusions
TS
Updating
Trail
tai40a 0.007% 99.992% 0.001%
tai50a 0.005% 99.994% 0.000%
tai60a 0.004% 99.996% 0.000%
tai80a 0.002% 99.997% 0.000%
tai100a 0.002% 99.998% 0.000%
tai50b 0.022% 99.976% 0.002%
tai60b 0.017% 99.982% 0.001%
tai80b 0.011% 99.988% 0.001%
tai100b 0.008% 99.991% 0.000%
tai150b 0.005% 99.995% 0.000%
Time distribution in sequential run on CPU
•  We combined ACO and Taboo Search (TS)

•  A neighbor φ ’ of φ in QAP
•  Neighborhood size of N(φ) is Nsize=n(n-1)/2
•  To choose the best φ’, we need to calculate
costs for all of Nsize neighbors
2 1 0 3φ
Neighborhood in the QAP
0 1 2 3φ’
swap

Computation Cost of a Neighboring Solution
•  Fast update [Taillard 04]:
–  if we have memory of Δ(φ, r, s) for all pairs r, s,
–  and {u, v} ∩ {r, s}= satisfies, Δ(φ’, u, v) can be obtained as:
∑
−
≠=
$
$
%
&
'
'
(
)
−+−
+−+−
+−+−
+−+−=Δ
1
,,0 )()()()()()()()(
)()()()()()()()(
)()()()()()()()(
)()()()()()()()(
)()(
)()(
)()(
)()(),,(
n
srkk kskrskkrksrk
skrkksrkskkr
ssrrssrssrsr
srrsrsrrssrr
bbabba
bbabba
bbabba
bbabbasr
φφφφφφφφ
φφφφφφφφ
φφφφφφφφ
φφφφφφφφφ
)(nO
))((
))((
),,(),,'(
)(')(')(')(')(')(')(')('
)(')(')(')(')(')(')(')('
rurvsvsuusvsvrur
urvrvsussusvrvru
bbbbaaaa
bbbbaaaa
vuvu
φφφφφφφφ
φφφφφφφφ
φφ
−+−−+−
+−+−−+−
+Δ=Δ
)1(O
•  Let φ’ be a neighbor of φ obtained by exchanging r-th and s-th elements of φ,
then move cost Δ(φ, r, s)=f(φ’) - f(φ) can be obtained as

Parallel computation of move cost
-The simplest threads assignment-
threadIdx.x=0
threadIdx.x=1
threadIdx.x=2
.
.
.
.
.
threadIdx.x=Nsize-1
blockIdx.x=0
Assign m agents to blocks
Assignmovecalculationstothreads
blockIdx.x=1 blockIdx.x= m-1
Nsize=n(n-1)/2
threadIdx.x=0
threadIdx.x=1
threadIdx.x=2
.
.
.
.
.
threadIdx.x=Nsize-1
threadIdx.x=0
threadIdx.x=1
threadIdx.x=2
.
.
.
.
.
threadIdx.x=Nsize-1
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0
1 0
2 1 2
3 3 4 5
4 6 7 8 9
5 10 11 12 13 14
6 15 16 17 18 19 20
7 21 22 23 24 25 26 27
8 28 29 30 31 32 33 34 35
9 36 37 38 39 40 41 42 43 44
10 45 46 47 48 49 50 51 52 53 54
11 55 56 57 58 59 60 61 62 63 64 65
12 66 67 68 69 70 71 72 73 74 75 76 77
13 78 79 80 81 82 83 84 85 86 87 88 89 90
u
v

Move-Cost Adjusted Thread Assignment (MATA)
Computational time
warp 0
warp 1
0
1
2
3
15
16
30
31
32
33
3232
thread index
Computational time
No branch
divergence
in each warp !
0
1
2
3
4
5
6
28
29
30
31
warp 0
thread index
32
33
34
35
36
37
38
60
61
62
63
warp 1
O(1) O(n)
Delay Caused by Branch Divergence

Speedup on a Single GTX480
tai50a
tai60a
tai80a
tai100a
tai50b
tai60b
tai80b
tai100b
Average
0
5
10
15
20
25
30
35
40
Speedup
3.7
26.1
3.4
27.7
4.3
20.3
3.4
18.3
3.9
24.9
4.6
35.5
4.2
21.4
5.4
29.5
4.1
25.5
CPU: i7 965, 3.2GHz
QAP Instances

Implementation on Multiple GPUs
CPU
ACO0
ACO2
ACO1ACO3
CPU
work
memory
: solutions
GPU0
GPU1
GPU2
GPU3
ACO3
ACO0
ACO1

4 Types of Island Models
•  We implemented following 4 types of
island models
1.  IM-INDP: Island model with independent
runs
2.  IM-ELIT: Island model with elitist
3.  IM-RING: Island model with ring connected
4.  IM-ELMR: Island model with elitist and
massive ring connected

IM-INDP:
Island model with independent runs
CPU
ACO0
ACO1
ACO3
ACO2

IM-ELIT:
Island model with elitist
worst guy
best guy
global best guy
ACO0
ACO1
ACO3
ACO2

IM-RING:
Island model with ring connected
worst guy
best guy
ACO1
ACO2
ACO3
ACO0

IM-ELMR: Island model with elitist
and massive ring connected
CPU
IM-ELIT +
ACO1
ACO2
ACO3
ACO0

Results of Island Models with 4
GPUstai50a
tai60a
tai80a
tai100a
tai50b
tai60b
tai80b
tai100b
Average
0
1
2
3
4
5
6
7
Speedup
2.1
2.6
2.9
3.3
1.9
2.22.3
2.5 2.42.5
2.7
2.9
1.7
2.12.2
2.5
1.5
2.3
2.5
3
1.2
1.4
1.9
2.6
1.5
4.7
4.3
6.5
1.4
2.32.3
3.2
1.7
2.52.6
3.3

Conclusion
•  On a single GPU with “MATA”
– 25.5 times speedup to CPU (i7 965, 3.2GHz)
•  On 4-GPU (GTX480)
– IM_ELMR model has 3.3 times speedup to
single GPU
•  As a result, 25.5×3.3 = 84.2 times speedup
compared with the CPU computation

GPOCL:
A Massively Parallel GP
Implementation in OpenCL
Douglas A. Augusto Helio J.C. Barbosa
douglas@lncc.br hcbm@lncc.br
Laboratório Nacional de Computa¸cão Cient´ıfica (LNCC)
Rio de Janeiro, Brazil

GPOCL’s Features
2 / 12
n Fast and e cient C/C++ implementation based on a compact
linear tree representation.
n Massively parallel tree interpretation using OpenCL.
n It can be executed on virtually any parallel device, comprising dif-
ferent architectures and vendors.
n It implements three di↵erent parallel strategies (fitness-based,
population-based, and a mixture of both).
n To improve diversity it can evolve loosely-coupled subpopulations
(neighborhoods).
n It has a rich set of command-line options, including primitives’ set
definition, probabilities of the genetic operators, stopping crite-
ria, minimum and maximum tree sizes, and the configuration of
neighborhoods.
n It is Free Software (http://gpocl.sf.net).

Open Computing Language (OpenCL)
3 / 12
n Open Computing Language, or simply OpenCL, is an open-
standard programming language for heterogeneous parallel com-
puting.1
n It aims at e ciently exploiting the computing power of all process-
ing devices, such as traditional processors (CPU) and accelerators
(GPU, FPGA, DSP, Intel’s MIC, and so forth).
n It provides a uniform programming interface, which saves the pro-
grammer from writing di↵erent codes in di↵erent languages when
targeting multiple compute architectures, thus providing portabil-
ity.
n It is very ﬂexible (low-level language).
1
http://www.khronos.org

GPOCL
4 / 12
GPOCL implements a GP system using a prefix linear tree represen-
tation. Its main routine performs the following high-level procedures:
1. OpenCL initialization: This is the step where the general
OpenCL-related tasks are initialized.
2. Calculating n-dimensional ranges: Defines how much paral-
lel work there will be and how they are distributed among the
compute units.
3. Memory bu↵ers creation: In this phase all global memory re-
gions accessed by the OpenCL kernels are allocated on the device
and possibly initialized. The fitness cases are transferred and
enough space is reserved for the population and error vectors.
4. Kernel building: An OpenCL kernel, relative to a given strategy
of parallelization, is compiled just-in-time, targeting the compute
device.
5. Evolving: This iterative routine implements the actual genetic
programming dynamics.

Main Evolutionary Algorithm
5 / 12
Create (randomly) the initial population P;
22 Evaluate(P);
for generation 1 to NG do
Copy the best (elitism) programs of P to the temporary population Ptmp;
while |Ptmp| < |P| do
Select and copy from P two ﬁt programs, p1 and p2;
if [probabilistically] crossover then
Recombine p1 and p2, generating p0
1 and p0
2;
p1 p0
1; p2 p0
2;
end
if [probabilistically] mutation then
Apply mutation in p1 and p2, creating p0
1 and p0
2;
p1 p0
1; p2 p0
2;
end
Insert p1 and p2 into Ptmp;
end
P Ptmp; then reset Ptmp;
1818 Evaluate(P);
end
return the best program found;

Evaluate(P)
6 / 12
The evaluation step itself does not do much—the hard work is done
mostly by the OpenCL kernels. Basically, three things happen within
Evaluate(P):
1. Population transfer: All programs of P are transferred to the
target compute device.
2. Kernel execution: For any non-trivial problem, this is the most
demanding phase. Here, the entire recently transferred popula-
tion is evaluated—by interpreting each program over each ﬁtness
case—on the compute device. Fortunately, this step can be done
both in parallel as well accelerated by GPUs.
3. Error retrieval: After being computed and accumulated in the
previous step, the population’s prediction errors need to be trans-
ferred to the host so that this information is available to the
evolutionary process.

Overall Best Parallelization Strategy
7 / 12
n The population of programs and ﬁtness cases are parallelized.
n A mixture of the ﬁtness- and population-based strategies.
n While di↵erent programs are evaluated simultaneously on di↵erent
compute units (CU), the processing elements (PE) within each CU
take care, in parallel, of the whole training data set.
n Since internally to each CU the PEs will be interpreting the same
program, the event of instruction divergence is unlikely.

Some benchmarks on a NVIDIA
GTX-285 GPU
An old generation GPU (released in early 2009)
8 / 12

Fitness-based Parallelization Strategy
9 / 12
100
1000
5000
10000
25000
50000
1000
5000
10000
25000
50000
1.000
2.000
3.000
4.000
5.000
6.000
7.000
8.000
9.000
10.000
BillionGPop/s
Population size Data set size
BillionGPop/s
9.540 Billion GPop/s
(good performance, but requires a lot of ﬁtness cases)

Population-based Parallelization Strategy
10 / 12
100
1000
5000
10000
25000
50000
1000
5000
10000
25000
50000
0.100
0.200
0.300
0.400
0.500
0.600
0.700
BillionGPop/s
BillionGPop/s
(bad performance, causes a lot of instruction divergence)

Combined Fitness- and Population-based
Parallelization Strategy
11 / 12
100
1000
5000
10000
25000
50000
1000
5000
10000
25000
50000
7.000
8.000
9.000
10.000
11.000
12.000
BillionGPop/s
BillionGPop/s

Shigeyoshi Tsutsui, Hannan University
and
Noriyuki Fujimoto, Osaka Prefecture University
Fast QAP Solver with ACO and Taboo Search on Multiple
GPUs with the Move-Cost Adjusted Thread Assignment
And the winner is....

GPUs for GEC Competition @ GECCO-2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to GPUs for GEC Competition @ GECCO-2013

Similar to GPUs for GEC Competition @ GECCO-2013 (20)

More from Daniele Loiacono

More from Daniele Loiacono (20)

Recently uploaded

Recently uploaded (20)

GPUs for GEC Competition @ GECCO-2013