GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~

1
GPGPU Accelerates PostgreSQL
~Unlock the power of multi-thousand cores~
PGconf.EU 2015 Wien
NEC Business Creation Division
The PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

2
Today’s topic
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
Very powerful
computing
capability
Very functional
& well-used
database
PG-Strom:
Glue of the both promising
technologies
GPGPU

3
Self Introduction
▌Name: KaiGai Kohei
▌Works for:NEC
▌Role:
 Lead of the PG-Strom project
 Contribution to PostgreSQL and other OSS
projects
 Making business opportunity using this
technology
▌PG-Strom Project
 Mission: To deliver the power of semiconductor evolution
for every people who want
 Launched at Jan-2012, as a personal development project
 Project is now funded by NEC
 Fully open source project

4
Agenda
1. GPU characteristics and why?
2. What intermediates PostgreSQL and GPU
3. How GPU processes SQL workloads
4. Performance results and analysis
5. Further development

5
Features of GPU (Graphic Processor Unit)
▌Characteristics
 Massive parallel cores
 Much higher DRAM bandwidth
 Better price / performance ratio
 advantages for massive but
simple arithmetic operations
 Different manner to write
software algorithm
SOURCE: CUDA C Programming Guide
GPU CPU
Model
Nvidia GTX
TITAN X
Intel Xeon
E5-2690 v3
Architecture Maxwell Haswell
Launch Mar-2015 Sep-2014
# of transistors 8.0billion 3.84billion
# of cores
3072
(simple)
12
(functional)
Core clock 1.0GHz
2.6GHz,
up to 3.5GHz
Peak Flops
(single
precision)
6.6TFLOPS
998.4GFLOPS
(with AVX2)
DRAM size
12GB, GDDR5
(384bits bus)
768GB/socket,
DDR4
Memory band 336.5GB/s 68GB/s
Power
consumption
250W 135W
Price $999 $2,094

6
How GPU cores works
●item[0]
step.1 step.2 step.4step.3
Calculation of
𝑖𝑡𝑒𝑚[𝑖]
𝑖=0…𝑁−1
with GPU cores
◆
●
▲ ■ ★
● ◆
●
● ◆ ▲
●
● ◆
●
● ◆ ▲ ■
●
● ◆
●
● ◆ ▲
●
● ◆
●
item[1]
item[2]
item[3]
item[4]
item[5]
item[6]
item[7]
item[8]
item[9]
item[10]
item[11]
item[12]
item[13]
item[14]
item[15]
Sum of items[]
by log2N steps
Inter-core synchronization by HW functionality

7
Dark Silicon and Semiconductor Trend
▌Processor shrink also increases power leakage.
▌Dark Silicon: area cannot power-on due to thermal problem.
▌People thought: How to utilize these extra transistors?
Domain specific processors, like GPU
Core Core Core Core
Core Core Core Core
Dark SiliconCore Core
Core Core
Dark Silicon
Core
Core
Core
Core
Core
Core
Next GenNext Gen
Intel Haswell
on-chip layout
(4core system)

8
Towards heterogeneous era...
Software must be designed to intermediate CPU and GPU

9
Agenda

10
What is PG-Strom
▌Core ideas
① GPU native code generation on the fly
② Asynchronous massive parallel execution
▌Advantages
 Transparent acceleration with 100% query compatibility
 Commodity H/W and less system integration cost
Parser
Planner
Executor
Custom-
Scan/Join
Interface
Query: SELECT * FROM l_tbl JOIN r_tbl on l_tbl.lid = r_tbl.rid;
PG-Strom
CUDA
driver
nvrtc
DMA Data Transfer
CUDA
Source
code
Massive
Parallel
Execution

11
How query is processed
▌Custom-Scan/Join interface (v9.5 new!)
▌PG-Strom provides own scan/join logics using GPUs
▌It generates same results, but different way
Query Parser
Query Optimizer
Query Executor
SQL
(text)
Parse
Tree
Query Plan
Custom
Scan/Join
Interface
PG-Strom
Interactions
between
PostgreSQL
and
Extensions
CUDA
code
Data
Chunk
NVIDIA
Run-Time
Compiler

12
definition
at planning
time
(OT) Dive into Custom-Join
Built-in JOIN Custom JOIN
CustomScan
(scanrelid==0)
Logic implemented
by extension
Inner Scan Outer Scan
table-X table-Y
Join
• NestedLoop
• MergeJoin
• HashJoin
TupleDesc
of table-X
TupleDesc
of table-Y
X-tuple Y-tuple
joined-tuple
TupleDesc
of Join X-Y
table-Y Something other
data source
table-X
Data Load with extension’s
comfortable way
joined-tuple
compatible tuple format

13
SQL-to-GPU native code generation (1/3)
postgres=# EXPLAIN VERBOSE
SELECT cat, count(*), avg(x) FROM t0
WHERE x between y and y + 20.0 GROUP BY cat;
QUERY PLAN
-------------------------------------------------------------------------------------
HashAggregate (cost=167646.33..167646.36 rows=3 width=12)
Output: cat, pgstrom.count((pgstrom.nrows())),
pgstrom.avg((pgstrom.nrows((x IS NOT NULL))), (pgstrom.psum(x)))
Group Key: t0.cat
-> Custom Scan (GpuPreAgg) (cost=24150.22..163315.53 rows=258 width=48)
Output: cat, pgstrom.nrows(), pgstrom.nrows((x IS NOT NULL)), pgstrom.psum(x)
Bulkload: On (density: 100.00%)
Reduction: Local + Global
Device Filter: ((t0.x >= t0.y) AND (t0.x <= (t0.y + '20'::double precision)))
Features: format: tuple-slot, bulkload: unsupported
Kernel Source: /opt/pgsql/base/pgsql_tmp/pgsql_tmp_strom_24905.4.gpu
-> Custom Scan (BulkScan) on public.t0 (cost=21150.22..159312.94
rows=10000060 width=12)
Output: id, cat, aid, bid, cid, did, eid, x, y, z
Features: format: heap-tuple, bulkload: supported
(13 rows)

14
$ less /opt/pgsql/base/pgsql_tmp/pgsql_tmp_strom_24905.4.gpu
:
STATIC_FUNCTION(bool)
gpupreagg_qual_eval(kern_context *kcxt,
kern_data_store *kds,
kern_data_store *ktoast,
size_t kds_index)
{
pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1); /* constant 20.0 */
pg_float8_t KVAR_8 = pg_float8_vref(kds,kcxt,7,kds_index); /* var of X */
pg_float8_t KVAR_9 = pg_float8_vref(kds,kcxt,8,kds_index); /* var of Y */
return EVAL(pgfn_boolop_and_2(kcxt,
pgfn_float8ge(kcxt, KVAR_8, KVAR_9),
pgfn_float8le(kcxt, KVAR_8,
pgfn_float8pl(kcxt, KVAR_9, KPARAM_1))));
}
:
 formula expression based on WHERE clause

15
SQL
Query
Dynamically
constructed
GPU code
------
Per Query Variant
Statically
constructed
GPU code
------
Algorithm Core
NVIDIA
Run-Time
Compiler
(nvrtc)
Cached
GPU Binary
+
PG-Strom
GPU code
generator
For
Execution
For Next
Time
GpuScan,
GpuHashJoin,
GpuPreAgg,
GpuSort, ...

16
PG-Strom functionalities at Sep-2015
▌Logics
 GpuScan ... Simple loop extraction by GPU multithread
 GpuHashJoin ... GPU multithread based N-way hash-join
 GpuNestLoop ... GPU multithread based N-way nested-loop
 GpuPreAgg ... Row reduction prior to CPU aggregation
 GpuSort ... GPU bitonic + CPU merge, hybrid sorting
▌Data Types
 Numeric ... int2/4/8, float4/8, numeric
 Date and Time ... date, time(tz), timestamp(tz), interval
 Text ... simple matching only, at this moment
 others ... currency
▌Functions
 Comparison operator ... <, <=, !=, =, >=, >
 Arithmetic operators ... +, -, *, /, %, ...
 Mathematical functions ... sqrt, log, exp, ...
 Aggregate functions ... min, max, sum, avg, stddev, ...

17
Agenda

18
Asynchronous & Pipelining Execution (1/3)
▌Karma of discrete GPU device
 We have to move the data to be processed.
 Thus, people say data intensive tasks are not good for GPU.
 Correct, but we can reduce the impact.
 Our solution: Pipelining
GPU
CPU
GPU kernel
execution
time
DMA
Send
DMA
Recv

19
chunk-(i+1)
chunk-(i+2)
chunk-i
chunk-(i+3)
tablescan
Move
to Next
DMA
Recv
GPU
Kernel
Exec
DMA
Send
Buffer
Read
DMA
Recv
GPU
Kernel
Exec
DMA
Send
Buffer
Read
GPU
Kernel
Exec
DMA
Send
Buffer
Read
DMA
Send
Buffer
Read
Step-(i+4) Step-(i+3) Step-(i+2) Step-(i+1) Step-i

20
▌Karma of discrete GPU device
 We have to move the data to be processed.
 Thus, people say data intensive tasks are not good for GPU.
 Correct, but we can reduce the impact
 Our solution: Pipelining
▌Downside of the pipelining
 Too small workloads
 Pipeline hazard
GPU
CPU
GPU kernel
execution
time
DMA
Send
DMA
Recv

21
SQL Logic on GPU (1/3) – GpuHashJoin
Inner
relation
Outer
relation
Inner
relation
Outer
relation
Hash table Hash table
Next step Next step
All CPU does
is just
references
the result of
relations join
Sequential
Hash table
search by CPU
Sequential
Projection
by CPU
Parallel
Projection
by GPU
Parallel Hash-
table search
by GPU
Built-in Hash-Join GpuHashJoin of PG-Strom

22
Inner Hash table
OT: Ununiformly Distributed Hash-Table
GPU Core-0
GPU Core-1
GPU Core-2
GPU Core-3
GPU Core-4
GPU Core-5
GPU Core-M
Outer
relation
Item[0]
Item[1]
Item[2]
Item[3]
Item[4]
Item[5]
Item[N]
Item[M]
Item Item Item Item Item
Inner
relation
Hash
Value
Hash
Function
Item[3] Item
Item[3] Item
Item[3] Item
Item[3] Item
Item[3] Item
Item[3] Item
Item[3] Item
Item[3] Item
Item[3] Item
Item[3] Item
Overflow of
Result Buffer

23
Inner Hash table
OT: Ununiformly Distributed Hash-Table
GPU Core-0
GPU Core-1
GPU Core-2
GPU Core-3
GPU Core-4
GPU Core-5
GPU Core-M
Outer
relation
Item[0]
Item[1]
Item[2]
Item[3]
Item[4]
Item[5]
Item[N]
Item[M]
Inner
relation
Hash
Value
Hash
Function
Item[3] Item Item[3] Item
Virtual Partition
with Extra Randomness
Just fit for
Result Buffer
x Retry for each
virtual partition

24
SQL Logic on GPU (2/3) – GpuNestLoop (unparametalized)
Outer-Relation
(Nx:usuallylarger)
※splittochunk-by-chunkon
demand
●
●
●
●
●
●
●
●
●
●
●●●●●●●Two
dimensional
GPU kernel
launch
blockDim.x
blockDim.y
Ny
Nx
Thread
(X=2, Y=3)
Inner-Relation
(Ny: relatively small)
Only edge thread references
DRAM to fetch values.
Nx:32 x Ny:32 = 1024
A matrix can be evaluated with
only 64 times DRAM accesses

25
SQL Logic on GPU (3/3) – GpuPreAgg
▌GpuPreAgg, prior to Aggregate, reduces number of rows to be processed.
▌Query rewrite to make equivalent results from intermediate aggregation.
GroupAggregate
Sort
SeqScan
Tbl_1
Result Set
several rows
several millions rows
several
millions
rows
Y, count(*),
avg(X)
X, Y
X, Y
X, Y, Z (table definition)
GroupAggregate
Sort
GpuScan
Tbl_1
Result Set
several rows
several hundreds rows
several millions rows
Y, sum(nrows),
avg_ex(nrows, psum)
Y, nrows, psum_X
X, Y
X, Y, Z (table definition)
GpuPreAgg Y, nrows, psum_X
several hundreds rows
SELECT count(*), AVG(X), Y FROM Tbl_1 GROUP BY Y;
reduction
of rows to be
processed

26
Agenda

27
Microbenchmark (1/2) – total response time
▌SELECT cat, AVG(x) FROM t0 NATURAL JOIN t1 [, ...] GROUP BY cat;
measurement of query response time with increasing of inner relations
 t0: 100M rows, t1~t10: 100K rows for each, all the data was preloaded.
▌Environment:
 PostgreSQL v9.5beta1 + PG-Strom (22-Oct), CUDA 7.0 + RHEL6.6 (x86_64)
 CPU: Xeon E5-2670v3, RAM: 384GB, GPU: NVIDIA TESLA K20c (2496cores)
58.89
82.51
102.15
125.46
159.83
194.50
241.88
12.44 17.88 17.56 21.98
33.33 38.14
50.29
0
50
100
150
200
250
300
2 3 4 5 6 7 8
QueryResponseTime[sec]
# of tables involved
Query Response Time towards Numer of Tables Joined
PostgreSQL 9.5β PG-Strom

28
Microbenchmark (2/2) – Time consumption per component
▌Good acceleration results towards JOIN and AGGREGATION
▌Less acceleration ratio when N is larger than 6
By inaccurate problem size estimation, why?
0
50
100
150
200
250
300
PgSQL Strom PgSQL Strom PgSQL Strom PgSQL Strom PgSQL Strom PgSQL Strom PgSQL Strom
2 3 4 5 6 7 8
# of tables involved
Time consumption per component (PostgreSQL v9.5β vs PG-Strom)
Scan Join Aggregate Others

29
OT: Why accurate problem size estimation is important
▌GpuJoin results must be smaller than GPU RAM size.
▌Virtual inner-partition saves scale of the Join results.
Better size estimation reduces number of retries.
We can determine the problem size using dynamic parallelism
downside: Supported by only Kepler or later generation.
innerRel
(depth-1)
innerRel
(depth-2)
innerRel
(depth-3)
innerRel
(depth-4)
outerRel
Results
× × × × =
× × × × =
GpuJoin with virtual inner-partition and iteration
Larger
than
GPU RAM
Within
GPU RAM
Size

30
TPC-DS Benchmark (1/2) – Total Results
▌What is TPC-DS?
 Successor of TPC-H, defines 103 (99+4) kind of queries
 Simulation of decision support queries in retail industry
▌About this benchmark
 Run the series of TPC-DS queries sequentially.
 We could collect 96 results of 103.
 Environment:
• PostgreSQL v9.5beta1 + PG-Strom (22-Oct), CUDA 7.0 + RHEL6.6 (x86_64)
• CPU: Xeon E5-2670v3, RAM: 384GB, GPU: NVIDIA TESLA K20c (2496cores)
0
500
1000
1500
2000
PostgreSQL v9.5β PG-Strom
anomaly?

31
TPC-DS Benchmark (2/2) – Total Result
▌Summary
 PG-Strom wins to PostgreSQL: 76 of 96
 PostgreSQL wins to PG-Strom: 20 of 96
 shows advantages on most of simple queries
▌Challenges
 Wise plan construction, to avoid GpuXXX node not to be here
 Functionality improvement for Aggregation and Scan
0
100
200
300
400
500
PostgreSQL v9.5β PG-Strom

32
Time consumption per workloads (1/3) – PostgreSQL v9.5
▌Scan, Join and Aggregation provides a clear explanation for
the time consumption factor in TPC-DS workloads
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Time consumption per workloads (PostgreSQL v9.5beta)

33
Time consumption per workloads (2/3) – PG-Strom
▌PG-Strom dramatically reduced time for Joins.
▌Time for Joins are: 37%  17%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Time consumption per workloads (PostgreSQL v9.5beta)

34
Time consumption per workloads (3/3) – Wow! GpuJoin
GpuJoin can explain 84% of performance improvement
y = 0.962x - 7.289
R² = 0.8442
-50.00
0.00
50.00
100.00
150.00
200.00
250.00
-50.00 0.00 50.00 100.00 150.00 200.00 250.00
ImprovementofTotalTime[sec]
(TotaltimeofPgSQL)–(TotaltimeofPG-Strom)
Improvement of Join Time [sec]
(Join time of PgSQL) – (Join time of PG-Strom)

35
Agenda

36
What we learn from TPC-DS
▌GpuJoin may be a killer feature of PG-Strom
 ....unless problem size estimation is accurate.
 Planner needs to be wiser, not to inject GpuJoin towards
the situation not suitable.
▌For Aggregation
 GpuPreAgg didn’t appear as we expected
 Anything GPU can help CPU aggregation?
▌For Scan
 Scan performance == data stream bandwidth of GPU input
 Single thread is a restriction for bandwidth of data supply

37
Features enhancement / improvement
▌Target-list calculation on GPU
 Pre-calculation of hash-value for HashAggregate
 Projection off-load if complicated mathematical expression
▌Custom-Scan under Gather node
 Allows multiple workers to provide data stream to GPUs
▌Partial Aggregation before Join
 Wider utilization of GpuPreAgg to reduce number of rows
to be joined by CPU
▌Utilization of dynamic parallelism
 Wise approach towards result buffer overflow problem in Join
▌... and more...

38
Short term development
1. Software quality improvement
2. Corner case handling
3. Documentation
4. Target-List calculation in GPU
5. Software quality improvement

39
WIP Project – In-place Computing
Why export entire data to run mathematical algorithm?
▌Runs complicated mathematical expression on GPUs, then
CPU just references the result
 Algorithms in SQL (even partially) automatically moves to GPUs
 WIP: Data-Clustering algorithms, like Min-Max, K-means, ...
mathematical
algorithm
Things
wants to
know
Things
wants to
know
result
result
mathematical
algorithm
Principle of
Near-Data
Computing
Limited
computing
capability
Data
Export

40
Future Direction of PG-Strom
Real-life workloads are the only compass for development

41
WE WANT YOU GET INVOLVED
▌PostgreSQL Users
 who concern about query
execution performance on
your workloads
▌Application Developers
 who are interested in trying
yet another database
acceleration methodology
▌PostgreSQL Developers
 who can bet the future of
heterogeneous era

42
Resources
▌Github
 https://github.com/pg-strom/devel
 https://github.com/pg-strom/devel/issues
▌Documentation
 Under construction.... sorry
▌Today’s slides
 Uploaded soon, check #pgconfeu on Tw
▌Author’s contact
 e-mail: kaigai@ak.jp.nec.com
 Tw: @kkaigai

GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~

Similar to GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~ (20)

More from Kohei KaiGai

More from Kohei KaiGai (20)

Recently uploaded

Recently uploaded (20)

GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~