GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
1. 1
GPGPU Accelerates PostgreSQL
~Unlock the power of multi-thousand cores~
PGconf.EU 2015 Wien
NEC Business Creation Division
The PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>
2. 2
Today’s topic
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
Very powerful
computing
capability
Very functional
& well-used
database
PG-Strom:
Glue of the both promising
technologies
GPGPU
3. 3
Self Introduction
▌Name: KaiGai Kohei
▌Works for:NEC
▌Role:
Lead of the PG-Strom project
Contribution to PostgreSQL and other OSS
projects
Making business opportunity using this
technology
▌PG-Strom Project
Mission: To deliver the power of semiconductor evolution
for every people who want
Launched at Jan-2012, as a personal development project
Project is now funded by NEC
Fully open source project
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
4. 4
Agenda
1. GPU characteristics and why?
2. What intermediates PostgreSQL and GPU
3. How GPU processes SQL workloads
4. Performance results and analysis
5. Further development
5. 5
Features of GPU (Graphic Processor Unit)
▌Characteristics
Massive parallel cores
Much higher DRAM bandwidth
Better price / performance ratio
advantages for massive but
simple arithmetic operations
Different manner to write
software algorithm
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
SOURCE: CUDA C Programming Guide
GPU CPU
Model
Nvidia GTX
TITAN X
Intel Xeon
E5-2690 v3
Architecture Maxwell Haswell
Launch Mar-2015 Sep-2014
# of transistors 8.0billion 3.84billion
# of cores
3072
(simple)
12
(functional)
Core clock 1.0GHz
2.6GHz,
up to 3.5GHz
Peak Flops
(single
precision)
6.6TFLOPS
998.4GFLOPS
(with AVX2)
DRAM size
12GB, GDDR5
(384bits bus)
768GB/socket,
DDR4
Memory band 336.5GB/s 68GB/s
Power
consumption
250W 135W
Price $999 $2,094
6. 6
How GPU cores works
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
●item[0]
step.1 step.2 step.4step.3
Calculation of
𝑖𝑡𝑒𝑚[𝑖]
𝑖=0…𝑁−1
with GPU cores
◆
●
▲ ■ ★
● ◆
●
● ◆ ▲
●
● ◆
●
● ◆ ▲ ■
●
● ◆
●
● ◆ ▲
●
● ◆
●
item[1]
item[2]
item[3]
item[4]
item[5]
item[6]
item[7]
item[8]
item[9]
item[10]
item[11]
item[12]
item[13]
item[14]
item[15]
Sum of items[]
by log2N steps
Inter-core synchronization by HW functionality
7. 7
Dark Silicon and Semiconductor Trend
▌Processor shrink also increases power leakage.
▌Dark Silicon: area cannot power-on due to thermal problem.
▌People thought: How to utilize these extra transistors?
Domain specific processors, like GPU
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
Core Core Core Core
Core Core Core Core
Dark SiliconCore Core
Core Core
Dark Silicon
Core
Core
Core
Core
Core
Core
Next GenNext Gen
Intel Haswell
on-chip layout
(4core system)
9. 9
Agenda
1. GPU characteristics and why?
2. What intermediates PostgreSQL and GPU
3. How GPU processes SQL workloads
4. Performance results and analysis
5. Further development
10. 10
What is PG-Strom
▌Core ideas
① GPU native code generation on the fly
② Asynchronous massive parallel execution
▌Advantages
Transparent acceleration with 100% query compatibility
Commodity H/W and less system integration cost
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
Parser
Planner
Executor
Custom-
Scan/Join
Interface
Query: SELECT * FROM l_tbl JOIN r_tbl on l_tbl.lid = r_tbl.rid;
PG-Strom
CUDA
driver
nvrtc
DMA Data Transfer
CUDA
Source
code
Massive
Parallel
Execution
11. 11
How query is processed
▌Custom-Scan/Join interface (v9.5 new!)
▌PG-Strom provides own scan/join logics using GPUs
▌It generates same results, but different way
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
Query Parser
Query Optimizer
Query Executor
SQL
(text)
Parse
Tree
Query Plan
Custom
Scan/Join
Interface
PG-Strom
Interactions
between
PostgreSQL
and
Extensions
CUDA
code
Data
Chunk
NVIDIA
Run-Time
Compiler
12. 12
definition
at planning
time
(OT) Dive into Custom-Join
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
Built-in JOIN Custom JOIN
CustomScan
(scanrelid==0)
Logic implemented
by extension
Inner Scan Outer Scan
table-X table-Y
Join
• NestedLoop
• MergeJoin
• HashJoin
TupleDesc
of table-X
TupleDesc
of table-Y
X-tuple Y-tuple
joined-tuple
TupleDesc
of Join X-Y
table-Y Something other
data source
table-X
Data Load with extension’s
comfortable way
joined-tuple
compatible tuple format
13. 13
SQL-to-GPU native code generation (1/3)
postgres=# EXPLAIN VERBOSE
SELECT cat, count(*), avg(x) FROM t0
WHERE x between y and y + 20.0 GROUP BY cat;
QUERY PLAN
-------------------------------------------------------------------------------------
HashAggregate (cost=167646.33..167646.36 rows=3 width=12)
Output: cat, pgstrom.count((pgstrom.nrows())),
pgstrom.avg((pgstrom.nrows((x IS NOT NULL))), (pgstrom.psum(x)))
Group Key: t0.cat
-> Custom Scan (GpuPreAgg) (cost=24150.22..163315.53 rows=258 width=48)
Output: cat, pgstrom.nrows(), pgstrom.nrows((x IS NOT NULL)), pgstrom.psum(x)
Bulkload: On (density: 100.00%)
Reduction: Local + Global
Device Filter: ((t0.x >= t0.y) AND (t0.x <= (t0.y + '20'::double precision)))
Features: format: tuple-slot, bulkload: unsupported
Kernel Source: /opt/pgsql/base/pgsql_tmp/pgsql_tmp_strom_24905.4.gpu
-> Custom Scan (BulkScan) on public.t0 (cost=21150.22..159312.94
rows=10000060 width=12)
Output: id, cat, aid, bid, cid, did, eid, x, y, z
Features: format: heap-tuple, bulkload: supported
(13 rows)
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
14. 14
SQL-to-GPU native code generation (2/3)
$ less /opt/pgsql/base/pgsql_tmp/pgsql_tmp_strom_24905.4.gpu
:
STATIC_FUNCTION(bool)
gpupreagg_qual_eval(kern_context *kcxt,
kern_data_store *kds,
kern_data_store *ktoast,
size_t kds_index)
{
pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1); /* constant 20.0 */
pg_float8_t KVAR_8 = pg_float8_vref(kds,kcxt,7,kds_index); /* var of X */
pg_float8_t KVAR_9 = pg_float8_vref(kds,kcxt,8,kds_index); /* var of Y */
return EVAL(pgfn_boolop_and_2(kcxt,
pgfn_float8ge(kcxt, KVAR_8, KVAR_9),
pgfn_float8le(kcxt, KVAR_8,
pgfn_float8pl(kcxt, KVAR_9, KPARAM_1))));
}
:
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
formula expression based on WHERE clause
16. 16
PG-Strom functionalities at Sep-2015
▌Logics
GpuScan ... Simple loop extraction by GPU multithread
GpuHashJoin ... GPU multithread based N-way hash-join
GpuNestLoop ... GPU multithread based N-way nested-loop
GpuPreAgg ... Row reduction prior to CPU aggregation
GpuSort ... GPU bitonic + CPU merge, hybrid sorting
▌Data Types
Numeric ... int2/4/8, float4/8, numeric
Date and Time ... date, time(tz), timestamp(tz), interval
Text ... simple matching only, at this moment
others ... currency
▌Functions
Comparison operator ... <, <=, !=, =, >=, >
Arithmetic operators ... +, -, *, /, %, ...
Mathematical functions ... sqrt, log, exp, ...
Aggregate functions ... min, max, sum, avg, stddev, ...
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
17. 17
Agenda
1. GPU characteristics and why?
2. What intermediates PostgreSQL and GPU
3. How GPU processes SQL workloads
4. Performance results and analysis
5. Further development
18. 18
Asynchronous & Pipelining Execution (1/3)
▌Karma of discrete GPU device
We have to move the data to be processed.
Thus, people say data intensive tasks are not good for GPU.
Correct, but we can reduce the impact.
Our solution: Pipelining
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
GPU
CPU
GPU kernel
execution
time
DMA
Send
DMA
Recv
20. 20
Asynchronous & Pipelining Execution (3/3)
▌Karma of discrete GPU device
We have to move the data to be processed.
Thus, people say data intensive tasks are not good for GPU.
Correct, but we can reduce the impact
Our solution: Pipelining
▌Downside of the pipelining
Too small workloads
Pipeline hazard
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
GPU
CPU
GPU kernel
execution
time
DMA
Send
DMA
Recv
21. 21
SQL Logic on GPU (1/3) – GpuHashJoin
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
Inner
relation
Outer
relation
Inner
relation
Outer
relation
Hash table Hash table
Next step Next step
All CPU does
is just
references
the result of
relations join
Sequential
Hash table
search by CPU
Sequential
Projection
by CPU
Parallel
Projection
by GPU
Parallel Hash-
table search
by GPU
Built-in Hash-Join GpuHashJoin of PG-Strom
23. 23
Inner Hash table
OT: Ununiformly Distributed Hash-Table
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
GPU Core-0
GPU Core-1
GPU Core-2
GPU Core-3
GPU Core-4
GPU Core-5
GPU Core-M
Outer
relation
Item[0]
Item[1]
Item[2]
Item[3]
Item[4]
Item[5]
Item[N]
Item[M]
Item Item Item Item Item
Item Item Item Item Item
Inner
relation
Hash
Value
Hash
Function
Item[3] Item Item[3] Item
Virtual Partition
with Extra Randomness
Just fit for
Result Buffer
x Retry for each
virtual partition
24. 24
SQL Logic on GPU (2/3) – GpuNestLoop (unparametalized)
Outer-Relation
(Nx:usuallylarger)
※splittochunk-by-chunkon
demand
●
●
●
●
●
●
●
●
●
●
●●●●●●●Two
dimensional
GPU kernel
launch
blockDim.x
blockDim.y
Ny
Nx
Thread
(X=2, Y=3)
Inner-Relation
(Ny: relatively small)
Only edge thread references
DRAM to fetch values.
Nx:32 x Ny:32 = 1024
A matrix can be evaluated with
only 64 times DRAM accesses
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
25. 25
SQL Logic on GPU (3/3) – GpuPreAgg
▌GpuPreAgg, prior to Aggregate, reduces number of rows to be processed.
▌Query rewrite to make equivalent results from intermediate aggregation.
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
GroupAggregate
Sort
SeqScan
Tbl_1
Result Set
several rows
several millions rows
several
millions
rows
Y, count(*),
avg(X)
X, Y
X, Y
X, Y, Z (table definition)
GroupAggregate
Sort
GpuScan
Tbl_1
Result Set
several rows
several hundreds rows
several millions rows
Y, sum(nrows),
avg_ex(nrows, psum)
Y, nrows, psum_X
X, Y
X, Y, Z (table definition)
GpuPreAgg Y, nrows, psum_X
several hundreds rows
SELECT count(*), AVG(X), Y FROM Tbl_1 GROUP BY Y;
reduction
of rows to be
processed
26. 26
Agenda
1. GPU characteristics and why?
2. What intermediates PostgreSQL and GPU
3. How GPU processes SQL workloads
4. Performance results and analysis
5. Further development
27. 27
Microbenchmark (1/2) – total response time
▌SELECT cat, AVG(x) FROM t0 NATURAL JOIN t1 [, ...] GROUP BY cat;
measurement of query response time with increasing of inner relations
t0: 100M rows, t1~t10: 100K rows for each, all the data was preloaded.
▌Environment:
PostgreSQL v9.5beta1 + PG-Strom (22-Oct), CUDA 7.0 + RHEL6.6 (x86_64)
CPU: Xeon E5-2670v3, RAM: 384GB, GPU: NVIDIA TESLA K20c (2496cores)
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
58.89
82.51
102.15
125.46
159.83
194.50
241.88
12.44 17.88 17.56 21.98
33.33 38.14
50.29
0
50
100
150
200
250
300
2 3 4 5 6 7 8
QueryResponseTime[sec]
# of tables involved
Query Response Time towards Numer of Tables Joined
PostgreSQL 9.5β PG-Strom
28. 28
Microbenchmark (2/2) – Time consumption per component
▌Good acceleration results towards JOIN and AGGREGATION
▌Less acceleration ratio when N is larger than 6
By inaccurate problem size estimation, why?
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
0
50
100
150
200
250
300
PgSQL Strom PgSQL Strom PgSQL Strom PgSQL Strom PgSQL Strom PgSQL Strom PgSQL Strom
2 3 4 5 6 7 8
QueryResponseTime[sec]
# of tables involved
Time consumption per component (PostgreSQL v9.5β vs PG-Strom)
Scan Join Aggregate Others
29. 29
OT: Why accurate problem size estimation is important
▌GpuJoin results must be smaller than GPU RAM size.
▌Virtual inner-partition saves scale of the Join results.
Better size estimation reduces number of retries.
We can determine the problem size using dynamic parallelism
downside: Supported by only Kepler or later generation.
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
innerRel
(depth-1)
innerRel
(depth-2)
innerRel
(depth-3)
innerRel
(depth-4)
outerRel
Results
× × × × =
× × × × =
GpuJoin with virtual inner-partition and iteration
Larger
than
GPU RAM
Within
GPU RAM
Size
30. 30
TPC-DS Benchmark (1/2) – Total Results
▌What is TPC-DS?
Successor of TPC-H, defines 103 (99+4) kind of queries
Simulation of decision support queries in retail industry
▌About this benchmark
Run the series of TPC-DS queries sequentially.
We could collect 96 results of 103.
Environment:
• PostgreSQL v9.5beta1 + PG-Strom (22-Oct), CUDA 7.0 + RHEL6.6 (x86_64)
• CPU: Xeon E5-2670v3, RAM: 384GB, GPU: NVIDIA TESLA K20c (2496cores)
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
0
500
1000
1500
2000
QueryResponseTime[sec]
PostgreSQL v9.5β PG-Strom
anomaly?
31. 31
TPC-DS Benchmark (2/2) – Total Result
▌Summary
PG-Strom wins to PostgreSQL: 76 of 96
PostgreSQL wins to PG-Strom: 20 of 96
shows advantages on most of simple queries
▌Challenges
Wise plan construction, to avoid GpuXXX node not to be here
Functionality improvement for Aggregation and Scan
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
0
100
200
300
400
500
QueryResponseTime[sec]
PostgreSQL v9.5β PG-Strom
32. 32
Time consumption per workloads (1/3) – PostgreSQL v9.5
▌Scan, Join and Aggregation provides a clear explanation for
the time consumption factor in TPC-DS workloads
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Time consumption per workloads (PostgreSQL v9.5beta)
Scan Join Aggregate Others
33. 33
Time consumption per workloads (2/3) – PG-Strom
▌PG-Strom dramatically reduced time for Joins.
▌Time for Joins are: 37% 17%
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Time consumption per workloads (PostgreSQL v9.5beta)
Scan Join Aggregate Others
34. 34
Time consumption per workloads (3/3) – Wow! GpuJoin
GpuJoin can explain 84% of performance improvement
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
y = 0.962x - 7.289
R² = 0.8442
-50.00
0.00
50.00
100.00
150.00
200.00
250.00
-50.00 0.00 50.00 100.00 150.00 200.00 250.00
ImprovementofTotalTime[sec]
(TotaltimeofPgSQL)–(TotaltimeofPG-Strom)
Improvement of Join Time [sec]
(Join time of PgSQL) – (Join time of PG-Strom)
35. 35
Agenda
1. GPU characteristics and why?
2. What intermediates PostgreSQL and GPU
3. How GPU processes SQL workloads
4. Performance results and analysis
5. Further development
36. 36
What we learn from TPC-DS
▌GpuJoin may be a killer feature of PG-Strom
....unless problem size estimation is accurate.
Planner needs to be wiser, not to inject GpuJoin towards
the situation not suitable.
▌For Aggregation
GpuPreAgg didn’t appear as we expected
Anything GPU can help CPU aggregation?
▌For Scan
Scan performance == data stream bandwidth of GPU input
Single thread is a restriction for bandwidth of data supply
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
37. 37
Features enhancement / improvement
▌Target-list calculation on GPU
Pre-calculation of hash-value for HashAggregate
Projection off-load if complicated mathematical expression
▌Custom-Scan under Gather node
Allows multiple workers to provide data stream to GPUs
▌Partial Aggregation before Join
Wider utilization of GpuPreAgg to reduce number of rows
to be joined by CPU
▌Utilization of dynamic parallelism
Wise approach towards result buffer overflow problem in Join
▌... and more...
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
38. 38
Short term development
1. Software quality improvement
2. Corner case handling
3. Documentation
4. Target-List calculation in GPU
5. Software quality improvement
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
39. 39
WIP Project – In-place Computing
Why export entire data to run mathematical algorithm?
▌Runs complicated mathematical expression on GPUs, then
CPU just references the result
Algorithms in SQL (even partially) automatically moves to GPUs
WIP: Data-Clustering algorithms, like Min-Max, K-means, ...
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
mathematical
algorithm
Things
wants to
know
Things
wants to
know
result
result
mathematical
algorithm
Principle of
Near-Data
Computing
Limited
computing
capability
Data
Export
40. 40
Future Direction of PG-Strom
Real-life workloads are the only compass for development
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
41. 41
WE WANT YOU GET INVOLVED
▌PostgreSQL Users
who concern about query
execution performance on
your workloads
▌Application Developers
who are interested in trying
yet another database
acceleration methodology
▌PostgreSQL Developers
who can bet the future of
heterogeneous era
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL