Más contenido relacionado
La actualidad más candente (20)
Similar a B17 Eliminating the database bottleneck (20)
Más de Insight Technology, Inc. (20)
B17 Eliminating the database bottleneck
- 1. Eliminating the Database
Bottleneck
What makes Vectorwise so fast
Mark Van de Wiel
Director Product Management, Vectorwise
Thursday, November 01, 2012
1 of 9 1 of 9
Confidential © 2012 Actian Corporation
- 2. Agenda
Why traditional RDBMSs are slow for analytics
Why Vectorwise is fast
The I/O challenge
Efficient updates
Confidential © 2012 Actian Corporation 2
- 3. 100x (+) Performance Difference – 2003
Custom C versus Relational Database
TPC-H 1 GB query 1
(runtime in s)
30 28.1
26.2
25
20 MySQL
15 DBMS 'X'
C program
10
Vectorwise
5
0.2 0.6
0
MySQL DBMS 'X' C program Vectorwise
Confidential © 2012 Actian Corporation 3
- 5. Inefficient Storage for Analytics
Row-based storage model
Predominant in 2003, still very common today
Works well for OLTP
101 Joe 27 Black
103 Edward 21 Scissorhand
Confidential © 2012 Actian Corporation 5
- 6. Inefficient Storage – Row-based
Pages on disk – example
101 27 Joe Black
103 21 Edward Scissorhand
Var-width attribute pointers
pointers to tuples
Confidential © 2012 Actian Corporation 6
- 7. Issues with Row-based Storage
Always read all attributes
Poor bandwidth
Poor use of memory buffer
Complex row structure and navigation
E.g. compressing out null fields
E.g. row chaining
Confidential © 2012 Actian Corporation 7
- 8. Efficient Storage for Analytics
Columnar storage: store attributes separtely
Retrieve only attributes required by the query
Used by “traditional” column stores, e.g. Sybase IQ, Vertica
Confidential © 2012 Actian Corporation 8
- 9. Inefficient Processing
How a traditional database runs a query
Query:
SELECT
name,
salary*.19 AS tax
FROM
employee
WHERE
age > 25
Confidential © 2012 Actian Corporation 9
- 10. Inefficient Processing
How a traditional database runs a query
Tuple-at-a-time iterator interface:
- open()
- next(): tuple
- close()
next() is called:
- for each operator
- for each tuple
Complex code repeated over and over
Confidential © 2012 Actian Corporation 10
- 11. Inefficient Processing
How a traditional database runs a query
Data-specific computational
functionality
Called once for every operation
on every tuple
Worse for complex tuple
representations
Confidential © 2012 Actian Corporation 11
- 12. Inefficient Processing (Part 1 of 2)
Lots of repeated, unnecessary code
Operator logic
Function calls
Attribute access
Most instructions interpreting a query
Very few instructions processing actual data!
Many instructions per tuple
Confidential © 2012 Actian Corporation 12
- 13. CPU Features – Inefficient Processing Part 2
In the last 20 years…
Chip cache because RAM access is too slow and congested
Branch-sensitive CPU pipelines
Superscalar features
SIMD instructions (SSE and AVX)
Great for multimedia processing, scientific computing…
… but NOT for traditional relational databases
Complex code: function calls, branches
Poor use of CPU cache (both data and instructions)
Processing one value at a time
Confidential © 2012 Actian Corporation 13
- 15. Vectorwise – Vector-based Processing
Query:
SELECT
name,
salary*.19 AS tax
FROM
employee
WHERE
age > 25
Confidential © 2012 Actian Corporation 15
- 16. Vectorwise – Vector-based Processing
Vector contains data of
multiple tuples (1024)
All operations consume
and produce entire vectors
Effect: much less
operator.next() and
primitive calls.
AND: pipelined query
evaluation
Confidential © 2012 Actian Corporation 16
- 17. Why is Vectorwise so Fast?
Reduced interpretation overhead
100+ times fewer function calls
Good CPU cache use
High locality in primitives
Cache-conscious algorithms
No tuple navigation
Primitives only see arrays
Vectorization allows algorithmic optimization
CPU and compiler-friendly function bodies
Multiple work units, loop-pipelining, SIMD…
BONUS: PARALLEL QUERY
Confidential © 2012 Actian Corporation 17
- 18. Some Numbers
Traditional RDBMS: <200 MB/s per core
Vectorwise (lab environment): >1.5 GB/s per core
Confidential © 2012 Actian Corporation 18
- 19. Addressing the I/O Challenge
Columnar storage
Smart column buffer (memory)
Data compression
On disk: less I/O
In memory: best use of column buffer
Ultra-efficient decompression algorithms to
get sufficient throughput
Large contiguous data blocks
for optimum disk I/O
In-memory min-max indexes per block (i.e. per column)
Eliminate data blocks based on implicit/explicit filter criteria
Confidential © 2012 Actian Corporation 19
- 20. Efficient Updates in a Column Store
Positional Delta Trees (PDTs)
In-memory representation of small data changes
Efficiently merged with on-disk data
Periodically propagated to disk
Provide snapshot read consistency
ACID compliant
Confidential © 2012 Actian Corporation 20
- 21. Agenda
Why traditional RDBMSs are slow for analytics
Why Vectorwise is fast
The I/O challenge
Efficient updates
Confidential © 2012 Actian Corporation 21