Sub-second-sql-on-hadoop-at-scale

Sub-second SQL on Hadoop
at Scale
Yifeng Jiang
Solutions Engineer, Hortonworks
2015/11/23
© Hortonworks Inc. 2011 – 2015. All Rights Reserved

About Me
蒋燚峰 (Yifeng Jiang)
•  Solutions Engineer, Hortonworks
•  Apache HBase book author
•  I like hiking
•  Twitter: @uprush

SQL on Hadoop Solutions
•  Hive on Tez
•  The de facto standard of SQL on Hadoop for interactive SQL
•  One tool, all big data SQL use cases: ETL, reporting, BI, analytics, etc.
•  Hive LLAP
•  To make Hive even faster

SQL on Hadoop Solutions – Cont.
•  Phoenix
•  High performance relational database layer over HBase for low latency applications
•  Spark SQL
•  Spark's module for working with structured data
•  Kylin
•  Extreme OLAP engine for big data open sourced by eBay
•  Not supported by Hortonworks

Agenda
Two Real-life SQL on Hadoop Use Cases
•  Use Case #1: Highly Parallel Workload Over Massive Data
•  Use Case #2: Sub-second SQL on Hadoop at Scale

© Hortonworks Inc. 2015. All Rights Reserved
Highly Parallel Workload Over Massive
Data
Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Use Case #1
Batch Reporting
Massive Dataset
•  13 months, 450B+ rows of data
•  Adding 1.3B rows of data per day
Highly Parallel Workload
•  100K reports per day
•  15K reports per hour
Input Dataset

Key Hive Optimization
Hive on Tez selected
Four Hive on Tez optimization points
•  Partitioning
•  Data loading
•  Query execution
•  Parallel tuning

Partitioning
Maximizing the number of partitions
•  Basic and most important for performance
•  Only read relevant data
Keep the total number under a couple
thousand partitions
•  Hive seems to be able to handle this for
queries very well
CREATE TABLE access_logs (
host string,
path string,
referrer string,
…
) PARTITIONED BY (
site int,
ymd date
)

Data Loading
Load data into Hive table stored as ORC
Three main ORC parameters
•  File system block size: 256MB
•  Stripe size: 64MB
•  Compression: ZLIB
•  ZLIB is highly optimized in new hive versions

Data Loading – Cont.
Make sure ORC files are big enough
•  Between 1 and 10 HDFS blocks if possible
•  Avoid having lots of reducers that write to all partitions
•  Enable optimize sort dynamic partitioning
•  Or use DISTRIBUTED BY clause
•  We chose DISTRIBUTED BY for fine grained control
INSERT INTO orc_sales PARTITION ( country ) SELECT FROM daily_sales
DISTRIBUTE BY country, gender;

Query Execution
A query execution essentially is put together from
•  Client execution [ 0s if done correctly ]
•  Optimization [HiveServer2] [~ 0.1s]
•  HCatalog lookups [Hcatalog, Metastore] [ very fast in hive 0.14 ]
•  Application Master creation [4-5s]
•  Container Allocation [3-5s]
•  Query Execution
YARN and HDFS
HiveServer2
Server #1
Client
Running testing tool
N connections
N connections
Metastore Metastore DB
HiveServer2
Server #2
Tez
AM
Tez
Container
Tez
Container
…

Query Execution – Cont.
Connection setup has high overhead
•  Open one connection and execute large number of queries
•  Standard connection pooling
Distribute Queries to 2 Hive Servers
•  HiveServer2 becomes bottleneck at roughly 8-15 queries/s
•  Deploy multiple Hive Servers through Ambari
•  New fix out parallelizing query compilation in later versions

Re-use Tez Session, Pre-warming
•  Reinitializing Tez session takes 5+ seconds
•  Turn on Tez session re-use
•  Tez sessions can be pre-initialized with pre-warm (with some drawbacks).
•  With pre-warm, full speed is practically instantaneous

Re-use Tez container
•  Re-creating containers takes 3 seconds
•  Enable container reuse and keeping containers for a small period of time
•  Key is to reach 100% utilization without wasting resources

0
10
20
30
40
50
60
70
80
90
100
0
2
4
6
8
10
12
24 48 60 72 84 96 112 136 148 172 184 196 204 216 228
Queries Per Second
Cluster Utilization ( Memory )
Tuning for Parallel Execution
The most important point for many real-world scenarios
•  In most query tuning scenarios this is at first ignored
•  Oftentimes single queries benefit from additional resources but this can reduce throughput
Tez memory settings are key for parallelization
•  With optimized Tez memory settings, we achieved 90+% CPU utilization in cluster
Cluster UtilizationQueries Per
Second
Query
Concurrency

Hive vs. Impala Performance Benchmark
Hive performance
•  Most SQLs response within 20s
•  Max to 70s for big result set
Impala performance
•  Many SQLs took 30s to 90s
•  Big result set SQLs took more than 10m
•  Notable performance degradation during
parallel execution
Benchmark Blog
Number of queries by response time

© Hortonworks Inc. 2015. All Rights Reserved
Sub-second SQL on Hadoop at Scale

Sub-second SQL Use Case
Online Reporting
•  Interactive online reporting
•  Query is relatively simple
•  Massive dataset
•  Low latency requirement
•  Sub-second response for most queries
•  Up to several seconds for big queries
•  Highly parallel
0
10000
20000
1 3 5 7 9 11 13 15 17 19 21 23 25 27
SELECT account, yyyymmdd,
sum(total_imps),
sum(total_click),
...
FROM table_x
WHERE yyyymmdd >= xxx
AND yyyymmdd < xxx
AND account = xxx
...
GROUP BY account, yyyymmdd, ...;

Which One to Go?
Apache Kylin
Hive LLAP

Hive Performance Recap
Hive is fast: interactive response
•  Tez execution engine (replacing MapReduce)
•  ORC columnar file format
•  Cost based optimizer (CBO)
•  Vectorized SQL engine
Hive 0.10
Batch
Processing
100-150x Query Speedup
Hive 0.14
Human
Interactive
(5 seconds)

Hive LLAP
HDFS
LLAP process runs on multiple nodes, accelerating Tez tasks
Node
Hive
Query
Node NodeNode Node
LLAP LLAP LLAP LLAP
LLAP = Live Long And Process

LLAP Query Execution
•  Number of concurrent queries throttled by
Hive Server
•  Hiver Server compile and optimize queries
•  Each Query coordinated independently by a
Tez AM
•  Hive Operators used for processing
•  Tez Runtime components used for data
transfer
•  Hive decides where query fragments run
(LLAP, Container, AM)
HiveServer
Query/AM
Controller
Client(s) YARN Cluster
AM1
llapd llapd
llapd
Container AM1
Container AM1
llapd
Container AM2
AM2

Hive LLAP -- Key Benefits
Performance benefits
•  Reduce starting time
•  Columnar data cache
•  Long-lived process is easy to optimize
•  JIT, concurrent I/O, etc.

Tez vs. LLAP
•  LLAP is 7 times faster for very small queries
•  2 times faster for heavy queries
•  1.5 times faster for high result size
0
5
10
15
20
25
avg max Day20 max Day200 max DMA20 max DMA200 max Landing
Tez
LLAP
max q1 max q2 max q3 max q4 max q5avg

LLAP Scalability
•  LLAP scales to 30 q/s at our cluster
•  Additional hive server needed after 20 threads
•  Timeline server needs to be disabled
•  Impact on query latency around 48 threads and 25 q/s
0
5
10
15
20
25
30
35
40
5
threads
10
threads
1 hs
20
threads
1hs
20
threads
3 HS
48
threads
3 HS
72
threads
3HS
96
threads
3 HS
Q/S
Q/S
0
5
10
15
20
25
5 threads 10 threads 1
hs
20 threads
1hs
20 threads 3
HS
48 threads 3
HS
72 threads
3HS
96 threads 3
HS
Average
max Day20
max Day200
max DMA20
max DMA200
max Landing
query latency impacted
max q1
average
max q2
max q3
max q4
max q5

LLAP vs. Phoenix: Average Latency
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Average Daily 20 Average Daily
200
Average DMA 20 Average DMA
200
Average Landing Average All
LLAP 5 threads
LLAP 20 threads 3HS ( 15 q/s )
Phoenix 256 4threads ( 15 q/s )
Phoenix 1024 10 threads ( 15 q/s )
•  All averages under 2s response time
•  Fastest average latency with Phoenix at 15 q/s
avg q1 avg q2 avg q3 avg q4 avg q5 avg q6

0
2
4
6
8
10
12
14
16
Max Daily 20 Max Daily 200 Max DMA 20 Max DMA 200 Max Landing
LLAP 5 threads
LLAP 20 threads 3HS ( 15 q/s )
Phoenix 256 4threads ( 15 q/s )
Phoenix 1024 10 threads ( 15 q/s )
LLAP vs. Phoenix: Max Latency
max q1 max q2 max q3 max q4 max q5
•  Result size and scan size impact latency
•  Fastest is Phoenix with 1024 regions
•  Bottleneck on LLAP seems to be transferring result through HS to client
•  Patch in work: HIVE-12049

Phoenix Details
•  Phoenix scales at least up to 40 q/s with 3 clients at our cluster
0
5
10
15
20
25
30
35
40
45
256 RS 4
threads
256 RS 10
threads
1024 RS 4
threads
1024 RS 10
threads
1024 RS 10
threads 3
clients
Q/S
40% higher throughput
at 256 regions
0
2
4
6
8
10
12
14
16
18
Average max Day20 max Day200 max DMA20 max DMA200 max Landing
256 RS 4 threads
256 RS 10 threads
1024 RS 4 threads
1024 RS 10 threads
1024 RS 10 threads 3 clients
3x faster big scan query
at 1024 regions
max q1average max q2 max q3 max q4 max q5

Tuning Points for Phoenix
•  Skip Scan
•  Merge Sort Patch for Client
•  Splitting Table using Hash
•  HBase & Phoenix Configurations

Phoenix Skip Scan
•  Using Skip Scan improved query performance by order of magnitude
•  5-10 times faster
•  Allows Phoenix to skip unneeded sub-keys
•  E.g., skip from day to day
SELECT * from T
WHERE ((KEY1 >='a' AND KEY1 <= 'b') OR
(KEY1 > 'c' AND KEY1 <= 'e')) AND
KEY2 IN (1, 2)
Ref:
http://phoenix-hbase.blogspot.jp/2013/05/demystifying-skip-scan-in-phoenix.html

Phoenix Merge Sort Patch
•  Phoenix bottleneck for large result sets was client side merge
•  Patch: PHOENIX-2126
•  6 times faster for biggest result query by fixing slow merge-sort
0
10
20
30
40
50
60
70
80
Max Landing Max Daily 20 Max Daily 200 Max DMA 20 Max DMA 200
Phoenix Unpatched
Phoenix Patched
max q5 max q1 max q2 max q3 max q4

Phoenix Splitting Table
•  Salted Tables
•  Automatically salted on row key
•  Manual split point definition
•  Minimize result set size ( client side merge )
•  Increase Parallelization ( server side aggregation )
•  Create original salt based on group by keys
•  Divide table into N regions in order to maximize CPU usage
CREATE TABLE my_table (
salt CHAR(2) NOT NULL,
user_id INTEGER NOT NULL,
ymd INTEGER NOT NULL,
clicks BIGINT,
CONSTRAINT pk PRIMARY KEY (salt,user_id,ymd))
SPLIT ON (’01’,’02’, … ,’ff’);
CREATE TABLE my_table (a_key VARCHAR PRIMARY KEY, a_col VARCHAR) SALT_BUCKETS = 20;

Phoenix Configurations
•  Increase Cache use
•  -XX:MaxDirectMemorySize = 30720MB
•  -hbase.bucketcache.size = 20480MB
•  Remove bottlenecks
•  hbase.regionserver.handler.count = 240
•  phoenix.query.queueSize = 100000
•  phoenix.query.threadPoolSize = 2048
•  Prevent AutoSplit
•  hbase.hregion.max.filesize = 100GB

Sub-second SQL Use Case Summary
•  Hive LLAP/Tez
•  Tez is proven and scalable in heavy batch tasks
•  LLAP reduces Hive latency significantly
•  LLAP is under active development
•  Phoenix [ winner for this particular use case ]
•  Sub-second queries possible today
•  Simple SQL plays to Phoenix strengths
•  PHOENIX-2126 fixes client bottlenecks for large queries

Summary

SQL on Hadoop at Scale
Hive is the de facto standard of SQL on Hadoop
•  Hive on Tez for batch and interactive SQL
•  Best solution for all general big data SQL use cases: ETL, reporting, BI, analytics, etc.
•  LLAP to make Hive even faster
•  Hive on Tez and LLAP proved performance at scale
Other SQL on Hadoop Options
•  Phoenix, Spark SQL, Kylin
•  Great tool for particular use case

Sub-second-sql-on-hadoop-at-scale

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Sub-second-sql-on-hadoop-at-scale

Similar a Sub-second-sql-on-hadoop-at-scale (20)

Más de Yifeng Jiang

Más de Yifeng Jiang (20)

Último

Último (20)

Sub-second-sql-on-hadoop-at-scale