SlideShare una empresa de Scribd logo
1 de 38
Descargar para leer sin conexión
Sub-second SQL on Hadoop
at Scale
Yifeng Jiang
Solutions Engineer, Hortonworks
2015/11/23
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
About Me
蒋 燚峰 (Yifeng Jiang)
•  Solutions Engineer, Hortonworks
•  Apache HBase book author
•  I like hiking
•  Twitter: @uprush
Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SQL on Hadoop Solutions
•  Hive on Tez
•  The de facto standard of SQL on Hadoop for interactive SQL
•  One tool, all big data SQL use cases: ETL, reporting, BI, analytics, etc.
•  Hive LLAP
•  To make Hive even faster
Page 4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SQL on Hadoop Solutions – Cont.
•  Phoenix
•  High performance relational database layer over HBase for low latency applications
•  Spark SQL
•  Spark's module for working with structured data
•  Kylin
•  Extreme OLAP engine for big data open sourced by eBay
•  Not supported by Hortonworks
Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Agenda
Two Real-life SQL on Hadoop Use Cases
•  Use Case #1: Highly Parallel Workload Over Massive Data
•  Use Case #2: Sub-second SQL on Hadoop at Scale
© Hortonworks Inc. 2015. All Rights Reserved
Highly Parallel Workload Over Massive
Data
Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Use Case #1
Batch Reporting
Massive Dataset
•  13 months, 450B+ rows of data
•  Adding 1.3B rows of data per day
Highly Parallel Workload
•  100K reports per day
•  15K reports per hour
Input Dataset
Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Key Hive Optimization
Hive on Tez selected
Four Hive on Tez optimization points
•  Partitioning
•  Data loading
•  Query execution
•  Parallel tuning
Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Partitioning
Maximizing the number of partitions
•  Basic and most important for performance
•  Only read relevant data
Keep the total number under a couple
thousand partitions
•  Hive seems to be able to handle this for
queries very well
CREATE TABLE access_logs (
host string,
path string,
referrer string,
…
) PARTITIONED BY (
site int,
ymd date
)
Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data Loading
Load data into Hive table stored as ORC
Three main ORC parameters
•  File system block size: 256MB
•  Stripe size: 64MB
•  Compression: ZLIB
•  ZLIB is highly optimized in new hive versions
Page 11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data Loading – Cont.
Make sure ORC files are big enough
•  Between 1 and 10 HDFS blocks if possible
•  Avoid having lots of reducers that write to all partitions
•  Enable optimize sort dynamic partitioning
•  Or use DISTRIBUTED BY clause
•  We chose DISTRIBUTED BY for fine grained control
INSERT INTO orc_sales PARTITION ( country ) SELECT FROM daily_sales
DISTRIBUTE BY country, gender;
Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Query Execution
A query execution essentially is put together from
•  Client execution [ 0s if done correctly ]
•  Optimization [HiveServer2] [~ 0.1s]
•  HCatalog lookups [Hcatalog, Metastore] [ very fast in hive 0.14 ]
•  Application Master creation [4-5s]
•  Container Allocation [3-5s]
•  Query Execution
YARN and HDFS
HiveServer2
Server #1
Client
Running testing tool
N connections
N connections
Metastore Metastore DB
HiveServer2
Server #2
Tez
AM
Tez
Container
Tez
Container
…
Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Query Execution – Cont.
Connection setup has high overhead
•  Open one connection and execute large number of queries
•  Standard connection pooling
Distribute Queries to 2 Hive Servers
•  HiveServer2 becomes bottleneck at roughly 8-15 queries/s
•  Deploy multiple Hive Servers through Ambari
•  New fix out parallelizing query compilation in later versions
Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Query Execution – Cont.
Re-use Tez Session, Pre-warming
•  Reinitializing Tez session takes 5+ seconds
•  Turn on Tez session re-use
•  Tez sessions can be pre-initialized with pre-warm (with some drawbacks).
•  With pre-warm, full speed is practically instantaneous
Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Query Execution – Cont.
Re-use Tez container
•  Re-creating containers takes 3 seconds
•  Enable container reuse and keeping containers for a small period of time
•  Key is to reach 100% utilization without wasting resources
Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
0
10
20
30
40
50
60
70
80
90
100
0
2
4
6
8
10
12
24 48 60 72 84 96 112 136 148 172 184 196 204 216 228
Queries Per Second
Cluster Utilization ( Memory )
Tuning for Parallel Execution
The most important point for many real-world scenarios
•  In most query tuning scenarios this is at first ignored
•  Oftentimes single queries benefit from additional resources but this can reduce throughput
Tez memory settings are key for parallelization
•  With optimized Tez memory settings, we achieved 90+% CPU utilization in cluster
Cluster UtilizationQueries Per
Second
Query
Concurrency
Page 17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive vs. Impala Performance Benchmark
Hive performance
•  Most SQLs response within 20s
•  Max to 70s for big result set
Impala performance
•  Many SQLs took 30s to 90s
•  Big result set SQLs took more than 10m
•  Notable performance degradation during
parallel execution
Benchmark Blog
Number of queries by response time
© Hortonworks Inc. 2015. All Rights Reserved
Sub-second SQL on Hadoop at Scale
Page 18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Page 19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sub-second SQL Use Case
Online Reporting
•  Interactive online reporting
•  Query is relatively simple
•  Massive dataset
•  Low latency requirement
•  Sub-second response for most queries
•  Up to several seconds for big queries
•  Highly parallel
0
10000
20000
1 3 5 7 9 11 13 15 17 19 21 23 25 27
SELECT account, yyyymmdd,
sum(total_imps),
sum(total_click),
...
FROM table_x
WHERE yyyymmdd >= xxx
AND yyyymmdd < xxx
AND account = xxx
...
GROUP BY account, yyyymmdd, ...;
Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Which One to Go?
Apache Kylin
Hive LLAP
Page 21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive Performance Recap
Hive is fast: interactive response
•  Tez execution engine (replacing MapReduce)
•  ORC columnar file format
•  Cost based optimizer (CBO)
•  Vectorized SQL engine
Hive 0.10
Batch
Processing
100-150x Query Speedup
Hive 0.14
Human
Interactive
(5 seconds)
Page 22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive LLAP
HDFS
LLAP process runs on multiple nodes, accelerating Tez tasks
Node
Hive
Query
Node NodeNode Node
LLAP LLAP LLAP LLAP
LLAP = Live Long And Process
Page 23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP Query Execution
•  Number of concurrent queries throttled by
Hive Server
•  Hiver Server compile and optimize queries
•  Each Query coordinated independently by a
Tez AM
•  Hive Operators used for processing
•  Tez Runtime components used for data
transfer
•  Hive decides where query fragments run
(LLAP, Container, AM)
HiveServer
Query/AM
Controller
Client(s) YARN Cluster
AM1
llapd llapd
llapd
Container AM1
Container AM1
llapd
Container AM2
AM2
Page 24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive LLAP -- Key Benefits
Performance benefits
•  Reduce starting time
•  Columnar data cache
•  Long-lived process is easy to optimize
•  JIT, concurrent I/O, etc.
Page 25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Tez vs. LLAP
•  LLAP is 7 times faster for very small queries
•  2 times faster for heavy queries
•  1.5 times faster for high result size
0
5
10
15
20
25
avg max Day20 max Day200 max DMA20 max DMA200 max Landing
Tez
LLAP
max q1 max q2 max q3 max q4 max q5avg
Page 26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP Scalability
•  LLAP scales to 30 q/s at our cluster
•  Additional hive server needed after 20 threads
•  Timeline server needs to be disabled
•  Impact on query latency around 48 threads and 25 q/s
0
5
10
15
20
25
30
35
40
5
threads
10
threads
1 hs
20
threads
1hs
20
threads
3 HS
48
threads
3 HS
72
threads
3HS
96
threads
3 HS
Q/S
Q/S
0
5
10
15
20
25
5 threads 10 threads 1
hs
20 threads
1hs
20 threads 3
HS
48 threads 3
HS
72 threads
3HS
96 threads 3
HS
Average
max Day20
max Day200
max DMA20
max DMA200
max Landing
query latency impacted
max q1
average
max q2
max q3
max q4
max q5
Page 27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP vs. Phoenix: Average Latency
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Average Daily 20 Average Daily
200
Average DMA 20 Average DMA
200
Average Landing Average All
LLAP 5 threads
LLAP 20 threads 3HS ( 15 q/s )
Phoenix 256 4threads ( 15 q/s )
Phoenix 1024 10 threads ( 15 q/s )
•  All averages under 2s response time
•  Fastest average latency with Phoenix at 15 q/s
avg q1 avg q2 avg q3 avg q4 avg q5 avg q6
Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
0
2
4
6
8
10
12
14
16
Max Daily 20 Max Daily 200 Max DMA 20 Max DMA 200 Max Landing
LLAP 5 threads
LLAP 20 threads 3HS ( 15 q/s )
Phoenix 256 4threads ( 15 q/s )
Phoenix 1024 10 threads ( 15 q/s )
LLAP vs. Phoenix: Max Latency
max q1 max q2 max q3 max q4 max q5
•  Result size and scan size impact latency
•  Fastest is Phoenix with 1024 regions
•  Bottleneck on LLAP seems to be transferring result through HS to client
•  Patch in work: HIVE-12049
Page 29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Phoenix Details
•  Phoenix scales at least up to 40 q/s with 3 clients at our cluster
0
5
10
15
20
25
30
35
40
45
256 RS 4
threads
256 RS 10
threads
1024 RS 4
threads
1024 RS 10
threads
1024 RS 10
threads 3
clients
Q/S
40% higher throughput
at 256 regions
0
2
4
6
8
10
12
14
16
18
Average max Day20 max Day200 max DMA20 max DMA200 max Landing
256 RS 4 threads
256 RS 10 threads
1024 RS 4 threads
1024 RS 10 threads
1024 RS 10 threads 3 clients
3x faster big scan query
at 1024 regions
max q1average max q2 max q3 max q4 max q5
Page 30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Tuning Points for Phoenix
•  Skip Scan
•  Merge Sort Patch for Client
•  Splitting Table using Hash
•  HBase & Phoenix Configurations
Page 31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Phoenix Skip Scan
•  Using Skip Scan improved query performance by order of magnitude
•  5-10 times faster
•  Allows Phoenix to skip unneeded sub-keys
•  E.g., skip from day to day
SELECT * from T
WHERE ((KEY1 >='a' AND KEY1 <= 'b') OR
(KEY1 > 'c' AND KEY1 <= 'e')) AND
KEY2 IN (1, 2)
Ref:
http://phoenix-hbase.blogspot.jp/2013/05/demystifying-skip-scan-in-phoenix.html
Page 32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Phoenix Merge Sort Patch
•  Phoenix bottleneck for large result sets was client side merge
•  Patch: PHOENIX-2126
•  6 times faster for biggest result query by fixing slow merge-sort
0
10
20
30
40
50
60
70
80
Max Landing Max Daily 20 Max Daily 200 Max DMA 20 Max DMA 200
Phoenix Unpatched
Phoenix Patched
max q5 max q1 max q2 max q3 max q4
Page 33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Phoenix Splitting Table
•  Salted Tables
•  Automatically salted on row key
•  Manual split point definition
•  Minimize result set size ( client side merge )
•  Increase Parallelization ( server side aggregation )
•  Create original salt based on group by keys
•  Divide table into N regions in order to maximize CPU usage
CREATE TABLE my_table (
salt CHAR(2) NOT NULL, 
user_id INTEGER NOT NULL,
ymd INTEGER NOT NULL,
clicks BIGINT,
CONSTRAINT pk PRIMARY KEY (salt,user_id,ymd))
SPLIT ON (’01’,’02’, … ,’ff’);
CREATE TABLE my_table (a_key VARCHAR PRIMARY KEY, a_col VARCHAR) SALT_BUCKETS = 20;
Page 34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Phoenix Configurations
•  Increase Cache use
•  -XX:MaxDirectMemorySize = 30720MB
•  -hbase.bucketcache.size = 20480MB
•  Remove bottlenecks
•  hbase.regionserver.handler.count = 240
•  phoenix.query.queueSize = 100000
•  phoenix.query.threadPoolSize = 2048
•  Prevent AutoSplit
•  hbase.hregion.max.filesize = 100GB
Page 35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sub-second SQL Use Case Summary
•  Hive LLAP/Tez
•  Tez is proven and scalable in heavy batch tasks
•  LLAP reduces Hive latency significantly
•  LLAP is under active development
•  Phoenix [ winner for this particular use case ]
•  Sub-second queries possible today
•  Simple SQL plays to Phoenix strengths
•  PHOENIX-2126 fixes client bottlenecks for large queries
Page 36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Summary
Page 36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Page 37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SQL on Hadoop at Scale
Hive is the de facto standard of SQL on Hadoop
•  Hive on Tez for batch and interactive SQL
•  Best solution for all general big data SQL use cases: ETL, reporting, BI, analytics, etc.
•  LLAP to make Hive even faster
•  Hive on Tez and LLAP proved performance at scale
Other SQL on Hadoop Options
•  Phoenix, Spark SQL, Kylin
•  Great tool for particular use case
Page 38 © Hortonworks Inc. 2011 – 2015. All Rights ReservedPage 38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tweet: #hadooproadshow
Thank You

Más contenido relacionado

La actualidad más candente

An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkDataWorks Summit
 
Running a container cloud on YARN
Running a container cloud on YARNRunning a container cloud on YARN
Running a container cloud on YARNDataWorks Summit
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...DataWorks Summit
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopYifeng Jiang
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014alanfgates
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANDataWorks Summit/Hadoop Summit
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownDataWorks Summit
 
High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFTDataWorks Summit
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017alanfgates
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)Chris Nauroth
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopHortonworks
 

La actualidad más candente (20)

An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
 
Running a container cloud on YARN
Running a container cloud on YARNRunning a container cloud on YARN
Running a container cloud on YARN
 
A Multi Colored YARN
A Multi Colored YARNA Multi Colored YARN
A Multi Colored YARN
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
 
High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFT
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 

Destacado

Big Data: Working with Big SQL data from Spark
Big Data:  Working with Big SQL data from Spark Big Data:  Working with Big SQL data from Spark
Big Data: Working with Big SQL data from Spark Cynthia Saracco
 
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and SparkThe Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and SparkAkshay Rai
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill Carol McDonald
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseDataWorks Summit/Hadoop Summit
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit
 
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...Spark Summit
 
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Spark Summit
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Spark Summit
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...Spark Summit
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark Summit
 
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Spark Summit
 
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Spark Summit
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 

Destacado (20)

LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Cost-based Query Optimization
Cost-based Query Optimization Cost-based Query Optimization
Cost-based Query Optimization
 
Big Data: Working with Big SQL data from Spark
Big Data:  Working with Big SQL data from Spark Big Data:  Working with Big SQL data from Spark
Big Data: Working with Big SQL data from Spark
 
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and SparkThe Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
Apache Zeppelin, Helium and Beyond
Apache Zeppelin, Helium and BeyondApache Zeppelin, Helium and Beyond
Apache Zeppelin, Helium and Beyond
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
 
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
 
Apache Phoenix + Apache HBase
Apache Phoenix + Apache HBaseApache Phoenix + Apache HBase
Apache Phoenix + Apache HBase
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
 
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
 
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 

Similar a Sub-second-sql-on-hadoop-at-scale

Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
Kinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveKinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveYifeng Jiang
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019alanfgates
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?DataWorks Summit
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?DataWorks Summit
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019alanfgates
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?DataWorks Summit
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive DataWorks Summit
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureDataWorks Summit
 
Hadoop Summit San Jose 2015: YARN - Past, Present and Future
Hadoop Summit San Jose 2015: YARN - Past, Present and FutureHadoop Summit San Jose 2015: YARN - Past, Present and Future
Hadoop Summit San Jose 2015: YARN - Past, Present and FutureVinod Kumar Vavilapalli
 
Seamless replication and disaster recovery for Apache Hive Warehouse
Seamless replication and disaster recovery for Apache Hive WarehouseSeamless replication and disaster recovery for Apache Hive Warehouse
Seamless replication and disaster recovery for Apache Hive WarehouseDataWorks Summit
 
Seamless Replication and Disaster Recovery for Apache Hive Warehouse
Seamless Replication and Disaster Recovery for Apache Hive WarehouseSeamless Replication and Disaster Recovery for Apache Hive Warehouse
Seamless Replication and Disaster Recovery for Apache Hive WarehouseSankar H
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter
 

Similar a Sub-second-sql-on-hadoop-at-scale (20)

LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
Kinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveKinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-dive
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Containers and Big Data
Containers and Big DataContainers and Big Data
Containers and Big Data
 
Hadoop Summit San Jose 2015: YARN - Past, Present and Future
Hadoop Summit San Jose 2015: YARN - Past, Present and FutureHadoop Summit San Jose 2015: YARN - Past, Present and Future
Hadoop Summit San Jose 2015: YARN - Past, Present and Future
 
Seamless replication and disaster recovery for Apache Hive Warehouse
Seamless replication and disaster recovery for Apache Hive WarehouseSeamless replication and disaster recovery for Apache Hive Warehouse
Seamless replication and disaster recovery for Apache Hive Warehouse
 
Seamless Replication and Disaster Recovery for Apache Hive Warehouse
Seamless Replication and Disaster Recovery for Apache Hive WarehouseSeamless Replication and Disaster Recovery for Apache Hive Warehouse
Seamless Replication and Disaster Recovery for Apache Hive Warehouse
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 

Más de Yifeng Jiang

Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsYifeng Jiang
 
introduction-to-apache-kafka
introduction-to-apache-kafkaintroduction-to-apache-kafka
introduction-to-apache-kafkaYifeng Jiang
 
Hive2 Introduction -- Interactive SQL for Big Data
Hive2 Introduction -- Interactive SQL for Big DataHive2 Introduction -- Interactive SQL for Big Data
Hive2 Introduction -- Interactive SQL for Big DataYifeng Jiang
 
Introduction to Streaming Analytics Manager
Introduction to Streaming Analytics ManagerIntroduction to Streaming Analytics Manager
Introduction to Streaming Analytics ManagerYifeng Jiang
 
HDF 3.0 IoT Platform for Everyone
HDF 3.0 IoT Platform for EveryoneHDF 3.0 IoT Platform for Everyone
HDF 3.0 IoT Platform for EveryoneYifeng Jiang
 
Hortonworks Data Cloud for AWS 1.11 Updates
Hortonworks Data Cloud for AWS 1.11 UpdatesHortonworks Data Cloud for AWS 1.11 Updates
Hortonworks Data Cloud for AWS 1.11 UpdatesYifeng Jiang
 
Introduction to Hortonworks Data Cloud for AWS
Introduction to Hortonworks Data Cloud for AWSIntroduction to Hortonworks Data Cloud for AWS
Introduction to Hortonworks Data Cloud for AWSYifeng Jiang
 
Real-time Analytics in Financial
Real-time Analytics in FinancialReal-time Analytics in Financial
Real-time Analytics in FinancialYifeng Jiang
 
sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16Yifeng Jiang
 
Yifeng hadoop-present-public
Yifeng hadoop-present-publicYifeng hadoop-present-public
Yifeng hadoop-present-publicYifeng Jiang
 
Hive-sub-second-sql-on-hadoop-public
Hive-sub-second-sql-on-hadoop-publicHive-sub-second-sql-on-hadoop-public
Hive-sub-second-sql-on-hadoop-publicYifeng Jiang
 
Yifeng spark-final-public
Yifeng spark-final-publicYifeng spark-final-public
Yifeng spark-final-publicYifeng Jiang
 
Apache Hiveの今とこれから
Apache Hiveの今とこれからApache Hiveの今とこれから
Apache Hiveの今とこれからYifeng Jiang
 
Hadoop Trends & Hadoop on EC2
Hadoop Trends & Hadoop on EC2Hadoop Trends & Hadoop on EC2
Hadoop Trends & Hadoop on EC2Yifeng Jiang
 
Apache Ambari Overview -- Hadoop for Everyone
Apache Ambari Overview -- Hadoop for EveryoneApache Ambari Overview -- Hadoop for Everyone
Apache Ambari Overview -- Hadoop for EveryoneYifeng Jiang
 
HDP Security Overview
HDP Security OverviewHDP Security Overview
HDP Security OverviewYifeng Jiang
 
Data Science on Hadoop
Data Science on HadoopData Science on Hadoop
Data Science on HadoopYifeng Jiang
 

Más de Yifeng Jiang (20)

Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfs
 
introduction-to-apache-kafka
introduction-to-apache-kafkaintroduction-to-apache-kafka
introduction-to-apache-kafka
 
Hive2 Introduction -- Interactive SQL for Big Data
Hive2 Introduction -- Interactive SQL for Big DataHive2 Introduction -- Interactive SQL for Big Data
Hive2 Introduction -- Interactive SQL for Big Data
 
Introduction to Streaming Analytics Manager
Introduction to Streaming Analytics ManagerIntroduction to Streaming Analytics Manager
Introduction to Streaming Analytics Manager
 
HDF 3.0 IoT Platform for Everyone
HDF 3.0 IoT Platform for EveryoneHDF 3.0 IoT Platform for Everyone
HDF 3.0 IoT Platform for Everyone
 
Hortonworks Data Cloud for AWS 1.11 Updates
Hortonworks Data Cloud for AWS 1.11 UpdatesHortonworks Data Cloud for AWS 1.11 Updates
Hortonworks Data Cloud for AWS 1.11 Updates
 
Spark Security
Spark SecuritySpark Security
Spark Security
 
Introduction to Hortonworks Data Cloud for AWS
Introduction to Hortonworks Data Cloud for AWSIntroduction to Hortonworks Data Cloud for AWS
Introduction to Hortonworks Data Cloud for AWS
 
Real-time Analytics in Financial
Real-time Analytics in FinancialReal-time Analytics in Financial
Real-time Analytics in Financial
 
sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16
 
Nifi workshop
Nifi workshopNifi workshop
Nifi workshop
 
Yifeng hadoop-present-public
Yifeng hadoop-present-publicYifeng hadoop-present-public
Yifeng hadoop-present-public
 
Hive-sub-second-sql-on-hadoop-public
Hive-sub-second-sql-on-hadoop-publicHive-sub-second-sql-on-hadoop-public
Hive-sub-second-sql-on-hadoop-public
 
Yifeng spark-final-public
Yifeng spark-final-publicYifeng spark-final-public
Yifeng spark-final-public
 
Apache Hiveの今とこれから
Apache Hiveの今とこれからApache Hiveの今とこれから
Apache Hiveの今とこれから
 
HDFS Deep Dive
HDFS Deep DiveHDFS Deep Dive
HDFS Deep Dive
 
Hadoop Trends & Hadoop on EC2
Hadoop Trends & Hadoop on EC2Hadoop Trends & Hadoop on EC2
Hadoop Trends & Hadoop on EC2
 
Apache Ambari Overview -- Hadoop for Everyone
Apache Ambari Overview -- Hadoop for EveryoneApache Ambari Overview -- Hadoop for Everyone
Apache Ambari Overview -- Hadoop for Everyone
 
HDP Security Overview
HDP Security OverviewHDP Security Overview
HDP Security Overview
 
Data Science on Hadoop
Data Science on HadoopData Science on Hadoop
Data Science on Hadoop
 

Último

LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456KiaraTiradoMicha
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfayushiqss
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...Nitya salvi
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyAnusha Are
 

Último (20)

LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 

Sub-second-sql-on-hadoop-at-scale

  • 1. Sub-second SQL on Hadoop at Scale Yifeng Jiang Solutions Engineer, Hortonworks 2015/11/23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
  • 2. About Me 蒋 燚峰 (Yifeng Jiang) •  Solutions Engineer, Hortonworks •  Apache HBase book author •  I like hiking •  Twitter: @uprush
  • 3. Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved SQL on Hadoop Solutions •  Hive on Tez •  The de facto standard of SQL on Hadoop for interactive SQL •  One tool, all big data SQL use cases: ETL, reporting, BI, analytics, etc. •  Hive LLAP •  To make Hive even faster
  • 4. Page 4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved SQL on Hadoop Solutions – Cont. •  Phoenix •  High performance relational database layer over HBase for low latency applications •  Spark SQL •  Spark's module for working with structured data •  Kylin •  Extreme OLAP engine for big data open sourced by eBay •  Not supported by Hortonworks
  • 5. Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Agenda Two Real-life SQL on Hadoop Use Cases •  Use Case #1: Highly Parallel Workload Over Massive Data •  Use Case #2: Sub-second SQL on Hadoop at Scale
  • 6. © Hortonworks Inc. 2015. All Rights Reserved Highly Parallel Workload Over Massive Data Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
  • 7. Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Use Case #1 Batch Reporting Massive Dataset •  13 months, 450B+ rows of data •  Adding 1.3B rows of data per day Highly Parallel Workload •  100K reports per day •  15K reports per hour Input Dataset
  • 8. Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Key Hive Optimization Hive on Tez selected Four Hive on Tez optimization points •  Partitioning •  Data loading •  Query execution •  Parallel tuning
  • 9. Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Partitioning Maximizing the number of partitions •  Basic and most important for performance •  Only read relevant data Keep the total number under a couple thousand partitions •  Hive seems to be able to handle this for queries very well CREATE TABLE access_logs ( host string, path string, referrer string, … ) PARTITIONED BY ( site int, ymd date )
  • 10. Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Data Loading Load data into Hive table stored as ORC Three main ORC parameters •  File system block size: 256MB •  Stripe size: 64MB •  Compression: ZLIB •  ZLIB is highly optimized in new hive versions
  • 11. Page 11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Data Loading – Cont. Make sure ORC files are big enough •  Between 1 and 10 HDFS blocks if possible •  Avoid having lots of reducers that write to all partitions •  Enable optimize sort dynamic partitioning •  Or use DISTRIBUTED BY clause •  We chose DISTRIBUTED BY for fine grained control INSERT INTO orc_sales PARTITION ( country ) SELECT FROM daily_sales DISTRIBUTE BY country, gender;
  • 12. Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Query Execution A query execution essentially is put together from •  Client execution [ 0s if done correctly ] •  Optimization [HiveServer2] [~ 0.1s] •  HCatalog lookups [Hcatalog, Metastore] [ very fast in hive 0.14 ] •  Application Master creation [4-5s] •  Container Allocation [3-5s] •  Query Execution YARN and HDFS HiveServer2 Server #1 Client Running testing tool N connections N connections Metastore Metastore DB HiveServer2 Server #2 Tez AM Tez Container Tez Container …
  • 13. Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Query Execution – Cont. Connection setup has high overhead •  Open one connection and execute large number of queries •  Standard connection pooling Distribute Queries to 2 Hive Servers •  HiveServer2 becomes bottleneck at roughly 8-15 queries/s •  Deploy multiple Hive Servers through Ambari •  New fix out parallelizing query compilation in later versions
  • 14. Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Query Execution – Cont. Re-use Tez Session, Pre-warming •  Reinitializing Tez session takes 5+ seconds •  Turn on Tez session re-use •  Tez sessions can be pre-initialized with pre-warm (with some drawbacks). •  With pre-warm, full speed is practically instantaneous
  • 15. Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Query Execution – Cont. Re-use Tez container •  Re-creating containers takes 3 seconds •  Enable container reuse and keeping containers for a small period of time •  Key is to reach 100% utilization without wasting resources
  • 16. Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved 0 10 20 30 40 50 60 70 80 90 100 0 2 4 6 8 10 12 24 48 60 72 84 96 112 136 148 172 184 196 204 216 228 Queries Per Second Cluster Utilization ( Memory ) Tuning for Parallel Execution The most important point for many real-world scenarios •  In most query tuning scenarios this is at first ignored •  Oftentimes single queries benefit from additional resources but this can reduce throughput Tez memory settings are key for parallelization •  With optimized Tez memory settings, we achieved 90+% CPU utilization in cluster Cluster UtilizationQueries Per Second Query Concurrency
  • 17. Page 17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive vs. Impala Performance Benchmark Hive performance •  Most SQLs response within 20s •  Max to 70s for big result set Impala performance •  Many SQLs took 30s to 90s •  Big result set SQLs took more than 10m •  Notable performance degradation during parallel execution Benchmark Blog Number of queries by response time
  • 18. © Hortonworks Inc. 2015. All Rights Reserved Sub-second SQL on Hadoop at Scale Page 18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
  • 19. Page 19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sub-second SQL Use Case Online Reporting •  Interactive online reporting •  Query is relatively simple •  Massive dataset •  Low latency requirement •  Sub-second response for most queries •  Up to several seconds for big queries •  Highly parallel 0 10000 20000 1 3 5 7 9 11 13 15 17 19 21 23 25 27 SELECT account, yyyymmdd, sum(total_imps), sum(total_click), ... FROM table_x WHERE yyyymmdd >= xxx AND yyyymmdd < xxx AND account = xxx ... GROUP BY account, yyyymmdd, ...;
  • 20. Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Which One to Go? Apache Kylin Hive LLAP
  • 21. Page 21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive Performance Recap Hive is fast: interactive response •  Tez execution engine (replacing MapReduce) •  ORC columnar file format •  Cost based optimizer (CBO) •  Vectorized SQL engine Hive 0.10 Batch Processing 100-150x Query Speedup Hive 0.14 Human Interactive (5 seconds)
  • 22. Page 22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive LLAP HDFS LLAP process runs on multiple nodes, accelerating Tez tasks Node Hive Query Node NodeNode Node LLAP LLAP LLAP LLAP LLAP = Live Long And Process
  • 23. Page 23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP Query Execution •  Number of concurrent queries throttled by Hive Server •  Hiver Server compile and optimize queries •  Each Query coordinated independently by a Tez AM •  Hive Operators used for processing •  Tez Runtime components used for data transfer •  Hive decides where query fragments run (LLAP, Container, AM) HiveServer Query/AM Controller Client(s) YARN Cluster AM1 llapd llapd llapd Container AM1 Container AM1 llapd Container AM2 AM2
  • 24. Page 24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive LLAP -- Key Benefits Performance benefits •  Reduce starting time •  Columnar data cache •  Long-lived process is easy to optimize •  JIT, concurrent I/O, etc.
  • 25. Page 25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tez vs. LLAP •  LLAP is 7 times faster for very small queries •  2 times faster for heavy queries •  1.5 times faster for high result size 0 5 10 15 20 25 avg max Day20 max Day200 max DMA20 max DMA200 max Landing Tez LLAP max q1 max q2 max q3 max q4 max q5avg
  • 26. Page 26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP Scalability •  LLAP scales to 30 q/s at our cluster •  Additional hive server needed after 20 threads •  Timeline server needs to be disabled •  Impact on query latency around 48 threads and 25 q/s 0 5 10 15 20 25 30 35 40 5 threads 10 threads 1 hs 20 threads 1hs 20 threads 3 HS 48 threads 3 HS 72 threads 3HS 96 threads 3 HS Q/S Q/S 0 5 10 15 20 25 5 threads 10 threads 1 hs 20 threads 1hs 20 threads 3 HS 48 threads 3 HS 72 threads 3HS 96 threads 3 HS Average max Day20 max Day200 max DMA20 max DMA200 max Landing query latency impacted max q1 average max q2 max q3 max q4 max q5
  • 27. Page 27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP vs. Phoenix: Average Latency 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Average Daily 20 Average Daily 200 Average DMA 20 Average DMA 200 Average Landing Average All LLAP 5 threads LLAP 20 threads 3HS ( 15 q/s ) Phoenix 256 4threads ( 15 q/s ) Phoenix 1024 10 threads ( 15 q/s ) •  All averages under 2s response time •  Fastest average latency with Phoenix at 15 q/s avg q1 avg q2 avg q3 avg q4 avg q5 avg q6
  • 28. Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved 0 2 4 6 8 10 12 14 16 Max Daily 20 Max Daily 200 Max DMA 20 Max DMA 200 Max Landing LLAP 5 threads LLAP 20 threads 3HS ( 15 q/s ) Phoenix 256 4threads ( 15 q/s ) Phoenix 1024 10 threads ( 15 q/s ) LLAP vs. Phoenix: Max Latency max q1 max q2 max q3 max q4 max q5 •  Result size and scan size impact latency •  Fastest is Phoenix with 1024 regions •  Bottleneck on LLAP seems to be transferring result through HS to client •  Patch in work: HIVE-12049
  • 29. Page 29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Phoenix Details •  Phoenix scales at least up to 40 q/s with 3 clients at our cluster 0 5 10 15 20 25 30 35 40 45 256 RS 4 threads 256 RS 10 threads 1024 RS 4 threads 1024 RS 10 threads 1024 RS 10 threads 3 clients Q/S 40% higher throughput at 256 regions 0 2 4 6 8 10 12 14 16 18 Average max Day20 max Day200 max DMA20 max DMA200 max Landing 256 RS 4 threads 256 RS 10 threads 1024 RS 4 threads 1024 RS 10 threads 1024 RS 10 threads 3 clients 3x faster big scan query at 1024 regions max q1average max q2 max q3 max q4 max q5
  • 30. Page 30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tuning Points for Phoenix •  Skip Scan •  Merge Sort Patch for Client •  Splitting Table using Hash •  HBase & Phoenix Configurations
  • 31. Page 31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Phoenix Skip Scan •  Using Skip Scan improved query performance by order of magnitude •  5-10 times faster •  Allows Phoenix to skip unneeded sub-keys •  E.g., skip from day to day SELECT * from T WHERE ((KEY1 >='a' AND KEY1 <= 'b') OR (KEY1 > 'c' AND KEY1 <= 'e')) AND KEY2 IN (1, 2) Ref: http://phoenix-hbase.blogspot.jp/2013/05/demystifying-skip-scan-in-phoenix.html
  • 32. Page 32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Phoenix Merge Sort Patch •  Phoenix bottleneck for large result sets was client side merge •  Patch: PHOENIX-2126 •  6 times faster for biggest result query by fixing slow merge-sort 0 10 20 30 40 50 60 70 80 Max Landing Max Daily 20 Max Daily 200 Max DMA 20 Max DMA 200 Phoenix Unpatched Phoenix Patched max q5 max q1 max q2 max q3 max q4
  • 33. Page 33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Phoenix Splitting Table •  Salted Tables •  Automatically salted on row key •  Manual split point definition •  Minimize result set size ( client side merge ) •  Increase Parallelization ( server side aggregation ) •  Create original salt based on group by keys •  Divide table into N regions in order to maximize CPU usage CREATE TABLE my_table ( salt CHAR(2) NOT NULL, user_id INTEGER NOT NULL, ymd INTEGER NOT NULL, clicks BIGINT, CONSTRAINT pk PRIMARY KEY (salt,user_id,ymd)) SPLIT ON (’01’,’02’, … ,’ff’); CREATE TABLE my_table (a_key VARCHAR PRIMARY KEY, a_col VARCHAR) SALT_BUCKETS = 20;
  • 34. Page 34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Phoenix Configurations •  Increase Cache use •  -XX:MaxDirectMemorySize = 30720MB •  -hbase.bucketcache.size = 20480MB •  Remove bottlenecks •  hbase.regionserver.handler.count = 240 •  phoenix.query.queueSize = 100000 •  phoenix.query.threadPoolSize = 2048 •  Prevent AutoSplit •  hbase.hregion.max.filesize = 100GB
  • 35. Page 35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sub-second SQL Use Case Summary •  Hive LLAP/Tez •  Tez is proven and scalable in heavy batch tasks •  LLAP reduces Hive latency significantly •  LLAP is under active development •  Phoenix [ winner for this particular use case ] •  Sub-second queries possible today •  Simple SQL plays to Phoenix strengths •  PHOENIX-2126 fixes client bottlenecks for large queries
  • 36. Page 36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Summary Page 36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
  • 37. Page 37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved SQL on Hadoop at Scale Hive is the de facto standard of SQL on Hadoop •  Hive on Tez for batch and interactive SQL •  Best solution for all general big data SQL use cases: ETL, reporting, BI, analytics, etc. •  LLAP to make Hive even faster •  Hive on Tez and LLAP proved performance at scale Other SQL on Hadoop Options •  Phoenix, Spark SQL, Kylin •  Great tool for particular use case
  • 38. Page 38 © Hortonworks Inc. 2011 – 2015. All Rights ReservedPage 38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tweet: #hadooproadshow Thank You