SlideShare una empresa de Scribd logo
1 de 46
Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP: Sub-Second Analytical Queries in Hive
Sergey Shelukhin, Siddharth Seth
Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP: long-lived execution in Hive
What is LLAP?+
+ LLAP in Hive; performance primer+
+ LLAP as a unified data access layer+
+ Deploying and managing LLAP+
+ Future work+
+
+ How does LLAP make queries faster?+
Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What is LLAP?
Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What is LLAP?
• Hybrid model combining daemons and containers for
fast, concurrent execution of analytical workloads
(e.g. Hive SQL queries)
• Concurrent queries without specialized YARN queue setup
• Multi-threaded execution of vectorized operator pipelines
• Asynchronous IO and efficient in-memory caching
• Relational view of the data available thru the API
• High performance scans, execution code pushdown
• Centralized data security
Node
LLAP Process
Cache
Query Fragment
HDFS
Query Fragment
Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What LLAP isn't
• Not another "execution engine" (like Tez, MR, Spark…)
• Tez, Spark (work in progress) can work on top of LLAP
– Engines provide coordination and scheduling
– some work (e.g. large shuffles) can be scheduled in containers
• Not a storage substrate
• Daemons are stateless and read (and cache) data from HDFS/S3/Azure/..
Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP
Queue
Technical overview – execution
• LLAP daemon has a number of executors (think
containers) that execute work "fragments"
• Fragments are parts of one, or multiple parallel
workloads (e.g. Hive SQL queries)
• Work queue with pluggable priority
• Geared towards low latency queries over long-
running queries (by default)
• I/O is similar to containers – read/write to HDFS,
shuffle, other storages and formats
• Streaming output for data API
Executor
Q1 Map 1
Executor
External read
Executor
Q3 Reducer 3
Q1 Map 1
Q1 Map 1
Q3 Map 19
HDFS
Waiting for
shuffle inputs
HBase
Container
(shuffle input)
Spark
executor
Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Executor
Technical overview – IO layer
• Optional when executing inside LLAP
• Wraps existing InputFormat
• Currently only ORC is supported
• Asynchronous IO for Hive
• Transparent, compressed in-memory cache
• Format-specific, extensible
• SSD cache is work in progress
RecordReade
r
Fragment
Cache
IO thread
Plan & decode
Read, decompress
Metadata cache
Actual data (HDFS, S3, …)
What to read Data buffers
Splits Vectorized data
Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP in Hive; performance primer
Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Overview - Hive + LLAP
• Transparent to Hive users, BI tools, etc.
• Except the queries become faster :)
• Number of concurrent queries throttled
by Hive Server
• Hive decides where query fragments run
(LLAP, Container, AM) based on
configuration, data size, format, etc.
• Each Query coordinated independently by
a Tez AM
• Hive Operators used for processing
• Tez Runtime used for data transfer
HiveServer2
Query/AM
Controller
Client(s) YARN Cluster
AM1
llapd
llapd
Container AM1
Container AM1
llapd
Container AM2
AM2
AM3
llapd
Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
How does LLAP speed up the queries?
• Eliminates container startup costs
• JIT optimizer has a chance to work
• Especially important for vectorized processing
• Asynchronous IO
• Caching
• Data sharing (hash join tables, etc.)
• Optimizations for parallel queries
Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Industry benchmark – 10Tb scale
0
5
10
15
20
25
30
35
40
45
50
query3 query12 query20 query21 query26 query27 query42 query52 query55 query73 query89 query91 query98
Time(seconds)
LLAP vs Hive 1.x 10TB Scale
Hive 1.x LLAP90% faster
Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Evaluation for a real use case
0
20000
1 3 5 7 9 11 13 15 17 19 21 23 25 27
• Aggregate daily statistics for a time interval:
SELECT yyyymmdd,
sum(total_1),
sum(total_2),
...
from table
where yyyymmdd >= xxx
and yyyymmdd < xxx
and userid = xxx
group by userid, yyyymmdd;
Max
Max
Max
Avg
Avg Avg
0
1
2
3
4
5
6
7
8
D W M Y D W M Y D W M Y
Tez Phoenix LLAP
Executiontime,s
Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Evaluation for a real use case
ExecutionTime,s.
• Display a large report
• Phoenix runtimes not displayed (30-250s
on large ranges)
Execution Time in seconds over time range
Max
Max
Avg
Avg
0
5
10
15
20
D W M Y D W M Y
Tez LLAP
Executiontime,s
Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
How does LLAP make queries faster?
Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Persistent daemons (recap)
• Reduced startup time
• Better use of JIT optimizer
• Hash join hash tables, fragment plans are shared
• Multiple tasks do not all generate the table or deserialize the plans
Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Performance – first query
• Cold LLAP is nearly as fast as
shared pre-warmed containers
(impractical on real clusters)
• Realistic (long-running) LLAP
~3x faster than realistic (no
prewarm) Tez
No
prewarm,
20.94
Shared
prewarm
11.96
Cold LLAP
13.23
Realistic
LLAP
7.49
0
5
10
15
20
25
Firstqueryruntime,s
Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Performance – repeated query
• Cache disabled!
13.23
7.93 7.7
6.29
5.44 5.3 5.01 4.89 4.83
7.49
6.1
4.61 4.59
4.93
4.62 4.63 4.45 4.25
0
2
4
6
8
10
12
14
0 1 2 3 4 5 6 7 8
Queryruntime,s
Query run
Cold LLAP
LLAP after an
unrelated workload
Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Parallel queries – priorities, preemption
• Lower-priority fragments can be preempted
• For example, a fragment can start running before its inputs are
ready, for better pipelining; such fragments may be preempted
• LLAP work queue examines the DAG parameters to give
preference to interactive (BI) queries
LLAP
QueueExecutor
Executor
Interactive
query map 1/3
…
Interactive
query map 3/3
Executor
Interactive
query map 2/3
Wide query
reduce waiting
Time
ClusterUtilization
Long-Running Query
Short-Running Query
Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Parallel query execution – LLAP vs Hive 1.2
3131
2416
1741
1420
1508
531
291
147
94 75
100%
91% 90%
71%
44%
45%
13%
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
0
500
1000
1500
2000
2500
3000
3500
1 2 4 8 16
Scaling(comparedtolinear)
Totalruntimeforallqueries,s.
Number of concurrent users
Hive 1.2 linear scaling
Hive 1.2 total runtime
LLAP linear scaling
LLAP total runtime
Hive+LLAP efficiency
Hive 1.2 efficiency
Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Parallel query execution – 10Tb scale
531
291
147
94
75
100%
91% 90%
71%
44%
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
0
100
200
300
400
500
600
1 2 4 8 16
Scaling(comparedtolinear)
Totalruntimeforallqueries,s.
Number of concurrent users
LLAP linear scaling
LLAP total runtime
Hive+LLAP efficiency
Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
IO elevator and caching
• Reading, decoding and processing are parallel
• Unlike in regular Hive, where processing can wait for e.g. an S3 read
• Logically-compresses (e.g. ORC-encoded) data is cached off-heap
• Metadata is cached (currently on-heap) with high priority
• Replacement policy is pluggable (LRFU is the default)
• Tez schedules work to preferred location to improve locality
• Tradeoff between HDFS reads and operator speed
• Cache takes memory away from operators (sort buffers, hash join tables, etc.)
• Depends on how much memory you have, workflow, dataset size, etc.
• Ex. 4Gb per executor x 16 CPUs – 64Gb for executors, rest for cache
Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
I/OExecution
In-memory processing – present and future
Decoder
Distributed FS
Fragment
Hive
operator
Hive
operator
Vectorized
processin
g
col1 col2
Low-level pushdown
(e.g. filter)*
SSD cache*
Off-heap cache Compression
codec
Native data
vectors
- work in
progress
*
Compact encoded data
Compressed data
col1
col2
Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Fragment
I/OExecution
In-memory processing – future?
Distributed FSHive
operator
Hive
operator
col1 col2
SSD cache*
Off-heap cache Compression
codec
- work in
progress
*
Compact encoded data
Compressed data
Processing on
encoded
data, codegen
Zero copy
- next
steps?
*
Low-level operators
(e.g. filter) *
Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Performance – cache on slow FS
• Industry-standard cloud FS,
much slower than on-prem
HDFS
• Large scan on 1Tb scale
• Local HD/SDD cache (WIP)
• Massive improvement on slow
storage with little memory cost
0
50
100
150
200
250
300
350
400
Cold cache Hot cache
Runtime,s
Entire query
Scan portion
Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Performance – cache on HDFS, 1Tb scale
20.2
18.3
17.6
16.2 16.2
16.8
14.5
11.3
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
0
5
10
15
20
25
24 26 28 30 32 34 36 38 40 42
Cachehitrate
Queryruntime,s
Cache size, Gb
Query runtime, after
unrelated workload
Query runtime, after
related workload
Metadata hit rate
Data hit rate
Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Perf overview – a demo in a gif
Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP as the unified data access layer
Page28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Overview
• LLAP can provide a "relational datanode" view of the data
• Security via endpoint ACLs, or fragment signing (for external engines)
• Granular, e.g. column-level, security is possible
• Anyone (with access) can push the (approved) code in, from complex query
fragments to simple data reads
• SparkSQL/LLAP integration is based on SparkSQL data source APIs
• DataFrame can be created with LlapInputFormat
• Hive execution engines and other DAG executors can use LLAP directly
• like Tez does now
Page29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Example - SparkSQL integration
• Allowing direct access to Hive
table data from Spark breaks
security model for warehouses
secured using Ranger or SQL
Standard Authorization
• Other features may not work
correctly unless support added in
Spark
• Row/Cell level security (when
implemented), ACID, Schema
Evolution, …
• SparkSQL has interfaces for external
sources; they can be implemented to
access Hive data via HS2/LLAP
• SparkSQL Catalyst optimizer hooks
can be used to push processing into
LLAP (e.g. filters on the tables)
• Some optimizations, like dynamic partition
pruning, can be used by Hive even if SparkSQL
doesn't support them; if execution pushdown
is advanced enough
Page30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Example - SparkSQL integration – execution flow
Page 30
HadoopRDD.compute()
LlapInputFormat.getRecordReader()
- Open socket for incoming data
- Send package/plan to LLAP
Check permissions
Generate splits w/LLAP locations
Return securely signed splits
HiveServer2
Spark Executor
HadoopRDD.getPartitions()
Get Hive splits
Cluster nodes
Verify splits
Run query plan
Send data back
LLAP
Partitions/
Splits
Request
splits
: :
var llapContext =
LlapContext.newInstance(
sparkContext, jdbcUrl)
var df: DataFrame =
llapContext.sql("select *
from tpch_text_5.region")
DataFrame for Hive/LLAP data
Page31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Example - Tez/LLAP integration
• Tez DAG coordination remains mostly unchanged
• Tez plugins (TEZ-2003)
• Scheduling – allows scheduling parts of a DAG as fragments on LLAP
• Task communication – custom execution specifications, protocols – allows
talking to LLAP, manages security tokens, etc.
• Hive operators used for processing
• Few or no changes necessary to support any Hive query complexity
• All new Hive features automatically supported
Page32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Deploying and managing LLAP
Page33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Availability
• First version shipped in Apache Hive 2.0
• Hive 2.0.1 contains important bugfixes to make it production ready
• Hive 2.1 would have new features and further perf improvements
• A solid platform for the basic scenario – run LLAP-only queries
on a secure cluster
• or a fraction thereof
• Work in progress to support additional features (ACID, UDFs,
hybrid deployments, etc.)
• Unified data access layer is the next step
Page34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
General overview
• LLAP daemons run on YARN; deployed via Apache Slider
• Easy to bring up, tear down, flex clusters via slider commands
• Hive-on-Tez should be set up with a compatible version (0.8.2+)
• Zookeeper for service discovery and coordination
• On HS2, using doAs is not recommended; use SQL auth or Ranger
• However, you can keep using storage-based auth for other queries, as usual
• Java 8 strongly recommended
• April 2015 called and they want their Java 7 End-of-Lived
Page35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Starting the cluster
• Apache Ambari integration – WIP (will do all of this for you)
• Hive service (hive --service llap)
• Generates a slider package, and a script to deploy it and start the cluster
• Run as the correct user – slider paths are user-specific!
• HADOOP_HOME, JAVA_HOME should be set
• kinit on secure cluster
• Specify a name, e.g. --name llap0; number of instances (e.g. one per machine);
memory, cache size, number of executors (e.g. one per CPU); see --help
• G1 GC recommended (e.g. --args " -XX:+UseG1GC -XX:+ResizeTLAB -
XX:-ResizePLAB")
Page36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Monitoring
• LLAP exposes a UI for
monitoring
• Also has jmx endpoint with
much more data, logs and
jstack endpoints as usual
• Aggregate monitoring UI is
work in progress
Page37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Running queries
• Once the cluster is started, queries can be run from HS2 and CLI
• --hiveconf hive.execution.mode=llap --hiveconf hive.llap.execution.mode=all
--hiveconf hive.llap.io.enabled=true
--hiveconf hive.llap.daemon.service.hosts=@<cluster name>
• Set the usual perf settings (recommended, if not already set)
• Enable vectorization, PPD, map join; use ORC
• LLAP-specific: set hive.tez.input.generate.consistent.splits=true;
• For interactive queries, disable CBO
• Parallel compilation on HS2 – otherwise your parallel queries might bottleneck there!
• Good to go!
Page38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Watching queries – Tez UI integration
Page39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Security
• Again, Ambari will do this for you soon
• SQL or Ranger auth recommended, with hive user running queries
• LLAP uses keytabs and ACLs, similar to datanode, to secure endpoints
• Keytabs need to be set up; certs may need to be set up for Web UI
• HS2 (or client) obtains a token to access a particular LLAP cluster, and
gives it to Tez AM
• LLAP-s share tokens using secure ZK, similar to HA token sharing
• Signing fragments on HS2 and granular security checks are work in
progress
Page40 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Summary and future work
Page41 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Future work
• Data view for external services
• Column-level security
• Better integration and deployment
• Ambari, monitoring UI, fine-grained log aggregation
• Hybrid deployment
• Resizing LLAP to accommodate containers, YARN resource delegation, etc.
• Feature work
• ACID, text cache, better cache policies, UDFs, SSD cache, etc.
• More performance work
Page42 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Summary
• LLAP is a new execution substrate for fast, concurrent
analytical workloads, harnessing Hive vectorized SQL
engine and efficient in-memory caching layer
• Provides secure relational view of the data thru a
standard InputFormat API
• Available in Hive 2, with better ecosystem integration in
the works
Page43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Questions?
?
Interested? Stop by the Hortonworks booth to learn more
Page44 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Backup slides
Page45 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Example execution: MR vs Tez vs Tez+LLAP
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
T T T
T T
R
T T
T
R
M M M
R R
R
M M
R
R
HDFS
MapReduce Tez Tez with LLAP
Page46 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
IO Elevator
• Reading, decoding and
processing are parallel
• Unlike in regular Hive
• HDFS reads are expensive
(more so on S3, Azure)
• Data decompression and
decoding is expensive

Más contenido relacionado

La actualidad más candente

Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemDataWorks Summit/Hadoop Summit
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hiverxu
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownDataWorks Summit
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3DataWorks Summit
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...DataWorks Summit/Hadoop Summit
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance DataWorks Summit/Hadoop Summit
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Chris Nauroth
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Mich Talebzadeh (Ph.D.)
 
What's new in hadoop 3.0
What's new in hadoop 3.0What's new in hadoop 3.0
What's new in hadoop 3.0Heiko Loewe
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleYifeng Jiang
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...DataWorks Summit
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANDataWorks Summit/Hadoop Summit
 

La actualidad más candente (20)

Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
 
What's new in hadoop 3.0
What's new in hadoop 3.0What's new in hadoop 3.0
What's new in hadoop 3.0
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scale
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
 

Similar a LLAP: Sub-Second Analytical Queries in Hive

LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveDataWorks Summit
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BIDataWorks Summit
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
Kinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveKinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveYifeng Jiang
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019alanfgates
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?DataWorks Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerDataWorks Summit
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017alanfgates
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?DataWorks Summit
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019alanfgates
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseDataWorks Summit
 

Similar a LLAP: Sub-Second Analytical Queries in Hive (20)

LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Kinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveKinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-dive
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging Manager
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
 

Más de DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 

Más de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 

Último

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Último (20)

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

LLAP: Sub-Second Analytical Queries in Hive

  • 1. Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP: Sub-Second Analytical Queries in Hive Sergey Shelukhin, Siddharth Seth
  • 2. Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP: long-lived execution in Hive What is LLAP?+ + LLAP in Hive; performance primer+ + LLAP as a unified data access layer+ + Deploying and managing LLAP+ + Future work+ + + How does LLAP make queries faster?+
  • 3. Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What is LLAP?
  • 4. Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What is LLAP? • Hybrid model combining daemons and containers for fast, concurrent execution of analytical workloads (e.g. Hive SQL queries) • Concurrent queries without specialized YARN queue setup • Multi-threaded execution of vectorized operator pipelines • Asynchronous IO and efficient in-memory caching • Relational view of the data available thru the API • High performance scans, execution code pushdown • Centralized data security Node LLAP Process Cache Query Fragment HDFS Query Fragment
  • 5. Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What LLAP isn't • Not another "execution engine" (like Tez, MR, Spark…) • Tez, Spark (work in progress) can work on top of LLAP – Engines provide coordination and scheduling – some work (e.g. large shuffles) can be scheduled in containers • Not a storage substrate • Daemons are stateless and read (and cache) data from HDFS/S3/Azure/..
  • 6. Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP Queue Technical overview – execution • LLAP daemon has a number of executors (think containers) that execute work "fragments" • Fragments are parts of one, or multiple parallel workloads (e.g. Hive SQL queries) • Work queue with pluggable priority • Geared towards low latency queries over long- running queries (by default) • I/O is similar to containers – read/write to HDFS, shuffle, other storages and formats • Streaming output for data API Executor Q1 Map 1 Executor External read Executor Q3 Reducer 3 Q1 Map 1 Q1 Map 1 Q3 Map 19 HDFS Waiting for shuffle inputs HBase Container (shuffle input) Spark executor
  • 7. Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Executor Technical overview – IO layer • Optional when executing inside LLAP • Wraps existing InputFormat • Currently only ORC is supported • Asynchronous IO for Hive • Transparent, compressed in-memory cache • Format-specific, extensible • SSD cache is work in progress RecordReade r Fragment Cache IO thread Plan & decode Read, decompress Metadata cache Actual data (HDFS, S3, …) What to read Data buffers Splits Vectorized data
  • 8. Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP in Hive; performance primer
  • 9. Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Overview - Hive + LLAP • Transparent to Hive users, BI tools, etc. • Except the queries become faster :) • Number of concurrent queries throttled by Hive Server • Hive decides where query fragments run (LLAP, Container, AM) based on configuration, data size, format, etc. • Each Query coordinated independently by a Tez AM • Hive Operators used for processing • Tez Runtime used for data transfer HiveServer2 Query/AM Controller Client(s) YARN Cluster AM1 llapd llapd Container AM1 Container AM1 llapd Container AM2 AM2 AM3 llapd
  • 10. Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved How does LLAP speed up the queries? • Eliminates container startup costs • JIT optimizer has a chance to work • Especially important for vectorized processing • Asynchronous IO • Caching • Data sharing (hash join tables, etc.) • Optimizations for parallel queries
  • 11. Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Industry benchmark – 10Tb scale 0 5 10 15 20 25 30 35 40 45 50 query3 query12 query20 query21 query26 query27 query42 query52 query55 query73 query89 query91 query98 Time(seconds) LLAP vs Hive 1.x 10TB Scale Hive 1.x LLAP90% faster
  • 12. Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Evaluation for a real use case 0 20000 1 3 5 7 9 11 13 15 17 19 21 23 25 27 • Aggregate daily statistics for a time interval: SELECT yyyymmdd, sum(total_1), sum(total_2), ... from table where yyyymmdd >= xxx and yyyymmdd < xxx and userid = xxx group by userid, yyyymmdd; Max Max Max Avg Avg Avg 0 1 2 3 4 5 6 7 8 D W M Y D W M Y D W M Y Tez Phoenix LLAP Executiontime,s
  • 13. Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Evaluation for a real use case ExecutionTime,s. • Display a large report • Phoenix runtimes not displayed (30-250s on large ranges) Execution Time in seconds over time range Max Max Avg Avg 0 5 10 15 20 D W M Y D W M Y Tez LLAP Executiontime,s
  • 14. Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved How does LLAP make queries faster?
  • 15. Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Persistent daemons (recap) • Reduced startup time • Better use of JIT optimizer • Hash join hash tables, fragment plans are shared • Multiple tasks do not all generate the table or deserialize the plans
  • 16. Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Performance – first query • Cold LLAP is nearly as fast as shared pre-warmed containers (impractical on real clusters) • Realistic (long-running) LLAP ~3x faster than realistic (no prewarm) Tez No prewarm, 20.94 Shared prewarm 11.96 Cold LLAP 13.23 Realistic LLAP 7.49 0 5 10 15 20 25 Firstqueryruntime,s
  • 17. Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Performance – repeated query • Cache disabled! 13.23 7.93 7.7 6.29 5.44 5.3 5.01 4.89 4.83 7.49 6.1 4.61 4.59 4.93 4.62 4.63 4.45 4.25 0 2 4 6 8 10 12 14 0 1 2 3 4 5 6 7 8 Queryruntime,s Query run Cold LLAP LLAP after an unrelated workload
  • 18. Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Parallel queries – priorities, preemption • Lower-priority fragments can be preempted • For example, a fragment can start running before its inputs are ready, for better pipelining; such fragments may be preempted • LLAP work queue examines the DAG parameters to give preference to interactive (BI) queries LLAP QueueExecutor Executor Interactive query map 1/3 … Interactive query map 3/3 Executor Interactive query map 2/3 Wide query reduce waiting Time ClusterUtilization Long-Running Query Short-Running Query
  • 19. Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Parallel query execution – LLAP vs Hive 1.2 3131 2416 1741 1420 1508 531 291 147 94 75 100% 91% 90% 71% 44% 45% 13% 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 0 500 1000 1500 2000 2500 3000 3500 1 2 4 8 16 Scaling(comparedtolinear) Totalruntimeforallqueries,s. Number of concurrent users Hive 1.2 linear scaling Hive 1.2 total runtime LLAP linear scaling LLAP total runtime Hive+LLAP efficiency Hive 1.2 efficiency
  • 20. Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Parallel query execution – 10Tb scale 531 291 147 94 75 100% 91% 90% 71% 44% 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 0 100 200 300 400 500 600 1 2 4 8 16 Scaling(comparedtolinear) Totalruntimeforallqueries,s. Number of concurrent users LLAP linear scaling LLAP total runtime Hive+LLAP efficiency
  • 21. Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved IO elevator and caching • Reading, decoding and processing are parallel • Unlike in regular Hive, where processing can wait for e.g. an S3 read • Logically-compresses (e.g. ORC-encoded) data is cached off-heap • Metadata is cached (currently on-heap) with high priority • Replacement policy is pluggable (LRFU is the default) • Tez schedules work to preferred location to improve locality • Tradeoff between HDFS reads and operator speed • Cache takes memory away from operators (sort buffers, hash join tables, etc.) • Depends on how much memory you have, workflow, dataset size, etc. • Ex. 4Gb per executor x 16 CPUs – 64Gb for executors, rest for cache
  • 22. Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved I/OExecution In-memory processing – present and future Decoder Distributed FS Fragment Hive operator Hive operator Vectorized processin g col1 col2 Low-level pushdown (e.g. filter)* SSD cache* Off-heap cache Compression codec Native data vectors - work in progress * Compact encoded data Compressed data col1 col2
  • 23. Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Fragment I/OExecution In-memory processing – future? Distributed FSHive operator Hive operator col1 col2 SSD cache* Off-heap cache Compression codec - work in progress * Compact encoded data Compressed data Processing on encoded data, codegen Zero copy - next steps? * Low-level operators (e.g. filter) *
  • 24. Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Performance – cache on slow FS • Industry-standard cloud FS, much slower than on-prem HDFS • Large scan on 1Tb scale • Local HD/SDD cache (WIP) • Massive improvement on slow storage with little memory cost 0 50 100 150 200 250 300 350 400 Cold cache Hot cache Runtime,s Entire query Scan portion
  • 25. Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Performance – cache on HDFS, 1Tb scale 20.2 18.3 17.6 16.2 16.2 16.8 14.5 11.3 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 0 5 10 15 20 25 24 26 28 30 32 34 36 38 40 42 Cachehitrate Queryruntime,s Cache size, Gb Query runtime, after unrelated workload Query runtime, after related workload Metadata hit rate Data hit rate
  • 26. Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Perf overview – a demo in a gif
  • 27. Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP as the unified data access layer
  • 28. Page28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Overview • LLAP can provide a "relational datanode" view of the data • Security via endpoint ACLs, or fragment signing (for external engines) • Granular, e.g. column-level, security is possible • Anyone (with access) can push the (approved) code in, from complex query fragments to simple data reads • SparkSQL/LLAP integration is based on SparkSQL data source APIs • DataFrame can be created with LlapInputFormat • Hive execution engines and other DAG executors can use LLAP directly • like Tez does now
  • 29. Page29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Example - SparkSQL integration • Allowing direct access to Hive table data from Spark breaks security model for warehouses secured using Ranger or SQL Standard Authorization • Other features may not work correctly unless support added in Spark • Row/Cell level security (when implemented), ACID, Schema Evolution, … • SparkSQL has interfaces for external sources; they can be implemented to access Hive data via HS2/LLAP • SparkSQL Catalyst optimizer hooks can be used to push processing into LLAP (e.g. filters on the tables) • Some optimizations, like dynamic partition pruning, can be used by Hive even if SparkSQL doesn't support them; if execution pushdown is advanced enough
  • 30. Page30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Example - SparkSQL integration – execution flow Page 30 HadoopRDD.compute() LlapInputFormat.getRecordReader() - Open socket for incoming data - Send package/plan to LLAP Check permissions Generate splits w/LLAP locations Return securely signed splits HiveServer2 Spark Executor HadoopRDD.getPartitions() Get Hive splits Cluster nodes Verify splits Run query plan Send data back LLAP Partitions/ Splits Request splits : : var llapContext = LlapContext.newInstance( sparkContext, jdbcUrl) var df: DataFrame = llapContext.sql("select * from tpch_text_5.region") DataFrame for Hive/LLAP data
  • 31. Page31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Example - Tez/LLAP integration • Tez DAG coordination remains mostly unchanged • Tez plugins (TEZ-2003) • Scheduling – allows scheduling parts of a DAG as fragments on LLAP • Task communication – custom execution specifications, protocols – allows talking to LLAP, manages security tokens, etc. • Hive operators used for processing • Few or no changes necessary to support any Hive query complexity • All new Hive features automatically supported
  • 32. Page32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Deploying and managing LLAP
  • 33. Page33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Availability • First version shipped in Apache Hive 2.0 • Hive 2.0.1 contains important bugfixes to make it production ready • Hive 2.1 would have new features and further perf improvements • A solid platform for the basic scenario – run LLAP-only queries on a secure cluster • or a fraction thereof • Work in progress to support additional features (ACID, UDFs, hybrid deployments, etc.) • Unified data access layer is the next step
  • 34. Page34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved General overview • LLAP daemons run on YARN; deployed via Apache Slider • Easy to bring up, tear down, flex clusters via slider commands • Hive-on-Tez should be set up with a compatible version (0.8.2+) • Zookeeper for service discovery and coordination • On HS2, using doAs is not recommended; use SQL auth or Ranger • However, you can keep using storage-based auth for other queries, as usual • Java 8 strongly recommended • April 2015 called and they want their Java 7 End-of-Lived
  • 35. Page35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Starting the cluster • Apache Ambari integration – WIP (will do all of this for you) • Hive service (hive --service llap) • Generates a slider package, and a script to deploy it and start the cluster • Run as the correct user – slider paths are user-specific! • HADOOP_HOME, JAVA_HOME should be set • kinit on secure cluster • Specify a name, e.g. --name llap0; number of instances (e.g. one per machine); memory, cache size, number of executors (e.g. one per CPU); see --help • G1 GC recommended (e.g. --args " -XX:+UseG1GC -XX:+ResizeTLAB - XX:-ResizePLAB")
  • 36. Page36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Monitoring • LLAP exposes a UI for monitoring • Also has jmx endpoint with much more data, logs and jstack endpoints as usual • Aggregate monitoring UI is work in progress
  • 37. Page37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Running queries • Once the cluster is started, queries can be run from HS2 and CLI • --hiveconf hive.execution.mode=llap --hiveconf hive.llap.execution.mode=all --hiveconf hive.llap.io.enabled=true --hiveconf hive.llap.daemon.service.hosts=@<cluster name> • Set the usual perf settings (recommended, if not already set) • Enable vectorization, PPD, map join; use ORC • LLAP-specific: set hive.tez.input.generate.consistent.splits=true; • For interactive queries, disable CBO • Parallel compilation on HS2 – otherwise your parallel queries might bottleneck there! • Good to go!
  • 38. Page38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Watching queries – Tez UI integration
  • 39. Page39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Security • Again, Ambari will do this for you soon • SQL or Ranger auth recommended, with hive user running queries • LLAP uses keytabs and ACLs, similar to datanode, to secure endpoints • Keytabs need to be set up; certs may need to be set up for Web UI • HS2 (or client) obtains a token to access a particular LLAP cluster, and gives it to Tez AM • LLAP-s share tokens using secure ZK, similar to HA token sharing • Signing fragments on HS2 and granular security checks are work in progress
  • 40. Page40 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Summary and future work
  • 41. Page41 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Future work • Data view for external services • Column-level security • Better integration and deployment • Ambari, monitoring UI, fine-grained log aggregation • Hybrid deployment • Resizing LLAP to accommodate containers, YARN resource delegation, etc. • Feature work • ACID, text cache, better cache policies, UDFs, SSD cache, etc. • More performance work
  • 42. Page42 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Summary • LLAP is a new execution substrate for fast, concurrent analytical workloads, harnessing Hive vectorized SQL engine and efficient in-memory caching layer • Provides secure relational view of the data thru a standard InputFormat API • Available in Hive 2, with better ecosystem integration in the works
  • 43. Page43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Questions? ? Interested? Stop by the Hortonworks booth to learn more
  • 44. Page44 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Backup slides
  • 45. Page45 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Example execution: MR vs Tez vs Tez+LLAP M M M R R M M R M M R M M R HDFS HDFS HDFS T T T T T R T T T R M M M R R R M M R R HDFS MapReduce Tez Tez with LLAP
  • 46. Page46 © Hortonworks Inc. 2011 – 2015. All Rights Reserved IO Elevator • Reading, decoding and processing are parallel • Unlike in regular Hive • HDFS reads are expensive (more so on S3, Azure) • Data decompression and decoding is expensive

Notas del editor

  1. 6 months note