Hive2.0 big dataspain-nov-2016

Apache Hive 2:
SQL, Speed, Scale
Alan Gates
Hive PMC Member
Co-founder Hortonworks

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Acknowledgements
 The Apache Hive community for building all this awesome tech
 Content of some of these slides based on earlier presentations by Sergey Shelukhin
and Siddarth Seth
 alias Hive=‘Apache Hive’
alias Hadoop=‘Apache Hadoop’
alias Spark=‘Apache Spark’
alias Tez=‘Apache Tez’
alias Parquet=‘Apache Parquet’
alias ORC=‘Apache ORC’
alias Calcite=‘Apache Calcite’

Apache Hive History
 Initially Hive provided SQL on Hadoop
– Provided a table view instead of file view of data
– Translated SQL to MapReduce
– Mostly used for ETL (Extract Transform Load)
– Big, batch, high start up time
 Around 2012 it became clear users wanted to do all data warehousing on Hadoop,
not just batch ETL
 Hive has shifted over time to focus on traditional data warehousing problems
– Still does large ETL well
– Now also can be used for analytics, reporting
– Work being done to better support BI (Business Intelligence) tools
 Not OLTP, very focused on backend analytics

Hive 1.x and 2.x
 New feature development in Hive moving at a fast pace
– Stressful for those who use Hive for its original purpose (ETL type SQL on MapReduce)
– Realizing the full potential of Hive as data warehouse on Hadoop requires more changes
 Compromise: follow Hadoop’s example, split into stable and new feature lines
 1.x
– Stable
– Backwards compatible
– Ongoing bug fixes
 2.x
– Major new features
– Backwards compatible where possible, but some things will be broken
– Hive 2.0 released February 15, 2016 – Not considered production ready
– Hive 2.1 released June 20, 2016 – Getting closer, but still beta

Hive 2 Releases
 2.0 released February 2016
– 1039 JIRAs resolved with 2.0 as fix version
• 666 bugs
• 140 improvements or new features
– 2.0.1 released in May with 68 bug fixes
 2.1 released June 2016
– 633 JIRAs resolved
• 391 bugs
• 128 new features or improvements

Hive 2.0 New Features Overview
 HPLSQL
 LLAP
 Hive-On-Spark Improvements
 Cost Based Optimizer Improvements
 Many, many new features and bug fixes I will not have time to cover
 Plus some really cool stuff being worked on now that I will touch on at the end

Adding Procedural SQL: HPLSQL
 Procedural SQL, akin to Oracle’s PL/SQL and Teradata’s stored procedures
– Adds cursors, loops (FOR, WHILE, LOOP), branches (IF), HPLSQL procedures, exceptions (SIGNAL)
 Aims to be compatible with all major dialects of procedural SQL to maximize re-use of
existing scripts
 Currently external to Hive, communicates with Hive via JDBC.
– User runs command using hplsql binary
– Goal is to tightly integrate it so that Hive’s parser can execute HPLSQL, store HPLSQL procedures,
etc.

Sub-second Queries in Hive: LLAP (Live Long and Process)
 Persistent daemons
– Saves time on process start up (eliminates container allocation and JVM start up time)
– All code JITed within a query or two
 Data caching with an async I/O elevator
– Hot data cached in memory (columnar aware, so only hot columns cached)
– When possible work scheduled on node with data cached, if not work will be run in other node
 Operators can be executed inside LLAP when it makes sense
– Large, ETL style queries usually don’t make sense
– User code not run in LLAP for security
 Working on interface to allow other data engines to read securely in parallel
 Beta in 2.0, 2.1

Hive With LLAP Execution Options
AM AM
T T T
R R
R
T T
T
R
M M M
R R
R
M M
R
R
Tez Only LLAP + Tez
T T T
R R
R
T T
T
R
LLAP only

© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: 25+x Performance Boost
0
5
10
15
20
25
30
35
40
45
50
0
50
100
150
200
250
Speedup(xFactor)
QueryTime(s)(LowerisBetter)
Hive 2 with LLAP averages 26x faster than Hive 1
Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)

© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Performance Compared with an MPP System (Apache Impala)

LLAP On Going Work
 Currently in beta, hope to have it stable by early next year
 Read only, write path being worked on (see HIVE-14535)
 ACID integration being worked on
 User must decide whether query runs fully in LLAP, mixed mode, or not at all
– Should be handled by CBO
 Currently only reads ORC files, other options being added
 Currently only integrates with Tez as an engine

Improvements to Hive on Spark
 Dynamic partition pruning
 Make use of spark persistence for self-join, self-union, and CTEs
 Vectorized map-join and other map-join improvements
 Parallel order by
 Pre-warming of containers
 Support for Spark 1.5
 Many bug fixes

Cost Base Optimizer (CBO) Improvements
 Hive’s CBO uses Calcite
– Not all optimization rules migrated yet, but 2.0 continues work towards that
 CBO on by default in 2.0 (wasn’t in in 1.x)
 Main focus of CBO work has been BI queries (using TPC-DS as guide)
– Some work on machine generated queries, since tools generate some funky queries
 Focus on improving stats collection and estimating stats more accurately between
operators in the plan

And Many, Many More
• SQL Standard Auth is the default authorization (actually works)
• First pass at storing metadata in HBase instead of RDBMS
• CLI mode for beeline (WIP to replace and deprecate CLI in Hive 2.*)
• Codahale-based metrics (also in 1.3)
• HS2 Web UI
• Stability Improvements and bugfixes for ACID (almost production ready now)
• Native vectorized mapjoin, vectorized reducesink, improved vectorized GBY, etc.
• Improvements to Parquet performance (PPD, memory manager, etc.)
• ORC schema evolution
• Improvement to windowing functions, refactoring ORC before split, SIMD
optimizations, new LIMIT syntax, parallel compilation in HS2, improvements to Tez
session management, many more

Features Being Worked on in Hive 2
 LLAP GA
 Vastly improved performance for ACID tables
– Reworked the way ACID delta files are stored and organized
 Adding MERGE for ACID
 Support for transactional inserts for non-ACID tables
 Vast readability improvements in EXPLAIN
 Materialized views
 Integration with Druid
 Continued progress towards full SQL 2011
– INTERSECT, EXCEPT, extended subquery support, non-equijoins
 Hive on Spark support for Spark 2.0

Hive 2 Incompatibilities
 Java 7 & 8 supported, 6 no longer supported
 Requires Hadoop 2.x, Hadoop 1.x no longer supported
 MapReduce deprecated, Tez or Spark recommended instead
– At some future date MR will be removed
 Some configuration defaults changed, e.g.
– bucketing enforced by default
– metadata schema no longer created if it is missing
– SQL Standard authorization used by default
 We plan to remove Hive CLI in the future and replace with beeline CLI
– Makes it easier for users to deploy secure clusters where all access is via [OJ]DBC
– It is cleaner to maintain one code path

Thank You

Hive2.0 big dataspain-nov-2016

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (12)

Similar a Hive2.0 big dataspain-nov-2016

Similar a Hive2.0 big dataspain-nov-2016 (20)

Último

Último (20)

Hive2.0 big dataspain-nov-2016

Notas del editor