What's New in Apache Hive

© Cloudera, Inc. All rights reserved.
WHAT’S NEW IN APACHE HIVE 3 FOR HDP 3.1
Jason Dere
Apache Hive PMC Member
jdere@cloudera.com

© Cloudera, Inc. All rights reserved. 2
AGENDA
Apache Hive 3
Data Analytics Studio
Coming Soon

AGENDA
Apache Hive 3
Coming Soon

Hive LLAP - MPP Performance at Hadoop Scale
Deep
Storage
Hadoop Cluster
LLAP Daemon
Query Executors
LLAP Daemon
Query Executors
LLAP Daemon
Query Executors
LLAP Daemon
Query Executors
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
HiveServer2
(Query
Endpoint)
ODBC /
JDBC
SQL
Queries In-Memory Cache
(Shared Across All Users)
HDFS and Compatible S3 WASB Isilon

Hive3: Focus on the EnterpriseDataWarehouse
BI tools
Materialized
view
Surrogate
key
Constraints
Query
Result
Cache
Workload
management
• Results return
from HDFS/cache
directly
• Reduce load from
repetitive queries
• Allows more
queries to be run
in parallel
• Reduce resource
starvation in large
clusters
• Active/Passive HA
• More “tools” for
optimizer to use
• More ”tools” for
DBAs to
tune/optimize
• Invisible tuning of
DB from users’
perspective
• ACID v2 is as fast as
regular tables
• Hive 3 is optimized
for S3/WASB/GCP
• Support for
JDBC/Kafka/Druid
out of the box
ACID v2
Cloud
Storage
Connectors

New SQL Features

Materializedview
Optimizing workloads and queries without changing the SQL
SELECT distinct dest,origin
FROM flights;
SELECT origin, count(*)
FROM flights
GROUP BY origin
HAVING origin = ‘OAK’;
CREATE MATERIALIZED VIEW flight_agg
AS
SELECT dest,origin,count(*)
FROM flights
GROUP BY dest,origin;

Materializedview - Maintenance
• Partial table rewrites are supported
• Typical: Denormalize last month of data only
• Rewrite engine will produce union of latest and historical data
• Updates to base tables
• Invalidates views, but
• Can choose to allow stale views (max staleness) for performance
• Can partial match views and compute delta after updates
• Incremental updates
• Common classes of views allow for incremental updates
• Others need full refresh

Constraints& defaults
• Helps optimizer to produce better plans
• BI tool integrations
• Data Integrity
• hive.constraint.notnull.enforce = true
• SQL compatibility & offload scenarios
Example:
CREATE TABLE Persons (
ID Int NOT NULL,
Name String NOT NULL,
Age Int,
Creator String DEFAULT CURRENT_USER(),
CreateDate Date DEFAULT CURRENT_DATE(),
PRIMARY KEY (ID) DISABLE NOVALIDATE
);
CREATE TABLE BusinessUnit (
ID Int NOT NULL,
Head Int NOT NULL,
Creator String DEFAULT CURRENT_USER(),
CreateDate Date DEFAULT CURRENT_DATE(),
PRIMARY KEY (ID) DISABLE NOVALIDATE,
CONSTRAINT fk FOREIGN KEY (Head)
REFERENCES Persons(ID) DISABLE
NOVALIDATE
);

Hive-1010:Information schema& sysdb
Question:
Find which tables have a column with ‘ssn’
as part of the column name?
use information_schema;
SELECT table_schema, table_name
FROM information_schema.columns
WHERE column_name LIKE '%ssn%';
Question:
Find the biggest tables in the system.
use sys;
SELECT tbl_name, total_size
FROM table_stats_view v, tbls t
WHERE t.tbl_id = v.tbl_id ORDER BY
cast(v.total_size as int) DESC LIMIT 3;

Connectors

JDBC connector
• How did we build the information_schema?
• We mapped the metastore into Hive’s table
space!
• Uses Hive-JDBC connector
• Read-only for now
• Supports automatic pushdown of full
subqueries
• Cost-based optimizer decides part of query runs
in RDBMS versus Hive
• Joins, aggregates, filters, projections, etc
CREATE TABLE postgres_table (
id INT,
name varchar
);
CREATE EXTERNAL TABLE hive_table (
id INT,
name STRING
) STORED BY
'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
"hive.sql.database.type" = "POSTGRES",
"hive.sql.jdbc.driver"="org.postgresql.Driver",
"hive.sql.jdbc.url"="jdbc:postgresql://...",
"hive.sql.dbcp.username"="jdbctest",
"hive.sql.dbcp.password"="",
"hive.sql.query"="select * from postgres_table",
"hive.sql.column.mapping" = "id=ID, name=NAME",
"hive.jdbc.update.on.duplicate" = "true"
);
In Postgres
In Hive

Druid Connector- Joins between Hive and realtime datain Druid
Bloom filter pushdown greatly reduces data transfer
Send promotional email to all customers from CA who purchased more than 1000$ worth of merchandise today.
create external table sales(`__time` timestamp, quantity int, sales_price double,customer_id bigint, item_id int, store_id int)
stored by 'org.apache.hadoop.hive.druid.DruidStorageHandler'
tblproperties ( "kafka.bootstrap.servers" = "localhost:9092", "kafka.topic" = "sales-topic",
"druid.kafka.ingestion.maxRowsInMemory" = "5");
create table customers (customer_id bigint, first_name string, last_name string, email string, state string);
select email from customers join sales using customer_id where to_date(sales.__time) = date ‘2018-09-06’
and quantity * sales_price > 1000 and customers.state = ‘CA’;

Kafkaconnector
Transformation over stream in real time
I want to have moving average over sliding window in kafka from stock ticker kafka stream.
create external table
tickers (`__time` timestamp , stock_id bigint, stock_sym varchar(4), price decimal (10,2), exhange_id int)
stored by 'org.apache.hadoop.hive.kafka.KafkaStorageHandler’
tblproperties ("kafka.topic" = "stock-topic", "kafka.bootstrap.servers"="localhost:9092",
"kafka.serde.class"="org.apache.hadoop.hive.serde2.JsonSerDe");
create external table
moving_avg (`__time` timestamp , stock_id bigint, avg_price decimal (10,2)
stored by 'org.apache.hadoop.hive.kafka.KafkaStorageHandler'
tblproperties ("kafka.topic" = "averages-topic", "kafka.bootstrap.servers"="localhost:9092",
"kafka.serde.class"="org.apache.hadoop.hive.serde2.JsonSerDe");
Insert into table moving_avg select CURRENT_TIMESTAMP, stock_id, avg(price) group by stock_id,
from tickers where __timestamp > to_unix_timestamp(CURRENT_TIMESTAMP - 5 minutes) * 1000

ACID v2

ACID v2
V1: CREATE TABLE hello_acid (load_date date, key int, value int)
CLUSTERED BY(key) INTO 3 BUCKETS
STORED AS ORC TBLPROPERTIES ('transactional'='true');
V2: CREATE TABLE hello_acid_v2 (load_date date, key int, value int);
• Performance just as good as non-ACID tables
• No bucketing required
• Non-ORC formats supported (INSERT & SELECT only)
• Fully compatible with native cloud storage

Workload Management

LLAP workload management
⬢ Effectively share LLAP cluster resources
– Resource allocation per user policy; separate ETL and BI, etc.
⬢ Resource based guardrails
– Protect against long running queries, high memory usage
⬢ Improved, query-aware scheduling
– Scheduler is aware of query characteristics, types, etc.
– Fragments easy to pre-empt compared to containers
– Queries get guaranteed fractions of the cluster, but can use
empty space

Guardrail Example
Common Triggers
● ELAPSED_TIME
● EXECUTION_TIME
● TOTAL_TASKS
● HDFS_BYTES_READ, HDFS_BYTES_WRITTEN
● CREATED FILES
● CREATED_DYNAMIC_PARTITIONS
Example
CREATE RESOURCE PLAN guardrail;
CREATE TRIGGER guardrail.long_running WHEN EXECUTION_TIME > 2000 DO KILL;
ALTER TRIGGER guardrail.long_running ADD TO UNMANAGED;
ALTER RESOURCE PLAN guardrail ENABLE ACTIVATE;

Resource plans example
CREATE RESOURCE PLAN daytime;
CREATE POOL daytime.bi WITH ALLOC_FRACTION=0.8, QUERY_PARALLELISM=5;
CREATE POOL daytime.etl WITH ALLOC_FRACTION=0.2, QUERY_PARALLELISM=20;
CREATE TRIGGER downgrade IN daytime WHEN total_runtime > 3000 THEN MOVE etl;
ADD RULE downgrade TO bi;
CREATE APPLICATION MAPPING tableau in daytime TO bi;
ALTER PLAN daytime SET default pool= etl;
APPLY PLAN daytime;
daytime
bi: 80% etl: 20%
Downgrade when total_runtime>3000

Performance

• Ran all 99 TPCDS queries
• Total query runtime have improved multifold in each release!
Benchmarkjourney
TPCDS 10TB scale on 10 node cluster
HDP 2.5
Hive1
HDP 2.5
LLAP
HDP 2.6
LLAP
25x 3x 2x
HDP 3.0
LLAP
2016 20182017
ACID
tables

• Performed by Postech University (Korea)
• Compares LLAP, Spark, Presto and Tez, and MR3
• Shows Hive3/LLAP fastest in aggregate and for most queries
• Indigo cluster: 20 nodes, 96GB, 2 disks, 3TB TPCDS
PostechUniversity benchmark
MR3 brenchmark

• Faster analytical queries with improved vectorization in HDP 3.0
• Vectorized execution of PTF, rollup and grouping sets.
• Perf gain compared to HDP 2.6
• TPCDS query67 ~ 10x!
OLAP Vectorization

SELECT * FROM
( SELECT AVG(ss_list_price) B1_LP,
COUNT(ss_list_price) B1_CNT ,COUNT(DISTINCT
ss_list_price) B1_CNTD
FROM store_sales
WHERE ss_quantity BETWEEN 0 AND 5 AND
(ss_list_price BETWEEN 11 and 11+10 OR
ss_coupon_amt BETWEEN 460 and 460+1000 OR
ss_wholesale_cost BETWEEN 14 and 14+20)) B1,
( SELECT AVG(ss_list_price) B2_LP,
COUNT(ss_list_price) B2_CNT ,COUNT(DISTINCT
ss_list_price) B2_CNTD
FROM store_sales
WHERE ss_quantity BETWEEN 6 AND 10 AND
(ss_list_price BETWEEN 91 and 91+10 OR
ss_coupon_amt BETWEEN 1430 and 1430+1000 OR
ss_wholesale_cost BETWEEN 32 and 32+20)) B2,
. . .
LIMIT 100;
TPCDS SQL query 28 joins 6 instances of store_sales table
Shared scan - 4x improvement!
RS RS RS RS RS
Scan
store_sales
Combined OR’ed B1-B6 Filters
B1 Filter B2 Filter B3 Filter B4 Filter B5 Filter
Join

• Dramatically improves performance of very selective joins
• Builds a bloom filter from one side of join and filters rows from other side
• Skips scan and further evaluation of rows that would not qualify the join
Dynamic Semijoin Reduction - 7x improvement for q72
SELECT …
FROM sales JOIN time ON
sales.time_id = time.time_id
WHERE time.year = 2014 AND
time.quarter IN ('Q1', 'Q2’)
Reduced scan on sales

AGENDA
Apache Hive 3
Coming Soon

SOLUTIONS: Full featured Auto-complete, results
direct download, quick-data preview and many
other quality-of-life improvements

SOLUTIONS: Pre-defined searches to quickly narrow
down problematic queries in a large cluster

SOLUTIONS: Heuristic recommendation engine
Fully self-serviced query and storage optimization

Query compare allows side-by-side
comparison of query details, explain
plan, configuration, execution
details

SOLUTIONS: Data Analytics Studio gives database
heatmap, quickly discover and see what part of your
cluster is being utilized more

One of the Extensible DataPlane Services
⬢ DAS 1.2 available now for HDP 3.1!
⬢ Replaces Hive & Tez Views
⬢ Monthly release cadence
⬢ Separate install from stack
DATAPLANE SERVICE
DATA SOURCE INTEGRATION
DATA SERVICES CATALOG
…DATA
LIFECYCLE
MANAGER
DATA
STEWARD
STUDIO
+OTHER
(partner)
SECURITY CONTROLS
CORE CAPABILITIES
MULTIPLE CLUSTERS AND SOURCES
MULTIHYBRID
EXTENSIBLE SERVICES
DATA
ANALYTICS
STUDIO

AGENDA
Apache Hive 3
Coming Soon

• Hive on Kubernetes
• Easy creation/deployment of new Hive compute clusters
• Integration with shared catalog/security/governance (SDX)
• Multiple versions of Hive
• Rolling patch upgrades
• Data Analytics Studio
• More recommendations, including materialized views
• New visualizations for query execution
Hive On CDP

• Connectors
• Integration with managed streaming/relational services
• Query Scheduler
• Micro-batch streaming queries with Kafka
• Automatic materialized view maintenance
• Automatic statistics collection/update
• ACID
• Provide APIs for native integration with other apps (Impala, Spark, BigSQL)
Hive On CDP

THANK YOU

What's New in Apache Hive

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a What's New in Apache Hive

Similar a What's New in Apache Hive (20)

Más de DataWorks Summit

Más de DataWorks Summit (20)

Último

Último (20)

What's New in Apache Hive