Submit Search
Upload
Hive Data Modeling and Query Optimization
•
8 likes
•
2,007 views
Eyad Garelnabi
Follow
Improve your Hive query performance through effective modeling and query optimization.
Read less
Read more
Technology
Report
Share
Report
Share
1 of 56
Download now
Download to read offline
Recommended
Performance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
Hive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
Intro to HBase
Intro to HBase
alexbaranau
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
Node Labels in YARN
Node Labels in YARN
DataWorks Summit
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
Recommended
Performance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
Hive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
Intro to HBase
Intro to HBase
alexbaranau
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
Node Labels in YARN
Node Labels in YARN
DataWorks Summit
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisation
Ori Reshef
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
Apache Kudu: Technical Deep Dive
Apache Kudu: Technical Deep Dive
Cloudera, Inc.
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
Building an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
Optimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
Hadoop Security Architecture
Hadoop Security Architecture
Owen O'Malley
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
Altinity Ltd
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
DataWorks Summit
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
Walaa Eldin Moustafa
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
Time-Series Apache HBase
Time-Series Apache HBase
HBaseCon
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
Will Du
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
earnwithme2522
More Related Content
What's hot
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisation
Ori Reshef
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
Apache Kudu: Technical Deep Dive
Apache Kudu: Technical Deep Dive
Cloudera, Inc.
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
Building an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
Optimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
Hadoop Security Architecture
Hadoop Security Architecture
Owen O'Malley
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
Altinity Ltd
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
DataWorks Summit
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
Walaa Eldin Moustafa
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
Time-Series Apache HBase
Time-Series Apache HBase
HBaseCon
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
What's hot
(20)
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisation
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Apache Kudu: Technical Deep Dive
Apache Kudu: Technical Deep Dive
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
Building an open data platform with apache iceberg
Building an open data platform with apache iceberg
Optimizing Hive Queries
Optimizing Hive Queries
Hadoop Security Architecture
Hadoop Security Architecture
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
Time-Series Apache HBase
Time-Series Apache HBase
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
Hadoop Overview & Architecture
Hadoop Overview & Architecture
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
Similar to Hive Data Modeling and Query Optimization
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
Will Du
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
earnwithme2522
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Michael Rys
Apache Hive
Apache Hive
Amit Khandelwal
Apache hive
Apache hive
pradipbajpai68
03 hive query language (hql)
03 hive query language (hql)
Subhas Kumar Ghosh
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
RUHULAMINHAZARIKA
Hive Hadoop
Hive Hadoop
Farafekr Technology Ltd.
Exadata Smart Scan - What is so smart about it?
Exadata Smart Scan - What is so smart about it?
Uwe Hesse
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)
Huibert Aalbers
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Michael Rys
Implementing the Databese Server session 02
Implementing the Databese Server session 02
Guillermo Julca
Build a modern data platform.pptx
Build a modern data platform.pptx
Ike Ellis
SQLServer Database Structures
SQLServer Database Structures
Antonios Chatzipavlis
Introduction to Amazon Athena
Introduction to Amazon Athena
Amazon Web Services
Stinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
Hortonworks
Data organization: hive meetup
Data organization: hive meetup
t3rmin4t0r
Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014
Nicolas Morales
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
Neeraja Rentachintala
Hive Evolution: ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
John Sichi
Similar to Hive Data Modeling and Query Optimization
(20)
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Apache Hive
Apache Hive
Apache hive
Apache hive
03 hive query language (hql)
03 hive query language (hql)
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive Hadoop
Hive Hadoop
Exadata Smart Scan - What is so smart about it?
Exadata Smart Scan - What is so smart about it?
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Implementing the Databese Server session 02
Implementing the Databese Server session 02
Build a modern data platform.pptx
Build a modern data platform.pptx
SQLServer Database Structures
SQLServer Database Structures
Introduction to Amazon Athena
Introduction to Amazon Athena
Stinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
Data organization: hive meetup
Data organization: hive meetup
Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
Hive Evolution: ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
Recently uploaded
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
The Digital Insurer
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
MadyBayot
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
sudhanshuwaghmare1
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Edi Saputra
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
The Digital Insurer
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Andrey Devyatkin
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
presentation ICT roal in 21st century education
presentation ICT roal in 21st century education
jfdjdjcjdnsjd
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
Remote DBA Services
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
The Digital Insurer
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Zilliz
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
Architecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Jeffrey Haguewood
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Martijn de Jong
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Orbitshub
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
ThousandEyes
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc
Recently uploaded
(20)
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
presentation ICT roal in 21st century education
presentation ICT roal in 21st century education
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
Architecting Cloud Native Applications
Architecting Cloud Native Applications
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Hive Data Modeling and Query Optimization
1.
Page 1 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive Data Modeling & Query Optimization Eyad Garelnabi
2.
Page 2 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Agenda • File Formats • Hive Table Types • Hive Data Layout • What About Data Modeling • Hive Join Strategies • Op?mizing Queries
3.
Page 3 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved File Formats: Text, Parquet, ORC, etc…
4.
Page 4 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Text • Requires SerDes – CSV: comma delimited – Additional SerDes online • Does not compress well • Row based separation • Slow to read and write • Usually used for initial data load
5.
Page 5 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Parquet • Faster access to data • Efficient compression • Effective for select queries
6.
Page 6 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved ORCFile High Performance: Split-able, columnar storage file Efficient Reads: Break into large “stripes” of data for efficient read Fast Filtering: Built in index, min/max, metadata for fast filtering blocks - bloom filters if desired Efficient Compression: Decompose complex row types into primitives: massive compression and efficient comparisons for filtering Precomputation: Built in aggregates per block (min, max, count, sum, etc.) Proven at 300 PB scale: Facebook uses ORC for their 300 PB Hive Warehouse
7.
Page 7 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved etc… • Avro – JSON formatted – Good for select * queries – Slow to read for other queries • Sequence – Optimized for Java MapReduce jobs – Ineficient for Hive – Rarely used
8.
Page 8 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved High Compression with ORCFile
9.
Page 9 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved HIVE Tables: External, Managed, Views
10.
Page 10 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved External Tables • Hive manages schema/metadata • When dropped, only schema is deleted CREATE EXTERNAL TABLE my_external_table ( 'id' int, 'name' string, 'department' string, 'country' string, ) ROW FORMAT DELIMETED FIELDS TERMINATED BY ',' STORED AS orc;
11.
Page 11 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Internal/Managed Tables • Hive manages schema and data • Data is saved by default in /usr/hive/warehouse/my_managed_table • When dropped, both schema and data are deleted CREATE TABLE my_managed_table ( 'id' int, 'name' string, 'department' string, 'country' string, ) ROW FORMAT DELIMETED FIELDS TERMINATED BY ',’ SET LOCATION ‘/usr/Scotiabank/demo’ STORED AS parquet;
12.
Page 12 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Views • Virtual table • No data is stored to HDFS • When dropped, only schema is deleted CREATE VIEW my_view ( 'id' int, 'name' string, 'department' string, 'country' string, ) AS {select_statement};
13.
Page 13 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved HIVE Data Layout: Par??oning, Bucke?ng and Skews
14.
Page 14 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Data Abstractions in Hive Par??ons, buckets and skews facilitate faster, more direct data access. Database Table Table Par??on Par??on Par??on Bucket Bucket Bucket Op?onal Per Table Skewed Keys Unskewed Keys
15.
Page 15 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Partitioning • Breaks up data horizontally by column value sets • When partitioning you will use 1 or more “virtual” columns break up data • Virtual columns cause directories to be created in HDFS. – Files for that partition are stored within that subdirectory. • Partitioning makes queries go fast. – Partitioning works particularly well when querying with the “virtual column” – If queries use various columns, it may be hard to decide which columns should we partition by
16.
Page 16 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Partitioning • Static Partitioning – Partitioning is done on selected column fields CREATE TABLE static_partioned_table ( 'id' int, 'name' string, 'department' string ) PARTITIONED BY ('country' string) ROW FORMAT DELIMETED FIELDS TERMINATED BY ',' STORED AS ORCFile; INSERT OVERWRITE TABLE static_partioned_table PARTITION (country='canada') SELECT id, name, department FROM my_external_table WHERE country='canada'
17.
Page 17 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Partitioning • Dynamic Partitioning – Partitioning is automatically done on all column fields CREATE TABLE dynamic_partioned_table ( 'id' int, 'name' string, 'department' string ) PARTITIONED BY ('country' string) ROW FORMAT DELIMETED FIELDS TERMINATED BY ',' STORED AS ORCFile; INSERT OVERWRITE TABLE dynamic_partioned_table PARTITION (country) SELECT id, name, country FROM my_external_table;
18.
Page 18 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Partitioning • IMPORTANT: dynamic partitioning will not work by default – When creating tables, make sure: – set hive.exec.dynamic.partition=true • Also, set maximum number of partitions to avoid going overboard set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions=1000; set hive.exec.max.dynamic.partitions.pernode=1000;
19.
Page 19 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Partitioning • Multi-layer Partitioning is possible but often not efficient – Number of partitions becomes too much and will overwhelm the Metastore • Limit the number of partitions. Less may be better – 1000 partitions will often perform better than 10000 • Hadoop likes big files – avoid creating partitions with mostly small files • Only use when – Data is very large and there are lots of table scans – Data is queried aginst a particular column frequently – Column data must have low cardinality
20.
Page 20 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Partitioning • Often better to partition by Date not Year/Month – By date you will only have 365 partitions at most – Partitioning by date will allow you to easily perform queiries that require ‘BETWEEN’and ‘IN’. ( https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by.html ) SELECT * FROM TableA WHERE DateStamp IN (‘2015-01-01’, ‘2015-02-03’, ‘2016-01-01’) VS SELECT * FROM TableB WHERE (YEAR=2015 AND MONTH=01 AND DAY=01) OR (YEAR=2015 AND MONTH=02 AND DAY=03) OR (YEAR=2016 AND MONTH=01 AND DAY=01)
21.
Page 21 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Bucketing • Breaks up data vertically by hashed key sets • When bucketing, you specify the number of buckets • Works particularly well when a lot of queries contain joins CREATE TABLE bucketed_table ( 'id' int, 'name' string, 'department' string, 'country' string ) CLUSTERED BY (id) INTO 12 BUCKETS ROW FORMAT DELIMETED FIELDS TERMINATED BY ',' STORED AS ORC;
22.
Page 22 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Bucketing • IMPORTANT: the bucketing specified at table creation is NOT enforced when the table is written to… • So when writing data, must make sure: – Hive.enforce.bucketing = true SET hive.enforce.bucketing = true; SET hive.exec.dynamic.partition.mode=nonstrict; INSERT INTO TABLE sale (xdate, state) SELECT * FROM staging_table;
23.
Page 23 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Bucketing • Works well when there is very large data volume and most queries are joins • Partitioning and bucketing may be combined, of course – Be careful not to wind up with very many small files that can overwhelm the NameNode – Ideal file size is 200-500mb • Partition and Bucket frequently joined tables in a similar way to improve join efficiency CREATE TABLE sale ( id int, amount decimal, ... ) PARTITIONED BY (xdate string, state string) CLUSTERED BY (id) SORTED BY (id) INTO 256 buckets;
24.
Page 24 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Skewed Tables and List Bucketing • When table is skewed with on or more column values taking up most space • By specifying the values that appear most often in the keys (in this example ‘key1’ and ‘key2’), HIVE will split those into separate files automatically and take this into account during queries so that it can skip the whole file if possible • “STORED AS DIRECTORIES” is called “list bucketing” – Table is skewed, but also store each part as separate directory – 1 directory for each skewed key value, 1 directory for all other keys CREATE TABLE mytable ( key STRING, value STRING, … ) SKEWED BY (key) ON (‘key1’, ‘key2’) STORED AS DIRECTORIES;
25.
Page 25 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Data Abstractions in Hive Par??ons, buckets and skews facilitate faster, more direct data access. Database Table Table Par??on Par??on Par??on Bucket Bucket Bucket Op?onal Per Table Skewed Keys Unskewed Keys
26.
Page 26 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Best Practice: When to use Partitioning/Bucketing/Skews • Partitioning is useful for chronological columns that don’t have a very high number of possible values – You don’t want to end up with millions of partitions • Bucketing is most useful for tables that are “most often” joined together on the same key – For example: joins by a patient-ID or customer-ID – Make sure the bucket count matches on both tables involved in the join • Skews useful when one or two column values dominate the table – Hive can avoid whole files when querying
27.
Page 27 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved What About Data Modeling?
28.
Page 28 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Data Modeling in Hadoop • No data modeling a-la DW/RDBMS • Decisions on data layout happen at the file/folder level – This is where partitioning, bucketing and skewing comes in • How far should we denormalize? – As far as it makes sense – Usually denormalize frequently joined tables – Be mindful of the memory implications of very wide tables (thousands of columns)
29.
Page 29 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Data Modeling in Hadoop • Can we Alter an existing table to add Partitions or Buckets? – No – Create new partitioned/bucketed table and copy data over • Are there limits on number of columns possible in Hive? – No “hard” limit from Hive – File format memory requirements may limit us though – ORC tested with up to 20,000 columns before getting out-of-memory – Be mindful of memory implications when designing wide tables
30.
Page 30 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved HIVE Join strategies: Choose the right JOIN
31.
Page 31 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Shuffle Joins – the default Page 31 customer order first last id cid price quan2ty Nick Toner 11911 4150 10.50 3 Jessie Simonds 11912 11914 12.25 27 Kasi Lamers 11913 3491 5.99 5 Rodger Clayton 11914 2934 39.99 22 Verona Hollen 11915 11914 40.50 10 SELECT * FROM customer join order ON customer.id = order.cid; M { id: 11911, { first: Nick, last: Toner }} { id: 11914, { first: Rodger, last: Clayton }} … M { cid: 4150, { price: 10.50, quan?ty: 3 }} { cid: 11914, { price: 12.25, quan?ty: 27 }} … R { id: 11914, { first: Rodger, last: Clayton }} { cid: 11914, { price: 12.25, quan?ty: 27 }} R { id: 11911, { first: Nick, last: Toner }} { cid: 4150, { price: 10.50, quan?ty: 3 }} … Iden?cal keys shuffled to the same reducer. Join done reduce-‐side. Expensive from a network u?liza?on standpoint.
32.
Page 32 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Broadcast Join (aka Map-side Join) • Star schemas (e.g. dimension tables) • Good when table is small enough to fit in RAM Page 32
33.
Page 33 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Using Broadcast Join • Set hive.auto.convert.join = true • HIVE then automatically uses broadcast join, if possible – Small tables held in memory by all nodes • Used for star-schema type joins common in Data warehousing use-cases • hive.auto.convert.join.noconditionaltask.size determines data size for automatic conversion to broadcast join: – Default 10MB is too low (check your default) – Recommended: 256MB for 4GB container Page 33
34.
Page 34 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Sort-Merge-Bucket join: When both are too large for memory Page 34 customer order first last id cid price quan2ty Nick Toner 11911 4150 10.50 3 Jessie Simonds 11912 11914 12.25 27 Kasi Lamers 11913 11914 40.50 10 Rodger Clayton 11914 12337 39.99 22 Verona Hollen 11915 15912 40.50 10 SELECT * FROM customer join order ON customer.id = order.cid; CREATE TABLE customer (id int, first string, last string) CLUSTERED BY(id) SORTED BY(id) INTO 32 BUCKETS; CREATE TABLE order (cid int, price float, quantity int) CLUSTERED BY(cid) SORTED BY(cid) INTO 32 BUCKETS;
35.
Page 35 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive Join Strategies Page 35 Type Approach Pros Cons Shuffle Join Join keys are shuffled using map/ reduce and joins performed reduce side. Works regardless of data size or layout. Most resource-‐intensive and slowest join type. Broadcast Join Small tables are loaded into memory in all nodes, mapper scans through the large table and joins. Very fast, single scan through largest table. All but one table must be small enough to fit in RAM. Sort-‐Merge-‐ Bucket Join Mappers take advantage of co-‐ loca?on of keys to do efficient joins. Very fast for tables of any size. Data must be sorted and bucketed ahead of ?me. All join types are now more efficient with Tez
36.
Page 36 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved More Join Strategies • Take a look at this blog posting for an explanation of joins: http://henning.kropponline.de/2016/10/09/hive-join-strategies/ • A search on Google will return more join strategies than what has been covered here • Keep in mind that most benchmarks were done using Map Reduce processing rather than Tez. Your performance should be better due to the in-memory processing nature of Tez. Page 36
37.
Page 37 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Wri?ng fast queries: Techniques to op?mize your queries
38.
Page 38 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Optimizing HIVE queries 1. Use Tez 2. Use ORCFile 3. Use Vectoriza?on 4. Use Cost Based Op?miza?on (CBO) 5. Write good SQL 6. Use Hive Explain 7. Consider Hive LLAP
39.
Page 39 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Technique #1: TEZ vs MR
40.
Page 40 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Understanding Tez vs MapReduce
41.
Page 41 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Technique #2: use ORCFile
42.
Page 42 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved ORCFile – Efficient Columnar Format High Performance: Split-able, columnar storage file Efficient Reads: Break into large “stripes” of data for efficient read Fast Filtering: Built in index, min/max, metadata for fast filtering blocks - bloom filters if desired Efficient Compression: Decompose complex row types into primitives: massive compression and efficient comparisons for filtering Precomputation: Built in aggregates per block (min, max, count, sum, etc.) Proven at 300 PB scale: Facebook uses ORC for their 300 PB Hive Warehouse
43.
Page 43 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Technique #3: Use Vectoriza?on
44.
Page 44 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Using Vectorization • Vectorized query execution is a Hive feature that greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins • Vectorized query execution streamlines operations by processing a block of 1024 rows at a time (instead of 1 row at a time) • ONLY works with ORCFiles Page 44 SET hive.vectorized.execution.enabled = true; SET hive.vectorized.execution.reduce.enabled=true;
45.
Page 45 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Technique #4: Use Cost-‐based Op?miza?on
46.
Page 46 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive Cost-Based Optimization (CBO) • Cost-‐Based Op-miza-on (CBO) engine uses sta?s?cs within Hive tables to produce op?mal query plans • Two types of stats used for op?miza?on: o Table stats o Column stats • Uses an open-‐source framework called Calcite (formerly Op,q)
47.
Page 47 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Step 1: ensure HIVE has table statistics Hive.stats.autogather=true; • Stats are collected at the table level automa?cally when: • If you have an exis?ng table without stats collected: • For column-‐level sta?s?cs: – HDP 2.1 – HDP 2.2 ANALYZE TABLE table-name COMPUTE STATISTICS; ANALYZE TABLE table-name COMPUTE STATISTICS for COLUMNS col1, col2; ANALYZE TABLE table-name COMPUTE STATISTICS for COLUMNS;
48.
Page 48 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved CBO with Partitioned Tables • When table is par??oned, you need to specify the par??on when collec?ng sta?s?cs: ANALYZE TABLE table-name partition (col1=‘x’) COMPUTE STATISTICS; ANALYZE TABLE table-name partition(col1=‘x’) COMPUTE STATISTICS for COLUMNS;
49.
Page 49 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Step 2: set HIVE properties to enable CBO SET hive.cbo.enable=true; SET hive.compute.query.using.stats = true; And now every query you run will use CBO…
50.
Page 50 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Technique #5: Write Smart SQL
51.
Page 51 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Query design matters • This is Big Data we’re talking about • So consider performance in every query you write • There are many ways to write SQL with the same functional results, but often varying performance characteristics • Avoid Joins when possible and choose the right Join when not Page 51
52.
Page 52 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Technique #6: Use Hive Explain
53.
Page 53 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved HIVE EXPLAIN – understanding your query plan Page 53 • It is an advanced tool to debug what HIVE is doing. • Look at the sequence of operations and make sure it looks reasonable • Validate join type (e.g. we’ve asked for a map-side join, did it get executed that way?) At the end of the day, if the plan is bad, everything else (ORC, Vectorization, etc) may not matter. Take a look at the below link on how to understand and analyze your query plan: https://www.slideshare.net/HadoopSummit/how-to-understand-and-analyze-apache-hive- query-execution-plan-for-performance-debugging EXPLAIN {Hive Query}
54.
Page 54 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Technique #7: Consider Hive LLAP
55.
Page 55 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP Key Benefits à Uses persistent query servers to avoid long startup times and deliver fast SQL. à Enables as fast as sub-second query in Hive by keeping all data and servers running and in-memory all the time. à Shares its in-memory cache among all SQL users, maximizing the use of this scarce resource. à Has fine-grained resource management and preemption, making it great for concurrent access across many users. à Great for cloud because it caches data in memory and keeps it compressed, overcoming long cloud storage access times and stretching the amount of data you can fit in RAM.
56.
Page 56 ©
Hortonworks Inc. 2011 – 2015. All Rights Reserved Thank You
Download now