Column Statistics in Hive

•Download as PPTX, PDF•

2 likes•1,393 views

vshreepadma

Spiritual

09/27/2012
Column Statistics Project
Shreepadma Venugopalan | Platform Engineering

Outline

• Motivation
• New Statistics
• Computing and Persisting Statistics
• Summary
• Further Readings

2
©2012 Cloudera, Inc. All Rights Reserved.

Why Column Statistics?

• RDBMS Query Optimizer
• Is cost-based
• Uses the statistical properties of the data to cost
alternate execution plans
• Picks executions plans with the lowest cost

• Hive Query Optimizer
• Is rule-based
• Uses rules of thumb to optimize the execution plan
• Unable to always pick the most efficient execution
plan

3
©2012 Cloudera, Inc. All Rights Reserved.

Why Column Statistics?
• Statistics in an RDBMS
• Maintained on per table, per partition, and per column
basis
• Used for a wide range of cost based query optimizations

• Statistics in Hive
• Maintained on per table and per partition level
• Can be used to perform some cost based optimizations
such as choosing join method etc.
• Insufficient for other cost based optimizations such as join
reordering, two stage aggregation etc.

Solution: Maintain statistics on columns in Hive

4
©2012 Cloudera, Inc. All Rights Reserved.

What are the New Statistics?

• Min Column Value
• Max Column Value
• Average Length of Column Value
• Max Length of Column Value
• Number of Distinct Values in a Column
• Number of Null Values in a Column
• Equi-height Histograms

5
©2012 Cloudera, Inc. All Rights Reserved.

How to Compute Column Statistics?

• Explicit Computation
• Triggered through an ANALYZE command
• Pros: Admin has fine grained control over the stats
job
• Cons: Doesn’t piggyback on other operations such as
scan
• Implicit Computation
• Incrementally compute statistics while loading data
• Pros: Avoid an additional table scan, more efficient
than explicit computation
• Cons: Impacts LOAD performance
6
©2012 Cloudera, Inc. All Rights Reserved.

How to Compute Column Statistics?

• Aggregate function of the column data
rolled up by table/partition
• Fits nicely into Hive’s UDAF framework
• Expect to scan TBs of data at a time
• Requirement # 1: Memory usage has to scale
sub-linearly with data size
• Requirement # 2: Stats task has to complete in a
reasonable amount of time
• Given these requirements, some statistics such
as NDV, histograms are hard to compute!

7
©2012 Cloudera, Inc. All Rights Reserved.

How to Compute NDVs?

• Naïve approach
• Maintain a count of distinct values in a column
• Impractical given memory requirements
• Flajolet-Martin approach
• Use probabilistic sketches to estimate NDV
• Memory required is logarithmic in size of data
• Estimates are within 10% of the actual value

8
©2011 Cloudera, Inc. All Rights Reserved.

How to Compute Histograms?

• Computing equi-height histograms is a quantile
computation/estimation problem
• Merging the quantiles computed at the mappers is
non-trivial
• Deterministic parallel algorithms such as QDigest
prohibitive in terms of memory required
• Probabilistic algorithms stream counting algorithms
such as Count-Min Sketch can be tweaked to
estimate quantiles
• Memory required is logarithmic in size of data
• Computationally expensive!

9
©2012 Cloudera, Inc. All Rights Reserved.

How to Store Column Statistics?

• Extend metastore schema to store new statistics
• Extend metastore Thrift API to update, query and
delete new statistics
• Size of the column statistics record in metastore is
independent of table/partition size
• ~32 bytes/column if histograms are not computed
• ~320 bytes/column for 20 bin histogram

10
©2012 Cloudera, Inc. All Rights Reserved.

Summary

• Scalar statistics has been implemented for primitive
type columns in both tables and partitions

• Patch is available on JIRA (HIVE-1362)

• Computing Equi-Height Histograms is a WIP

11
©2012 Cloudera, Inc. All Rights Reserved.

Questions?

12
©2012 Cloudera, Inc. All Rights Reserved.

Further Readings

• Blog
• http://www.cloudera.com/blog/2012/08/column-statistics-
in-hive/
• Academic
• A. Gruenheid, et. al., Query Optimization using Column
Statistics in Hive.
• S. Chaudhuri, An Overview of Query Optimization in
Relational Systems.
• P. Flajolet and N.G. Martin, Probabilistic Counting
Algorithms for Database Applications.
Contact: shreepadma@cloudera.com

13
©2012 Cloudera, Inc. All Rights Reserved.

What's hot

Conformed Dimension and Data MiningDylan Wan

Power aware load balancing in cloud manjula manju

Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...Intel IT Center

Cloud Analytics New Version 2017Ajith Kumar Ravi

Hadoop bangalore-meetup-dec-2011-yodaInMobi

EDW and HadoopTapio Vaattanen

3 olap storageClaudia Gomez

A Common Database Approach for OLTP and OLAP Using an In-Memory Column DatabaseIshara Amarasekera

DataStax: Making a Difference with Smart AnalyticsDataStax Academy

Olap operationsRohanJaiswal29

The Plan Cache Whisperer - Performance Tuning SQL ServerJason Strate

Introducing Data LakesPravin Kumar Singh, PMP, PSM

Cruising in data lake from zero to scaleJohn Varghese

Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Data Con LA

The olap tutorial 2012Amin Jalali

Incorta Data SecurityDylan Wan

SAG_Indexing and Query OptimizationVaibhav Jain

OLAP operationskunj desai

What's hot (18)

Conformed Dimension and Data Mining

Power aware load balancing in cloud

Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...

Cloud Analytics New Version 2017

Hadoop bangalore-meetup-dec-2011-yoda

EDW and Hadoop

3 olap storage

A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database

DataStax: Making a Difference with Smart Analytics

Olap operations

The Plan Cache Whisperer - Performance Tuning SQL Server

Introducing Data Lakes

Cruising in data lake from zero to scale

Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...

The olap tutorial 2012

Incorta Data Security

SAG_Indexing and Query Optimization

OLAP operations

Similar to Column Statistics in Hive

The Future of Data Warehousing: ETL Will Never be the SameCloudera, Inc.

Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)SolarWinds

Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...Cloudera, Inc.

IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...In-Memory Computing Summit

MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB

An overview of modern scalable web developmentTung Nguyen

Impala use case @ edgeRam Kedem

Which Change Data Capture Strategy is Right for You?Precisely

Data Lake Acceleration vs. Data Virtualization - What’s the difference?Denodo

AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...Tyler Wishnoff

Data Warehouse Design ConsiderationsRam Kedem

Building a Modern Analytic Database with Cloudera 5.8Cloudera, Inc.

Designing dashboards for performance shridhar wip 040613Mrunal Shridhar

Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.

The Shifting Landscape of Data IntegrationDATAVERSITY

Data Science Machine Lerning Bigdat.pptxPriyadarshini648418

Harness the power of Data in a Big Data LakeSaurabh K. Gupta

Jethro data meetup index base sql on hadoop - oct-2014Eli Singer

Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.

Consolidate your data marts for fast, flexible analytics 5.24.18Cloudera, Inc.

Similar to Column Statistics in Hive (20)

The Future of Data Warehousing: ETL Will Never be the Same

Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)

Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...

IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...

MongoDB for Spatio-Behavioral Data Analysis and Visualization

An overview of modern scalable web development

Impala use case @ edge

Which Change Data Capture Strategy is Right for You?

Data Lake Acceleration vs. Data Virtualization - What’s the difference?

AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...

Data Warehouse Design Considerations

Building a Modern Analytic Database with Cloudera 5.8

Designing dashboards for performance shridhar wip 040613

Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5

The Shifting Landscape of Data Integration

Data Science Machine Lerning Bigdat.pptx

Harness the power of Data in a Big Data Lake

Jethro data meetup index base sql on hadoop - oct-2014

Hadoop Essentials -- The What, Why and How to Meet Agency Objectives

Consolidate your data marts for fast, flexible analytics 5.24.18

Recently uploaded

No.1 Amil baba in Pakistan amil baba in Lahore amil baba in KarachiAmil Baba Mangal Maseeh

young Call girls in Dwarka sector 3🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Deerfoot Church of Christ Bulletin 4 21 24deerfootcoc

black magic specialist amil baba pakistan no 1 Black magic contact number rea...Amil Baba Mangal Maseeh

Unity is Strength 2024 Peace Haggadah_For Digital Viewing.pdfRebeccaSealfon

Asli amil baba in Karachi Pakistan and best astrologer Black magic specialistAmil Baba Mangal Maseeh

Seerah un nabi Muhammad Quiz Part-1.pdfAnsariB1

Unity is Strength 2024 Peace Haggadah + Song List.pdfRebeccaSealfon

No.1 Amil baba in Pakistan amil baba in Lahore amil baba in KarachiAmil Baba Naveed Bangali

Culture Clash_Bioethical Concerns_Slideshare Version.pptxStephen Palm

原版1:1复刻莫纳什大学毕业证Monash毕业证留信学历认证jdkhjh

Top 8 Krishna Bhajan Lyrics in English.pdfMillion-$-Knowledge {Million Dollar Knowledge}

Asli amil baba in Karachi asli amil baba in Lahoreamil baba kala jadu

Amil baba in uk amil baba in Australia amil baba in canadaamil baba kala jadu

Monthly Khazina-e-Ruhaniyaat April’2024 (Vol.14, Issue 12)Darul Amal Chishtia

The Chronological Life of Christ part 097 (Reality Check Luke 13 1-9).pptxNetwork Bible Fellowship

No 1 astrologer amil baba in Canada Usa astrologer in CanadaAmil Baba Mangal Maseeh

Asli amil baba near you 100%kala ilm ka mahirAmil Baba Mangal Maseeh

Study of the Psalms Chapter 1 verse 1 by wandereanmaricelcanoynuay

A Costly Interruption: The Sermon On the Mount, pt. 2 - BlessedVintage Church

Recently uploaded (20)

No.1 Amil baba in Pakistan amil baba in Lahore amil baba in Karachi

young Call girls in Dwarka sector 3🔝 9953056974 🔝 Delhi escort Service

Deerfoot Church of Christ Bulletin 4 21 24

black magic specialist amil baba pakistan no 1 Black magic contact number rea...

Unity is Strength 2024 Peace Haggadah_For Digital Viewing.pdf

Asli amil baba in Karachi Pakistan and best astrologer Black magic specialist

Seerah un nabi Muhammad Quiz Part-1.pdf

Unity is Strength 2024 Peace Haggadah + Song List.pdf

No.1 Amil baba in Pakistan amil baba in Lahore amil baba in Karachi

Culture Clash_Bioethical Concerns_Slideshare Version.pptx

原版1:1复刻莫纳什大学毕业证Monash毕业证留信学历认证

Top 8 Krishna Bhajan Lyrics in English.pdf

Asli amil baba in Karachi asli amil baba in Lahore

Amil baba in uk amil baba in Australia amil baba in canada

Monthly Khazina-e-Ruhaniyaat April’2024 (Vol.14, Issue 12)

The Chronological Life of Christ part 097 (Reality Check Luke 13 1-9).pptx

No 1 astrologer amil baba in Canada Usa astrologer in Canada

Asli amil baba near you 100%kala ilm ka mahir

Study of the Psalms Chapter 1 verse 1 by wanderean

A Costly Interruption: The Sermon On the Mount, pt. 2 - Blessed

Column Statistics in Hive

1. 09/27/2012 Column Statistics Project Shreepadma Venugopalan | Platform Engineering

3. Why Column Statistics? • RDBMS Query Optimizer • Is cost-based • Uses the statistical properties of the data to cost alternate execution plans • Picks executions plans with the lowest cost • Hive Query Optimizer • Is rule-based • Uses rules of thumb to optimize the execution plan • Unable to always pick the most efficient execution plan 3 ©2012 Cloudera, Inc. All Rights Reserved.

4. Why Column Statistics? • Statistics in an RDBMS • Maintained on per table, per partition, and per column basis • Used for a wide range of cost based query optimizations • Statistics in Hive • Maintained on per table and per partition level • Can be used to perform some cost based optimizations such as choosing join method etc. • Insufficient for other cost based optimizations such as join reordering, two stage aggregation etc. Solution: Maintain statistics on columns in Hive 4 ©2012 Cloudera, Inc. All Rights Reserved.

5. What are the New Statistics? • Min Column Value • Max Column Value • Average Length of Column Value • Max Length of Column Value • Number of Distinct Values in a Column • Number of Null Values in a Column • Equi-height Histograms 5 ©2012 Cloudera, Inc. All Rights Reserved.

6. How to Compute Column Statistics? • Explicit Computation • Triggered through an ANALYZE command • Pros: Admin has fine grained control over the stats job • Cons: Doesn’t piggyback on other operations such as scan • Implicit Computation • Incrementally compute statistics while loading data • Pros: Avoid an additional table scan, more efficient than explicit computation • Cons: Impacts LOAD performance 6 ©2012 Cloudera, Inc. All Rights Reserved.

7. How to Compute Column Statistics? • Aggregate function of the column data rolled up by table/partition • Fits nicely into Hive’s UDAF framework • Expect to scan TBs of data at a time • Requirement # 1: Memory usage has to scale sub-linearly with data size • Requirement # 2: Stats task has to complete in a reasonable amount of time • Given these requirements, some statistics such as NDV, histograms are hard to compute! 7 ©2012 Cloudera, Inc. All Rights Reserved.

8. How to Compute NDVs? • Naïve approach • Maintain a count of distinct values in a column • Impractical given memory requirements • Flajolet-Martin approach • Use probabilistic sketches to estimate NDV • Memory required is logarithmic in size of data • Estimates are within 10% of the actual value 8 ©2011 Cloudera, Inc. All Rights Reserved.

9. How to Compute Histograms? • Computing equi-height histograms is a quantile computation/estimation problem • Merging the quantiles computed at the mappers is non-trivial • Deterministic parallel algorithms such as QDigest prohibitive in terms of memory required • Probabilistic algorithms stream counting algorithms such as Count-Min Sketch can be tweaked to estimate quantiles • Memory required is logarithmic in size of data • Computationally expensive! 9 ©2012 Cloudera, Inc. All Rights Reserved.

10. How to Store Column Statistics? • Extend metastore schema to store new statistics • Extend metastore Thrift API to update, query and delete new statistics • Size of the column statistics record in metastore is independent of table/partition size • ~32 bytes/column if histograms are not computed • ~320 bytes/column for 20 bin histogram 10 ©2012 Cloudera, Inc. All Rights Reserved.

11. Summary • Scalar statistics has been implemented for primitive type columns in both tables and partitions • Patch is available on JIRA (HIVE-1362) • Computing Equi-Height Histograms is a WIP 11 ©2012 Cloudera, Inc. All Rights Reserved.

13. Further Readings • Blog • http://www.cloudera.com/blog/2012/08/column-statistics- in-hive/ • Academic • A. Gruenheid, et. al., Query Optimization using Column Statistics in Hive. • S. Chaudhuri, An Overview of Query Optimization in Relational Systems. • P. Flajolet and N.G. Martin, Probabilistic Counting Algorithms for Database Applications. Contact: shreepadma@cloudera.com 13 ©2012 Cloudera, Inc. All Rights Reserved.

Editor's Notes

Explain what is cost in the context of this discussion – CPU and I/O cost of executing a query plan
Talk about how each one of the stats will be useful
Talk about algorithms usedFlajolet-Martin, Histogram construction

Column Statistics in Hive

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Column Statistics in Hive

Similar to Column Statistics in Hive (20)

Recently uploaded

Recently uploaded (20)

Column Statistics in Hive

Editor's Notes