SlideShare a Scribd company logo
1 of 13
09/27/2012
Column Statistics Project
Shreepadma Venugopalan | Platform Engineering
Outline

    •   Motivation
    •   New Statistics
    •   Computing and Persisting Statistics
    •   Summary
    •   Further Readings




2
                      ©2012 Cloudera, Inc. All Rights Reserved.
Why Column Statistics?

    • RDBMS Query Optimizer
      • Is cost-based
      • Uses the statistical properties of the data to cost
        alternate execution plans
      • Picks executions plans with the lowest cost


    • Hive Query Optimizer
      • Is rule-based
      • Uses rules of thumb to optimize the execution plan
      • Unable to always pick the most efficient execution
        plan


3
                        ©2012 Cloudera, Inc. All Rights Reserved.
Why Column Statistics?
• Statistics in an RDBMS
     • Maintained on per table, per partition, and per column
       basis
     • Used for a wide range of cost based query optimizations

• Statistics in Hive
     • Maintained on per table and per partition level
     • Can be used to perform some cost based optimizations
       such as choosing join method etc.
     • Insufficient for other cost based optimizations such as join
       reordering, two stage aggregation etc.

           Solution: Maintain statistics on columns in Hive



4
                          ©2012 Cloudera, Inc. All Rights Reserved.
What are the New Statistics?

    •   Min Column Value
    •   Max Column Value
    •   Average Length of Column Value
    •   Max Length of Column Value
    •   Number of Distinct Values in a Column
    •   Number of Null Values in a Column
    •   Equi-height Histograms




5
                        ©2012 Cloudera, Inc. All Rights Reserved.
How to Compute Column Statistics?

    • Explicit Computation
      • Triggered through an ANALYZE command
      • Pros: Admin has fine grained control over the stats
        job
      • Cons: Doesn’t piggyback on other operations such as
        scan
    • Implicit Computation
      • Incrementally compute statistics while loading data
      • Pros: Avoid an additional table scan, more efficient
        than explicit computation
      • Cons: Impacts LOAD performance
6
                         ©2012 Cloudera, Inc. All Rights Reserved.
How to Compute Column Statistics?

    • Aggregate function of the column data
      rolled up by table/partition
    • Fits nicely into Hive’s UDAF framework
    • Expect to scan TBs of data at a time
      • Requirement # 1: Memory usage has to scale
        sub-linearly with data size
      • Requirement # 2: Stats task has to complete in a
        reasonable amount of time
      • Given these requirements, some statistics such
        as NDV, histograms are hard to compute!


7
                       ©2012 Cloudera, Inc. All Rights Reserved.
How to Compute NDVs?

    • Naïve approach
      • Maintain a count of distinct values in a column
      • Impractical given memory requirements
    • Flajolet-Martin approach
      • Use probabilistic sketches to estimate NDV
      • Memory required is logarithmic in size of data
      • Estimates are within 10% of the actual value



8
                      ©2011 Cloudera, Inc. All Rights Reserved.
How to Compute Histograms?

    • Computing equi-height histograms is a quantile
      computation/estimation problem
    • Merging the quantiles computed at the mappers is
      non-trivial
    • Deterministic parallel algorithms such as QDigest
      prohibitive in terms of memory required
    • Probabilistic algorithms stream counting algorithms
      such as Count-Min Sketch can be tweaked to
      estimate quantiles
       • Memory required is logarithmic in size of data
       • Computationally expensive!


9
                        ©2012 Cloudera, Inc. All Rights Reserved.
How to Store Column Statistics?

 • Extend metastore schema to store new statistics
 • Extend metastore Thrift API to update, query and
   delete new statistics
 • Size of the column statistics record in metastore is
   independent of table/partition size
 • ~32 bytes/column if histograms are not computed
 • ~320 bytes/column for 20 bin histogram




10
                      ©2012 Cloudera, Inc. All Rights Reserved.
Summary

 • Scalar statistics has been implemented for primitive
   type columns in both tables and partitions

 • Patch is available on JIRA (HIVE-1362)

 • Computing Equi-Height Histograms is a WIP




11
                      ©2012 Cloudera, Inc. All Rights Reserved.
Questions?




12
      ©2012 Cloudera, Inc. All Rights Reserved.
Further Readings

 • Blog
     • http://www.cloudera.com/blog/2012/08/column-statistics-
       in-hive/
 • Academic
     • A. Gruenheid, et. al., Query Optimization using Column
       Statistics in Hive.
     • S. Chaudhuri, An Overview of Query Optimization in
       Relational Systems.
     • P. Flajolet and N.G. Martin, Probabilistic Counting
       Algorithms for Database Applications.
                Contact: shreepadma@cloudera.com



13
                         ©2012 Cloudera, Inc. All Rights Reserved.

More Related Content

What's hot

Conformed Dimension and Data Mining
Conformed Dimension and Data MiningConformed Dimension and Data Mining
Conformed Dimension and Data MiningDylan Wan
 
Power aware load balancing in cloud
Power aware load balancing in cloud Power aware load balancing in cloud
Power aware load balancing in cloud manjula manju
 
Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...
Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...
Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...Intel IT Center
 
Cloud Analytics New Version 2017
Cloud Analytics New Version 2017Cloud Analytics New Version 2017
Cloud Analytics New Version 2017Ajith Kumar Ravi
 
Hadoop bangalore-meetup-dec-2011-yoda
Hadoop bangalore-meetup-dec-2011-yodaHadoop bangalore-meetup-dec-2011-yoda
Hadoop bangalore-meetup-dec-2011-yodaInMobi
 
A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database
A Common Database Approach for OLTP and OLAP Using an In-Memory Column DatabaseA Common Database Approach for OLTP and OLAP Using an In-Memory Column Database
A Common Database Approach for OLTP and OLAP Using an In-Memory Column DatabaseIshara Amarasekera
 
DataStax: Making a Difference with Smart Analytics
DataStax: Making a Difference with Smart AnalyticsDataStax: Making a Difference with Smart Analytics
DataStax: Making a Difference with Smart AnalyticsDataStax Academy
 
The Plan Cache Whisperer - Performance Tuning SQL Server
The Plan Cache Whisperer - Performance Tuning SQL ServerThe Plan Cache Whisperer - Performance Tuning SQL Server
The Plan Cache Whisperer - Performance Tuning SQL ServerJason Strate
 
Cruising in data lake from zero to scale
Cruising in data lake from zero to scaleCruising in data lake from zero to scale
Cruising in data lake from zero to scaleJohn Varghese
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Data Con LA
 
The olap tutorial 2012
The olap tutorial 2012The olap tutorial 2012
The olap tutorial 2012Amin Jalali
 
Incorta Data Security
Incorta Data SecurityIncorta Data Security
Incorta Data SecurityDylan Wan
 
SAG_Indexing and Query Optimization
SAG_Indexing and Query OptimizationSAG_Indexing and Query Optimization
SAG_Indexing and Query OptimizationVaibhav Jain
 
OLAP operations
OLAP operationsOLAP operations
OLAP operationskunj desai
 

What's hot (18)

Conformed Dimension and Data Mining
Conformed Dimension and Data MiningConformed Dimension and Data Mining
Conformed Dimension and Data Mining
 
Power aware load balancing in cloud
Power aware load balancing in cloud Power aware load balancing in cloud
Power aware load balancing in cloud
 
Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...
Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...
Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...
 
Cloud Analytics New Version 2017
Cloud Analytics New Version 2017Cloud Analytics New Version 2017
Cloud Analytics New Version 2017
 
Hadoop bangalore-meetup-dec-2011-yoda
Hadoop bangalore-meetup-dec-2011-yodaHadoop bangalore-meetup-dec-2011-yoda
Hadoop bangalore-meetup-dec-2011-yoda
 
EDW and Hadoop
EDW and HadoopEDW and Hadoop
EDW and Hadoop
 
3 olap storage
3 olap storage3 olap storage
3 olap storage
 
A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database
A Common Database Approach for OLTP and OLAP Using an In-Memory Column DatabaseA Common Database Approach for OLTP and OLAP Using an In-Memory Column Database
A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database
 
DataStax: Making a Difference with Smart Analytics
DataStax: Making a Difference with Smart AnalyticsDataStax: Making a Difference with Smart Analytics
DataStax: Making a Difference with Smart Analytics
 
Olap operations
Olap operationsOlap operations
Olap operations
 
The Plan Cache Whisperer - Performance Tuning SQL Server
The Plan Cache Whisperer - Performance Tuning SQL ServerThe Plan Cache Whisperer - Performance Tuning SQL Server
The Plan Cache Whisperer - Performance Tuning SQL Server
 
Introducing Data Lakes
Introducing Data LakesIntroducing Data Lakes
Introducing Data Lakes
 
Cruising in data lake from zero to scale
Cruising in data lake from zero to scaleCruising in data lake from zero to scale
Cruising in data lake from zero to scale
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
 
The olap tutorial 2012
The olap tutorial 2012The olap tutorial 2012
The olap tutorial 2012
 
Incorta Data Security
Incorta Data SecurityIncorta Data Security
Incorta Data Security
 
SAG_Indexing and Query Optimization
SAG_Indexing and Query OptimizationSAG_Indexing and Query Optimization
SAG_Indexing and Query Optimization
 
OLAP operations
OLAP operationsOLAP operations
OLAP operations
 

Similar to Column Statistics in Hive

The Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameThe Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameCloudera, Inc.
 
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)SolarWinds
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...Cloudera, Inc.
 
IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...
IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...
IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...In-Memory Computing Summit
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
Impala use case @ edge
Impala use case @ edgeImpala use case @ edge
Impala use case @ edgeRam Kedem
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Precisely
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Denodo
 
AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...
AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...
AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...Tyler Wishnoff
 
Data Warehouse Design Considerations
Data Warehouse Design ConsiderationsData Warehouse Design Considerations
Data Warehouse Design ConsiderationsRam Kedem
 
Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Cloudera, Inc.
 
Designing dashboards for performance shridhar wip 040613
Designing dashboards for performance shridhar wip 040613Designing dashboards for performance shridhar wip 040613
Designing dashboards for performance shridhar wip 040613Mrunal Shridhar
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationDATAVERSITY
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxPriyadarshini648418
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeSaurabh K. Gupta
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
 
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Cloudera, Inc.
 

Similar to Column Statistics in Hive (20)

The Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameThe Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the Same
 
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 
IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...
IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...
IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and Visualization
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Impala use case @ edge
Impala use case @ edgeImpala use case @ edge
Impala use case @ edge
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
 
AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...
AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...
AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...
 
Data Warehouse Design Considerations
Data Warehouse Design ConsiderationsData Warehouse Design Considerations
Data Warehouse Design Considerations
 
Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8
 
Designing dashboards for performance shridhar wip 040613
Designing dashboards for performance shridhar wip 040613Designing dashboards for performance shridhar wip 040613
Designing dashboards for performance shridhar wip 040613
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data Integration
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18
 

Recently uploaded

No.1 Amil baba in Pakistan amil baba in Lahore amil baba in Karachi
No.1 Amil baba in Pakistan amil baba in Lahore amil baba in KarachiNo.1 Amil baba in Pakistan amil baba in Lahore amil baba in Karachi
No.1 Amil baba in Pakistan amil baba in Lahore amil baba in KarachiAmil Baba Mangal Maseeh
 
Deerfoot Church of Christ Bulletin 4 21 24
Deerfoot Church of Christ Bulletin 4 21 24Deerfoot Church of Christ Bulletin 4 21 24
Deerfoot Church of Christ Bulletin 4 21 24deerfootcoc
 
black magic specialist amil baba pakistan no 1 Black magic contact number rea...
black magic specialist amil baba pakistan no 1 Black magic contact number rea...black magic specialist amil baba pakistan no 1 Black magic contact number rea...
black magic specialist amil baba pakistan no 1 Black magic contact number rea...Amil Baba Mangal Maseeh
 
Unity is Strength 2024 Peace Haggadah_For Digital Viewing.pdf
Unity is Strength 2024 Peace Haggadah_For Digital Viewing.pdfUnity is Strength 2024 Peace Haggadah_For Digital Viewing.pdf
Unity is Strength 2024 Peace Haggadah_For Digital Viewing.pdfRebeccaSealfon
 
Asli amil baba in Karachi Pakistan and best astrologer Black magic specialist
Asli amil baba in Karachi Pakistan and best astrologer Black magic specialistAsli amil baba in Karachi Pakistan and best astrologer Black magic specialist
Asli amil baba in Karachi Pakistan and best astrologer Black magic specialistAmil Baba Mangal Maseeh
 
Seerah un nabi Muhammad Quiz Part-1.pdf
Seerah un nabi  Muhammad Quiz Part-1.pdfSeerah un nabi  Muhammad Quiz Part-1.pdf
Seerah un nabi Muhammad Quiz Part-1.pdfAnsariB1
 
Unity is Strength 2024 Peace Haggadah + Song List.pdf
Unity is Strength 2024 Peace Haggadah + Song List.pdfUnity is Strength 2024 Peace Haggadah + Song List.pdf
Unity is Strength 2024 Peace Haggadah + Song List.pdfRebeccaSealfon
 
No.1 Amil baba in Pakistan amil baba in Lahore amil baba in Karachi
No.1 Amil baba in Pakistan amil baba in Lahore amil baba in KarachiNo.1 Amil baba in Pakistan amil baba in Lahore amil baba in Karachi
No.1 Amil baba in Pakistan amil baba in Lahore amil baba in KarachiAmil Baba Naveed Bangali
 
Culture Clash_Bioethical Concerns_Slideshare Version.pptx
Culture Clash_Bioethical Concerns_Slideshare Version.pptxCulture Clash_Bioethical Concerns_Slideshare Version.pptx
Culture Clash_Bioethical Concerns_Slideshare Version.pptxStephen Palm
 
原版1:1复刻莫纳什大学毕业证Monash毕业证留信学历认证
原版1:1复刻莫纳什大学毕业证Monash毕业证留信学历认证原版1:1复刻莫纳什大学毕业证Monash毕业证留信学历认证
原版1:1复刻莫纳什大学毕业证Monash毕业证留信学历认证jdkhjh
 
Asli amil baba in Karachi asli amil baba in Lahore
Asli amil baba in Karachi asli amil baba in LahoreAsli amil baba in Karachi asli amil baba in Lahore
Asli amil baba in Karachi asli amil baba in Lahoreamil baba kala jadu
 
Amil baba in uk amil baba in Australia amil baba in canada
Amil baba in uk amil baba in Australia amil baba in canadaAmil baba in uk amil baba in Australia amil baba in canada
Amil baba in uk amil baba in Australia amil baba in canadaamil baba kala jadu
 
Monthly Khazina-e-Ruhaniyaat April’2024 (Vol.14, Issue 12)
Monthly Khazina-e-Ruhaniyaat April’2024 (Vol.14, Issue 12)Monthly Khazina-e-Ruhaniyaat April’2024 (Vol.14, Issue 12)
Monthly Khazina-e-Ruhaniyaat April’2024 (Vol.14, Issue 12)Darul Amal Chishtia
 
The Chronological Life of Christ part 097 (Reality Check Luke 13 1-9).pptx
The Chronological Life of Christ part 097 (Reality Check Luke 13 1-9).pptxThe Chronological Life of Christ part 097 (Reality Check Luke 13 1-9).pptx
The Chronological Life of Christ part 097 (Reality Check Luke 13 1-9).pptxNetwork Bible Fellowship
 
No 1 astrologer amil baba in Canada Usa astrologer in Canada
No 1 astrologer amil baba in Canada Usa astrologer in CanadaNo 1 astrologer amil baba in Canada Usa astrologer in Canada
No 1 astrologer amil baba in Canada Usa astrologer in CanadaAmil Baba Mangal Maseeh
 
Asli amil baba near you 100%kala ilm ka mahir
Asli amil baba near you 100%kala ilm ka mahirAsli amil baba near you 100%kala ilm ka mahir
Asli amil baba near you 100%kala ilm ka mahirAmil Baba Mangal Maseeh
 
Study of the Psalms Chapter 1 verse 1 by wanderean
Study of the Psalms Chapter 1 verse 1 by wandereanStudy of the Psalms Chapter 1 verse 1 by wanderean
Study of the Psalms Chapter 1 verse 1 by wandereanmaricelcanoynuay
 
A Costly Interruption: The Sermon On the Mount, pt. 2 - Blessed
A Costly Interruption: The Sermon On the Mount, pt. 2 - BlessedA Costly Interruption: The Sermon On the Mount, pt. 2 - Blessed
A Costly Interruption: The Sermon On the Mount, pt. 2 - BlessedVintage Church
 

Recently uploaded (20)

No.1 Amil baba in Pakistan amil baba in Lahore amil baba in Karachi
No.1 Amil baba in Pakistan amil baba in Lahore amil baba in KarachiNo.1 Amil baba in Pakistan amil baba in Lahore amil baba in Karachi
No.1 Amil baba in Pakistan amil baba in Lahore amil baba in Karachi
 
young Call girls in Dwarka sector 3🔝 9953056974 🔝 Delhi escort Service
young Call girls in Dwarka sector 3🔝 9953056974 🔝 Delhi escort Serviceyoung Call girls in Dwarka sector 3🔝 9953056974 🔝 Delhi escort Service
young Call girls in Dwarka sector 3🔝 9953056974 🔝 Delhi escort Service
 
Deerfoot Church of Christ Bulletin 4 21 24
Deerfoot Church of Christ Bulletin 4 21 24Deerfoot Church of Christ Bulletin 4 21 24
Deerfoot Church of Christ Bulletin 4 21 24
 
black magic specialist amil baba pakistan no 1 Black magic contact number rea...
black magic specialist amil baba pakistan no 1 Black magic contact number rea...black magic specialist amil baba pakistan no 1 Black magic contact number rea...
black magic specialist amil baba pakistan no 1 Black magic contact number rea...
 
Unity is Strength 2024 Peace Haggadah_For Digital Viewing.pdf
Unity is Strength 2024 Peace Haggadah_For Digital Viewing.pdfUnity is Strength 2024 Peace Haggadah_For Digital Viewing.pdf
Unity is Strength 2024 Peace Haggadah_For Digital Viewing.pdf
 
Asli amil baba in Karachi Pakistan and best astrologer Black magic specialist
Asli amil baba in Karachi Pakistan and best astrologer Black magic specialistAsli amil baba in Karachi Pakistan and best astrologer Black magic specialist
Asli amil baba in Karachi Pakistan and best astrologer Black magic specialist
 
Seerah un nabi Muhammad Quiz Part-1.pdf
Seerah un nabi  Muhammad Quiz Part-1.pdfSeerah un nabi  Muhammad Quiz Part-1.pdf
Seerah un nabi Muhammad Quiz Part-1.pdf
 
Unity is Strength 2024 Peace Haggadah + Song List.pdf
Unity is Strength 2024 Peace Haggadah + Song List.pdfUnity is Strength 2024 Peace Haggadah + Song List.pdf
Unity is Strength 2024 Peace Haggadah + Song List.pdf
 
No.1 Amil baba in Pakistan amil baba in Lahore amil baba in Karachi
No.1 Amil baba in Pakistan amil baba in Lahore amil baba in KarachiNo.1 Amil baba in Pakistan amil baba in Lahore amil baba in Karachi
No.1 Amil baba in Pakistan amil baba in Lahore amil baba in Karachi
 
Culture Clash_Bioethical Concerns_Slideshare Version.pptx
Culture Clash_Bioethical Concerns_Slideshare Version.pptxCulture Clash_Bioethical Concerns_Slideshare Version.pptx
Culture Clash_Bioethical Concerns_Slideshare Version.pptx
 
原版1:1复刻莫纳什大学毕业证Monash毕业证留信学历认证
原版1:1复刻莫纳什大学毕业证Monash毕业证留信学历认证原版1:1复刻莫纳什大学毕业证Monash毕业证留信学历认证
原版1:1复刻莫纳什大学毕业证Monash毕业证留信学历认证
 
Top 8 Krishna Bhajan Lyrics in English.pdf
Top 8 Krishna Bhajan Lyrics in English.pdfTop 8 Krishna Bhajan Lyrics in English.pdf
Top 8 Krishna Bhajan Lyrics in English.pdf
 
Asli amil baba in Karachi asli amil baba in Lahore
Asli amil baba in Karachi asli amil baba in LahoreAsli amil baba in Karachi asli amil baba in Lahore
Asli amil baba in Karachi asli amil baba in Lahore
 
Amil baba in uk amil baba in Australia amil baba in canada
Amil baba in uk amil baba in Australia amil baba in canadaAmil baba in uk amil baba in Australia amil baba in canada
Amil baba in uk amil baba in Australia amil baba in canada
 
Monthly Khazina-e-Ruhaniyaat April’2024 (Vol.14, Issue 12)
Monthly Khazina-e-Ruhaniyaat April’2024 (Vol.14, Issue 12)Monthly Khazina-e-Ruhaniyaat April’2024 (Vol.14, Issue 12)
Monthly Khazina-e-Ruhaniyaat April’2024 (Vol.14, Issue 12)
 
The Chronological Life of Christ part 097 (Reality Check Luke 13 1-9).pptx
The Chronological Life of Christ part 097 (Reality Check Luke 13 1-9).pptxThe Chronological Life of Christ part 097 (Reality Check Luke 13 1-9).pptx
The Chronological Life of Christ part 097 (Reality Check Luke 13 1-9).pptx
 
No 1 astrologer amil baba in Canada Usa astrologer in Canada
No 1 astrologer amil baba in Canada Usa astrologer in CanadaNo 1 astrologer amil baba in Canada Usa astrologer in Canada
No 1 astrologer amil baba in Canada Usa astrologer in Canada
 
Asli amil baba near you 100%kala ilm ka mahir
Asli amil baba near you 100%kala ilm ka mahirAsli amil baba near you 100%kala ilm ka mahir
Asli amil baba near you 100%kala ilm ka mahir
 
Study of the Psalms Chapter 1 verse 1 by wanderean
Study of the Psalms Chapter 1 verse 1 by wandereanStudy of the Psalms Chapter 1 verse 1 by wanderean
Study of the Psalms Chapter 1 verse 1 by wanderean
 
A Costly Interruption: The Sermon On the Mount, pt. 2 - Blessed
A Costly Interruption: The Sermon On the Mount, pt. 2 - BlessedA Costly Interruption: The Sermon On the Mount, pt. 2 - Blessed
A Costly Interruption: The Sermon On the Mount, pt. 2 - Blessed
 

Column Statistics in Hive

  • 1. 09/27/2012 Column Statistics Project Shreepadma Venugopalan | Platform Engineering
  • 2. Outline • Motivation • New Statistics • Computing and Persisting Statistics • Summary • Further Readings 2 ©2012 Cloudera, Inc. All Rights Reserved.
  • 3. Why Column Statistics? • RDBMS Query Optimizer • Is cost-based • Uses the statistical properties of the data to cost alternate execution plans • Picks executions plans with the lowest cost • Hive Query Optimizer • Is rule-based • Uses rules of thumb to optimize the execution plan • Unable to always pick the most efficient execution plan 3 ©2012 Cloudera, Inc. All Rights Reserved.
  • 4. Why Column Statistics? • Statistics in an RDBMS • Maintained on per table, per partition, and per column basis • Used for a wide range of cost based query optimizations • Statistics in Hive • Maintained on per table and per partition level • Can be used to perform some cost based optimizations such as choosing join method etc. • Insufficient for other cost based optimizations such as join reordering, two stage aggregation etc. Solution: Maintain statistics on columns in Hive 4 ©2012 Cloudera, Inc. All Rights Reserved.
  • 5. What are the New Statistics? • Min Column Value • Max Column Value • Average Length of Column Value • Max Length of Column Value • Number of Distinct Values in a Column • Number of Null Values in a Column • Equi-height Histograms 5 ©2012 Cloudera, Inc. All Rights Reserved.
  • 6. How to Compute Column Statistics? • Explicit Computation • Triggered through an ANALYZE command • Pros: Admin has fine grained control over the stats job • Cons: Doesn’t piggyback on other operations such as scan • Implicit Computation • Incrementally compute statistics while loading data • Pros: Avoid an additional table scan, more efficient than explicit computation • Cons: Impacts LOAD performance 6 ©2012 Cloudera, Inc. All Rights Reserved.
  • 7. How to Compute Column Statistics? • Aggregate function of the column data rolled up by table/partition • Fits nicely into Hive’s UDAF framework • Expect to scan TBs of data at a time • Requirement # 1: Memory usage has to scale sub-linearly with data size • Requirement # 2: Stats task has to complete in a reasonable amount of time • Given these requirements, some statistics such as NDV, histograms are hard to compute! 7 ©2012 Cloudera, Inc. All Rights Reserved.
  • 8. How to Compute NDVs? • Naïve approach • Maintain a count of distinct values in a column • Impractical given memory requirements • Flajolet-Martin approach • Use probabilistic sketches to estimate NDV • Memory required is logarithmic in size of data • Estimates are within 10% of the actual value 8 ©2011 Cloudera, Inc. All Rights Reserved.
  • 9. How to Compute Histograms? • Computing equi-height histograms is a quantile computation/estimation problem • Merging the quantiles computed at the mappers is non-trivial • Deterministic parallel algorithms such as QDigest prohibitive in terms of memory required • Probabilistic algorithms stream counting algorithms such as Count-Min Sketch can be tweaked to estimate quantiles • Memory required is logarithmic in size of data • Computationally expensive! 9 ©2012 Cloudera, Inc. All Rights Reserved.
  • 10. How to Store Column Statistics? • Extend metastore schema to store new statistics • Extend metastore Thrift API to update, query and delete new statistics • Size of the column statistics record in metastore is independent of table/partition size • ~32 bytes/column if histograms are not computed • ~320 bytes/column for 20 bin histogram 10 ©2012 Cloudera, Inc. All Rights Reserved.
  • 11. Summary • Scalar statistics has been implemented for primitive type columns in both tables and partitions • Patch is available on JIRA (HIVE-1362) • Computing Equi-Height Histograms is a WIP 11 ©2012 Cloudera, Inc. All Rights Reserved.
  • 12. Questions? 12 ©2012 Cloudera, Inc. All Rights Reserved.
  • 13. Further Readings • Blog • http://www.cloudera.com/blog/2012/08/column-statistics- in-hive/ • Academic • A. Gruenheid, et. al., Query Optimization using Column Statistics in Hive. • S. Chaudhuri, An Overview of Query Optimization in Relational Systems. • P. Flajolet and N.G. Martin, Probabilistic Counting Algorithms for Database Applications. Contact: shreepadma@cloudera.com 13 ©2012 Cloudera, Inc. All Rights Reserved.

Editor's Notes

  1. Explain what is cost in the context of this discussion – CPU and I/O cost of executing a query plan
  2. Talk about how each one of the stats will be useful
  3. Talk about algorithms usedFlajolet-Martin, Histogram construction