More Related Content
Similar to Column Statistics in Hive
Similar to Column Statistics in Hive (20)
Column Statistics in Hive
- 2. Outline
• Motivation
• New Statistics
• Computing and Persisting Statistics
• Summary
• Further Readings
2
©2012 Cloudera, Inc. All Rights Reserved.
- 3. Why Column Statistics?
• RDBMS Query Optimizer
• Is cost-based
• Uses the statistical properties of the data to cost
alternate execution plans
• Picks executions plans with the lowest cost
• Hive Query Optimizer
• Is rule-based
• Uses rules of thumb to optimize the execution plan
• Unable to always pick the most efficient execution
plan
3
©2012 Cloudera, Inc. All Rights Reserved.
- 4. Why Column Statistics?
• Statistics in an RDBMS
• Maintained on per table, per partition, and per column
basis
• Used for a wide range of cost based query optimizations
• Statistics in Hive
• Maintained on per table and per partition level
• Can be used to perform some cost based optimizations
such as choosing join method etc.
• Insufficient for other cost based optimizations such as join
reordering, two stage aggregation etc.
Solution: Maintain statistics on columns in Hive
4
©2012 Cloudera, Inc. All Rights Reserved.
- 5. What are the New Statistics?
• Min Column Value
• Max Column Value
• Average Length of Column Value
• Max Length of Column Value
• Number of Distinct Values in a Column
• Number of Null Values in a Column
• Equi-height Histograms
5
©2012 Cloudera, Inc. All Rights Reserved.
- 6. How to Compute Column Statistics?
• Explicit Computation
• Triggered through an ANALYZE command
• Pros: Admin has fine grained control over the stats
job
• Cons: Doesn’t piggyback on other operations such as
scan
• Implicit Computation
• Incrementally compute statistics while loading data
• Pros: Avoid an additional table scan, more efficient
than explicit computation
• Cons: Impacts LOAD performance
6
©2012 Cloudera, Inc. All Rights Reserved.
- 7. How to Compute Column Statistics?
• Aggregate function of the column data
rolled up by table/partition
• Fits nicely into Hive’s UDAF framework
• Expect to scan TBs of data at a time
• Requirement # 1: Memory usage has to scale
sub-linearly with data size
• Requirement # 2: Stats task has to complete in a
reasonable amount of time
• Given these requirements, some statistics such
as NDV, histograms are hard to compute!
7
©2012 Cloudera, Inc. All Rights Reserved.
- 8. How to Compute NDVs?
• Naïve approach
• Maintain a count of distinct values in a column
• Impractical given memory requirements
• Flajolet-Martin approach
• Use probabilistic sketches to estimate NDV
• Memory required is logarithmic in size of data
• Estimates are within 10% of the actual value
8
©2011 Cloudera, Inc. All Rights Reserved.
- 9. How to Compute Histograms?
• Computing equi-height histograms is a quantile
computation/estimation problem
• Merging the quantiles computed at the mappers is
non-trivial
• Deterministic parallel algorithms such as QDigest
prohibitive in terms of memory required
• Probabilistic algorithms stream counting algorithms
such as Count-Min Sketch can be tweaked to
estimate quantiles
• Memory required is logarithmic in size of data
• Computationally expensive!
9
©2012 Cloudera, Inc. All Rights Reserved.
- 10. How to Store Column Statistics?
• Extend metastore schema to store new statistics
• Extend metastore Thrift API to update, query and
delete new statistics
• Size of the column statistics record in metastore is
independent of table/partition size
• ~32 bytes/column if histograms are not computed
• ~320 bytes/column for 20 bin histogram
10
©2012 Cloudera, Inc. All Rights Reserved.
- 11. Summary
• Scalar statistics has been implemented for primitive
type columns in both tables and partitions
• Patch is available on JIRA (HIVE-1362)
• Computing Equi-Height Histograms is a WIP
11
©2012 Cloudera, Inc. All Rights Reserved.
- 13. Further Readings
• Blog
• http://www.cloudera.com/blog/2012/08/column-statistics-
in-hive/
• Academic
• A. Gruenheid, et. al., Query Optimization using Column
Statistics in Hive.
• S. Chaudhuri, An Overview of Query Optimization in
Relational Systems.
• P. Flajolet and N.G. Martin, Probabilistic Counting
Algorithms for Database Applications.
Contact: shreepadma@cloudera.com
13
©2012 Cloudera, Inc. All Rights Reserved.
Editor's Notes
- Explain what is cost in the context of this discussion – CPU and I/O cost of executing a query plan
- Talk about how each one of the stats will be useful
- Talk about algorithms usedFlajolet-Martin, Histogram construction