SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
Ron Hu, Zhenhua Wang
Huawei Technologies, Inc.
Cardinality Estimation through
Histogram in Apache Spark 2.3
#DevSAIS13
Agenda
• Catalyst Architecture
• Cost Based Optimizer in Spark 2.2
• Statistics Collected
• Histogram Support in Spark 2.3
• Configuration Parameters
• Q & A
2
Catalyst Architecture
3
Spark optimizes query plan here
Reference:Deep Dive into Spark SQL’s Catalyst Optimizer, a databricks engineering blog
Query Optimizer in Spark SQL
• Spark SQL’s query optimizer is based on both
rules and cost.
• Most of Spark SQL optimizer’s rules are
heuristics rules.
– PushDownPredicate, ColumnPruning,
ConstantFolding,….
• Cost based optimization (CBO) was added in
Spark 2.2.
4
Cost Based Optimizer in Spark 2.2
• It was a good and working CBO framework to start
with.
• Focused on
– Statistics collection,
– Cardinality estimation,
– Build side selection, broadcast vs. shuffled join, join
reordering, etc.
• Used heuristics formula for cost function in terms
of cardinality and data size of each operator.
5
Statistics Collected
• Collect Table Statistics information
• Collect Column Statistics information
• Goal:
– Calculate the cost for each operator in terms of
number of output rows, size of output, etc.
– Based on the cost calculation, adjust the query
execution plan
6
Table Statistics Collected
• Command to collect statistics of a table.
– Ex: ANALYZE TABLE table-name COMPUTE
STATISTICS
• It collects table level statistics and saves into
metastore.
– Number of rows
– Table size in bytes
7
Column Statistics Collected
• Command to collect column level statistics of individual columns.
– Ex: ANALYZE TABLE table-name COMPUTE STATISTICS
FOR COLUMNS column-name1, column-name2, ….
• It collects column level statistics and saves into meta-store.
String/Binary type
✓ Distinct count
✓ Null count
✓ Average length
✓ Max length
Numeric/Date/Timestamp type
✓ Distinct count
✓ Max
✓ Min
✓ Null count
✓ Average length (fixed length)
✓ Max length (fixed length)
8
Real World Data Are Often Skewed
9#DevSAIS13 – Cardinality Estimation by Hu and Wang
Histogram Support in Spark 2.3
• Histogram is effective in handling
skewed distribution.
• We developed equi-height histogram
in Spark 2.3.
• Equi-Height histogram is better than
equi-width histogram
• Equi-height histogram can use multiple
buckets to show a very skewed value.
• Equi-width histogram cannot give right
frequency when a skewed value falls in
same bucket with other values.
Column interval
Frequency
Equi-Width
Equi-Height
Column interval
Frequency Density
10
Histogram Algorithm
– Each histogram has a default of 254 buckets.
• The height of a histogram is number of non-null values divided
by number of buckets.
– Each histogram bucket contains
• Range values of a bucket
• Number of distinct values in a bucket
– We use two table scans to generate the equi-height
histograms for all columns specified in analyze
command.
• Use ApproximatePercentile class to get end points of all
histogram buckets
• Use HyperLogLog++ algorithm to compute the number of
distinct values in each bucket.
11
Filter Cardinality Estimation
• Between Logical expressions: AND, OR, NOT
• In each logical expression: =, <, <=, >, >=, in, etc
• Current support type in Expression
– For <, <=, >, >=, <=>: Integer, Double, Date, Timestamp, etc
– For = , <=>: String, Integer, Double, Date, Timestamp, etc.
• Example: A <= B
– Based on A, B’s min/max/distinct count/null count values, decide
the relationships between A and B. After completing this
expression, we set the new min/max/distinct count/null count
– Assume all the data is evenly distributed if no histogram
information.
12
Filter Operator without Histogram
• Column A (op) literal B
– (op) can be “=“, “<”, “<=”, “>”, “>=”, “like”
– Column’s max/min/distinct count/null count should be updated
– Example: Column A < value B
Column AB B
A.min A.max
Filtering Factor = 0%
need to change A’s statistics
Filtering Factor = 100%
no need to change A’s statistics
Filtering Factor = (B.value – A.min) / (A.max – A.min)
A.min = no change
A.max = B.value
A.ndv = A.ndv * Filtering Factor
13
• Without histogram, we prorate over
the entire column range.
• It works only if it is evenly distributed.
Filter Operator with Histogram
• With histogram, we check the range values of a
bucket to see if it should be included in
estimation.
• We prorate only the boundary bucket.
• This way can enhance the accuracy of
estimation since we prorate (or guess) only a
much smaller set of records in a bucket only.
14
Histogram for Filter Example 1
Age distribution of a restaurant:
• Estimate row count for
predicate “age > 40”. Correct
answer is 5.
• Without histogram, estimate:
25 * (80 – 40)/(80 – 20) = 16.7
• With histogram, estimate:
1.0 * // only 5th bucket
5 // 5 records per bucket
= 5
15#DevSAIS13 – Cardinality Estimation by Hu and Wang
20
21
23
24
25
ndv=5
25
27
27
27
28
ndv=3
28
28
28
28
28
ndv=1
29
36
36
39
40
ndv=4
45
47
55
63
80
ndv=5
20 25 28 40 8028
Total row count: 25
age min = 20
age max = 80
age ndv = 17
Histogram for Filter Example 2
Age distribution of a restaurant:
• Estimate row count for predicate
“age = 28”. Correct answer is 6.
• Without histogram, estimate:
25 * 1 / 17 = 1.47
• With histogram, estimate:
( 1/3 // prorate the 2nd bucket
+ 1.0 // for 3rd bucket
) * 5 // 5 records per bucket
= 6.67
16#DevSAIS13 – Cardinality Estimation by Hu and Wang
20
21
23
24
25
ndv=5
25
27
27
27
28
ndv=3
28
28
28
28
28
ndv=1
29
36
36
39
40
ndv=4
45
47
55
63
80
ndv=5
20 25 28 40 8028
Total row count: 25
age min = 20
age max = 80
age ndv = 17
Join Cardinality without Histogram
• Inner-Join: The number of rows of “A join B on A.k1 = B.k1” is
estimated as:
num(A ⟗ B) = num(A) * num(B) / max(distinct(A.k1),
distinct(B.k1)),
– where num(A) is the number of records in table A, distinct is the number of
distinct values of that column.
– The underlying assumption for this formula is that each value of the smaller
domain is included in the larger domain.
– Assuming uniform distribution for entire range of both join columns.
• We similarly estimate cardinalities for Left-Outer Join, Right-Outer
Join and Full-Outer Join
17
Join Cardinality without Histogram
18
Total row count: 25
k1 min = 20
k1 max = 80
k1 ndv = 17
Table A, join column k1 Table B, join column k1
Total row count: 20
k1 min = 20
k1 max = 90
k1 ndv = 17
Without histogram, join cardinality estimate is 25 * 20 / 17 = 29.4
The correct answer is 20.
20
21
23
24
25
25
27
27
27
28
28
28
28
28
28
29
36
36
39
40
45
47
55
63
80
20 80
20
21
21
25
26
28
28
30
36
39
45
50
55
60
65
70
75
80
90
90
20 90
Join Cardinality with Histogram
• The number of rows of “A join B on A.k1 = B.k1” is estimated as:
num(𝐴⟗𝐵) = 𝑖,𝑗 num(𝐴𝑖) * num(𝐵𝑗) / max (ndv(Ai.k1), ndv(Bj.k1))
– where num(Ai) is the number of records in bucket i of table A, ndv is the number
of distinct values of that column in the corresponding bucket.
– We compute the join cardinality bucket by bucket, and then add up the total
count.
• If the buckets of two join tables do not align,
– We split the bucket on the boundary values into more than 1 bucket.
– In the split buckets, we prorate ndv and bucket height based on the boundary
values of the newly split buckets by assuming uniform distribution within a given
bucket.
19
Aligning Histogram Buckets for Join
• Form new buckets to align buckets properly
20#DevSAIS13 – Cardinality Estimation by Hu and Wang
Table A, join column k1,
Histogram buckets
Table B, join column k1,
Histogram buckets
20 25 30 50 70 9080
28
28 40
Original bucket
boundary
Extra new bucket boundary
To form additional buckets
This bucket is excluded
In computation
20 25 28
28
40 80705030
21#DevSAIS13 – Cardinality Estimation by Hu and Wang
Table A, join column k1,
Histogram buckets:
Total row count: 25
min = 20, max = 80
ndv = 17
20
21
23
24
25
ndv=5
25
27
27
27
28
ndv=3
28
28
28
28
28
ndv=1
29
ndv=1
36
36
39
40
ndv=3
45
47
ndv=2
55
63
ndv=2
80
ndv=1
2520 28 3028 5040 70 80
90
90
ndv=1
20
21
21
25
ndv=3
26
ndv=1
28
28
ndv=1
30
ndv=1
36
39
ndv=2
45
50
ndv=2
55
60
65
70
ndv=4
75
80
ndv=2
7030282520 28 5040 80 90
Table B, join column k1,
Histogram buckets:
Total row count: 20
min = 20, max = 90
ndv = 17
- With histogram, join cardinality estimate is 21.8 by
computing the aligned bucket’s cardinality one-by-one.
- Without histogram, join cardinality estimate is 29.4
- The correct answer is 20.
Other Operator Estimation
• Project: does not change row count
• Aggregate: consider uniqueness of group-by
columns
• Limit, Sample, etc.
22
Statistics Propagation
Join
(t1.a = t2.b)
Scan t2Scan t1a: min, max, ndv …
…
b: min, max, ndv …
…
a: newMin, newMax, newNdv …
b: newMin, newMax, newNdv …
…
Top-down statistics
requests
Bottom-up statistics
propagation
23
Statistics inference
• Statistics collected:
– Number of records for a table
– Number of distinct values for a column
• Can make these inferences:
– If the above two numbers are close, we can determine if a
column is a unique key.
– Can infer if it is a primary-key to foreign-key join.
– Can detect if a star schema exists.
– Can help determine the output size of group-by operator if
multiple columns of same tables appear in group-by
expression.
24
Configuration Parameters
Configuration Parameters Default
Value
Suggested
Value
spark.sql.cbo.enabled False True
spark.sql.cbo.joinReorder.enabled False True
spark.sql.cbo.joinReorder.dp.threshold 12 12
spark.sql.cbo.joinReorder.card.weight 0.7 0.7
spark.sql.statistics.size.autoUpdate.enabled False True
spark.sql.statistics.histogram.enabled False True
spark.sql.statistics.histogram.numBins 254 254
spark.sql.statistics.ndv.maxError 0.05 0.05
spark.sql.statistics.percentile.accuracy 10000 10000
25#DevSAIS13
Reference
• SPARK-16026: Cost-Based Optimizer
Framework
– https://issues.apache.org/jira/browse/SPARK-16026
– It has 45 sub-tasks.
• SPARK-21975: Histogram support in cost-based
optimizer
– https://issues.apache.org/jira/browse/SPARK-21975
– It has 10 sub-tasks.
26#DevSAIS13 – Cardinality Estimation by Hu and Wang
Summary
• Cost Based Optimizer in Spark 2.2
• Statistics Collected
• Histogram Support in Spark 2.3
– Skewed data distributions are intrinsic in real world
data.
– Turn on histogram configuration parameter
“spark.sql.statistics.histogram.enabled” to deal with
skew.
27
Q & A
ron.hu@huawei.com
wangzhenhua@huawei.com

Más contenido relacionado

La actualidad más candente

Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Improving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInImproving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInDatabricks
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Spark Summit
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkDatabricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangDatabricks
 
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and LogstashKeeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and LogstashAmazon Web Services
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
 
[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)NAVER D2
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionDataWorks Summit
 

La actualidad más candente (20)

Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Improving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInImproving Spark SQL at LinkedIn
Improving Spark SQL at LinkedIn
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected Ways
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
 
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and LogstashKeeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 

Similar a Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and Zhenhua Wang

Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsJen Aman
 
Summarizing Data : Listing and Grouping pdf
Summarizing Data : Listing and Grouping pdfSummarizing Data : Listing and Grouping pdf
Summarizing Data : Listing and Grouping pdfJustynOwen
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Databricks
 
Monte Carlo Simulation for project estimates v1.0
Monte Carlo Simulation for project estimates v1.0Monte Carlo Simulation for project estimates v1.0
Monte Carlo Simulation for project estimates v1.0PMILebanonChapter
 
final Line balancing slide12.ppt
final Line balancing slide12.pptfinal Line balancing slide12.ppt
final Line balancing slide12.pptxicohos114
 
IRJET- Wallace Tree Multiplier using MFA Counters
IRJET-  	  Wallace Tree Multiplier using MFA CountersIRJET-  	  Wallace Tree Multiplier using MFA Counters
IRJET- Wallace Tree Multiplier using MFA CountersIRJET Journal
 
Unsupervised learning
Unsupervised learning Unsupervised learning
Unsupervised learning AlexAman1
 
Matlab ch1 (4)
Matlab ch1 (4)Matlab ch1 (4)
Matlab ch1 (4)mohsinggg
 
Power Efficient and High Speed Carry Skip Adder using Binary to Excess One Co...
Power Efficient and High Speed Carry Skip Adder using Binary to Excess One Co...Power Efficient and High Speed Carry Skip Adder using Binary to Excess One Co...
Power Efficient and High Speed Carry Skip Adder using Binary to Excess One Co...rahulmonikasharma
 
Ashish garg research paper 660_CamReady
Ashish garg research paper 660_CamReadyAshish garg research paper 660_CamReady
Ashish garg research paper 660_CamReadyAshish Garg
 
Reducing Structural Bias in Technology Mapping
Reducing Structural Bias in Technology MappingReducing Structural Bias in Technology Mapping
Reducing Structural Bias in Technology Mappingsatrajit
 
Final Project Report
Final Project ReportFinal Project Report
Final Project ReportRiddhi Shah
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyAndrii Gakhov
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Evan Chan
 
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...Smarten Augmented Analytics
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerZahra Sadeghi
 
MariaDB 10.3 Optimizer - where does it stand
MariaDB 10.3 Optimizer - where does it standMariaDB 10.3 Optimizer - where does it stand
MariaDB 10.3 Optimizer - where does it standSergey Petrunya
 

Similar a Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and Zhenhua Wang (20)

Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
Time Series.pptx
Time Series.pptxTime Series.pptx
Time Series.pptx
 
Summarizing Data : Listing and Grouping pdf
Summarizing Data : Listing and Grouping pdfSummarizing Data : Listing and Grouping pdf
Summarizing Data : Listing and Grouping pdf
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
Monte Carlo Simulation for project estimates v1.0
Monte Carlo Simulation for project estimates v1.0Monte Carlo Simulation for project estimates v1.0
Monte Carlo Simulation for project estimates v1.0
 
final Line balancing slide12.ppt
final Line balancing slide12.pptfinal Line balancing slide12.ppt
final Line balancing slide12.ppt
 
IRJET- Wallace Tree Multiplier using MFA Counters
IRJET-  	  Wallace Tree Multiplier using MFA CountersIRJET-  	  Wallace Tree Multiplier using MFA Counters
IRJET- Wallace Tree Multiplier using MFA Counters
 
MUMS: Transition & SPUQ Workshop - Gradient-Free Construction of Active Subsp...
MUMS: Transition & SPUQ Workshop - Gradient-Free Construction of Active Subsp...MUMS: Transition & SPUQ Workshop - Gradient-Free Construction of Active Subsp...
MUMS: Transition & SPUQ Workshop - Gradient-Free Construction of Active Subsp...
 
Unsupervised learning
Unsupervised learning Unsupervised learning
Unsupervised learning
 
Matlab ch1 (4)
Matlab ch1 (4)Matlab ch1 (4)
Matlab ch1 (4)
 
Power Efficient and High Speed Carry Skip Adder using Binary to Excess One Co...
Power Efficient and High Speed Carry Skip Adder using Binary to Excess One Co...Power Efficient and High Speed Carry Skip Adder using Binary to Excess One Co...
Power Efficient and High Speed Carry Skip Adder using Binary to Excess One Co...
 
Ashish garg research paper 660_CamReady
Ashish garg research paper 660_CamReadyAshish garg research paper 660_CamReady
Ashish garg research paper 660_CamReady
 
Reducing Structural Bias in Technology Mapping
Reducing Structural Bias in Technology MappingReducing Structural Bias in Technology Mapping
Reducing Structural Bias in Technology Mapping
 
Final Project Report
Final Project ReportFinal Project Report
Final Project Report
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019
 
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation Controller
 
Matlab introduction
Matlab introductionMatlab introduction
Matlab introduction
 
MariaDB 10.3 Optimizer - where does it stand
MariaDB 10.3 Optimizer - where does it standMariaDB 10.3 Optimizer - where does it stand
MariaDB 10.3 Optimizer - where does it stand
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 

Último (20)

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 

Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and Zhenhua Wang

  • 1. Ron Hu, Zhenhua Wang Huawei Technologies, Inc. Cardinality Estimation through Histogram in Apache Spark 2.3 #DevSAIS13
  • 2. Agenda • Catalyst Architecture • Cost Based Optimizer in Spark 2.2 • Statistics Collected • Histogram Support in Spark 2.3 • Configuration Parameters • Q & A 2
  • 3. Catalyst Architecture 3 Spark optimizes query plan here Reference:Deep Dive into Spark SQL’s Catalyst Optimizer, a databricks engineering blog
  • 4. Query Optimizer in Spark SQL • Spark SQL’s query optimizer is based on both rules and cost. • Most of Spark SQL optimizer’s rules are heuristics rules. – PushDownPredicate, ColumnPruning, ConstantFolding,…. • Cost based optimization (CBO) was added in Spark 2.2. 4
  • 5. Cost Based Optimizer in Spark 2.2 • It was a good and working CBO framework to start with. • Focused on – Statistics collection, – Cardinality estimation, – Build side selection, broadcast vs. shuffled join, join reordering, etc. • Used heuristics formula for cost function in terms of cardinality and data size of each operator. 5
  • 6. Statistics Collected • Collect Table Statistics information • Collect Column Statistics information • Goal: – Calculate the cost for each operator in terms of number of output rows, size of output, etc. – Based on the cost calculation, adjust the query execution plan 6
  • 7. Table Statistics Collected • Command to collect statistics of a table. – Ex: ANALYZE TABLE table-name COMPUTE STATISTICS • It collects table level statistics and saves into metastore. – Number of rows – Table size in bytes 7
  • 8. Column Statistics Collected • Command to collect column level statistics of individual columns. – Ex: ANALYZE TABLE table-name COMPUTE STATISTICS FOR COLUMNS column-name1, column-name2, …. • It collects column level statistics and saves into meta-store. String/Binary type ✓ Distinct count ✓ Null count ✓ Average length ✓ Max length Numeric/Date/Timestamp type ✓ Distinct count ✓ Max ✓ Min ✓ Null count ✓ Average length (fixed length) ✓ Max length (fixed length) 8
  • 9. Real World Data Are Often Skewed 9#DevSAIS13 – Cardinality Estimation by Hu and Wang
  • 10. Histogram Support in Spark 2.3 • Histogram is effective in handling skewed distribution. • We developed equi-height histogram in Spark 2.3. • Equi-Height histogram is better than equi-width histogram • Equi-height histogram can use multiple buckets to show a very skewed value. • Equi-width histogram cannot give right frequency when a skewed value falls in same bucket with other values. Column interval Frequency Equi-Width Equi-Height Column interval Frequency Density 10
  • 11. Histogram Algorithm – Each histogram has a default of 254 buckets. • The height of a histogram is number of non-null values divided by number of buckets. – Each histogram bucket contains • Range values of a bucket • Number of distinct values in a bucket – We use two table scans to generate the equi-height histograms for all columns specified in analyze command. • Use ApproximatePercentile class to get end points of all histogram buckets • Use HyperLogLog++ algorithm to compute the number of distinct values in each bucket. 11
  • 12. Filter Cardinality Estimation • Between Logical expressions: AND, OR, NOT • In each logical expression: =, <, <=, >, >=, in, etc • Current support type in Expression – For <, <=, >, >=, <=>: Integer, Double, Date, Timestamp, etc – For = , <=>: String, Integer, Double, Date, Timestamp, etc. • Example: A <= B – Based on A, B’s min/max/distinct count/null count values, decide the relationships between A and B. After completing this expression, we set the new min/max/distinct count/null count – Assume all the data is evenly distributed if no histogram information. 12
  • 13. Filter Operator without Histogram • Column A (op) literal B – (op) can be “=“, “<”, “<=”, “>”, “>=”, “like” – Column’s max/min/distinct count/null count should be updated – Example: Column A < value B Column AB B A.min A.max Filtering Factor = 0% need to change A’s statistics Filtering Factor = 100% no need to change A’s statistics Filtering Factor = (B.value – A.min) / (A.max – A.min) A.min = no change A.max = B.value A.ndv = A.ndv * Filtering Factor 13 • Without histogram, we prorate over the entire column range. • It works only if it is evenly distributed.
  • 14. Filter Operator with Histogram • With histogram, we check the range values of a bucket to see if it should be included in estimation. • We prorate only the boundary bucket. • This way can enhance the accuracy of estimation since we prorate (or guess) only a much smaller set of records in a bucket only. 14
  • 15. Histogram for Filter Example 1 Age distribution of a restaurant: • Estimate row count for predicate “age > 40”. Correct answer is 5. • Without histogram, estimate: 25 * (80 – 40)/(80 – 20) = 16.7 • With histogram, estimate: 1.0 * // only 5th bucket 5 // 5 records per bucket = 5 15#DevSAIS13 – Cardinality Estimation by Hu and Wang 20 21 23 24 25 ndv=5 25 27 27 27 28 ndv=3 28 28 28 28 28 ndv=1 29 36 36 39 40 ndv=4 45 47 55 63 80 ndv=5 20 25 28 40 8028 Total row count: 25 age min = 20 age max = 80 age ndv = 17
  • 16. Histogram for Filter Example 2 Age distribution of a restaurant: • Estimate row count for predicate “age = 28”. Correct answer is 6. • Without histogram, estimate: 25 * 1 / 17 = 1.47 • With histogram, estimate: ( 1/3 // prorate the 2nd bucket + 1.0 // for 3rd bucket ) * 5 // 5 records per bucket = 6.67 16#DevSAIS13 – Cardinality Estimation by Hu and Wang 20 21 23 24 25 ndv=5 25 27 27 27 28 ndv=3 28 28 28 28 28 ndv=1 29 36 36 39 40 ndv=4 45 47 55 63 80 ndv=5 20 25 28 40 8028 Total row count: 25 age min = 20 age max = 80 age ndv = 17
  • 17. Join Cardinality without Histogram • Inner-Join: The number of rows of “A join B on A.k1 = B.k1” is estimated as: num(A ⟗ B) = num(A) * num(B) / max(distinct(A.k1), distinct(B.k1)), – where num(A) is the number of records in table A, distinct is the number of distinct values of that column. – The underlying assumption for this formula is that each value of the smaller domain is included in the larger domain. – Assuming uniform distribution for entire range of both join columns. • We similarly estimate cardinalities for Left-Outer Join, Right-Outer Join and Full-Outer Join 17
  • 18. Join Cardinality without Histogram 18 Total row count: 25 k1 min = 20 k1 max = 80 k1 ndv = 17 Table A, join column k1 Table B, join column k1 Total row count: 20 k1 min = 20 k1 max = 90 k1 ndv = 17 Without histogram, join cardinality estimate is 25 * 20 / 17 = 29.4 The correct answer is 20. 20 21 23 24 25 25 27 27 27 28 28 28 28 28 28 29 36 36 39 40 45 47 55 63 80 20 80 20 21 21 25 26 28 28 30 36 39 45 50 55 60 65 70 75 80 90 90 20 90
  • 19. Join Cardinality with Histogram • The number of rows of “A join B on A.k1 = B.k1” is estimated as: num(𝐴⟗𝐵) = 𝑖,𝑗 num(𝐴𝑖) * num(𝐵𝑗) / max (ndv(Ai.k1), ndv(Bj.k1)) – where num(Ai) is the number of records in bucket i of table A, ndv is the number of distinct values of that column in the corresponding bucket. – We compute the join cardinality bucket by bucket, and then add up the total count. • If the buckets of two join tables do not align, – We split the bucket on the boundary values into more than 1 bucket. – In the split buckets, we prorate ndv and bucket height based on the boundary values of the newly split buckets by assuming uniform distribution within a given bucket. 19
  • 20. Aligning Histogram Buckets for Join • Form new buckets to align buckets properly 20#DevSAIS13 – Cardinality Estimation by Hu and Wang Table A, join column k1, Histogram buckets Table B, join column k1, Histogram buckets 20 25 30 50 70 9080 28 28 40 Original bucket boundary Extra new bucket boundary To form additional buckets This bucket is excluded In computation 20 25 28 28 40 80705030
  • 21. 21#DevSAIS13 – Cardinality Estimation by Hu and Wang Table A, join column k1, Histogram buckets: Total row count: 25 min = 20, max = 80 ndv = 17 20 21 23 24 25 ndv=5 25 27 27 27 28 ndv=3 28 28 28 28 28 ndv=1 29 ndv=1 36 36 39 40 ndv=3 45 47 ndv=2 55 63 ndv=2 80 ndv=1 2520 28 3028 5040 70 80 90 90 ndv=1 20 21 21 25 ndv=3 26 ndv=1 28 28 ndv=1 30 ndv=1 36 39 ndv=2 45 50 ndv=2 55 60 65 70 ndv=4 75 80 ndv=2 7030282520 28 5040 80 90 Table B, join column k1, Histogram buckets: Total row count: 20 min = 20, max = 90 ndv = 17 - With histogram, join cardinality estimate is 21.8 by computing the aligned bucket’s cardinality one-by-one. - Without histogram, join cardinality estimate is 29.4 - The correct answer is 20.
  • 22. Other Operator Estimation • Project: does not change row count • Aggregate: consider uniqueness of group-by columns • Limit, Sample, etc. 22
  • 23. Statistics Propagation Join (t1.a = t2.b) Scan t2Scan t1a: min, max, ndv … … b: min, max, ndv … … a: newMin, newMax, newNdv … b: newMin, newMax, newNdv … … Top-down statistics requests Bottom-up statistics propagation 23
  • 24. Statistics inference • Statistics collected: – Number of records for a table – Number of distinct values for a column • Can make these inferences: – If the above two numbers are close, we can determine if a column is a unique key. – Can infer if it is a primary-key to foreign-key join. – Can detect if a star schema exists. – Can help determine the output size of group-by operator if multiple columns of same tables appear in group-by expression. 24
  • 25. Configuration Parameters Configuration Parameters Default Value Suggested Value spark.sql.cbo.enabled False True spark.sql.cbo.joinReorder.enabled False True spark.sql.cbo.joinReorder.dp.threshold 12 12 spark.sql.cbo.joinReorder.card.weight 0.7 0.7 spark.sql.statistics.size.autoUpdate.enabled False True spark.sql.statistics.histogram.enabled False True spark.sql.statistics.histogram.numBins 254 254 spark.sql.statistics.ndv.maxError 0.05 0.05 spark.sql.statistics.percentile.accuracy 10000 10000 25#DevSAIS13
  • 26. Reference • SPARK-16026: Cost-Based Optimizer Framework – https://issues.apache.org/jira/browse/SPARK-16026 – It has 45 sub-tasks. • SPARK-21975: Histogram support in cost-based optimizer – https://issues.apache.org/jira/browse/SPARK-21975 – It has 10 sub-tasks. 26#DevSAIS13 – Cardinality Estimation by Hu and Wang
  • 27. Summary • Cost Based Optimizer in Spark 2.2 • Statistics Collected • Histogram Support in Spark 2.3 – Skewed data distributions are intrinsic in real world data. – Turn on histogram configuration parameter “spark.sql.statistics.histogram.enabled” to deal with skew. 27