Dynamic Partition Pruning in Apache Spark

•

7 likes•4,942 views

In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.

Data & Analytics

Dynamic Partition Pruning
in Apache Spark
Spark + AI Summit, Amsterdam
1
Bogdan Ghit and Juliusz Sompolski

2
About Us
BI Experience team in the
Databricks Amsterdam European Development Centre
● Working on improving the experience and performance of
Business Intelligence / SQL analytics workloads using
Databricks
○ JDBC / ODBC connectivity to Databricks clusters
○ Integrations with BI tools such as Tableau
○ But also: core performance improvements in
Apache Spark for common SQL analytics query
patterns
Bogdan Ghit
Juliusz Sompolski

TPCDS Q98 on 10 TB
How to Make a Query 100x Faster?

Static Partition Pruning
SELECT * FROM Sales WHERE day_of_week = ‘Mon’
Filter
Scan
Basic data-flow
Filter
Scan
Filter Push-down
Filter
Scan
Partition files with
multi-columnar data

Table Denormalization
SELECT * FROM Sales JOIN Date
WHERE Date.day_of_week = ‘Mon’
Static pruning not possible
Scan
Sales
Filter
day_of_week = ‘mon’
Join
Simple workaround
Scan
Sales
Join
Scan
Date
Filter
day_of_week = ‘mon’
Scan
Scan
Date

This Talk
Dynamic pruning
Scan
Sales
Filter
day_of_week = ‘mon’
Join
SELECT * FROM Sales JOIN Date
WHERE Date.day_of_week = ‘Mon’
Scan
Countries

Spark In a Nutshell
Query Logical Plan
Optimization
Physical Plan
Selection
RDD batches
Cluster slots
Stats-based
cost model
Rule-based
transformations
APIs

Optimization Opportunities
Data Layout
Partition files with
multi-columnar data
Scan FACT TABLE Scan DIM TABLE
Non-partitioned dataset
Filter DIM
Join on partition id
Query Shape

A Simple Approach
Partition files with
multi-columnar data
Scan FACT TABLE
Scan DIM TABLE
Non-partitioned dataset
Filter DIM
Join on partition id
Scan DIM TABLE
Filter DIM
Work duplication may be expensive
Heuristics based on inaccurate stats

Broadcast Hash Join
FileScan FileScan with Dim Filter
Non-partitioned dataset
BroadcastExchange
Broadcast Hash Join
Execute the build side
of the join
Place the result in a
broadcast variableBroadcast the build
side results
Execute the join
locally without
a shuﬀle

Reusing Broadcast Results
Partition files with
multi-columnar data
FileScan
FileScan with Dim Filter
Non-partitioned dataset
BroadcastExchange
Broadcast Hash Join
Dynamic Filter

Experimental Setup
Workload Selection
- TPC-DS scale factors 1-10 TB
Cluster Configuration
- 10 i3.xlarge machines
Data-Processing Framework
- Apache Spark 3.0

TPCDS 1 TB
60 / 102 queries speedup between 2 and 18

Top Queries
Very good speedups for top 10% of the queries

Data Skipped
Very eﬀective in skipping data

TPCDS 10 TB
Even better speedups at 10x the scale

Query 98
SELECT i_item_desc, i_category, i_class, i_current_price,
sum(ss_ext_sales_price) as itemrevenue,
sum(ss_ext_sales_price)*100/sum(sum(ss_ext_sales_price)) over
(partition by i_class) as revenueratio
FROM
store_sales, item, date_dim
WHERE
ss_item_sk = i_item_sk
and i_category in ('Sports', 'Books', 'Home')
and ss_sold_date_sk = d_date_sk
and cast(d_date as date) between cast('1999-02-22' as date)
and (cast('1999-02-22' as date) + interval '30' day)
GROUP BY
i_item_id, i_item_desc, i_category, i_class, i_current_price
ORDER BY
i_category, i_class, i_item_id, i_item_desc, revenueratio

TPCDS 10 TB
Highly selective dimension filter that retains only
one month out of 5 years of data

Conclusion
Apache Spark 3.0 introduces Dynamic Partition Pruning
- Strawman approach at logical planning time
- Optimized approach during execution time
Significant speedup, exhibited in many TPC-DS queries
With this optimization Spark may now work good with
star-schema queries, making it unnecessary to ETL
denormalized tables.

20
Thanks!
Bogdan Ghit - linkedin.com/in/bogdanghit
Juliusz Sompolski - linkedin.com/in/juliuszsompolski

What's hot

Parquet performance tuning: the missing guideRyan Blue

Understanding Query Plans and Spark UIsDatabricks

Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward

Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks

The Apache Spark File Format EcosystemDatabricks

Apache Spark Core – Practical OptimizationDatabricks

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Delta Lake: Optimizing MergeDatabricks

How We Optimize Spark SQL Jobs With parallel and sync IODatabricks

Common Strategies for Improving Performance on Your Delta LakehouseDatabricks

Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks

Unified Big Data Processing with Apache Spark (QCON 2014)Databricks

Physical Plans in Spark SQLDatabricks

Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks

Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks

Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks

The delta architecturePrakash Chockalingam

What's hot (20)

Parquet performance tuning: the missing guide

Understanding Query Plans and Spark UIs

Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...

Adaptive Query Execution: Speeding Up Spark SQL at Runtime

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...

The Apache Spark File Format Ecosystem

Apache Spark Core – Practical Optimization

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Delta Lake: Optimizing Merge

How We Optimize Spark SQL Jobs With parallel and sync IO

Common Strategies for Improving Performance on Your Delta Lakehouse

Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...

Unified Big Data Processing with Apache Spark (QCON 2014)

Physical Plans in Spark SQL

Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...

Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...

Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...

The delta architecture

Similar to Dynamic Partition Pruning in Apache Spark

Aerospike for machine learningAerospike

Analyst View of Data Virtualization: Conversations with Boulder Business Inte...Denodo

Data Virtualization for Data Architects (New Zealand)Denodo

AIDC NY: BODO AI Presentation - 09.19.2019Intel® Software

Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks

Apache Spark 3.0: Overview of What’s New and Why CareDatabricks

Database Shootout: What's best for BI?Jos van Dongen

Maximize Big Data ROI via Best of Breed Patterns and PracticesJeff Bertman

SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData

Denodo 6.0: Self Service Search, Discovery & Governance using an Universal Se...Denodo

OLAP on the Cloud with Azure Databricks and Azure SynapseAtScale

Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax

Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsDatabricks

Microsoft SQL Server 2016 - Everything Built InDavid J Rosenthal

클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스Amazon Web Services Korea

Building Data Products with BigQuery for PPC and SEO (SMX 2022)Christopher Gutknecht

The hidden engineering behind machine learning products at HelixaAlluxio, Inc.

Data Virtualization for Data Architects (Australia)Denodo

Data Warehouse Design and Best PracticesIvo Andreev

Sql 2016 2017 fullMaximiliano Accotto

Similar to Dynamic Partition Pruning in Apache Spark (20)

Aerospike for machine learning

Analyst View of Data Virtualization: Conversations with Boulder Business Inte...

Data Virtualization for Data Architects (New Zealand)

AIDC NY: BODO AI Presentation - 09.19.2019

Best Practices for Building and Deploying Data Pipelines in Apache Spark

Apache Spark 3.0: Overview of What’s New and Why Care

Database Shootout: What's best for BI?

Maximize Big Data ROI via Best of Breed Patterns and Practices

SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Denodo 6.0: Self Service Search, Discovery & Governance using an Universal Se...

OLAP on the Cloud with Azure Databricks and Azure Synapse

Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...

Performance Analysis of Apache Spark and Presto in Cloud Environments

Microsoft SQL Server 2016 - Everything Built In

클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스

Building Data Products with BigQuery for PPC and SEO (SMX 2022)

The hidden engineering behind machine learning products at Helixa

Data Virtualization for Data Architects (Australia)

Data Warehouse Design and Best Practices

Sql 2016 2017 full

Recently uploaded

Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh

➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823

Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...gajnagarg

Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal

Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...gajnagarg

➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...amitlee9823

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823

Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...gajnagarg

Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...gajnagarg

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823

Recently uploaded (20)

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec

👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...

➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...

Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...

Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -

Abortion pills in Jeddah | +966572737505 | Get Cytotec

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand

Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...

➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service

CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

Aspirational Block Program Block Syaldey District - Almora

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...

Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...

Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...

Dynamic Partition Pruning in Apache Spark

1. Dynamic Partition Pruning in Apache Spark Spark + AI Summit, Amsterdam 1 Bogdan Ghit and Juliusz Sompolski

2. 2 About Us BI Experience team in the Databricks Amsterdam European Development Centre ● Working on improving the experience and performance of Business Intelligence / SQL analytics workloads using Databricks ○ JDBC / ODBC connectivity to Databricks clusters ○ Integrations with BI tools such as Tableau ○ But also: core performance improvements in Apache Spark for common SQL analytics query patterns Bogdan Ghit Juliusz Sompolski

3. TPCDS Q98 on 10 TB How to Make a Query 100x Faster?

4. Static Partition Pruning SELECT * FROM Sales WHERE day_of_week = ‘Mon’ Filter Scan Basic data-flow Filter Scan Filter Push-down Filter Scan Partition files with multi-columnar data

5. Table Denormalization SELECT * FROM Sales JOIN Date WHERE Date.day_of_week = ‘Mon’ Static pruning not possible Scan Sales Filter day_of_week = ‘mon’ Join Simple workaround Scan Sales Join Scan Date Filter day_of_week = ‘mon’ Scan Scan Date

6. This Talk Dynamic pruning Scan Sales Filter day_of_week = ‘mon’ Join SELECT * FROM Sales JOIN Date WHERE Date.day_of_week = ‘Mon’ Scan Countries

7. Spark In a Nutshell Query Logical Plan Optimization Physical Plan Selection RDD batches Cluster slots Stats-based cost model Rule-based transformations APIs

8. Optimization Opportunities Data Layout Partition files with multi-columnar data Scan FACT TABLE Scan DIM TABLE Non-partitioned dataset Filter DIM Join on partition id Query Shape

9. A Simple Approach Partition files with multi-columnar data Scan FACT TABLE Scan DIM TABLE Non-partitioned dataset Filter DIM Join on partition id Scan DIM TABLE Filter DIM Work duplication may be expensive Heuristics based on inaccurate stats

10. Broadcast Hash Join FileScan FileScan with Dim Filter Non-partitioned dataset BroadcastExchange Broadcast Hash Join Execute the build side of the join Place the result in a broadcast variableBroadcast the build side results Execute the join locally without a shuﬀle

11. Reusing Broadcast Results Partition files with multi-columnar data FileScan FileScan with Dim Filter Non-partitioned dataset BroadcastExchange Broadcast Hash Join Dynamic Filter

12. Experimental Setup Workload Selection - TPC-DS scale factors 1-10 TB Cluster Configuration - 10 i3.xlarge machines Data-Processing Framework - Apache Spark 3.0

13. TPCDS 1 TB 60 / 102 queries speedup between 2 and 18

14. Top Queries Very good speedups for top 10% of the queries

15. Data Skipped Very eﬀective in skipping data

16. TPCDS 10 TB Even better speedups at 10x the scale

17. Query 98 SELECT i_item_desc, i_category, i_class, i_current_price, sum(ss_ext_sales_price) as itemrevenue, sum(ss_ext_sales_price)*100/sum(sum(ss_ext_sales_price)) over (partition by i_class) as revenueratio FROM store_sales, item, date_dim WHERE ss_item_sk = i_item_sk and i_category in ('Sports', 'Books', 'Home') and ss_sold_date_sk = d_date_sk and cast(d_date as date) between cast('1999-02-22' as date) and (cast('1999-02-22' as date) + interval '30' day) GROUP BY i_item_id, i_item_desc, i_category, i_class, i_current_price ORDER BY i_category, i_class, i_item_id, i_item_desc, revenueratio

18. TPCDS 10 TB Highly selective dimension filter that retains only one month out of 5 years of data

19. Conclusion Apache Spark 3.0 introduces Dynamic Partition Pruning - Strawman approach at logical planning time - Optimized approach during execution time Significant speedup, exhibited in many TPC-DS queries With this optimization Spark may now work good with star-schema queries, making it unnecessary to ETL denormalized tables.

20. 20 Thanks! Bogdan Ghit - linkedin.com/in/bogdanghit Juliusz Sompolski - linkedin.com/in/juliuszsompolski

Dynamic Partition Pruning in Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Dynamic Partition Pruning in Apache Spark

Similar to Dynamic Partition Pruning in Apache Spark (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Dynamic Partition Pruning in Apache Spark