SlideShare una empresa de Scribd logo
1 de 25
Descargar para leer sin conexión
Improving Spark SQL Performance by 30%:
How We Optimize Parquet Filter Pushdown
and Parquet Reader
Ke Sun (sunke3296@gmail.com)
Senior Engineer of Data Engine Team, ByteDance
Who We Are
l Data Engine team of ByteDance
l Build a platform of one-stop
experience for OLAP , on which
users can analyze EB level data by
writing SQL without caring about
the underlying execution engine
What We Do
l Manage Spark SQL / Presto / Hive
workload
l Offer open API and serverless OLAP
platform
l Optimize Spark SQL / Presto / Hudi /
Hive engine
l Design data architecture for most
business lines in ByteDance
Agenda
Spark SQL at ByteDance
How Spark Read Parquet
Optimization of Parquet Filter
Pushdown and Parquet Reader
at ByteDance
Spark SQL at ByteDance
2017 2019 20202016 2018
Spark SQL at ByteDance
Small Scale Experiments
Ad-hoc Workload
Few ETL Workload
Full-production deployment & migration
Main engine in DW area
Spark SQL at ByteDance
▪ Spark SQL covers 98%+ ETL workload
▪ Parquet is the default file format in data warehouse and
vectorizedReader is also enabled by default
How Spark Read Parquet
How Spark Read Parquet
▪ Overview of Parquet
▪ Procedure of Parquet Reading
▪ What We can Optimize
How Spark Read Parquet
▪ Overview of Parquet
Column Pruning
More efficient compression
Parquet can skip useless data by spark filter
pushdown using Footer & RowGroup statistics
https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif
How Spark Read Parquet
▪ Procedure of Parquet Reading
▪ VectorizedParquetRecordReader skips useless
RowGroups by filter pushdown translated by
ParquetFilters
▪ VectorizedParquetRecordReader builds column readers
for every target column and these column readers read
data together in batch
DataSourceScanExec
.inputRDD
ParquetFileFormat
.buildReaderWithPartitionValues
VectorizedParquetRecordReader
. nextBatch
How Spark Read Parquet
▪ Optimization – Statistics are not distinguishable
select * from table_name where date = ‘***’ and category = ‘test’
(date is partition column and category is predicate column)
For example, Spark reads all the 3 RowGroups for statistics are not distinguishable.
min of category max of category read or not
RowGroup1 a1 z1 Yes
RowGroup2 a2 z2 Yes
RowGroup3 a3 z3 Yes
Min/Max Statistics of RowGroup for a Parquet File
How Spark Read Parquet
▪ Optimization – Statistics are not distinguishable
Parquet filter pushdown works poorly when predicate columns are out of order in
parquet files and this phenomenon is reasonable
It is valuable to sort the common used predicate columns in parquet files to
reduce IO
How Spark Read Parquet
▪ Optimization – Spark reads too much unnecessary data
select col1 from table_name where date = ‘***’ and col2 = ‘test’
col1 col2 col3
col1 col2 col3
col1 col2 col3
ParquetFile
RowGroup1
RowGroup2
RowGroup3
▪ RowGroup1 is skipped by filter pushdown
▪ Col3 is skipped by column pruning
▪ Col1 and col2 are read together by vectorized reader
How Spark Read Parquet
▪ Optimization – Spark reads too much unnecessary data
select col1 from table_name where date = ‘***’ and col2 = ‘test’
col1 col2 col3
col1 col2 col3
col1 col2 col3
ParquetFile
RowGroup1
RowGroup2
RowGroup3
▪ Most data of col1 is unnecessary to read
if filter ratio (col2=‘test’) is very high
▪ It is valuable to read & filter data by filter columns firstly
and then read data of other columns
Optimization of Parquet Filter Pushdown and
Parquet Reader at ByteDance
Optimization of Parquet Filter Pushdown
▪ Statistics are not distinguishable
▪ Increase parquet statistics discrimination
▪ Low Overhead: sort all data is expensive or even impossible
▪ Automation: users do not need to update ETL jobs
LocalSort: Add a SortExec node before InsertIntoHiveTable node
▪ Which columns to be sorted?
▪ Analyze the history queries and choose the most common used predicate columns
▪ Configure sort columns to table property of hive table and Spark SQL will read this property
▪ It is a automatic procedure without manual intervention
Optimization of Parquet Filter Pushdown
…
Project
SortExec
…
Project
InsertIntoHiveTable
InsertIntoHiveTable
Optimization of Parquet Filter Pushdown
▪ Spark reads less data for more efficient statistics
▪ Parquet file size is much smaller for data is more efficient to compress
▪ Only near 5% overhead
Spark reads only one RowGroup after sorting data by column category
min of category max of category read or not
RowGroup1 a1 g1 No
RowGroup2 g2 u2 Yes
RowGroup3 u3 z3 No
Min/Max Statistics of RowGroup for a Parquet File
Optimization of Parquet Reader
▪ Spark reads too much unnecessary data
▪ Filter unnecessary data as soon as possible
Prewhere: Read data of filter columns in batch firstly and skip other
columns if unmatched (It is a good idea from ClickHouse)
Optimization of Parquet Reader
▪ Split parquet reader into 2 reader: FilterReader for filter
columns and NonFilterReader for other columns
col1 col2 col3
col1 col2 col3
col1 col2 col3
RowGroup1
RowGroup2
RowGroup3
VectorizedParquetRecordReader
col1 col2 col3
col1 col2 col3
col1 col2 col3
FilterReader
col1 col2 col3
col1 col2 col3
col1 col2 col3
NonFilterReader
PrewhereVectorizedParquetRecordReader
Optimization of Parquet Reader
▪ Skip Page
▪ Skip Decoding
▪ Skip RowGroup
Potential BenefitFilterReader reads data in batch
Apply filter expressions to data
match
NonFilterReader reads data in
batch and skip unnecessary data
N
Y
Union data and return in batch
Optimization of Parquet Reader
▪ ByteType
▪ ShortType
▪ IntegerType
▪ LongType
▪ FloatType
▪ DoubleType
▪ StringType
▪ >
▪ >=
▪ <
▪ <=
▪ =
▪ In
▪ isNull
▪ isNotNull
Supported Filter TypeSupported Data Type of Filter Column
Databricks simplifies data and AI
so data teams can innovate faster
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Más contenido relacionado

La actualidad más candente

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 

La actualidad más candente (20)

Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 

Similar a Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and Parquet Reader

Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架
hdhappy001
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
DataWorks Summit
 

Similar a Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and Parquet Reader (20)

What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Spark on Yarn @ Netflix
Spark on Yarn @ NetflixSpark on Yarn @ Netflix
Spark on Yarn @ Netflix
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesLow Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling Examples
 
Real World Storage in Treasure Data
Real World Storage in Treasure DataReal World Storage in Treasure Data
Real World Storage in Treasure Data
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
 
夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
spark stream - kafka - the right way
spark stream - kafka - the right way spark stream - kafka - the right way
spark stream - kafka - the right way
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL Releases
 

Más de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
shambhavirathore45
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
shivangimorya083
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 

Último (20)

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and Parquet Reader

  • 1. Improving Spark SQL Performance by 30%: How We Optimize Parquet Filter Pushdown and Parquet Reader Ke Sun (sunke3296@gmail.com) Senior Engineer of Data Engine Team, ByteDance
  • 2. Who We Are l Data Engine team of ByteDance l Build a platform of one-stop experience for OLAP , on which users can analyze EB level data by writing SQL without caring about the underlying execution engine
  • 3. What We Do l Manage Spark SQL / Presto / Hive workload l Offer open API and serverless OLAP platform l Optimize Spark SQL / Presto / Hudi / Hive engine l Design data architecture for most business lines in ByteDance
  • 4. Agenda Spark SQL at ByteDance How Spark Read Parquet Optimization of Parquet Filter Pushdown and Parquet Reader at ByteDance
  • 5. Spark SQL at ByteDance
  • 6. 2017 2019 20202016 2018 Spark SQL at ByteDance Small Scale Experiments Ad-hoc Workload Few ETL Workload Full-production deployment & migration Main engine in DW area
  • 7. Spark SQL at ByteDance ▪ Spark SQL covers 98%+ ETL workload ▪ Parquet is the default file format in data warehouse and vectorizedReader is also enabled by default
  • 8. How Spark Read Parquet
  • 9. How Spark Read Parquet ▪ Overview of Parquet ▪ Procedure of Parquet Reading ▪ What We can Optimize
  • 10. How Spark Read Parquet ▪ Overview of Parquet Column Pruning More efficient compression Parquet can skip useless data by spark filter pushdown using Footer & RowGroup statistics https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif
  • 11. How Spark Read Parquet ▪ Procedure of Parquet Reading ▪ VectorizedParquetRecordReader skips useless RowGroups by filter pushdown translated by ParquetFilters ▪ VectorizedParquetRecordReader builds column readers for every target column and these column readers read data together in batch DataSourceScanExec .inputRDD ParquetFileFormat .buildReaderWithPartitionValues VectorizedParquetRecordReader . nextBatch
  • 12. How Spark Read Parquet ▪ Optimization – Statistics are not distinguishable select * from table_name where date = ‘***’ and category = ‘test’ (date is partition column and category is predicate column) For example, Spark reads all the 3 RowGroups for statistics are not distinguishable. min of category max of category read or not RowGroup1 a1 z1 Yes RowGroup2 a2 z2 Yes RowGroup3 a3 z3 Yes Min/Max Statistics of RowGroup for a Parquet File
  • 13. How Spark Read Parquet ▪ Optimization – Statistics are not distinguishable Parquet filter pushdown works poorly when predicate columns are out of order in parquet files and this phenomenon is reasonable It is valuable to sort the common used predicate columns in parquet files to reduce IO
  • 14. How Spark Read Parquet ▪ Optimization – Spark reads too much unnecessary data select col1 from table_name where date = ‘***’ and col2 = ‘test’ col1 col2 col3 col1 col2 col3 col1 col2 col3 ParquetFile RowGroup1 RowGroup2 RowGroup3 ▪ RowGroup1 is skipped by filter pushdown ▪ Col3 is skipped by column pruning ▪ Col1 and col2 are read together by vectorized reader
  • 15. How Spark Read Parquet ▪ Optimization – Spark reads too much unnecessary data select col1 from table_name where date = ‘***’ and col2 = ‘test’ col1 col2 col3 col1 col2 col3 col1 col2 col3 ParquetFile RowGroup1 RowGroup2 RowGroup3 ▪ Most data of col1 is unnecessary to read if filter ratio (col2=‘test’) is very high ▪ It is valuable to read & filter data by filter columns firstly and then read data of other columns
  • 16. Optimization of Parquet Filter Pushdown and Parquet Reader at ByteDance
  • 17. Optimization of Parquet Filter Pushdown ▪ Statistics are not distinguishable ▪ Increase parquet statistics discrimination ▪ Low Overhead: sort all data is expensive or even impossible ▪ Automation: users do not need to update ETL jobs LocalSort: Add a SortExec node before InsertIntoHiveTable node
  • 18. ▪ Which columns to be sorted? ▪ Analyze the history queries and choose the most common used predicate columns ▪ Configure sort columns to table property of hive table and Spark SQL will read this property ▪ It is a automatic procedure without manual intervention Optimization of Parquet Filter Pushdown … Project SortExec … Project InsertIntoHiveTable InsertIntoHiveTable
  • 19. Optimization of Parquet Filter Pushdown ▪ Spark reads less data for more efficient statistics ▪ Parquet file size is much smaller for data is more efficient to compress ▪ Only near 5% overhead Spark reads only one RowGroup after sorting data by column category min of category max of category read or not RowGroup1 a1 g1 No RowGroup2 g2 u2 Yes RowGroup3 u3 z3 No Min/Max Statistics of RowGroup for a Parquet File
  • 20. Optimization of Parquet Reader ▪ Spark reads too much unnecessary data ▪ Filter unnecessary data as soon as possible Prewhere: Read data of filter columns in batch firstly and skip other columns if unmatched (It is a good idea from ClickHouse)
  • 21. Optimization of Parquet Reader ▪ Split parquet reader into 2 reader: FilterReader for filter columns and NonFilterReader for other columns col1 col2 col3 col1 col2 col3 col1 col2 col3 RowGroup1 RowGroup2 RowGroup3 VectorizedParquetRecordReader col1 col2 col3 col1 col2 col3 col1 col2 col3 FilterReader col1 col2 col3 col1 col2 col3 col1 col2 col3 NonFilterReader PrewhereVectorizedParquetRecordReader
  • 22. Optimization of Parquet Reader ▪ Skip Page ▪ Skip Decoding ▪ Skip RowGroup Potential BenefitFilterReader reads data in batch Apply filter expressions to data match NonFilterReader reads data in batch and skip unnecessary data N Y Union data and return in batch
  • 23. Optimization of Parquet Reader ▪ ByteType ▪ ShortType ▪ IntegerType ▪ LongType ▪ FloatType ▪ DoubleType ▪ StringType ▪ > ▪ >= ▪ < ▪ <= ▪ = ▪ In ▪ isNull ▪ isNotNull Supported Filter TypeSupported Data Type of Filter Column
  • 24. Databricks simplifies data and AI so data teams can innovate faster
  • 25. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.