The document discusses optimizations made to Spark SQL performance when working with Parquet files at ByteDance. It describes how Spark originally reads Parquet files and identifies two main areas for optimization: Parquet filter pushdown and the Parquet reader. For filter pushdown, sorting columns improved statistics and reduced data reads by 30%. For the reader, splitting it to first filter then read other columns prevented loading unnecessary data. These changes improved Spark SQL performance at ByteDance without changing jobs.
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and Parquet Reader
1. Improving Spark SQL Performance by 30%:
How We Optimize Parquet Filter Pushdown
and Parquet Reader
Ke Sun (sunke3296@gmail.com)
Senior Engineer of Data Engine Team, ByteDance
2. Who We Are
l Data Engine team of ByteDance
l Build a platform of one-stop
experience for OLAP , on which
users can analyze EB level data by
writing SQL without caring about
the underlying execution engine
3. What We Do
l Manage Spark SQL / Presto / Hive
workload
l Offer open API and serverless OLAP
platform
l Optimize Spark SQL / Presto / Hudi /
Hive engine
l Design data architecture for most
business lines in ByteDance
4. Agenda
Spark SQL at ByteDance
How Spark Read Parquet
Optimization of Parquet Filter
Pushdown and Parquet Reader
at ByteDance
6. 2017 2019 20202016 2018
Spark SQL at ByteDance
Small Scale Experiments
Ad-hoc Workload
Few ETL Workload
Full-production deployment & migration
Main engine in DW area
7. Spark SQL at ByteDance
▪ Spark SQL covers 98%+ ETL workload
▪ Parquet is the default file format in data warehouse and
vectorizedReader is also enabled by default
9. How Spark Read Parquet
▪ Overview of Parquet
▪ Procedure of Parquet Reading
▪ What We can Optimize
10. How Spark Read Parquet
▪ Overview of Parquet
Column Pruning
More efficient compression
Parquet can skip useless data by spark filter
pushdown using Footer & RowGroup statistics
https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif
11. How Spark Read Parquet
▪ Procedure of Parquet Reading
▪ VectorizedParquetRecordReader skips useless
RowGroups by filter pushdown translated by
ParquetFilters
▪ VectorizedParquetRecordReader builds column readers
for every target column and these column readers read
data together in batch
DataSourceScanExec
.inputRDD
ParquetFileFormat
.buildReaderWithPartitionValues
VectorizedParquetRecordReader
. nextBatch
12. How Spark Read Parquet
▪ Optimization – Statistics are not distinguishable
select * from table_name where date = ‘***’ and category = ‘test’
(date is partition column and category is predicate column)
For example, Spark reads all the 3 RowGroups for statistics are not distinguishable.
min of category max of category read or not
RowGroup1 a1 z1 Yes
RowGroup2 a2 z2 Yes
RowGroup3 a3 z3 Yes
Min/Max Statistics of RowGroup for a Parquet File
13. How Spark Read Parquet
▪ Optimization – Statistics are not distinguishable
Parquet filter pushdown works poorly when predicate columns are out of order in
parquet files and this phenomenon is reasonable
It is valuable to sort the common used predicate columns in parquet files to
reduce IO
14. How Spark Read Parquet
▪ Optimization – Spark reads too much unnecessary data
select col1 from table_name where date = ‘***’ and col2 = ‘test’
col1 col2 col3
col1 col2 col3
col1 col2 col3
ParquetFile
RowGroup1
RowGroup2
RowGroup3
▪ RowGroup1 is skipped by filter pushdown
▪ Col3 is skipped by column pruning
▪ Col1 and col2 are read together by vectorized reader
15. How Spark Read Parquet
▪ Optimization – Spark reads too much unnecessary data
select col1 from table_name where date = ‘***’ and col2 = ‘test’
col1 col2 col3
col1 col2 col3
col1 col2 col3
ParquetFile
RowGroup1
RowGroup2
RowGroup3
▪ Most data of col1 is unnecessary to read
if filter ratio (col2=‘test’) is very high
▪ It is valuable to read & filter data by filter columns firstly
and then read data of other columns
17. Optimization of Parquet Filter Pushdown
▪ Statistics are not distinguishable
▪ Increase parquet statistics discrimination
▪ Low Overhead: sort all data is expensive or even impossible
▪ Automation: users do not need to update ETL jobs
LocalSort: Add a SortExec node before InsertIntoHiveTable node
18. ▪ Which columns to be sorted?
▪ Analyze the history queries and choose the most common used predicate columns
▪ Configure sort columns to table property of hive table and Spark SQL will read this property
▪ It is a automatic procedure without manual intervention
Optimization of Parquet Filter Pushdown
…
Project
SortExec
…
Project
InsertIntoHiveTable
InsertIntoHiveTable
19. Optimization of Parquet Filter Pushdown
▪ Spark reads less data for more efficient statistics
▪ Parquet file size is much smaller for data is more efficient to compress
▪ Only near 5% overhead
Spark reads only one RowGroup after sorting data by column category
min of category max of category read or not
RowGroup1 a1 g1 No
RowGroup2 g2 u2 Yes
RowGroup3 u3 z3 No
Min/Max Statistics of RowGroup for a Parquet File
20. Optimization of Parquet Reader
▪ Spark reads too much unnecessary data
▪ Filter unnecessary data as soon as possible
Prewhere: Read data of filter columns in batch firstly and skip other
columns if unmatched (It is a good idea from ClickHouse)
21. Optimization of Parquet Reader
▪ Split parquet reader into 2 reader: FilterReader for filter
columns and NonFilterReader for other columns
col1 col2 col3
col1 col2 col3
col1 col2 col3
RowGroup1
RowGroup2
RowGroup3
VectorizedParquetRecordReader
col1 col2 col3
col1 col2 col3
col1 col2 col3
FilterReader
col1 col2 col3
col1 col2 col3
col1 col2 col3
NonFilterReader
PrewhereVectorizedParquetRecordReader
22. Optimization of Parquet Reader
▪ Skip Page
▪ Skip Decoding
▪ Skip RowGroup
Potential BenefitFilterReader reads data in batch
Apply filter expressions to data
match
NonFilterReader reads data in
batch and skip unnecessary data
N
Y
Union data and return in batch
23. Optimization of Parquet Reader
▪ ByteType
▪ ShortType
▪ IntegerType
▪ LongType
▪ FloatType
▪ DoubleType
▪ StringType
▪ >
▪ >=
▪ <
▪ <=
▪ =
▪ In
▪ isNull
▪ isNotNull
Supported Filter TypeSupported Data Type of Filter Column