ORC improvement in Apache Spark 2.3

1 © Hortonworks Inc. 2011–2018. All rights reserved
ORC Improvement in Apache Spark 2.3
Dongjoon Hyun
Principal Software Engineer @ Hortonworks Data Science Team
April 2018

Dongjoon Hyun
• Hortonworks
• Principal Software Engineer @ Data Science Team
• Apache Project
• Apache REEF Project Management Committee(PMC) Member & Committer
• Apache Spark Project Contributor
• GitHub
• https://github.com/dongjoon-hyun

Agenda
• What’s New in Apache Spark 2.3
• Previous ORC issues in Apache Spark
• Current Approach & Demo
• Performance & Limitation
• Future roadmap

• Vectorized ORC Reader
• Structured Streaming with ORC
• Schema evolution with ORC
• PySpark Performance Enhancements
with Apache Arrow and ORC
• Structured stream-stream joins
• Spark History Server V2
• Spark on Kubernetes
• Data source API V2
• Streaming API V2
• Continuous Structured Streaming
Processing
Major Features Experimental Features
What’s New in Apache Spark 2.3

Spark’s file-based data sources
• TEXT The simplest one with one string column schema
• CSV Popular for data science workloads
• JSON The most flexible one for schema changes
• PARQUET The only one with vectorized reader
• ORC Popular for shared Hive tables

Motivation
• TEXT The simplest one with one string column schema
• CSV Popular for data science workloads
• JSON The most flexible one for schema changes
• PARQUET The only one with vectorized reader
• ORC Popular for shared Hive tables
Fast
Flexible
Hive Table Access

Previous ORC Issues in Spark

Background – The history of Spark and ORC
• Before Apache ORC
• Hive 1.2.1 (2015 JUN)  SPARK-2883 (Hive ORC is used since Spark 1.4)
• After Apache ORC
• v1.0.0 (2016 JAN)
• v1.1.0 (2016 JUN)
• v1.2.0 (2016 AUG)
• v1.3.0 (2017 JAN)
• v1.4.0 (2017 MAY)  SPARK-21422 (Apache ORC is added since Spark 2.3)
• v1.4.1 (2017 OCT)  SPARK-22300
• v1.4.3 (2018 FEB)  SPARK-23340 (Spark 2.4)

Six Issue Categories
• ORC Writer Versions
• Performance
• Structured streaming
• Column names
• Hive tables and schema evolution
• Robustness

Issues with ORC Writer Versions
• ORIGINAL
• HIVE_8732 (2014) ORC string statistics are not merged correctly
• HIVE_4243 (2015) Fix column names in FileSinkOperator
• HIVE_12055(2015) Create row-by-row shims for the write path
• HIVE_13083(2016) Writing HiveDecimal can wrongly suppress
present stream
• ORC_101 (2016) Correct the use of the default charset in bloomfilter
• ORC_135 (2018) PPD for timestamp is wrong when reader/writer
timezones are different

Issues with performance
• Vectorized ORC Reader (SPARK-16060)
• Fast read partition-column only (SPARK-22712)
• Pushing down filters for DateType (SPARK-21787)

• `FileNotFoundException` at writing
empty partitions as ORC
• Create structured steam with ORC files
Write (SPARK-15474) Read (SPARK-22781)
Issues with structured streaming
spark.readStream.orc(path)

Issues with column names
• Unicode column names (SPARK-23072)
• Column names with dot (SPARK-21791)
• Should not create invalid column names (SPARK-21912)

Issues with Hive tables and schema evolution
• Support `ALTER TABLE ADD COLUMNS` (SPARK-21929)
• Introduced at Spark 2.2, but throws AnalysisException for ORC
• Support column positional mismatch (SPARK-22267)
• Return wrong result if ORC file schema is different from Hive MetaStore schema order
• `convertMetastore` ignore storage property (SPARK-22158, Fixed at 2.2.1)
• `convertMetastoreOrc` is introduced in Spark 2.0, but it had several issues.

Issues with robustness
• ORC metadata exceed ProtoBuf message size limit (SPARK-19109)
• NullPointerException on zero-size ORC file (SPARK-19809)
• Support `ignoreCorruptFiles` (SPARK-23049)
• Support `ignoreMissingFiles` (SPARK-23305)
• `FileNotFound` at file names with special chars (SPARK-22146, Fixed in 2.2.1)

Current Approach

Supports two ORC file formats
• Adding a new OrcFileFormat (SPARK-20682)
FileFormat
TextBasedFileFormat
ParquetFileFormat
OrcFileFormat
HiveFileFormat
JsonFileFormat
LibSVMFileFormat
CSVFileFormat
TextFileFormat
o.a.s.sql.execution.datasources
o.a.s.ml.source.libsvmo.a.s.sql.hive.orc
OrcFileFormat
`hive` OrcFileFormat
from Hive 1.2.1
`native` OrcFileFormat
with ORC 1.4.3

In Reality – Four cases for ORC Reader/Writer
`hive` Reader`native` Reader
`hive` Writer
`native` Writer
• New Data
• New Apps
• Best performance
(Vectorized Reader)
• New Data
• Old Apps
• Improved performance
(Non-vectorized Reader)
• Old Data
• New Apps
• Improved performance
(Vectorized Reader)
• Old Data
• Old Apps
• As-Is performance
(Non-vectorized Reader)
1
2
3
4

Performance – Single column scan from wide tables
Number of columns
Time
(ms)
1M rows with all BIGINT columns
0
200
400
600
800
1000
1200
100 200 300
native writer / native reader hive writer / native reader
native writer / hive reader hive writer / hive reader
4x 1
2
3
4
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala

How to specify `native` OrcFileFormat directly
CREATE TABLE people (name string, age int)
USING org.apache.spark.sql.execution.datasources.orc
df.write
.format("org.apache.spark.sql.execution.datasources.orc")
.save(path)
spark.read
.format("org.apache.spark.sql.execution.datasources.orc")
.load(path)
Read Dataset
Write Dataset
Create ORC Table

Switch ORC implementation (SPARK-20728)
• spark.sql.orc.impl=native (default: `hive`)
USING ORC OPTIONS (orc.compress 'ZLIB')
spark.read.orc(path)
df.write.orc(path)
spark.read.format("orc").load (path)
df.write.format("orc").save(path)
Read/Write Dataset
Read/Write Dataset
Create ORC Table

Switch ORC implementation (SPARK-20728) – Cont.
• spark.sql.orc.impl=native (default: `hive`)
spark.readStream.orc(path)
spark.readStream.format("orc").load(path)
df.writeStream
.option("checkpointLocation", path1)
.format("orc")
.option("path", path2)
.start
Read/Write
Structured Stream

ORC Readers with `spark.sql.` configurations
orc.impl
# of cols <= codegen.maxFields
`native`
`hive` ORC Reader
`hive`
true
spark.sql.codegen.maxFields=100 (default)
false
`native` ORC Columnar Batch Reader
all atomic types
true
false
`native` ORC Record Reader
orc.enableVectorizedReader false
true

ORC Readers with `spark.sql.` configurations – Cont.
orc.enableVectorizedReader
Wrapping
ORC ColumnVector 
Spark OrcColumnVector
orc.copyBatchToSpark
true
false
Copying
Spark OnHeapColumnVector
true
columnVector.offheap.enabled
true
Copying
Spark OffHeapColumnVector
false
`native` ORC Columnar Batch Reader

Support vectorized read on Hive ORC Tables
• spark.sql.hive.convertMetastoreOrc=true (default: false)
• `spark.sql.orc.impl=native` is required, too.
STORED AS ORC
USING HIVE OPTIONS (fileFormat 'ORC', orc.compress 'gzip')
SPARK-23355

Schema evolution at reading file-based data sources
• Frequently, new files can have wider column types or new columns
• Before SPARK-21929, users drop and recreate ORC table with an updated schema.
• User-defined schema reduces schema inference cost and handles upcasting
• boolean -> byte -> short -> int -> long
• float -> double
spark.read.schema("col1 int").orc(path)
spark.read.schema("col1 long, col2 long").orc(path)

Schema evolution at reading file-based data sources – Cont.
1. Native Vectorized ORC Reader
2. Only safe change via upcasting
3. JSON is the most flexible for changing types
File Format TEXT CSV JSON ORC
`hive`
ORC
`native`1
PARQUET
Add Column At The End ✔️ ✔️ ✔️ ✔️ ✔️
Hide Trailing Column ✔️ ✔️ ✔️ ✔️ ✔️
Hide Column ✔️ ✔️ ✔️
Change Type2 ✔️ ✔️3 ✔️
Change Position ✔️ ✔️ ✔️

Demo 1
ORC configuration

Demo 2
PySpark with ORC

Performance

Micro Benchmark
• Target
• Apache Spark 2.3.0
• Apache ORC 1.4.1
• Machine
• MacBook Pro (2015 Mid)
• Intel® Core™ i7-4770JQ CPI @ 2.20GHz
• Mac OS X 10.13.4
• JDK 1.8.0_161

Performance – Single column scan from wide tables
Number of columns
Time
(ms)
1M rows with all BIGINT columns
0
200
400
600
800
1000
1200
100 200 300
native writer / native reader hive writer / hive reader
4x

Performance – Vectorized Read
0
500
1000
1500
2000
2500
TINYINT SMALLINT INT BIGINT FLOAT DOULBE
native hive
15M rows in a single-column table
Time
(ms)
10x
5x
11x

Performance – Partitioned table read
0
500
1000
1500
2000
2500
Data column Partition column Both columns
native hive
Time
(ms)
21x7x
15M rows in a partitioned table

Predicate Pushdown
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Select 10% rows (id < value)
Select all rows (id IS NOT NULL)
parquet native Time (ms)
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala
15M rows with 5 data columns and 1 sequential id column

Limitation
Future Roadmap

Limitation
• Spark vectorization supports atomic types only
• Limited simple schema evolution. JSON provides more
• boolean -> byte -> short -> int -> long
• float -> double
• `convertMetastore` ignores `STORED AS` table properties (SPARK-23355)
• Both ORC/Parquet

Future Roadmap – Apache Spark 2.4 (2018 Fall)
• Feature Parity for ORC with Parquet (SPARK-20901)
• Use `native` ORC implementation by default (SPARK-23456)
• Use ORC predicate pushdown by default (SPARK-21783)
• Use `convertMetastoreOrc` by default (SPARK-22279)
• Test ORC as default data source format (SPARK-23553)
• Test and support Bloom Filters (SPARK-12417)

Future Roadmap – On-going work
• Support VectorUDT/MatrixUDT (SPARK-22320)
• Support CHAR/VARCHAR Types
• Vectorized Writer with DataSource V2
• ALTER TABLE … CHANGE column type (SPARK-18727)

Summary
• Apache Spark 2.3 starts to take advantage of Apache ORC
• Native vectorized ORC reader
• boosts Spark ORC performance
• provides better schema evolution ability
• Structured streaming starts to work with ORC (both reader/writer)
• Spark is going to become faster and faster with ORC

Reference
• https://youtu.be/ZVSD9EsQl-8, ORC configuration in Apache Spark 2.3
• https://youtu.be/zJZ1gtzu-rs, Apache Spark 2.3 ORC with Apache Arrow
• https://community.hortonworks.com/articles/148917/orc-improvements-for-apache-
spark-22.html
• https://www.slideshare.net/Hadoop_Summit/performance-update-when-apache-orc-
met-apache-spark-81023199, Dataworks Summit 2017 Sydney
• https://www.slideshare.net/Hadoop_Summit/orc-file-optimizing-your-big-data,
Dataworks Summit 2017 San Jose

Questions?

Thank you

ORC improvement in Apache Spark 2.3

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a ORC improvement in Apache Spark 2.3

Similar a ORC improvement in Apache Spark 2.3 (20)

Más de DataWorks Summit

Más de DataWorks Summit (20)

Último

Último (20)

ORC improvement in Apache Spark 2.3