Apache Spark is increasingly adopted as an alternate processing framework to MapReduce, due to its ability to speed up batch, interactive and streaming analytics. Spark enables new analytics use cases like machine learning and graph analysis with its rich and easy to use programming libraries. And, it offers the flexibility to run analytics on data stored in Hadoop, across data across object stores and within traditional databases. This makes Spark an ideal platform for accelerating cross-platform analytics on-premises and in the cloud. Building on the success of Spark 1.x release, Spark 2.x delivers major improvements in the areas of API, Performance, and Structured Streaming. In this paper, we will cover a high-level view of the Apache Spark framework, and then focus on what we consider to be very important improvements made in Apache Spark 2.x. We will then share the results of a real-world benchmark effort and share details on Spark and environment configuration changes made to our lab, discuss the results of the benchmark, and provide a reference architecture example for those interested in taking Spark 2.x for their own test drive. This presentation stresses the value of refreshing the Spark 1 with Spark 2 as performance testing results show 2.3x improvement with SparkSQL workloads similar to TPC Benchmark™ DS (TPC-DS). MARK LOCHBIHLER, Principal Architect, Hortonworks and VIPLAVA MADASU, Big Data Systems Engineer, Hewlett Packard Enterprise
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
1. Uncovering an Apache Spark 2 Benchmark -
Configuration, Tuning and Test Results
• Mark Lochbihler, Hortonworks - Principal Architect
• Viplava Madasu, HPE - Big Data Systems Engineer
San Jose, California
JUNE 17–21, 2018
1
Tuesday, June 19
4:00 PM - 4:40 PM
Executive Ballroom
210C/D/G/H
2. Today’s Agenda
• What’s New with Spark 2.x – Mark
• Spark Architecture
• Spark on YARN
• What’s New
• Spark 2.x Benchmark - Viplava
• What was Benchmarked
• Configuration and Tuning
• Infrastructure Used
• Results
• Questions / More Info – Mark and Viplava
San Jose, California
JUNE 17–21, 20182
3. Apache Spark
Apache Spark is a fast general-purpose engine for large-scale data
processing. Spark was developed in response to limitations in Hadoop’s
two-stage disk-based MapReduce processing framework.
Orchestration:
Spark’s standalone cluster manager, Apache Mesos,
or Hadoop YARN San Jose, California
JUNE 17–21, 2018
3
4. Spark on Hadoop YARN
YARN has the concept of labels for groupings of Hadoop Worker nodes.
Spark on YARN is an optimal way to schedule and run Spark jobs on a Hadoop cluster alongside a variety of
other data-processing frameworks, leveraging existing clusters using queue placement policies, and enabling
security by running on Kerberos-enabled clusters.
Client Mode Cluster Mode
Client
Executor
App
MasterSpark Driver
Client
Executor
App Master
Spark Driver
San Jose, California
JUNE 17–21, 20184
5. Spark 2.x vs Spark 1.x
Apache Spark 2.x is a major release update of Spark 1.x and includes
significant updates in the following areas:
• API usability
• SQL 2003 support
• Performance improvements
• Structured streaming
• R UDF support
• Operational improvements
San Jose, California
JUNE 17–21, 2018
5
6. Spark 2.x – New and Updated APIs
Including:
• Unifying DataFrame and Dataset APIs providing type safety for
DataFrames
• New SparkSession API with a new entry point that replaces the old
SQLContext and HiveContext for DataFrame and Dataset APIs
• New streamlined configuration API for SparkSession
• New improved Aggregator API for typed aggregation in Datasets
San Jose, California
JUNE 17–21, 2018
6
7. Spark 2.x – Improved SQL Functionality
• ANSI SQL 2003 support
• Enables running all 99 TPC-DS queries
• A native SQL parser that supports both ANSI-SQL as well as Hive QL
• Native DDL command implementations
• Subquery support
• Native CSV data source
• Off-heap memory management for both caching and runtime
execution
• Hive-style bucketing support
San Jose, California
JUNE 17–21, 2018
7
8. Spark 2.x – Performance Improvements
• By implementing a new technique called “whole stage code
generation”, Spark 2.x improves the performance 2-10 times for
common operators in SQL and DataFrames.
• Other performance improvements include:
• Improved Parquet scan throughput through vectorization
• Improved ORC performance
• Many improvements in the Catalyst query optimizer for common workloads
• Improved window function performance via native implementations
for all window functions.
San Jose, California
JUNE 17–21, 2018
8
9. Spark 2.x – Spark Machine Learning API
• Spark 2.x replaces the RDD-based APIs in the spark.mllib package (put in
maintenance mode) with the DataFrame-based API in the spark.ml
package.
• New features in the Spark 2.x Machine Learning API include:
• ML persistence to support saving and loading ML models and Pipelines
• New MLlib APIs in R for generalized linear models
• Naive Bayes
• K-Means Clustering
• Survival regression
• New MLlib APIs in Python for
• LDA, Gaussian Mixture Model, Generalized Linear Regression, etc.
San Jose, California
JUNE 17–21, 2018
9
10. Spark 2.x – Spark Streaming
• Spark 2.x introduced a new high-level streaming API, called
Structured Streaming, built on top of Spark SQL and the Catalyst
optimizer.
• Structured Streaming enables users to program against streaming
sources and sinks using the same DataFrame/Dataset API as in static
data sources, leveraging the Catalyst optimizer to automatically
incrementalize the query plans.
San Jose, California
JUNE 17–21, 2018
10
11. 11
Hortonworks Data Platform 2.6.5 – Just Released
HDP 2.6.5 / 3.0 includes Apache Spark 2.3
ORC/Parquet Feature Parity
– Spark extends its vectorized read capability to ORC data sources.
– Structured streaming officially supports ORC data source with API and documentation
Python Pandas UDF, with good performance and easy to use for Pandas users. This feature supports
financial analysis use cases.
Structured streaming now supports stream-stream joins.
Structured streaming that goes to millisecond latency (Alpha). New continuous processing mode
provides the best performance by minimizing the latency without waiting in idle status.
San Jose, California
JUNE 17–21, 2018
12. Evaluation of Spark SQL with Spark 2.x versus Spark 1.6
• Benchmark Performed
• Hive testbench, which is similar to TPC-DS benchmark
• Tuning for the benchmark
San Jose, California
JUNE 17–21, 2018
12
13. Why Cluster tuning matters
• Spark/Hadoop default configurations are not optimal for most enterprise
applications
• Large number of configuration parameters
• Tuning cluster will benefit all the applications
• Can further tune job level configuration
• More important if using disaggregated compute/storage layers as in HPE
Reference Architecture
• Useful for cloud too
San Jose, California
JUNE 17–21, 2018
13
14. Factors to consider for Spark performance tuning
• Hardware
• CPU, Memory, Storage systems, Local disks, Network
• Hadoop configuration
• HDFS
• YARN
• Spark configuration
• Executor cores, Executor memory, Shuffle partitions, Compression etc.
San Jose, California
JUNE 17–21, 2018
14
15. General Hardware Guidelines
• Sizing hardware for Spark depends on the use case, but Spark benefits from
• More CPU cores
• More memory
• Flash storage for temporary storage
• Faster network fabric
• CPU Cores
• Spark scales well to tens of CPU cores per machine
• Most Spark applications are CPU bound, so at least 8-16 cores per machine.
• Memory
• Spark can make use of hundreds of gigabytes of memory per machine
• Allocate only at most 75% of the memory for Spark; leave the rest for the operating
system and buffer cache.
• Storage tab of Spark’s monitoring UI will help.
• Max 200GB per executor.
San Jose, California
JUNE 17–21, 2018
15
16. General Hardware Guidelines …
• Network
• For Group-By, Reduce-By, and SQL join operations, network performance
becomes important due to the Shuffles involved
• 10 Gigabit network is the recommended choice
• Local Disks
• Spark uses local disks to store data that doesn’t fit in RAM, as well as to preserve
intermediate output between stages
• SSDs are recommended
• Mount disks with noatime option to reduce unnecessary writes
San Jose, California
JUNE 17–21, 2018
16
18. Useful HDFS configuration settings
• Increase the dfs.blocksize value to allow more data to be processed by
each map task
• Also reduces NameNode memory consumption
• dfs.blocksize 256/512MB
• Increase the dfs.namenode.handler.count value to better manage
multiple HDFS operations from multiple clients
• dfs.namenode.handler.count 100
• To eliminate timeout exceptions (java.io.IOException: Unable to close file
close file because the last block does not have enough number of replicas),
San Jose, California
JUNE 17–21, 2018
18
19. Useful YARN configuration settings
• YARN is the popular cluster manager for Spark on Hadoop, so it is
important that YARN and Spark configurations are tuned in tandem.
• Settings of Spark executor memory and executor cores result in
allocation requests to YARN with the same values and YARN should be
configured to accommodate the desired Spark settings
• Amount of physical memory that can be allocated for containers per
node
• yarn.nodemanager.resource.memory-mb 384 GiB
• Amount of vcores available on a compute node that can be allocated for
containers
• yarn.nodemanager.resource.cpu-vcores 48
San Jose, California
JUNE 17–21, 2018
19
20. YARN tuning …
• Number of YARN containers depends on the nature of the workload
• Assuming total of 384 GiB on each node, a workload that needs 24 GiB containers
will result in 16 total containers
• Assuming 12 worker nodes, number of 24 GiB containers = 16 * 12 – 1 = 191
• One container per YARN application master
• General guideline is to configure containers in a way that maximizes the
utilization of the memory and vcores on each node in the cluster
San Jose, California
JUNE 17–21, 2018
20
21. YARN tuning …
• Location of YARN intermediate files on the compute nodes
• yarn.nodemanager.local-dirs /data1/hadoop/yarn/local, /data2/hadoop/yarn/local,
/data3/hadoop/yarn/local, /data4/hadoop/yarn/local
• Setting of spark.local.dir is ignored for YARN cluster mode
• The node-locality-delay specifies how many scheduling intervals to let
pass attempting to find a node local slot to run on prior to searching for a
rack local slot
• Important for small jobs that do not have a large number of tasks as it will better
utilize the compute nodes
• yarn.scheduler.capacity.node-locality-delay 1
San Jose, California
JUNE 17–21, 2018
21
22. Tuning Spark – Executor cores
• Unlike Hadoop MapReduce where each map or reduce task is always started in a new
process, Spark can efficiently use process threads (cores) to distribute task processing
• Results in a need to tune Spark executors with respect to the amount of memory
and number of cores each executor can use
• Has to work within the configuration boundaries of YARN
• Number of cores per executor can be controlled by
• the configuration setting spark.executor.cores
• the --executor-cores option of the spark-submit command
• The default is 1 for Spark on YARN
San Jose, California
JUNE 17–21, 2018
22
23. Tuning Spark – Executor cores
• Simplest but inefficient approach would be to configure one executor per core and divide the memory
equally among the number of executors
• Since each partition cannot be computed on more than one executor, the size of each partition is
limited and causes memory problems, or spilling to disk for shuffles
• If the executors have only one core, then at most one task can run in each executor, which throws
away the benefits of broadcast variables, which have to be sent to each executor once.
• Each executor has some memory overhead (minimum of 384MB) – so, if we have many small
executors, results in lot of memory overhead
• Giving many cores to each executor also has issues
• GC issues - since a larger JVM heap will delay the time until a GC event is triggered resulting in
larger GC pauses
• Results in poor HDSF throughput issues because of handling many concurrent threads
• spark.executor.cores – experiment and set this based on your workloads. We found 9 was
was the right setting for this configuration and bench test in our lab.
San Jose, California
JUNE 17–21, 2018
23
24. Tuning Spark – Memory
• Memory for each Spark job is application specific
• Configure Executor memory in proportion to the number of partitions and cores per
executor
• Divide the total amount of memory on each node by the number of executors on the node
• Should be less than the maximum YARN container size - so YARN maximum container size may
need to be adjusted accordingly
• Configuration setting spark.executor.memory or the --executor-memory option of the spark-
submit command
• JVM runs into issues with very large heaps (above 80GB).
• Spark Driver memory
• If driver collects too much data, the job may run into OOM errors.
• Increase the driver memory using spark.driver.maxResultSize
San Jose, California
JUNE 17–21, 2018
24
25. Spark 2.x – Memory Model
• Each executor has memory overhead for things like VM
overheads, interned strings, other native overheads
• spark.yarn.executor.memoryOverhead
• Default value is spark.executor.memory * 0.10, with minimum of
384MB.
• Prior to Spark 1.6, separate tuning was needed for
storage (RDD) memory and execution/shuffle memory
via spark.storage.memoryFraction and
spark.shuffle.memoryFraction
• Spark 1.6 introduced a new “UnifiedMemoryManager”
• When no Storage memory is used, Execution can acquire all the
available memory and vice versa
• As a result, applications that do not use caching can use the
entire space for execution, obviating unnecessary disk spills.
• Applications that do use caching can reserve a minimum storage
space where their data blocks are immune to being evicted
• spark.memory.storageFraction tunable, but good out-of-the-box
performance
San Jose, California
JUNE 17–21, 2018
25
26. Tuning Spark – Shuffle partitions
• Spark SQL, by default, sets the number of reduce side
partitions to 200 when doing a shuffle for wide
transformations, e.g., groupByKey, reduceByKey,
sortByKey etc.
• Not optimal for many cases as it will use only 200 cores for
processing tasks after the shuffle
• For large datasets, this might result in shuffle block overflow
resulting in job failures
• The number of shuffle partitions should be at least
equal to the number of total executor cores or a
multiple of it in case of large data sets.
• spark.sql.shuffle.partitions setting
• Also – helps to partition in prime numbers in terms of
hash effectiveness.
San Jose, California
JUNE 17–21, 2018
26
27. Tuning Spark – Compression
• Using compression in Spark can improve performance in a meaningful
way as compression results in less disk I/O and network I/O
• Even though compressing the data results in some CPU cycles being
used, the performance improvements with compression outweigh the
CPU overhead when a large amount of data is involved
• Also compression results in reduced storage requirements for storing
data on disk, e.g., intermediate shuffle files
San Jose, California
JUNE 17–21, 2018
27
28. Tuning Spark – Compression
• spark.io.compression.codec setting to decide the codec
• three codecs provided: lz4, lzf, and snappy
• default codec is lz4
• Four main places where Spark makes use of compression
• Compress map output files during a shuffle operation using
spark.shuffle.compress setting (Default true)
• Compress data spilled during shuffles using spark.shuffle.spill.compress setting
(Default true)
• Compress broadcast variables before sending them using
spark.broadcast.compress setting (Default true)
• Compress serialized RDD partitions using spark.rdd.compress setting
(Default false)
San Jose, California
JUNE 17–21, 2018
28
29. Tuning Spark – Serialization type
• Serialization plays an important role in the performance of any distributed application
• Spark memory usage is greatly affected by storage level and serialization format
• By default, Spark serializes objects using Java Serializer which can work with any class that implements
java.io.Serializable interface
• For custom data types, Kryo Serialization is more compact and efficient than Java Serialization
• but user classes need to be explicitly registered with the Kryo Serializer
• spark.serializer org.apache.spark.serializer.KryoSerializer
• Spark SQL automatically uses Kryo serialization for DataFrames internally in Spark 2.x
• For customer applications that still use RDDs, Kryo Serialization should result in a significant
performance boost
San Jose, California
JUNE 17–21, 2018
29
30. Tuning Spark – Other configuration settings
• When using ORC/parquet format for the data, Spark SQL can push the filter
down to ORC/parquet, thus avoiding large data transfer.
• spark.sql.orc.filterPushdown (Default false)
• spark.sql.parquet.filterPushdown (Default true)
• For large data sets, you may encounter various network timeouts. Can tune
different timeout values
• spark.core.connection.ack.wait.timeout
• spark.storage.blockManagerSlaveTimeoutMs
• spark.shuffle.io.connectionTimeout
• spark.rpc.askTimeout
• spark.rpc.lookupTimeout
• “Umbrella” setting for all these timeouts, spark.network.timeout (Default is
120 seconds). For 10TB dataset, this value should be something like 600
seconds.
San Jose, California
JUNE 17–21, 2018
30
31. HPE’s Elastic Platform for Big Data Analytics (EPA)
Modular building blocks of compute and storage optimized for modern workloads
Apollo 2000
Compute
DL360 Apollo 6500
w/ NVIDIA GPU
Synergy
Storage
Apollo 4200DL380 Apollo 4510
Hot Cold Object
Purpose - built
Network FlexFabric 5950/5940
San Jose, California
JUNE 17–21, 201831
32. HPE EPA - Single-Rack Reference Architecture for Spark 2.x
San Jose, California
JUNE 17–21, 201832
33. HPE EPA - Single-Rack Reference Architecture for Spark 2.x
San Jose, California
JUNE 17–21, 201833
34. San Jose, California
JUNE 17–21, 2018
Base Rack
• (1) DL360 Control Block – (1)
Management Node, (2) Head Nodes
• (8) Apollo 2000 Compute Blocks –
(32) XL170r Worker Nodes
• (10) Apollo 4200 Storage Blocks –
(10) Apollo 4200 Data Nodes
• (1) Network Block
• (1) Rack Block
Aggregation Rack
• (8) Apollo 2000 Compute Blocks – (32)
XL170r Worker Nodes
• (10) Apollo 4200 Storage Blocks – (10)
Apollo 4200 Data Nodes
• (1) Network Block
• (1) Aggregation Switch Block - (2) HPE
5950 32QSFP28
• (1) Rack Block
Expansion Rack
• (8) Apollo 2000 Compute Blocks –
(32) XL170r Worker Nodes
• (10) Apollo 4200 Storage Blocks –
(10) Apollo 4200 Data Nodes
• (1) Network Block
• (1) Rack Block
Expansion Rack
• (8) Apollo 2000 Compute Blocks –
(32) XL170r Worker Nodes
• (10) Apollo 4200 Storage Blocks –
(10) Apollo 4200 Data Nodes
• (1) Network Block
• (1) Rack Block
HPE EPA - Multi-Rack configuration
34
35. Spark 2.x - Effect of cores per executor on query performance
San Jose, California
JUNE 17–21, 2018
35
36. Spark 2.x – Effect of shuffle partitions on query performance
San Jose, California
JUNE 17–21, 201836
37. Spark 2.x – Effect of compression codec on query performance
San Jose, California
JUNE 17–21, 2018
37
38. Evaluation of Spark SQL with Spark 2.x versus Spark 1.6
• Hive testbench(similar to TPC/DS) with 1000 SF (1TB size) and 10000 SF
(10TB size) used for testing
• Hive testbench used to generate the data
• ORC format used for storing the data
• ANSI SQL compatibility
• Spark 2.x could run all Hive testbench queries whereas Spark 1.6 could run only
50 queries
• Spark SQL robustness
• With 10TB dataset size, Spark 2.x could finish all of the queries whereas Spark 1.6
could finish only about 40 queries
San Jose, California
JUNE 17–21, 2018
38
39. Spark 2.x performance improvements over Spark 1.6
- with 10000 SF (10TB)
San Jose, California
JUNE 17–21, 2018
39
40. Spark 2.x performance improvements over Spark 1.6
- with 10000 SF (10TB)
San Jose, California
JUNE 17–21, 2018
40
41. Spark 2.x - Scaling performance by adding Compute Nodes
without data rebalancing
San Jose, California
JUNE 17–21, 2018
41
43. Questions ?
San Jose, California
JUNE 17–21, 2018
43
Mark Lochbihler, Hortonworks - Principal Architect
mlochbihler@hortonworks.com
Viplava Madasu, HPE - Big Data Systems Engineer
viplava.madasu@hpe.com