Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day. In this talk, we'll focus on three areas: * *Scaling Compute*: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters. * *Optimizing Core Engine*: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second. * *Scaling Users:* How we make Spark easy to use, and faster to debug to seamlessly onboard new users.
Speakers: Ankit Agarwal, Sameer Agarwal
1. Scaling Apache Spark at
Facebook
Sameer Agarwal & Ankit Agarwal
Spark Summit | San Francisco | 24th April 2019
2. Sameer Agarwal
- Software Engineer at Facebook (Data Warehouse Team)
- Apache Spark Committer (Spark Core/SQL)
- Previously at Databricks and UC Berkeley
Ankit Agarwal
- Production Engineering Manager at Facebook (Data Warehouse Team)
- Data Infrastructure Team at Facebook since 2012
- Previously worked on the search team at Yahoo!
About Us
3. 1. Spark at Facebook
2. Hardware Trends: A tale of two bottlenecks
3. Evolving the Core Engine
- History Based Tuning
- Join Optimizations
4. Our Users and their Use-cases
5. The Road Ahead
Agenda
4. 1. Spark at Facebook
2. Hardware Trends: A tale of two bottlenecks
3. Evolving the Core Engine
- History Based Tuning
- Join Optimizations
4. Our Users and their Use-cases
5. The Road Ahead
Agenda
6. 2.7 Billion MAU
2 Billion DAU
Source: Facebook Q4 2018 earnings call transcript
7. 2015
Small Scale
Experiments
2016
Few Pipelines in
Production
2017
Running 60TB+
shuffle pipelines
2018
Full-production
deployment
Successor to Apache
Hive at Facebook
2019
Scaling Spark
Largest Compute
Engine at Facebook
by CPU
The Journey
8. 1. Spark at Facebook
2. Hardware Trends: A tale of two bottlenecks
3. Evolving the Core Engine
- History Based Tuning
- Join Optimizations
4. Our Users and their Use-cases
5. The Road Ahead
Agenda
10. Hardware Trends
CPU, DRAM, and Disk
1. The industry is optimizing for
throughput by adding more cores
2. To optimize performance/watt,
next generation processors will have
more cores that run at lower
frequency
11. Hardware Trends
CPU, DRAM, and Disk
1. The price of DRAM continued to rise
throughout 2016-2018 and has
started fluctuating this year
2. Need to reduce our over-
dependence on DRAM
12. Hardware Trends
CPU, DRAM, and Disk
1. Disk sizes continue to increase
but the number of random
accesses per second aren’t
increasing
2. IOPS becomes a bottleneck
13. What does this mean for Spark?
1. Optimize Spark for increasing core-memory ratio
2. Run Spark on disaggregated compute/storage clusters
- Use server types optimized for compute and storage
- Scale/upgrade clusters independently over time depending
on whether CPU or IOPS was a bottleneck
3. Scale extremely diverse workloads (SQL, ML etc.) on Spark
over clusters of tens of thousands of heterogenous
machines
17. Spark Architecture at Facebook
Compute Cluster Storage Cluster
Distributed FS instance #1
Distributed FS instance #2
Spill,
Cache,
Shuffle
Tangram Scheduler
Executor #1
Executors #2
Distributed FS instance #3
Heterogenous Hardware
(purchased over 0-5 years)
Brian Cho and Dmitry Borovsky, Cosco: An Efficient Facebook-Scale Shuffle Service
Today at 4:30PM (Developer Track)
Rui Jian and Hao Lin, Tangram: Distributed Scheduling for Spark at Facebook
Tomorrow at 11:50AM (Developer Track)
18. 1. Spark at Facebook
2. Hardware Trends: A tale of two bottlenecks
3. Evolving the Core Engine
- History Based Tuning
- Join Optimizations
4. Our Users and their Use-cases
5. The Road Ahead
Agenda
Contributed 100+
patches upstream
23. History-Based Tuning
1. Need to tune Spark on a per-job or a per-stage basis
2. Leverage historical characteristics of the job to tune resources:
• Peak executor memory and spill sizes to tune executor off-heap
memory
• Shuffle size to optionally not insert partial aggregates in the query plan
• Predicting the number of shuffle partitions (job level and stage level)
25. History-Based Tuning
New
Query
Historical
Job Runs
Query Plan
Template
Apply
Config
Overrides
Apply
Conservative
Defaults
No Regressions/Failures
since past N days
Regressions/Failures
since past N days
Config
Override
Rules
26. 1. Broadcast Join: Broadcast small table to all nodes, stream
the larger table; skew resistant
2. Shuffle-Hash Join: Shuffle both tables, create a hashmap
with smaller table and stream the larger table
3. Sort-Merge Join: Shuffle and sort both tables, buffer one
side and stream the other side
Joins in Spark
27. 1. Bucketing is a way to shuffle (and optionally sort) output data
based on certain columns of table
2. Ideal for write-once, read-many datasets
3. Variant of Sort Merge Join in Spark; overrides
outputPartitioning and outputOrdering for
HiveTableScanExec and stitches partitioning/ ordering
metadata throughout the query plan
Sort-Merge-Bucket (SMB) Join
SPARK-19256
28. A hybrid join algorithm where-in each task starts off by
executing a shuffle-hash join. In the process of execution,
should the hash table exceed a certain size (and OOM),
it automatically reconstructs/sorts the iterators and falls
back to a sort merge join
Dynamic Join
SPARK- 21505
29. Skew Join
A hybrid join algorithm that processes skewed keys via a
broadcast join and non-skewed keys via a shuffle-hash
or sort-merge join
SELECT /*+ SKEWED_ON(a.userid='10001') */ a.userid
FROM table_A a INNER JOIN table_B b
ON a.userid = b.userid
30. 1. Spark at Facebook
2. Hardware Trends: A tale of two bottlenecks
3. Evolving the Core Engine
- History Based Tuning
- Join Optimizations
4. Our Users and their Use-cases
5. The Road Ahead
Agenda
38. • Hard limits on config values
• Capacity Quotas (Storage and Compute)
• Strict resource limits (containerization)
Defensive Deployment
Guardrails for us (and users)
45. Memory.what?
memory.high is the memory usage throttle limit. This is the main mechanism to control
a cgroup’s memory use. If a cgroup's memory use goes over the high boundary specified
here, the cgroup’s processes are throttled and put under heavy reclaim pressure. The
default is max, meaning there is no limit.
memory.max is the memory usage hard limit, acting as the final protection mechanism:
If a cgroup's memory usage reaches this limit and can't be reduced, the system OOM
killer is invoked on the cgroup.
49. • Our cgroup configuration was wrong
• History Based scheduling
So… What happened?
50. • Cgroups configuration can be tricky
• Find the right balance between efficiency and reliability
• Bonus: Better resource control on IO
Takeaways
51. 1. Spark at Facebook
2. Hardware Trends: A tale of two bottlenecks
3. Evolving the Core Engine
- History Based Tuning
- Join Optimizations
4. Our Users and their Use-cases
5. The Road Ahead
Agenda