Scaling Apache Spark at Facebook

Scaling Apache Spark at
Facebook
Sameer Agarwal & Ankit Agarwal
Spark Summit | San Francisco | 24th April 2019

Sameer Agarwal
- Software Engineer at Facebook (Data Warehouse Team)
- Apache Spark Committer (Spark Core/SQL)
- Previously at Databricks and UC Berkeley
Ankit Agarwal
- Production Engineering Manager at Facebook (Data Warehouse Team)
- Data Infrastructure Team at Facebook since 2012
- Previously worked on the search team at Yahoo!
About Us

1. Spark at Facebook
2. Hardware Trends: A tale of two bottlenecks
3. Evolving the Core Engine
- History Based Tuning
- Join Optimizations
4. Our Users and their Use-cases
5. The Road Ahead
Agenda

2.7 Billion MAU
2 Billion DAU
Source: Facebook Q4 2018 earnings call transcript

2015
Small Scale
Experiments
2016
Few Pipelines in
Production
2017
Running 60TB+
shuffle pipelines
2018
Full-production
deployment
Successor to Apache
Hive at Facebook
2019
Scaling Spark
Largest Compute
Engine at Facebook
by CPU
The Journey

Hardware Trends
CPU, DRAM, and Disk

Hardware Trends
CPU, DRAM, and Disk
1. The industry is optimizing for
throughput by adding more cores
2. To optimize performance/watt,
next generation processors will have
more cores that run at lower
frequency

Hardware Trends
CPU, DRAM, and Disk
1. The price of DRAM continued to rise
throughout 2016-2018 and has
started fluctuating this year
2. Need to reduce our over-
dependence on DRAM

Hardware Trends
CPU, DRAM, and Disk
1. Disk sizes continue to increase
but the number of random
accesses per second aren’t
increasing
2. IOPS becomes a bottleneck

What does this mean for Spark?
1. Optimize Spark for increasing core-memory ratio
2. Run Spark on disaggregated compute/storage clusters
- Use server types optimized for compute and storage
- Scale/upgrade clusters independently over time depending
on whether CPU or IOPS was a bottleneck
3. Scale extremely diverse workloads (SQL, ML etc.) on Spark
over clusters of tens of thousands of heterogenous
machines

Spark Architecture at Facebook
Compute Cluster Storage Cluster
Distributed FS instance #1
Executor #1
Executors #2

Spill,
Cache,
Shuffle
Executor #1
Executors #2

Spill,
Cache,
Shuffle
Tangram Scheduler
Executor #1
Executors #2
Heterogenous Hardware
(purchased over 0-5 years)

Spill,
Cache,
Shuffle
Tangram Scheduler
Executor #1
Executors #2
Heterogenous Hardware
(purchased over 0-5 years)
Brian Cho and Dmitry Borovsky, Cosco: An Efficient Facebook-Scale Shuffle Service
Today at 4:30PM (Developer Track)
Rui Jian and Hao Lin, Tangram: Distributed Scheduling for Spark at Facebook
Tomorrow at 11:50AM (Developer Track)

1. Spark at Facebook
2. Hardware Trends: A tale of two bottlenecks
3. Evolving the Core Engine
- History Based Tuning
- Join Optimizations
4. Our Users and their Use-cases
5. The Road Ahead
Agenda
Contributed 100+
patches upstream

History-Based Tuning: MotivationClusterMemoryUtilization
1 week
max (80-100%)
p95 (55-70%)
p50 (10-60%)

History-Based Tuning: MotivationClusterMemoryUtilization
1 week
max (80-100%)
p95 (55-70%)
p50 (10-60%)
One-size-fits-all configs results in under-utilization of resources

History-Based Tuning: MotivationPercentageofSparkTasks(CDF)
Peak Execution Memory Bytes
75% of Spark tasks use less than
600 MB of peak execution memory

History-Based Tuning: MotivationPercentageofSparkTasks(CDF)
Peak Execution Memory Bytes
75% of Spark tasks use less than
600 MB of peak execution memory
Individual resource requirements for each Spark task has a huge variance

History-Based Tuning
1. Need to tune Spark on a per-job or a per-stage basis
2. Leverage historical characteristics of the job to tune resources:
• Peak executor memory and spill sizes to tune executor off-heap
memory
• Shuffle size to optionally not insert partial aggregates in the query plan
• Predicting the number of shuffle partitions (job level and stage level)

New
Query
Query Plan
Template
InsertIntoHiveTable [partitions: ds,country]
+- *Project [cast(key as int) AS key, value]
+- *HiveTableScan (db.test) [col: key,value] [part: ds]

New
Query
Historical
Job Runs
Query Plan
Template
Apply
Config
Overrides
Apply
Conservative
Defaults
No Regressions/Failures
since past N days
Regressions/Failures
since past N days
Config
Override
Rules

1. Broadcast Join: Broadcast small table to all nodes, stream
the larger table; skew resistant
2. Shuffle-Hash Join: Shuffle both tables, create a hashmap
with smaller table and stream the larger table
3. Sort-Merge Join: Shuffle and sort both tables, buffer one
side and stream the other side
Joins in Spark

1. Bucketing is a way to shuffle (and optionally sort) output data
based on certain columns of table
2. Ideal for write-once, read-many datasets
3. Variant of Sort Merge Join in Spark; overrides
outputPartitioning and outputOrdering for
HiveTableScanExec and stitches partitioning/ ordering
metadata throughout the query plan
Sort-Merge-Bucket (SMB) Join
SPARK-19256

A hybrid join algorithm where-in each task starts off by
executing a shuffle-hash join. In the process of execution,
should the hash table exceed a certain size (and OOM),
it automatically reconstructs/sorts the iterators and falls
back to a sort merge join
Dynamic Join
SPARK- 21505

Skew Join
A hybrid join algorithm that processes skewed keys via a
broadcast join and non-skewed keys via a shuffle-hash
or sort-merge join
SELECT /*+ SKEWED_ON(a.userid='10001') */ a.userid
FROM table_A a INNER JOIN table_B b
ON a.userid = b.userid

Data Scientists (10%)
Data Engineers (15%)
Software Engineers (60%)
Others (15%)
Who uses Spark?

Error Classification
• System v/s User
• Retriability
• Root Cause
Showing actionable error messages
Automatic Error Classification
aka Failure Attribution

How Spark is used?
Pure SQL (54%)
Pure SQL (72%)
UDF & Transforms
(45%)
UDF & Transforms
(20%)
DataFrames (1%)
DataFrames (8%)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Count CPU
ChartTitle
Query Count CPU
Diversity of Workload

Data Driven Decisions
Standardized Testing
Change X Standardized Tests Log Metrics
Evaluate Results

Data Driven Decisions
Shadow Testing
Change X
Create a tag
Shadow Testing Log Metrics
Evaluate Results
Tag based selection

• New Features
• Regular Releases
• Configuration Updates
• Hardware Testing
Where do we use it?

Workload Prioritization
Spark Cluster
Team 1 Team 2
BackfillPipelinesFastlane Interactive
(FIFO) (User Fair Share)(DRF)
(FIFO) (FIFO)
(User Fair Share)
60% 40%

• Hard limits on config values
• Capacity Quotas (Storage and Compute)
• Strict resource limits (containerization)
Defensive Deployment
Guardrails for us (and users)

Resource Limits
Cgroup v2
Spark Executor
/cgroup2/task_container/exec1

Memory Oversubscription
Finding the balance
/cgroup2/task_container/
exec1
exec2
exec3
exec4
40 GB
12 GB
12 GB
12 GB
12 GB
memory.max

A tale of two resources
CPU
Memory
ResourceUtilization%
Mar 2nd – Mar 10th 2019

A tale of THREE resources
Or my love-hate relationship with cgroups
CPU
Memory
Disk IO
ResourceUtilization%
Mar 2nd – Mar 10th 2019

The full picture
exec1
exec2
exec3
exec4
40 GB
12 GB
12 GB
12 GB
12 GB
memory.max
10 GB
10 GB
10 GB
10 GB
memory.high

Memory.what?
memory.high is the memory usage throttle limit. This is the main mechanism to control
a cgroup’s memory use. If a cgroup's memory use goes over the high boundary specified
here, the cgroup’s processes are throttled and put under heavy reclaim pressure. The
default is max, meaning there is no limit.
memory.max is the memory usage hard limit, acting as the final protection mechanism:
If a cgroup's memory usage reaches this limit and can't be reduced, the system OOM
killer is invoked on the cgroup.

Memory Pressure?
Memory Pressure
memory.max
memory.high

The full picture
exec1
exec2
exec3
exec4
40 GB
12 GB
12 GB
12 GB
12 GB
memory.max
4 GB
4 GB
4 GB
4 GB
memory.high

Memory Pressure?
Memory Pressure
memory.max
memory.high
Thrashing

• Our cgroup configuration was wrong
• History Based scheduling
So… What happened?

• Cgroups configuration can be tricky
• Find the right balance between efficiency and reliability
• Bonus: Better resource control on IO
Takeaways

• Scaling Spark 10X
• Redefining “Warehouse”
• Beyond SQL
The Road Ahead

INFRASTRUCTURE
Sameer Agarwal: sag@fb.com
Ankit Agarwal: ankitag@fb.com

Scaling Apache Spark at Facebook

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Scaling Apache Spark at Facebook

Similar a Scaling Apache Spark at Facebook (20)

Más de Databricks

Más de Databricks (20)

Último

Último (20)

Scaling Apache Spark at Facebook