SlideShare una empresa de Scribd logo
1 de 53
Descargar para leer sin conexión
Scaling Apache Spark at
Facebook
Sameer Agarwal & Ankit Agarwal
Spark Summit | San Francisco | 24th April 2019
Sameer Agarwal
- Software Engineer at Facebook (Data Warehouse Team)
- Apache Spark Committer (Spark Core/SQL)
- Previously at Databricks and UC Berkeley
Ankit Agarwal
- Production Engineering Manager at Facebook (Data Warehouse Team)
- Data Infrastructure Team at Facebook since 2012
- Previously worked on the search team at Yahoo!
About Us
1. Spark at Facebook
2. Hardware Trends: A tale of two bottlenecks
3. Evolving the Core Engine
- History Based Tuning
- Join Optimizations
4. Our Users and their Use-cases
5. The Road Ahead
Agenda
1. Spark at Facebook
2. Hardware Trends: A tale of two bottlenecks
3. Evolving the Core Engine
- History Based Tuning
- Join Optimizations
4. Our Users and their Use-cases
5. The Road Ahead
Agenda
Data at Facebook
2.7 Billion MAU
2 Billion DAU
Source: Facebook Q4 2018 earnings call transcript
2015
Small Scale
Experiments
2016
Few Pipelines in
Production
2017
Running 60TB+
shuffle pipelines
2018
Full-production
deployment
Successor to Apache
Hive at Facebook
2019
Scaling Spark
Largest Compute
Engine at Facebook
by CPU
The Journey
1. Spark at Facebook
2. Hardware Trends: A tale of two bottlenecks
3. Evolving the Core Engine
- History Based Tuning
- Join Optimizations
4. Our Users and their Use-cases
5. The Road Ahead
Agenda
Hardware Trends
CPU, DRAM, and Disk
Hardware Trends
CPU, DRAM, and Disk
1. The industry is optimizing for
throughput by adding more cores
2. To optimize performance/watt,
next generation processors will have
more cores that run at lower
frequency
Hardware Trends
CPU, DRAM, and Disk
1. The price of DRAM continued to rise
throughout 2016-2018 and has
started fluctuating this year
2. Need to reduce our over-
dependence on DRAM
Hardware Trends
CPU, DRAM, and Disk
1. Disk sizes continue to increase
but the number of random
accesses per second aren’t
increasing
2. IOPS becomes a bottleneck
What does this mean for Spark?
1. Optimize Spark for increasing core-memory ratio
2. Run Spark on disaggregated compute/storage clusters
- Use server types optimized for compute and storage
- Scale/upgrade clusters independently over time depending
on whether CPU or IOPS was a bottleneck
3. Scale extremely diverse workloads (SQL, ML etc.) on Spark
over clusters of tens of thousands of heterogenous
machines
Spark Architecture at Facebook
Compute Cluster Storage Cluster
Distributed FS instance #1
Distributed FS instance #2
Executor #1
Executors #2
Distributed FS instance #3
Spark Architecture at Facebook
Compute Cluster Storage Cluster
Distributed FS instance #1
Distributed FS instance #2
Spill,
Cache,
Shuffle
Executor #1
Executors #2
Distributed FS instance #3
Spark Architecture at Facebook
Compute Cluster Storage Cluster
Distributed FS instance #1
Distributed FS instance #2
Spill,
Cache,
Shuffle
Tangram Scheduler
Executor #1
Executors #2
Distributed FS instance #3
Heterogenous Hardware
(purchased over 0-5 years)
Spark Architecture at Facebook
Compute Cluster Storage Cluster
Distributed FS instance #1
Distributed FS instance #2
Spill,
Cache,
Shuffle
Tangram Scheduler
Executor #1
Executors #2
Distributed FS instance #3
Heterogenous Hardware
(purchased over 0-5 years)
Brian Cho and Dmitry Borovsky, Cosco: An Efficient Facebook-Scale Shuffle Service
Today at 4:30PM (Developer Track)
Rui Jian and Hao Lin, Tangram: Distributed Scheduling for Spark at Facebook
Tomorrow at 11:50AM (Developer Track)
1. Spark at Facebook
2. Hardware Trends: A tale of two bottlenecks
3. Evolving the Core Engine
- History Based Tuning
- Join Optimizations
4. Our Users and their Use-cases
5. The Road Ahead
Agenda
Contributed 100+
patches upstream
History-Based Tuning: MotivationClusterMemoryUtilization
1 week
max (80-100%)
p95 (55-70%)
p50 (10-60%)
History-Based Tuning: MotivationClusterMemoryUtilization
1 week
max (80-100%)
p95 (55-70%)
p50 (10-60%)
One-size-fits-all configs results in under-utilization of resources
History-Based Tuning: MotivationPercentageofSparkTasks(CDF)
Peak Execution Memory Bytes
75% of Spark tasks use less than
600 MB of peak execution memory
History-Based Tuning: MotivationPercentageofSparkTasks(CDF)
Peak Execution Memory Bytes
75% of Spark tasks use less than
600 MB of peak execution memory
Individual resource requirements for each Spark task has a huge variance
History-Based Tuning
1. Need to tune Spark on a per-job or a per-stage basis
2. Leverage historical characteristics of the job to tune resources:
• Peak executor memory and spill sizes to tune executor off-heap
memory
• Shuffle size to optionally not insert partial aggregates in the query plan
• Predicting the number of shuffle partitions (job level and stage level)
History-Based Tuning
New
Query
Query Plan
Template
InsertIntoHiveTable [partitions: ds,country]
+- *Project [cast(key as int) AS key, value]
+- *HiveTableScan (db.test) [col: key,value] [part: ds]
History-Based Tuning
New
Query
Historical
Job Runs
Query Plan
Template
Apply
Config
Overrides
Apply
Conservative
Defaults
No Regressions/Failures
since past N days
Regressions/Failures
since past N days
Config
Override
Rules
1. Broadcast Join: Broadcast small table to all nodes, stream
the larger table; skew resistant
2. Shuffle-Hash Join: Shuffle both tables, create a hashmap
with smaller table and stream the larger table
3. Sort-Merge Join: Shuffle and sort both tables, buffer one
side and stream the other side
Joins in Spark
1. Bucketing is a way to shuffle (and optionally sort) output data
based on certain columns of table
2. Ideal for write-once, read-many datasets
3. Variant of Sort Merge Join in Spark; overrides
outputPartitioning and outputOrdering for
HiveTableScanExec and stitches partitioning/ ordering
metadata throughout the query plan
Sort-Merge-Bucket (SMB) Join
SPARK-19256
A hybrid join algorithm where-in each task starts off by
executing a shuffle-hash join. In the process of execution,
should the hash table exceed a certain size (and OOM),
it automatically reconstructs/sorts the iterators and falls
back to a sort merge join
Dynamic Join
SPARK- 21505
Skew Join
A hybrid join algorithm that processes skewed keys via a
broadcast join and non-skewed keys via a shuffle-hash
or sort-merge join
SELECT /*+ SKEWED_ON(a.userid='10001') */ a.userid
FROM table_A a INNER JOIN table_B b
ON a.userid = b.userid
1. Spark at Facebook
2. Hardware Trends: A tale of two bottlenecks
3. Evolving the Core Engine
- History Based Tuning
- Join Optimizations
4. Our Users and their Use-cases
5. The Road Ahead
Agenda
Data Scientists (10%)
Data Engineers (15%)
Software Engineers (60%)
Others (15%)
Who uses Spark?
Error Classification
• System v/s User
• Retriability
• Root Cause
Showing actionable error messages
Automatic Error Classification
aka Failure Attribution
How Spark is used?
Pure SQL (54%)
Pure SQL (72%)
UDF & Transforms
(45%)
UDF & Transforms
(20%)
DataFrames (1%)
DataFrames (8%)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Count CPU
ChartTitle
Query Count CPU
Diversity of Workload
Data Driven Decisions
Standardized Testing
Change X Standardized Tests Log Metrics
Evaluate Results
Data Driven Decisions
Shadow Testing
Change X
Create a tag
Shadow Testing Log Metrics
Evaluate Results
Tag based selection
• New Features
• Regular Releases
• Configuration Updates
• Hardware Testing
Where do we use it?
Workload Prioritization
Spark Cluster
Team 1 Team 2
BackfillPipelinesFastlane Interactive
(FIFO) (User Fair Share)(DRF)
(FIFO) (FIFO)
(User Fair Share)
60% 40%
• Hard limits on config values
• Capacity Quotas (Storage and Compute)
• Strict resource limits (containerization)
Defensive Deployment
Guardrails for us (and users)
Resource Limits
Cgroup v2
Spark Executor
/cgroup2/task_container/exec1
Memory Oversubscription
Finding the balance
/cgroup2/task_container/
exec1
exec2
exec3
exec4
40 GB
12 GB
12 GB
12 GB
12 GB
memory.max
A tale of two resources
CPU
Memory
ResourceUtilization%
Mar 2nd – Mar 10th 2019
A tale of THREE resources
Or my love-hate relationship with cgroups
CPU
Memory
Disk IO
ResourceUtilization%
Mar 2nd – Mar 10th 2019
Memory Oversubscription
Finding the balance
/cgroup2/task_container/
exec1
exec2
exec3
exec4
40 GB
12 GB
12 GB
12 GB
12 GB
memory.max
Memory Oversubscription
The full picture
/cgroup2/task_container/
exec1
exec2
exec3
exec4
40 GB
12 GB
12 GB
12 GB
12 GB
memory.max
10 GB
10 GB
10 GB
10 GB
memory.high
Memory.what?
memory.high is the memory usage throttle limit. This is the main mechanism to control
a cgroup’s memory use. If a cgroup's memory use goes over the high boundary specified
here, the cgroup’s processes are throttled and put under heavy reclaim pressure. The
default is max, meaning there is no limit.
memory.max is the memory usage hard limit, acting as the final protection mechanism:
If a cgroup's memory usage reaches this limit and can't be reduced, the system OOM
killer is invoked on the cgroup.
Memory Pressure?
Memory Pressure
memory.max
memory.high
Memory Oversubscription
The full picture
/cgroup2/task_container/
exec1
exec2
exec3
exec4
40 GB
12 GB
12 GB
12 GB
12 GB
memory.max
4 GB
4 GB
4 GB
4 GB
memory.high
Memory Pressure?
Memory Pressure
memory.max
memory.high
Thrashing
• Our cgroup configuration was wrong
• History Based scheduling
So… What happened?
• Cgroups configuration can be tricky
• Find the right balance between efficiency and reliability
• Bonus: Better resource control on IO
Takeaways
1. Spark at Facebook
2. Hardware Trends: A tale of two bottlenecks
3. Evolving the Core Engine
- History Based Tuning
- Join Optimizations
4. Our Users and their Use-cases
5. The Road Ahead
Agenda
• Scaling Spark 10X
• Redefining “Warehouse”
• Beyond SQL
The Road Ahead
INFRASTRUCTURE
Sameer Agarwal: sag@fb.com
Ankit Agarwal: ankitag@fb.com

Más contenido relacionado

La actualidad más candente

Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3SANG WON PARK
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorDatabricks
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideIBM
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystifiedOmid Vahdaty
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 

La actualidad más candente (20)

Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 

Similar a Scaling Apache Spark at Facebook

Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Spark Summit
 
Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Jianfeng Zhang
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye ZhouMetrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye ZhouDatabricks
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Spark Summit
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...t_ivanov
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Knoldus Inc.
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit
 
Metrics-driven tuning of Apache Spark at scale
Metrics-driven tuning of Apache Spark at scaleMetrics-driven tuning of Apache Spark at scale
Metrics-driven tuning of Apache Spark at scaleDataWorks Summit
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierDatabricks
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesDatabricks
 
Native support of Prometheus monitoring in Apache Spark 3
Native support of Prometheus monitoring in Apache Spark 3Native support of Prometheus monitoring in Apache Spark 3
Native support of Prometheus monitoring in Apache Spark 3Dongjoon Hyun
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Haoyuan Li
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in SparkSnappyData
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareDatabricks
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 

Similar a Scaling Apache Spark at Facebook (20)

Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
 
Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Spark & Yarn better together 1.2
Spark & Yarn better together 1.2
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye ZhouMetrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
 
Spark1
Spark1Spark1
Spark1
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
 
Metrics-driven tuning of Apache Spark at scale
Metrics-driven tuning of Apache Spark at scaleMetrics-driven tuning of Apache Spark at scale
Metrics-driven tuning of Apache Spark at scale
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 
Native support of Prometheus monitoring in Apache Spark 3
Native support of Prometheus monitoring in Apache Spark 3Native support of Prometheus monitoring in Apache Spark 3
Native support of Prometheus monitoring in Apache Spark 3
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 

Último (20)

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 

Scaling Apache Spark at Facebook

  • 1. Scaling Apache Spark at Facebook Sameer Agarwal & Ankit Agarwal Spark Summit | San Francisco | 24th April 2019
  • 2. Sameer Agarwal - Software Engineer at Facebook (Data Warehouse Team) - Apache Spark Committer (Spark Core/SQL) - Previously at Databricks and UC Berkeley Ankit Agarwal - Production Engineering Manager at Facebook (Data Warehouse Team) - Data Infrastructure Team at Facebook since 2012 - Previously worked on the search team at Yahoo! About Us
  • 3. 1. Spark at Facebook 2. Hardware Trends: A tale of two bottlenecks 3. Evolving the Core Engine - History Based Tuning - Join Optimizations 4. Our Users and their Use-cases 5. The Road Ahead Agenda
  • 4. 1. Spark at Facebook 2. Hardware Trends: A tale of two bottlenecks 3. Evolving the Core Engine - History Based Tuning - Join Optimizations 4. Our Users and their Use-cases 5. The Road Ahead Agenda
  • 6. 2.7 Billion MAU 2 Billion DAU Source: Facebook Q4 2018 earnings call transcript
  • 7. 2015 Small Scale Experiments 2016 Few Pipelines in Production 2017 Running 60TB+ shuffle pipelines 2018 Full-production deployment Successor to Apache Hive at Facebook 2019 Scaling Spark Largest Compute Engine at Facebook by CPU The Journey
  • 8. 1. Spark at Facebook 2. Hardware Trends: A tale of two bottlenecks 3. Evolving the Core Engine - History Based Tuning - Join Optimizations 4. Our Users and their Use-cases 5. The Road Ahead Agenda
  • 10. Hardware Trends CPU, DRAM, and Disk 1. The industry is optimizing for throughput by adding more cores 2. To optimize performance/watt, next generation processors will have more cores that run at lower frequency
  • 11. Hardware Trends CPU, DRAM, and Disk 1. The price of DRAM continued to rise throughout 2016-2018 and has started fluctuating this year 2. Need to reduce our over- dependence on DRAM
  • 12. Hardware Trends CPU, DRAM, and Disk 1. Disk sizes continue to increase but the number of random accesses per second aren’t increasing 2. IOPS becomes a bottleneck
  • 13. What does this mean for Spark? 1. Optimize Spark for increasing core-memory ratio 2. Run Spark on disaggregated compute/storage clusters - Use server types optimized for compute and storage - Scale/upgrade clusters independently over time depending on whether CPU or IOPS was a bottleneck 3. Scale extremely diverse workloads (SQL, ML etc.) on Spark over clusters of tens of thousands of heterogenous machines
  • 14. Spark Architecture at Facebook Compute Cluster Storage Cluster Distributed FS instance #1 Distributed FS instance #2 Executor #1 Executors #2 Distributed FS instance #3
  • 15. Spark Architecture at Facebook Compute Cluster Storage Cluster Distributed FS instance #1 Distributed FS instance #2 Spill, Cache, Shuffle Executor #1 Executors #2 Distributed FS instance #3
  • 16. Spark Architecture at Facebook Compute Cluster Storage Cluster Distributed FS instance #1 Distributed FS instance #2 Spill, Cache, Shuffle Tangram Scheduler Executor #1 Executors #2 Distributed FS instance #3 Heterogenous Hardware (purchased over 0-5 years)
  • 17. Spark Architecture at Facebook Compute Cluster Storage Cluster Distributed FS instance #1 Distributed FS instance #2 Spill, Cache, Shuffle Tangram Scheduler Executor #1 Executors #2 Distributed FS instance #3 Heterogenous Hardware (purchased over 0-5 years) Brian Cho and Dmitry Borovsky, Cosco: An Efficient Facebook-Scale Shuffle Service Today at 4:30PM (Developer Track) Rui Jian and Hao Lin, Tangram: Distributed Scheduling for Spark at Facebook Tomorrow at 11:50AM (Developer Track)
  • 18. 1. Spark at Facebook 2. Hardware Trends: A tale of two bottlenecks 3. Evolving the Core Engine - History Based Tuning - Join Optimizations 4. Our Users and their Use-cases 5. The Road Ahead Agenda Contributed 100+ patches upstream
  • 19. History-Based Tuning: MotivationClusterMemoryUtilization 1 week max (80-100%) p95 (55-70%) p50 (10-60%)
  • 20. History-Based Tuning: MotivationClusterMemoryUtilization 1 week max (80-100%) p95 (55-70%) p50 (10-60%) One-size-fits-all configs results in under-utilization of resources
  • 21. History-Based Tuning: MotivationPercentageofSparkTasks(CDF) Peak Execution Memory Bytes 75% of Spark tasks use less than 600 MB of peak execution memory
  • 22. History-Based Tuning: MotivationPercentageofSparkTasks(CDF) Peak Execution Memory Bytes 75% of Spark tasks use less than 600 MB of peak execution memory Individual resource requirements for each Spark task has a huge variance
  • 23. History-Based Tuning 1. Need to tune Spark on a per-job or a per-stage basis 2. Leverage historical characteristics of the job to tune resources: • Peak executor memory and spill sizes to tune executor off-heap memory • Shuffle size to optionally not insert partial aggregates in the query plan • Predicting the number of shuffle partitions (job level and stage level)
  • 24. History-Based Tuning New Query Query Plan Template InsertIntoHiveTable [partitions: ds,country] +- *Project [cast(key as int) AS key, value] +- *HiveTableScan (db.test) [col: key,value] [part: ds]
  • 25. History-Based Tuning New Query Historical Job Runs Query Plan Template Apply Config Overrides Apply Conservative Defaults No Regressions/Failures since past N days Regressions/Failures since past N days Config Override Rules
  • 26. 1. Broadcast Join: Broadcast small table to all nodes, stream the larger table; skew resistant 2. Shuffle-Hash Join: Shuffle both tables, create a hashmap with smaller table and stream the larger table 3. Sort-Merge Join: Shuffle and sort both tables, buffer one side and stream the other side Joins in Spark
  • 27. 1. Bucketing is a way to shuffle (and optionally sort) output data based on certain columns of table 2. Ideal for write-once, read-many datasets 3. Variant of Sort Merge Join in Spark; overrides outputPartitioning and outputOrdering for HiveTableScanExec and stitches partitioning/ ordering metadata throughout the query plan Sort-Merge-Bucket (SMB) Join SPARK-19256
  • 28. A hybrid join algorithm where-in each task starts off by executing a shuffle-hash join. In the process of execution, should the hash table exceed a certain size (and OOM), it automatically reconstructs/sorts the iterators and falls back to a sort merge join Dynamic Join SPARK- 21505
  • 29. Skew Join A hybrid join algorithm that processes skewed keys via a broadcast join and non-skewed keys via a shuffle-hash or sort-merge join SELECT /*+ SKEWED_ON(a.userid='10001') */ a.userid FROM table_A a INNER JOIN table_B b ON a.userid = b.userid
  • 30. 1. Spark at Facebook 2. Hardware Trends: A tale of two bottlenecks 3. Evolving the Core Engine - History Based Tuning - Join Optimizations 4. Our Users and their Use-cases 5. The Road Ahead Agenda
  • 31. Data Scientists (10%) Data Engineers (15%) Software Engineers (60%) Others (15%) Who uses Spark?
  • 32. Error Classification • System v/s User • Retriability • Root Cause Showing actionable error messages Automatic Error Classification aka Failure Attribution
  • 33. How Spark is used? Pure SQL (54%) Pure SQL (72%) UDF & Transforms (45%) UDF & Transforms (20%) DataFrames (1%) DataFrames (8%) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Count CPU ChartTitle Query Count CPU Diversity of Workload
  • 34. Data Driven Decisions Standardized Testing Change X Standardized Tests Log Metrics Evaluate Results
  • 35. Data Driven Decisions Shadow Testing Change X Create a tag Shadow Testing Log Metrics Evaluate Results Tag based selection
  • 36. • New Features • Regular Releases • Configuration Updates • Hardware Testing Where do we use it?
  • 37. Workload Prioritization Spark Cluster Team 1 Team 2 BackfillPipelinesFastlane Interactive (FIFO) (User Fair Share)(DRF) (FIFO) (FIFO) (User Fair Share) 60% 40%
  • 38. • Hard limits on config values • Capacity Quotas (Storage and Compute) • Strict resource limits (containerization) Defensive Deployment Guardrails for us (and users)
  • 39. Resource Limits Cgroup v2 Spark Executor /cgroup2/task_container/exec1
  • 40. Memory Oversubscription Finding the balance /cgroup2/task_container/ exec1 exec2 exec3 exec4 40 GB 12 GB 12 GB 12 GB 12 GB memory.max
  • 41. A tale of two resources CPU Memory ResourceUtilization% Mar 2nd – Mar 10th 2019
  • 42. A tale of THREE resources Or my love-hate relationship with cgroups CPU Memory Disk IO ResourceUtilization% Mar 2nd – Mar 10th 2019
  • 43. Memory Oversubscription Finding the balance /cgroup2/task_container/ exec1 exec2 exec3 exec4 40 GB 12 GB 12 GB 12 GB 12 GB memory.max
  • 44. Memory Oversubscription The full picture /cgroup2/task_container/ exec1 exec2 exec3 exec4 40 GB 12 GB 12 GB 12 GB 12 GB memory.max 10 GB 10 GB 10 GB 10 GB memory.high
  • 45. Memory.what? memory.high is the memory usage throttle limit. This is the main mechanism to control a cgroup’s memory use. If a cgroup's memory use goes over the high boundary specified here, the cgroup’s processes are throttled and put under heavy reclaim pressure. The default is max, meaning there is no limit. memory.max is the memory usage hard limit, acting as the final protection mechanism: If a cgroup's memory usage reaches this limit and can't be reduced, the system OOM killer is invoked on the cgroup.
  • 47. Memory Oversubscription The full picture /cgroup2/task_container/ exec1 exec2 exec3 exec4 40 GB 12 GB 12 GB 12 GB 12 GB memory.max 4 GB 4 GB 4 GB 4 GB memory.high
  • 49. • Our cgroup configuration was wrong • History Based scheduling So… What happened?
  • 50. • Cgroups configuration can be tricky • Find the right balance between efficiency and reliability • Bonus: Better resource control on IO Takeaways
  • 51. 1. Spark at Facebook 2. Hardware Trends: A tale of two bottlenecks 3. Evolving the Core Engine - History Based Tuning - Join Optimizations 4. Our Users and their Use-cases 5. The Road Ahead Agenda
  • 52. • Scaling Spark 10X • Redefining “Warehouse” • Beyond SQL The Road Ahead