SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
Robert Xue
Software Engineer, Groupon
CONTINUOUS APPLICATION
WITH FAIR SCHEDULER
About me
Groupon Online Defense
• Malicious behavior
detection
• A log stream processing
engine
Tech stack
• Kafka
• Storm
• Spark on Yarn
• In-house hardware
Software engineer, Behavioral Modeling team
Scheduler
• Code name: SchedulableBuilder
• FIFO and FAIR.
• Update spark.scheduler.mode to switch
Job pool scheduling mode
• Code name SchedulingAlgorithm
• FIFO and FAIR, applies to FAIR
scheduler only
• Update fairscheduler.xml to decide
Application
• Created by spark-submit
Job
• A group of tasks
• Unit of work to be submitted
Task
• Unit of work to be scheduled
Glossary
What does a scheduler do?
• Determine who can use the resources
• Tweak to optimize
‣ Total execution time
‣ Resource utilization
‣ Total CPU seconds
When does your scheduler work?
• Only when you have more than one job
submitted to spark context
• Writing your rdd operations line by line != multiple
jobs submitted at the same time
Range(0, 24).foreach { i =>
sc.textFile(s”file-$i").count()
}
Run - 1
Sequentially submitting 24 file reading jobs
• 4 parts, ~500MB files * 24
• 8 cores
• 110 seconds
Range(0, 24).foreach { i =>
sc.textFile(s”file-$i").count()
}
Node_local - Rack_local - Any - Unused
• 8 * 110 * 11.33% = 99.68
cpu seconds
• ~20 individual jobs clearly
visible
Submit your job in parallel
• Java Runnable
• Python threading
• Scala Future
• Scalaz Task
• Monix Task
Range(0, 24).foreach { i =>
sc.textFile(s”file-$i").count()
}
Range(0, 24).par.foreach { i =>
sc.textFile(s”file-$i").count()
}
Run - 2
Parallel submit & FIFO(24)
• 8 cores
• 110 22 seconds
• 8 * 22 * 82.04% =
99.68144.32 cpu
seconds
• 015 seconds 100%
utilization period
Range(0, 24).par.foreach { i =>
sc.textFile(s”file-$i").count()
}
Node_local - Rack_local - Any - Unused
Turn on FAIR scheduler
Provide --conf spark.scheduler.mode=FAIR in your
spark submit…
… it is that simple
Run - 3
Parallel submit & FAIR(24)
• 8 cores
• 22 15 seconds
• 8 * 15 * 73.74% =
144.3288.48record
cpu
seconds
• BadGreat locality
Node_local - Rack_local - Any - Unused
Range(0, 24).par.foreach { i =>
sc.textFile(s”file-$i").count()
}
Tweak your locality config
Provide --conf spark.locality.wait=1s in your spark
submit…
• spark.locality.wait.process/node/rack available as
well
• default is 3s
Run - 4
Parallel submit & FAIR(24) & locality.wait=1s
• 8 cores
• 15 12record
seconds
• 8 * 12 * 97.64% =
88.4893.68 cpu
seconds
Node_local - Rack_local - Any - Unused
Range(0, 24).par.foreach { i =>
sc.textFile(s”file-$i").count()
}
Summary
Cluster
utilization
Execution time Locality
tweaking
Miscellaneous
Sequential Poor Long Hard Default
behavior in
Spark-shell
Parallel {FIFO} Good Short Hard Default
behavior in
Notebooks
Parallel {FAIR} Good Short Easy
Standalone applications
Stream, Batch, Adhoc query, submitted from multiple apps
Read data
Model
Execution
Stream
Process
Reporting
Continuous application
Stream, Batch, Adhoc query, parallel submitted in one app
Read data ◼
Model
Execution ◼
Stream
Process ◼
Reporting ◼
Multiple standalone applications Continuous application
Sharing
context
Share file content using database,
file system, network, etc
Simply pass RDD reference
around
Resource
allocation
Static; configure more cores than
typical load to handle peak traffic
Dynamic; less important tasks yield
CPU critical tasks
Job
scheduling
Crontab, oozie, airflow… — all
approaches leads to spark-submit,
a typically 20s - 120s overhead
Spark-submit once and only once
Continuous Application
At t=2.5, SeriesA and B finished processing stream input at t=1.0
A round of model execution had been triggered at t=2.5
Model was triggered using all available data at that moment
At t=3.1, app received stream input at t=3.0
SeriesA was still processing input at t=2.0
SeriesB started processing the new input.
At t=3.6, SeriesA finished processing stream input at t=2.0
SeriesA started processing stream input at t=3.0
Model had a pending execution triggered at t=3.6
At t=4.3, Series A and B finished processing stream input at t=3.0
Model had a pending execution triggered at t=4.3
Model’s execution at t=3.6 was cancelled
At t=4.6, model finished it’s execution at t=2.5
Series B finished processing stream input at t=4.0
Model started execution that was triggered at t=4.3
Model used data that was available at t=4.6 as input
Example:
4 models executing in parallel from a 16 cores app
Practices for Continuous Application
Decouple stream processing from batch processing
• Batch interval is merely your minimum resolution
Emphasis tuning for streaming part
• Assign to a scheduler pool with minimum core guarantee
Execute only the latest batch invocation
• Assign to a scheduler pool with high weight
Reporting and query onsite
• Assign to a scheduler pool with low weight
Summary
Coding
• Share data by passing RDD/DStream/DF/DS
• Always launch jobs in a separate thread
• No complex logic in the streaming operation
• Push batch job invocation into a queue
• Execute only the latest batch job
App submission
• Turn on FAIR scheduler mode
• Provide fairscheduler.xml
• Turn off stream back pressure for Streaming
app
• Turn off dynamic allocation for Streaming app
• Turn on dynamic allocation for long-lived batch
app
Job bootstrapping
• sc.setJobGroup(“group”, “description”)
• sc.setLocalProperty("spark.scheduler.pool",
“pool”)
• rdd.setName(“my data at t=0”).persist()
Packaging
• Multiple logic, one repo, one jar
• Batch app can be long-lived as well
• Replace crontab with continuous app + REST
Tuning
• Understand your job’s opportunity cost
• Tweak spark.locality.wait parameters
• Config cores that yields acceptable SLA with
good resource utilization
Thank You.
Robxue Xue, Software Engineer, Groupon
rxue@groupon.com | @roboxue
Tool used in this deck: groupon/sparklint
Open sourced on github since Spark Summit EU 2016
Q & A
Why do you need a scheduler - 3
Parallelly submitting 24 file reading jobs - FIFO(4)
• 8 cores
• 22 27 seconds
• 8 * 27 * 46.74% =
99.68100.88 cpu
seconds
• BadImproved locality
implicit val ec =
ExecutionContext.fromExecutor(
new ForkJoinPool(4))
Range(0, 24).foreach(i=> Future{
sc.textFile(s”file-$i”).count()
})
Node_local - Rack_local - Any - Unused
Why do you need a scheduler - 4
Parallelly submitting 24 file reading jobs - FAIR(4)
• 8 cores
• 22 32 seconds
• 8 * 32 * 40.92% =
99.68104.72 cpu
seconds
• BadGreat locality
// --conf spark.scheduler.mode=FAIR
implicit val ec =
ExecutionContext.fromExecutor(
new ForkJoinPool(4))
Range(0, 24).foreach(i=> Future{
sc.textFile(s”file-$i”).count()
}) Node_local - Rack_local - Any - Unused
Back up:
locality settings matters
spark.scheduler.mode=FAIR
spark.locality.wait=500ms
spark.scheduler.mode=FAIR
spark.locality.wait=20s
Node_local - Rack_local - Any - Unused
Backup:
Scheduler summary
Sequential job submit Parallel job submit
Under Parallelized
Parallel job submit
Perfect Parallelized
Parallel job submit
Over Parallelized /
Poor locality settings
FIFO scheduler Under-utilized cluster
Good locality
Low core usage
Long execution time
Under-utilized cluster
Poor locality
High core usage
Short execution time
Well-utilized cluster
Poor locality
High core usage
Short execution time Well-utilized cluster
Poor locality
High core usage
Long execution time
FAIR scheduler N/A Under-utilized cluster
Good locality
Low core usage
Short execution time
Well-utilized cluster
Good locality
Low core usage
Short execution time
Node_local - Rack_local - Any - Unused
Backup:
What’s wrong with Dynamic Allocation in streaming
… if you are analyzing time series data / “tensor” …
- Streaming app, always up. Workloads comes periodically and sequentially.
- Core usage graph has a saw-tooth pattern
- Less likely to return executors, if your batch interval <
executorIdleTimeout
- Dynamic allocation is off if executor has cache on it
- “by default executors containing cached data are never removed”
- Dynamic allocation == “Resource blackhole” ++ “poor utilization”

Más contenido relacionado

La actualidad más candente

Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark Summit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySparkSpark Summit
 
SQL Performance Improvements at a Glance in Apache Spark 3.0
SQL Performance Improvements at a Glance in Apache Spark 3.0SQL Performance Improvements at a Glance in Apache Spark 3.0
SQL Performance Improvements at a Glance in Apache Spark 3.0Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 

La actualidad más candente (20)

Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
SQL Performance Improvements at a Glance in Apache Spark 3.0
SQL Performance Improvements at a Glance in Apache Spark 3.0SQL Performance Improvements at a Glance in Apache Spark 3.0
SQL Performance Improvements at a Glance in Apache Spark 3.0
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 

Similar a Continuous Application with FAIR Scheduler with Robert Xue

Groovy concurrency
Groovy concurrencyGroovy concurrency
Groovy concurrencyAlex Miller
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Performance van Java 8 en verder - Jeroen Borgers
Performance van Java 8 en verder - Jeroen BorgersPerformance van Java 8 en verder - Jeroen Borgers
Performance van Java 8 en verder - Jeroen BorgersNLJUG
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Async and parallel patterns and application design - TechDays2013 NL
Async and parallel patterns and application design - TechDays2013 NLAsync and parallel patterns and application design - TechDays2013 NL
Async and parallel patterns and application design - TechDays2013 NLArie Leeuwesteijn
 
OGG Architecture Performance
OGG Architecture PerformanceOGG Architecture Performance
OGG Architecture PerformanceEnkitec
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Bobby Curtis
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceEnkitec
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
 
Building source code level profiler for C++.pdf
Building source code level profiler for C++.pdfBuilding source code level profiler for C++.pdf
Building source code level profiler for C++.pdfssuser28de9e
 
.NET Multithreading/Multitasking
.NET Multithreading/Multitasking.NET Multithreading/Multitasking
.NET Multithreading/MultitaskingSasha Kravchuk
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a clusterGal Marder
 
Performance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons LearnedPerformance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons LearnedTim Callaghan
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorDatabricks
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit
 

Similar a Continuous Application with FAIR Scheduler with Robert Xue (20)

Groovy concurrency
Groovy concurrencyGroovy concurrency
Groovy concurrency
 
Concurrency
ConcurrencyConcurrency
Concurrency
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Performance van Java 8 en verder - Jeroen Borgers
Performance van Java 8 en verder - Jeroen BorgersPerformance van Java 8 en verder - Jeroen Borgers
Performance van Java 8 en verder - Jeroen Borgers
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Async and parallel patterns and application design - TechDays2013 NL
Async and parallel patterns and application design - TechDays2013 NLAsync and parallel patterns and application design - TechDays2013 NL
Async and parallel patterns and application design - TechDays2013 NL
 
OGG Architecture Performance
OGG Architecture PerformanceOGG Architecture Performance
OGG Architecture Performance
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture Performance
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Building source code level profiler for C++.pdf
Building source code level profiler for C++.pdfBuilding source code level profiler for C++.pdf
Building source code level profiler for C++.pdf
 
.NET Multithreading/Multitasking
.NET Multithreading/Multitasking.NET Multithreading/Multitasking
.NET Multithreading/Multitasking
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a cluster
 
Shooting the Rapids
Shooting the RapidsShooting the Rapids
Shooting the Rapids
 
Performance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons LearnedPerformance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons Learned
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS Accelerator
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excelysmaelreyes
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 

Último (20)

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excel
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 

Continuous Application with FAIR Scheduler with Robert Xue

  • 1. Robert Xue Software Engineer, Groupon CONTINUOUS APPLICATION WITH FAIR SCHEDULER
  • 2. About me Groupon Online Defense • Malicious behavior detection • A log stream processing engine Tech stack • Kafka • Storm • Spark on Yarn • In-house hardware Software engineer, Behavioral Modeling team
  • 3. Scheduler • Code name: SchedulableBuilder • FIFO and FAIR. • Update spark.scheduler.mode to switch Job pool scheduling mode • Code name SchedulingAlgorithm • FIFO and FAIR, applies to FAIR scheduler only • Update fairscheduler.xml to decide Application • Created by spark-submit Job • A group of tasks • Unit of work to be submitted Task • Unit of work to be scheduled Glossary
  • 4. What does a scheduler do? • Determine who can use the resources • Tweak to optimize ‣ Total execution time ‣ Resource utilization ‣ Total CPU seconds
  • 5. When does your scheduler work? • Only when you have more than one job submitted to spark context • Writing your rdd operations line by line != multiple jobs submitted at the same time Range(0, 24).foreach { i => sc.textFile(s”file-$i").count() }
  • 6. Run - 1 Sequentially submitting 24 file reading jobs • 4 parts, ~500MB files * 24 • 8 cores • 110 seconds Range(0, 24).foreach { i => sc.textFile(s”file-$i").count() } Node_local - Rack_local - Any - Unused • 8 * 110 * 11.33% = 99.68 cpu seconds • ~20 individual jobs clearly visible
  • 7. Submit your job in parallel • Java Runnable • Python threading • Scala Future • Scalaz Task • Monix Task Range(0, 24).foreach { i => sc.textFile(s”file-$i").count() } Range(0, 24).par.foreach { i => sc.textFile(s”file-$i").count() }
  • 8. Run - 2 Parallel submit & FIFO(24) • 8 cores • 110 22 seconds • 8 * 22 * 82.04% = 99.68144.32 cpu seconds • 015 seconds 100% utilization period Range(0, 24).par.foreach { i => sc.textFile(s”file-$i").count() } Node_local - Rack_local - Any - Unused
  • 9. Turn on FAIR scheduler Provide --conf spark.scheduler.mode=FAIR in your spark submit… … it is that simple
  • 10. Run - 3 Parallel submit & FAIR(24) • 8 cores • 22 15 seconds • 8 * 15 * 73.74% = 144.3288.48record cpu seconds • BadGreat locality Node_local - Rack_local - Any - Unused Range(0, 24).par.foreach { i => sc.textFile(s”file-$i").count() }
  • 11. Tweak your locality config Provide --conf spark.locality.wait=1s in your spark submit… • spark.locality.wait.process/node/rack available as well • default is 3s
  • 12. Run - 4 Parallel submit & FAIR(24) & locality.wait=1s • 8 cores • 15 12record seconds • 8 * 12 * 97.64% = 88.4893.68 cpu seconds Node_local - Rack_local - Any - Unused Range(0, 24).par.foreach { i => sc.textFile(s”file-$i").count() }
  • 13. Summary Cluster utilization Execution time Locality tweaking Miscellaneous Sequential Poor Long Hard Default behavior in Spark-shell Parallel {FIFO} Good Short Hard Default behavior in Notebooks Parallel {FAIR} Good Short Easy
  • 14. Standalone applications Stream, Batch, Adhoc query, submitted from multiple apps Read data Model Execution Stream Process Reporting
  • 15. Continuous application Stream, Batch, Adhoc query, parallel submitted in one app Read data ◼ Model Execution ◼ Stream Process ◼ Reporting ◼
  • 16. Multiple standalone applications Continuous application Sharing context Share file content using database, file system, network, etc Simply pass RDD reference around Resource allocation Static; configure more cores than typical load to handle peak traffic Dynamic; less important tasks yield CPU critical tasks Job scheduling Crontab, oozie, airflow… — all approaches leads to spark-submit, a typically 20s - 120s overhead Spark-submit once and only once Continuous Application
  • 17. At t=2.5, SeriesA and B finished processing stream input at t=1.0 A round of model execution had been triggered at t=2.5 Model was triggered using all available data at that moment
  • 18. At t=3.1, app received stream input at t=3.0 SeriesA was still processing input at t=2.0 SeriesB started processing the new input.
  • 19. At t=3.6, SeriesA finished processing stream input at t=2.0 SeriesA started processing stream input at t=3.0 Model had a pending execution triggered at t=3.6
  • 20. At t=4.3, Series A and B finished processing stream input at t=3.0 Model had a pending execution triggered at t=4.3 Model’s execution at t=3.6 was cancelled
  • 21. At t=4.6, model finished it’s execution at t=2.5 Series B finished processing stream input at t=4.0 Model started execution that was triggered at t=4.3 Model used data that was available at t=4.6 as input
  • 22. Example: 4 models executing in parallel from a 16 cores app
  • 23. Practices for Continuous Application Decouple stream processing from batch processing • Batch interval is merely your minimum resolution Emphasis tuning for streaming part • Assign to a scheduler pool with minimum core guarantee Execute only the latest batch invocation • Assign to a scheduler pool with high weight Reporting and query onsite • Assign to a scheduler pool with low weight
  • 24. Summary Coding • Share data by passing RDD/DStream/DF/DS • Always launch jobs in a separate thread • No complex logic in the streaming operation • Push batch job invocation into a queue • Execute only the latest batch job App submission • Turn on FAIR scheduler mode • Provide fairscheduler.xml • Turn off stream back pressure for Streaming app • Turn off dynamic allocation for Streaming app • Turn on dynamic allocation for long-lived batch app Job bootstrapping • sc.setJobGroup(“group”, “description”) • sc.setLocalProperty("spark.scheduler.pool", “pool”) • rdd.setName(“my data at t=0”).persist() Packaging • Multiple logic, one repo, one jar • Batch app can be long-lived as well • Replace crontab with continuous app + REST Tuning • Understand your job’s opportunity cost • Tweak spark.locality.wait parameters • Config cores that yields acceptable SLA with good resource utilization
  • 25. Thank You. Robxue Xue, Software Engineer, Groupon rxue@groupon.com | @roboxue Tool used in this deck: groupon/sparklint Open sourced on github since Spark Summit EU 2016
  • 26. Q & A
  • 27. Why do you need a scheduler - 3 Parallelly submitting 24 file reading jobs - FIFO(4) • 8 cores • 22 27 seconds • 8 * 27 * 46.74% = 99.68100.88 cpu seconds • BadImproved locality implicit val ec = ExecutionContext.fromExecutor( new ForkJoinPool(4)) Range(0, 24).foreach(i=> Future{ sc.textFile(s”file-$i”).count() }) Node_local - Rack_local - Any - Unused
  • 28. Why do you need a scheduler - 4 Parallelly submitting 24 file reading jobs - FAIR(4) • 8 cores • 22 32 seconds • 8 * 32 * 40.92% = 99.68104.72 cpu seconds • BadGreat locality // --conf spark.scheduler.mode=FAIR implicit val ec = ExecutionContext.fromExecutor( new ForkJoinPool(4)) Range(0, 24).foreach(i=> Future{ sc.textFile(s”file-$i”).count() }) Node_local - Rack_local - Any - Unused
  • 29. Back up: locality settings matters spark.scheduler.mode=FAIR spark.locality.wait=500ms spark.scheduler.mode=FAIR spark.locality.wait=20s Node_local - Rack_local - Any - Unused
  • 30. Backup: Scheduler summary Sequential job submit Parallel job submit Under Parallelized Parallel job submit Perfect Parallelized Parallel job submit Over Parallelized / Poor locality settings FIFO scheduler Under-utilized cluster Good locality Low core usage Long execution time Under-utilized cluster Poor locality High core usage Short execution time Well-utilized cluster Poor locality High core usage Short execution time Well-utilized cluster Poor locality High core usage Long execution time FAIR scheduler N/A Under-utilized cluster Good locality Low core usage Short execution time Well-utilized cluster Good locality Low core usage Short execution time Node_local - Rack_local - Any - Unused
  • 31. Backup: What’s wrong with Dynamic Allocation in streaming … if you are analyzing time series data / “tensor” … - Streaming app, always up. Workloads comes periodically and sequentially. - Core usage graph has a saw-tooth pattern - Less likely to return executors, if your batch interval < executorIdleTimeout - Dynamic allocation is off if executor has cache on it - “by default executors containing cached data are never removed” - Dynamic allocation == “Resource blackhole” ++ “poor utilization”