SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
How to
performance-tune
spark applications in
large clusters
- Omkar Joshi
Omkar Joshi
ojoshi@netflix.com
● Software engineer @ Netflix
● Architect & author of Marmaray
(Generic ingestion framework) @
Uber.
● Architected Object store & NFS
solutions at Hedvig
● Hadoop Yarn committer
● Enjoy gardening & hiking in free
time
01 JVM Profiler
02 Hadoop Profiler
03 Spark Listener
04 Auto tune
05 Storage improvements
06 Cpu / Runtime improvements
07 Efficiency improvements
08 Memory improvements
Agenda
JVM Profiler: Distributed
Profiling at a Large Scale
● Help tracking memory/cpu/stacktrace for large amount of Spark
executors
● Open sourced (https://github.com/uber-common/jvm-profiler)
● Presented in previous Spark Summit
● Some new update
Hadoop Profiler (Recap)
● Java Agent attached to each
executor
● Collects metrics via JMX and
/proc
● Instruments arbitrary Java user
code
● Emits to Kafka, InfluxDB, and
Redis and other data sinks
Method
Argument
Profiler
Method
duration
Profiler
CPU/Memory
Profiler
Reporter
Java Agent
Kafka InfluxDB
https://github.com/uber-common/jvm-profiler
Spark Listener
● Plugable Listener
○ spark.extraListeners=com.foo.MySparkListener
● Modify Spark Code and Send Execution Plan to Spark Listener
● Generate Data Lineage Information
● Offline Analysis for Spark Task Execution
Auto Tune
● Problem: data scientist using team level Spark conf template
● Known Daily Applications
○ Use historical run to set Spark configurations (memory, vcore, etc.) *
● Ad-hoc Repeated (Daily) Applications
○ Use Machine Learning to predict resource usage
○ Challenge: Feature Engineering
■ Execution Plan
Project [user_id, product_id, price]
Filter (date = 2019-01-31 and product_id = xyz)
UnresolvedRelation datalake.user_purchase
Open Sourced in September 2018
https://github.com/uber/marmaray
Blog Post:
https://eng.uber.com/marmaray-hado
op-ingestion-open-source/
High-Level Architecture
Chain of converters
Schema Service
Input
Storage
System
Source
Connector
M3 Monitoring System
Work
Unit
Calculator
Metadata Manager
(Checkpoint store)
Converter1 Converter 2
Sink
Connector
Output
Storage
System
Error Tables
Datafeed Config Store
Spark execution framework
Spark job
improvements
Storage
improvements
Effective data layout (parquet)
● Parquet uses columnar compression
● Columnar compression savings outperform gz or snappy compression
savings
● Align records such that adjacent rows have identical column values
○ Eg. For state column California.
● Sort records with increasing value of cardinality.
● Generalization sometimes is not possible; If it is a framework provide custom
sorting.
User Id First Name City State Rider score
abc123011101 John New York City New York 4.95
abc123011102 Michael San Francisco California 4.92
abc123011103 Andrea Seattle Washington 4.93
abc123011104 Robert Atlanta City Georgia 4.95
User Id First Name City State Rider score
cas123032203 Sheetal Atlanta City Georgia 4.97
dsc123022320 Nikki Atlanta City Georgia 4.95
ssd012320212 Dhiraj Atlanta City Georgia 4.94
abc123011104 Robert Atlanta City Georgia 4.95
CPU / Runtime
improvements
Custom Spark accumulators
● Problem
○ Given a set of ride records; remove duplicate ride records and also count duplicates per state
1. RDD<String> rideRecords =
javaSparkContext.readParquet(some_path);
2. Map<String, Long> ridesPerStateBeforeDup
= rideRecords.map(r ->
getState(r)).countByKey();
3. RDD<String> dedupRideRecords =
dedup(rideRecords);
4. Map<String, Long> ridesPerStateAfterDup =
dedupRideRecords.map(r ->
getState(r)).countByKey();
5. dedupRideRecords.write(some_hdfs_path);
6. Duplicates = Diff(ridesPerStateAfterDup,
ridesPerStateBeforeDup)
7. # spark stages = 5 ( 3 for countByKey)
1. Class RidesPerStateAccumulator extends
AccumulatorV2<Map<String, Long>>
2. RidesPerStateAccumulator
riderPerStateBeforeDup, riderPerStateAfterDup;
3. dedup(javaSparkContext.readParquet(some_pat
h).map(r -> {riderPerStateBeforeDup.add(r);
return r;})).map(r ->
{riderPerStateAfterDup.add(r); return
r;}).write(some_hdfs_path);
4. Duplicates = Diff(ridesPerStateAfterDup,
ridesPerStateBeforeDup)
5. # spark Stages = 2 (no counting overhead!!)
Kafka topic
256 Partitions
3Billion messsages per
run
Increasing kafka read parallelism
Kafka
Broker
1 Kafka
consumer
/ partition
(256
tasks)
1 Kafka
consumer per
Kafka partition
Each consumer
reading 12Million
messages
Spark Stage 1
Sparkshuffleservice
Shuffle
Write
(1.3TB)
Spark Stage 2
8192
tasks
Shuffle
Read
(1.3TB)
Kafka
Broker
>1 Kafka consumer
per Kafka partition
Each consumer reading
384K messages
1 Kafka partition split
into 32 virtual
partitions
Spark Stage 2
8192
tasks
Increasing kafka read parallelism contd..
2
1
● Why Kryo?
○ Lesser memory footprint than Java serializer.
○ Faster and supports custom serializer
● Bug fix to truly enable kryo serialization
○ Spark kryo config prefix change.
● What all is needed to take advantage of that
○ Set “spark.serializer” to “org.apache.spark.serializer.KryoSerializer”
○ Registering avro schemas to spark conf (sparkConf.registerAvroSchemas())
■ Useful if you are using Avro GenericRecord (Schema + Data)
○ Register all classes which are needed while doing spark transformation
■ Use “spark.kryo.classesToRegister”
■ Use “spark.kryo.registrationRequired” to find missing classes
Kryo serialization
Kryo serialization contd..
Data (128)
Data (128)
Data (128)
Data (128)
Data (128)
Avro Schema (4K)
Avro Schema (4K)
Avro Schema (4K)
Avro Schema (4K)
Avro Schema (4K)
Data (128)
Data (128)
Data (128)
Data (128)
Data (128)
Schema identifier(4)
Schema identifier(4)
Schema identifier(4)
Schema identifier(4)
Schema identifier(4)
1 Record = 4228 Bytes 1 Record = 132 Bytes (97% savings)
Reduce ser/deser time by restructuring payload
1. @AllArgsConstructor
2. @Getter
3. private class SparkPayload {
4. private final String sortingKey;
5. // Map with 1000+ entries.
6. private final Map<String, GenericRecord>
data;
7. }
8.
1. @Getter
2. private class SparkPayload {
3. private final String sortingKey;
4. // Map with 1000+ entries.
5. private final byte[] serializedData;
6.
7. public SparkPayload(final String sortingKey,
Map<String, GenericRecord> data) {
8. this.sortingKey = sortingKey;
9. this.serializedData =
KryoSerializer.serialize(data);
10. }
11.
12. public Map<String, GenericRecord> getData() {
13. return
KryoSerializer.deserialize(this.serializedData,
Map.class);
14. }
15. }
Parallelizing spark’s iterator
jsc.mapPartitions (iterator -> while (i.hasnext) { parquetWriter.write(i.next); })
MapPartitions Stage (~45min) MapPartitions Stage (~25min)
Reading from disk
Fetching record from
Spark’s Iterator
Writing to Parquet
Writer in memory
Parquet columnar
compression
FileSystem write
Reading from disk
Fetching record from
Spark’s Iterator
Writing to Parquet
Writer in memory
Parquet columnar
compression
FileSystem write
Spark’s Thread
New writer
Thread
Memory
buffer
Spark’s Thread
Efficiency
improvements
Improve utilization by sharing same spark resources
JavaSparkContext.readFromKafka(“topic
1”).writeToHdfs();
JavaSparkContext.readFromKafka(“topic
1”).writeToHdfs();
JavaSparkContext.readFromKafka(“topic
3”).writeToHdfs();
JavaSparkContext.readFromKafka(“topic
2”).writeToHdfs();
JavaSparkContext.readFromKafka(“topic
N”).writeToHdfs();Threadpoolwith#threads=parallelism
needed
Improve utilization by sharing same spark resources
Time
ExecutorId
Stage Boundaries
Lot of wastage!!!
Improve utilization by sharing same spark resources
Time
ExecutorId
Stage Boundaries
Memory
improvements
Off heap memory improvements (work in progress)
● Symptom
○ Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used. Consider
boosting spark.yarn.executor.memoryOverhead
● Solution as per stack overflow :)
○ Increase “spark.yarn.executor.memoryOverhead”
○ Result - Huge memory wastage.
● Spark memory distribution
○ Heap & off-heap memory (direct & memory mapped)
● Possible solutions
○ Avoid memory mapping or perform memory mapping chunk by chunk instead of entire file
(4-16MB vs 1GB)
● Current vs Target
○ Current - 7GB [Heap(4gb) + Off-heap(3gb)]
○ Target - 5GB [Heap(4gb) + Off-heap(1gb)] - ~28% memory reduction (per container)
Thank You!!
We are hiring!!
Please reach out to us if you would like to
work on the amazing problems.
Omkar Joshi (ojoshi@netflix.com)

Más contenido relacionado

La actualidad más candente

Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemAvleen Vig
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performancePostgreSQL-Consulting
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Lucidworks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookDatabricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 

La actualidad más candente (20)

Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log system
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 

Similar a How to Performance-Tune Apache Spark Applications in Large Clusters

SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorDatabricks
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Databricks
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Databricks
 
Advanced .NET Data Access with Dapper
Advanced .NET Data Access with Dapper Advanced .NET Data Access with Dapper
Advanced .NET Data Access with Dapper David Paquette
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkRde:code 2017
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
DAGScheduler - The Internals of Apache Spark.pdf
DAGScheduler - The Internals of Apache Spark.pdfDAGScheduler - The Internals of Apache Spark.pdf
DAGScheduler - The Internals of Apache Spark.pdfJoeKibangu
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 

Similar a How to Performance-Tune Apache Spark Applications in Large Clusters (20)

SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS Accelerator
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
 
Advanced .NET Data Access with Dapper
Advanced .NET Data Access with Dapper Advanced .NET Data Access with Dapper
Advanced .NET Data Access with Dapper
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
DAGScheduler - The Internals of Apache Spark.pdf
DAGScheduler - The Internals of Apache Spark.pdfDAGScheduler - The Internals of Apache Spark.pdf
DAGScheduler - The Internals of Apache Spark.pdf
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 

Último (20)

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 

How to Performance-Tune Apache Spark Applications in Large Clusters

  • 1. How to performance-tune spark applications in large clusters - Omkar Joshi
  • 2. Omkar Joshi ojoshi@netflix.com ● Software engineer @ Netflix ● Architect & author of Marmaray (Generic ingestion framework) @ Uber. ● Architected Object store & NFS solutions at Hedvig ● Hadoop Yarn committer ● Enjoy gardening & hiking in free time
  • 3. 01 JVM Profiler 02 Hadoop Profiler 03 Spark Listener 04 Auto tune 05 Storage improvements 06 Cpu / Runtime improvements 07 Efficiency improvements 08 Memory improvements Agenda
  • 4. JVM Profiler: Distributed Profiling at a Large Scale ● Help tracking memory/cpu/stacktrace for large amount of Spark executors ● Open sourced (https://github.com/uber-common/jvm-profiler) ● Presented in previous Spark Summit ● Some new update
  • 5. Hadoop Profiler (Recap) ● Java Agent attached to each executor ● Collects metrics via JMX and /proc ● Instruments arbitrary Java user code ● Emits to Kafka, InfluxDB, and Redis and other data sinks Method Argument Profiler Method duration Profiler CPU/Memory Profiler Reporter Java Agent Kafka InfluxDB https://github.com/uber-common/jvm-profiler
  • 6. Spark Listener ● Plugable Listener ○ spark.extraListeners=com.foo.MySparkListener ● Modify Spark Code and Send Execution Plan to Spark Listener ● Generate Data Lineage Information ● Offline Analysis for Spark Task Execution
  • 7. Auto Tune ● Problem: data scientist using team level Spark conf template ● Known Daily Applications ○ Use historical run to set Spark configurations (memory, vcore, etc.) * ● Ad-hoc Repeated (Daily) Applications ○ Use Machine Learning to predict resource usage ○ Challenge: Feature Engineering ■ Execution Plan Project [user_id, product_id, price] Filter (date = 2019-01-31 and product_id = xyz) UnresolvedRelation datalake.user_purchase
  • 8. Open Sourced in September 2018 https://github.com/uber/marmaray Blog Post: https://eng.uber.com/marmaray-hado op-ingestion-open-source/
  • 9. High-Level Architecture Chain of converters Schema Service Input Storage System Source Connector M3 Monitoring System Work Unit Calculator Metadata Manager (Checkpoint store) Converter1 Converter 2 Sink Connector Output Storage System Error Tables Datafeed Config Store Spark execution framework
  • 12. Effective data layout (parquet) ● Parquet uses columnar compression ● Columnar compression savings outperform gz or snappy compression savings ● Align records such that adjacent rows have identical column values ○ Eg. For state column California. ● Sort records with increasing value of cardinality. ● Generalization sometimes is not possible; If it is a framework provide custom sorting.
  • 13. User Id First Name City State Rider score abc123011101 John New York City New York 4.95 abc123011102 Michael San Francisco California 4.92 abc123011103 Andrea Seattle Washington 4.93 abc123011104 Robert Atlanta City Georgia 4.95 User Id First Name City State Rider score cas123032203 Sheetal Atlanta City Georgia 4.97 dsc123022320 Nikki Atlanta City Georgia 4.95 ssd012320212 Dhiraj Atlanta City Georgia 4.94 abc123011104 Robert Atlanta City Georgia 4.95
  • 15. Custom Spark accumulators ● Problem ○ Given a set of ride records; remove duplicate ride records and also count duplicates per state 1. RDD<String> rideRecords = javaSparkContext.readParquet(some_path); 2. Map<String, Long> ridesPerStateBeforeDup = rideRecords.map(r -> getState(r)).countByKey(); 3. RDD<String> dedupRideRecords = dedup(rideRecords); 4. Map<String, Long> ridesPerStateAfterDup = dedupRideRecords.map(r -> getState(r)).countByKey(); 5. dedupRideRecords.write(some_hdfs_path); 6. Duplicates = Diff(ridesPerStateAfterDup, ridesPerStateBeforeDup) 7. # spark stages = 5 ( 3 for countByKey) 1. Class RidesPerStateAccumulator extends AccumulatorV2<Map<String, Long>> 2. RidesPerStateAccumulator riderPerStateBeforeDup, riderPerStateAfterDup; 3. dedup(javaSparkContext.readParquet(some_pat h).map(r -> {riderPerStateBeforeDup.add(r); return r;})).map(r -> {riderPerStateAfterDup.add(r); return r;}).write(some_hdfs_path); 4. Duplicates = Diff(ridesPerStateAfterDup, ridesPerStateBeforeDup) 5. # spark Stages = 2 (no counting overhead!!)
  • 16. Kafka topic 256 Partitions 3Billion messsages per run Increasing kafka read parallelism Kafka Broker 1 Kafka consumer / partition (256 tasks) 1 Kafka consumer per Kafka partition Each consumer reading 12Million messages Spark Stage 1 Sparkshuffleservice Shuffle Write (1.3TB) Spark Stage 2 8192 tasks Shuffle Read (1.3TB) Kafka Broker >1 Kafka consumer per Kafka partition Each consumer reading 384K messages 1 Kafka partition split into 32 virtual partitions Spark Stage 2 8192 tasks
  • 17. Increasing kafka read parallelism contd.. 2 1
  • 18. ● Why Kryo? ○ Lesser memory footprint than Java serializer. ○ Faster and supports custom serializer ● Bug fix to truly enable kryo serialization ○ Spark kryo config prefix change. ● What all is needed to take advantage of that ○ Set “spark.serializer” to “org.apache.spark.serializer.KryoSerializer” ○ Registering avro schemas to spark conf (sparkConf.registerAvroSchemas()) ■ Useful if you are using Avro GenericRecord (Schema + Data) ○ Register all classes which are needed while doing spark transformation ■ Use “spark.kryo.classesToRegister” ■ Use “spark.kryo.registrationRequired” to find missing classes Kryo serialization
  • 19. Kryo serialization contd.. Data (128) Data (128) Data (128) Data (128) Data (128) Avro Schema (4K) Avro Schema (4K) Avro Schema (4K) Avro Schema (4K) Avro Schema (4K) Data (128) Data (128) Data (128) Data (128) Data (128) Schema identifier(4) Schema identifier(4) Schema identifier(4) Schema identifier(4) Schema identifier(4) 1 Record = 4228 Bytes 1 Record = 132 Bytes (97% savings)
  • 20. Reduce ser/deser time by restructuring payload 1. @AllArgsConstructor 2. @Getter 3. private class SparkPayload { 4. private final String sortingKey; 5. // Map with 1000+ entries. 6. private final Map<String, GenericRecord> data; 7. } 8. 1. @Getter 2. private class SparkPayload { 3. private final String sortingKey; 4. // Map with 1000+ entries. 5. private final byte[] serializedData; 6. 7. public SparkPayload(final String sortingKey, Map<String, GenericRecord> data) { 8. this.sortingKey = sortingKey; 9. this.serializedData = KryoSerializer.serialize(data); 10. } 11. 12. public Map<String, GenericRecord> getData() { 13. return KryoSerializer.deserialize(this.serializedData, Map.class); 14. } 15. }
  • 21. Parallelizing spark’s iterator jsc.mapPartitions (iterator -> while (i.hasnext) { parquetWriter.write(i.next); }) MapPartitions Stage (~45min) MapPartitions Stage (~25min) Reading from disk Fetching record from Spark’s Iterator Writing to Parquet Writer in memory Parquet columnar compression FileSystem write Reading from disk Fetching record from Spark’s Iterator Writing to Parquet Writer in memory Parquet columnar compression FileSystem write Spark’s Thread New writer Thread Memory buffer Spark’s Thread
  • 23. Improve utilization by sharing same spark resources JavaSparkContext.readFromKafka(“topic 1”).writeToHdfs(); JavaSparkContext.readFromKafka(“topic 1”).writeToHdfs(); JavaSparkContext.readFromKafka(“topic 3”).writeToHdfs(); JavaSparkContext.readFromKafka(“topic 2”).writeToHdfs(); JavaSparkContext.readFromKafka(“topic N”).writeToHdfs();Threadpoolwith#threads=parallelism needed
  • 24. Improve utilization by sharing same spark resources Time ExecutorId Stage Boundaries Lot of wastage!!!
  • 25. Improve utilization by sharing same spark resources Time ExecutorId Stage Boundaries
  • 27. Off heap memory improvements (work in progress) ● Symptom ○ Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead ● Solution as per stack overflow :) ○ Increase “spark.yarn.executor.memoryOverhead” ○ Result - Huge memory wastage. ● Spark memory distribution ○ Heap & off-heap memory (direct & memory mapped) ● Possible solutions ○ Avoid memory mapping or perform memory mapping chunk by chunk instead of entire file (4-16MB vs 1GB) ● Current vs Target ○ Current - 7GB [Heap(4gb) + Off-heap(3gb)] ○ Target - 5GB [Heap(4gb) + Off-heap(1gb)] - ~28% memory reduction (per container)
  • 28. Thank You!! We are hiring!! Please reach out to us if you would like to work on the amazing problems. Omkar Joshi (ojoshi@netflix.com)