SlideShare una empresa de Scribd logo
1 de 23
Spark
Understanding & Performance Issues
Key-points of Spark
• A better implementation of MapReduce paradigm
• Handles batch, iterative and real-time applications
within a single framework.
• Most computations maps into many maps and
reduces with dependences among them
• Spark’s RDD programming model models these
dependences as a DAG.
Spark Goals
• Generality: diverse workloads, operators, job sizes
• Low latency: sub second
• Fault Tolerance: faults shouldn’t be a special case
• Simplicity: offer a High level API without boilerplate
code
Programming Point of
View
• High Level API (accessible to data scientists)
• Native integration with Java, Python and Scala
• Due to flexible programming model, applications
can rewrite the way shuffle or aggregation is done.
• Optionally applications can choose to put datasets
in memory
Engineering Point of
View
• Uses RPCs for task dispatching and scheduling
• Uses Thread-pool for execution of tasks (rather than a pool of JVM processes that Hadoop does)
• The above two enables spark to schedules tasks in milliseconds whereas MR scheduling takes
seconds or minutes in busy clusters.
• Supports checkpoint-based recovery (like Hadoop) + lineage-based recovery (much faster)
• Spark caches the data to be processed
• Each application gets its own executor processes, which stay up for the duration of the whole
application and run tasks in multiple threads.
• Benefit: Isolating applications from each other, on both the scheduling side (each driver
schedules its own tasks) and executor side (tasks from different applications run in different
JVMs).
• Disadvantage: it also means that data cannot be shared across different Spark applications
(instances of SparkContext) without writing it to an external storage system.
Spark Jargons (1/2)
• Driver: The program/process responsible for
running the Job over Spark Engine
• Executor: The process responsible for executing a
task
• Master: The machine where the Driver is
• Slave/Worker: The machine where the Executor
program runs.
Spark’s Master/Slave
Architecture
Spark Jargons (2/2)
• Job: A parallel computation consisting of multiple tasks that
gets spawned in response to a Spark action (e.g. save,
collect)
• Stages: Each job gets divided into smaller sets of tasks
called stages that depend on each other (similar to the map
and reduce stages in MapReduce)
• Tasks: Each stage has some tasks, one per partition. One
task is executed on one partition of data on one executor.
• Dag: stands for Directed Acyclic Graph, in the present
content is a DAG of operators
RDDs
• Resilient Distributed Datasets are the primary abstraction in Spark, a fault tolerant collection of
elements that can be operated in parallel.
• They are currently two types:
1. Parallelized collections: take an existing scala collection and run functions on it in parallel.
2. Hadoop Datasets: run functions on each record of a file in HDFS or any other storage
supported by Hadoop.
• Support two types of operations Transformations and Actions
1. Transformations: are lazy operations on a RDD that create one or many new RDDs, e.g.
map, filter, reduceByKey, join, randomSplit.
2. Actions: are computed immediately.They consist of running all the previous transformations
in order to get back an actual result. In other words, a RDD operation that returns a value of
any type but RDD[T] is an action. (actions are synchronous)
• An RDD can be persisted into storage disk or cached in memory.
Transformations
• There are two kinds of transformations:
1. Narrow transformations: are the result of the data from a single partition only,
i.e. map, filter..
• Spark groups narrow transformations in one stage which is called pipelining.
2. Wide/Shuffle transformations: are the result of groupByKey and reduceByKey.
The data required to compute the records in a single partition may exist in many
partitions of the parent RDD.
• All of the tuples with the same key must end up in the same partition,
processed by the same task.
• To satisfy these operations, Spark must execute RDD shuffle, which transfers
data across cluster and results in a new stage with a new set of partitions.
Transformations |
Actions
• map( function )
• filter( function )
• flatmap( function )
• sample( function )
• union( otherDataSet )
• dinstict( [numTasks] )
• groupByKey( [numTasks] )
• reduceByKey( function, [numTasks])
• sortByKey( [ascending], [numTasks])
• join( otherDataSets, [numTasks] ) etc…
• reduce(function)
• collect()
• count()
• first()
• take(n)
• takeSample(..)
• saveAsTextFile(path)
• saveAsSequentialFile(path)
• countByKey()
• foreach( function ) etc…
RDD Shuffle
• Shuffling is a process of redistributing data across
partitions (aka repartitioning) that may or may not
cause moving data across JVM processes or even
over the wire (between executors on separate
machines).
• “This typically involves copying data across
executors and machines, making the shuffle a
complex and costly operation.” - wrote in Spark’s
website
Spark’s System
Layers
Different Deployment Modes
• Spark Stand-alone
• Spark on Yarn
• Spark on Mesos
Common Performance
Issues
• Adequate Parallelism / partitioning: smaller/more numerous partitions allow work to be distributed
among more workers, but larger/fewer partitions allow work to be done in larger chunks, which may
result in the work getting done more quickly as long as all workers are kept busy, due to reduced
overhead.
• Re-partitioning on the Fly: each execution stage may have a different optimal degree of
parallelism, and the data shuffling between stages become opportunities to adjust the partitioning
accordingly.
• Wrong ordering of the transformations: shuffling more unnecessary data
• Data Layout: OO Languages add a layer of abstraction but this increase overhead in memory
usage.Furthermore those frameworks run on top of JVM and it’s garbage collector is known to be
sensitive to memory layout and access patterns.
• Task Placement: The co-allocation of heterogeneous tasks has the potential for creating unexpected
performance issues.
• Load Balancing: assuming applications execute stages sequentially, every imbalance
in a stage’s tasks lead to resource idleness.
Other issues (1/2): Too
many Shuffle files
• Has been observed that the bottleneck that Spark
currently faces is a problem specific to the existing
implementation of how shuffle files are defined.
• Each Map creates one shuffle file for each Reducer so in case that we
have 5000 Maps and 1024 Reduces we end up with over 5 million
shuffle files in total.
• This can lead to:
1. Poor Performance due to communication via Socket
2. Suffer from Random I/Os
Solutions
• Unsuccessful:
• Extra Processing Stage
• TritonSort (Try to bottleneck every source at the same time)
• Optimizations from static point of view or when the structure of the data is
known
• Successful:
1. Shuffle File Consolidation -
proposed by A. Davidson at al. “Optimizing Shuffle Performance in Spark”
2. RDMA in Spark -
proposed by W.Rahman at al “Accelerating Spark with RDMA for Big Data Processing:
Early Experiences”
Other issues (2/2): Data
Shuffle blocks
• It is not feasible to gather all the shuffle data before they are
consumed, because:
1. The data transfers would take a long time to complete
2. A large amount of memory and local storage would be needed to
cache it.
• So Producer Consumer model of Shuffling-Reducing is adopted.
1. This create a complex all-to-all communication pattern that puts a
significant burden on the networking infrastructure.
2. CPU Blocked due to missing Shuffle Block vs Explosion of
Memory utilization due to accumulation of many Shuffle blocks.
Evaluation tools
• Spark monitoring web UI
(Offers precise event timeline, DAG visualisation and other
monitoring tools)
• Sar to report iops for i/o usage. (provided as a part of the
sysstat package)
• iostat
• htop
• free
Evaluation
applications/Benchmarks
• SparkBench ( Benchmark suite ) - by IBM “SPARKBENCH: A Comprehensive
Benchmarking Suite For In Memory Data Analytic Platform Spark”
• GroupBy Test ( commonly used Spark Benchmark ) - used by “Accelerating Spark
with RDMA for Big Data Processing: Early Experiences"
• Twidd (application) - used by “Diagnosing Performance Bottlenecks in Massive Data
Parallel Programs.”
• Elcat (application) - used by “Diagnosing Performance Bottlenecks in Massive Data Parallel
Programs.”
• PageRank (application) - used by “Diagnosing Performance Bottlenecks in Massive Data
Parallel Programs.”
• BDBench (benchmark)
• TPC-DS (benchmark)
Blocked Time Analysis
• Issues:
1. Per-Task Utilization can not be measured in
Spark because all tasks run in a single process.
2. Instrumentation should be Light in terms of
memory
3.Instrumentation shouldn’t add to job time.
4.Needed to add logging in HDFS.
Usage of Memory
• Execution: Memory used for shuffles, sorts and aggregation
• Storage: Memory used to cache data that will be reused
later
• 1st approach: static assignment
• 2nd approach: Unified memory (always storage spills to
disk)
• 3rd approach: Dynamic Assignment into different cores
(Each task is now assigned 1/N of the memory) -> helps
with stranglers.
References
• https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
• SPARKBENCH: A Comprehensive Benchmarking Suite For In Memory
Data Analytic Platform Spark
• Accelerating Spark with RDMA for Big Data Processing: Early
Experiences
• Diagnosing Performance Bottlenecks in Massive Data Parallel Programs.
• A. Davidson at al. “Optimizing Shuffle Performance in Spark”
• W.Rahman at al “Accelerating Spark with RDMA for Big Data Processing:
Early Experiences”

Más contenido relacionado

La actualidad más candente

Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureImproving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Databricks
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Databricks
 

La actualidad más candente (19)

Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureImproving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
 
Spark tunning in Apache Kylin
Spark tunning in Apache KylinSpark tunning in Apache Kylin
Spark tunning in Apache Kylin
 
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
 
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Big Data for QAs
Big Data for QAsBig Data for QAs
Big Data for QAs
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPC
 

Destacado

A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
Alexander Ulanov
 
MongoDB Shell Tips & Tricks
MongoDB Shell Tips & TricksMongoDB Shell Tips & Tricks
MongoDB Shell Tips & Tricks
MongoDB
 
Dynamic Reconfiguration of Apache ZooKeeper
Dynamic Reconfiguration of Apache ZooKeeperDynamic Reconfiguration of Apache ZooKeeper
Dynamic Reconfiguration of Apache ZooKeeper
DataWorks Summit
 

Destacado (20)

Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Immutability
ImmutabilityImmutability
Immutability
 
Afternoon Talks @Office
Afternoon Talks @OfficeAfternoon Talks @Office
Afternoon Talks @Office
 
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & AirflowBuilding a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
 
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernándezSpark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
20140202 fosdem-nosql-devroom-hadoop-yarn
20140202 fosdem-nosql-devroom-hadoop-yarn20140202 fosdem-nosql-devroom-hadoop-yarn
20140202 fosdem-nosql-devroom-hadoop-yarn
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technical
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Apache Yarn - Hadoop Cluster Management
Apache Yarn -  Hadoop Cluster ManagementApache Yarn -  Hadoop Cluster Management
Apache Yarn - Hadoop Cluster Management
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
 
How to write maintainable code
How to write maintainable codeHow to write maintainable code
How to write maintainable code
 
Performance in Spark 2.0, PDX Spark Meetup 8/18/16
Performance in Spark 2.0, PDX Spark Meetup 8/18/16Performance in Spark 2.0, PDX Spark Meetup 8/18/16
Performance in Spark 2.0, PDX Spark Meetup 8/18/16
 
MongoDB Shell Tips & Tricks
MongoDB Shell Tips & TricksMongoDB Shell Tips & Tricks
MongoDB Shell Tips & Tricks
 
Training MongoDB - Monitoring and Operability
Training MongoDB - Monitoring and OperabilityTraining MongoDB - Monitoring and Operability
Training MongoDB - Monitoring and Operability
 
Improving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time SparkImproving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time Spark
 
Dynamic Reconfiguration of Apache ZooKeeper
Dynamic Reconfiguration of Apache ZooKeeperDynamic Reconfiguration of Apache ZooKeeper
Dynamic Reconfiguration of Apache ZooKeeper
 
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 

Similar a Spark Overview and Performance Issues

2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 

Similar a Spark Overview and Performance Issues (20)

Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark streaming high level overview
Spark streaming high level overviewSpark streaming high level overview
Spark streaming high level overview
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Hadoop
HadoopHadoop
Hadoop
 
try
trytry
try
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
 
Spark 1.0
Spark 1.0Spark 1.0
Spark 1.0
 

Más de Antonios Katsarakis

Más de Antonios Katsarakis (8)

The L2AW theorem
The L2AW theoremThe L2AW theorem
The L2AW theorem
 
Invalidation-Based Protocols for Replicated Datastores
Invalidation-Based Protocols for Replicated DatastoresInvalidation-Based Protocols for Replicated Datastores
Invalidation-Based Protocols for Replicated Datastores
 
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
 
Hermes Reliable Replication Protocol - Poster
Hermes Reliable Replication Protocol - Poster Hermes Reliable Replication Protocol - Poster
Hermes Reliable Replication Protocol - Poster
 
Hermes Reliable Replication Protocol - ASPLOS'20 Presentation
Hermes Reliable Replication Protocol -  ASPLOS'20 PresentationHermes Reliable Replication Protocol -  ASPLOS'20 Presentation
Hermes Reliable Replication Protocol - ASPLOS'20 Presentation
 
Scale-out ccNUMA - Eurosys'18
Scale-out ccNUMA - Eurosys'18Scale-out ccNUMA - Eurosys'18
Scale-out ccNUMA - Eurosys'18
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
Distributed Processing Frameworks
Distributed Processing FrameworksDistributed Processing Frameworks
Distributed Processing Frameworks
 

Último

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Último (20)

Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 

Spark Overview and Performance Issues

  • 2. Key-points of Spark • A better implementation of MapReduce paradigm • Handles batch, iterative and real-time applications within a single framework. • Most computations maps into many maps and reduces with dependences among them • Spark’s RDD programming model models these dependences as a DAG.
  • 3. Spark Goals • Generality: diverse workloads, operators, job sizes • Low latency: sub second • Fault Tolerance: faults shouldn’t be a special case • Simplicity: offer a High level API without boilerplate code
  • 4. Programming Point of View • High Level API (accessible to data scientists) • Native integration with Java, Python and Scala • Due to flexible programming model, applications can rewrite the way shuffle or aggregation is done. • Optionally applications can choose to put datasets in memory
  • 5. Engineering Point of View • Uses RPCs for task dispatching and scheduling • Uses Thread-pool for execution of tasks (rather than a pool of JVM processes that Hadoop does) • The above two enables spark to schedules tasks in milliseconds whereas MR scheduling takes seconds or minutes in busy clusters. • Supports checkpoint-based recovery (like Hadoop) + lineage-based recovery (much faster) • Spark caches the data to be processed • Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. • Benefit: Isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). • Disadvantage: it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
  • 6. Spark Jargons (1/2) • Driver: The program/process responsible for running the Job over Spark Engine • Executor: The process responsible for executing a task • Master: The machine where the Driver is • Slave/Worker: The machine where the Executor program runs.
  • 8. Spark Jargons (2/2) • Job: A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect) • Stages: Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce) • Tasks: Each stage has some tasks, one per partition. One task is executed on one partition of data on one executor. • Dag: stands for Directed Acyclic Graph, in the present content is a DAG of operators
  • 9. RDDs • Resilient Distributed Datasets are the primary abstraction in Spark, a fault tolerant collection of elements that can be operated in parallel. • They are currently two types: 1. Parallelized collections: take an existing scala collection and run functions on it in parallel. 2. Hadoop Datasets: run functions on each record of a file in HDFS or any other storage supported by Hadoop. • Support two types of operations Transformations and Actions 1. Transformations: are lazy operations on a RDD that create one or many new RDDs, e.g. map, filter, reduceByKey, join, randomSplit. 2. Actions: are computed immediately.They consist of running all the previous transformations in order to get back an actual result. In other words, a RDD operation that returns a value of any type but RDD[T] is an action. (actions are synchronous) • An RDD can be persisted into storage disk or cached in memory.
  • 10. Transformations • There are two kinds of transformations: 1. Narrow transformations: are the result of the data from a single partition only, i.e. map, filter.. • Spark groups narrow transformations in one stage which is called pipelining. 2. Wide/Shuffle transformations: are the result of groupByKey and reduceByKey. The data required to compute the records in a single partition may exist in many partitions of the parent RDD. • All of the tuples with the same key must end up in the same partition, processed by the same task. • To satisfy these operations, Spark must execute RDD shuffle, which transfers data across cluster and results in a new stage with a new set of partitions.
  • 11. Transformations | Actions • map( function ) • filter( function ) • flatmap( function ) • sample( function ) • union( otherDataSet ) • dinstict( [numTasks] ) • groupByKey( [numTasks] ) • reduceByKey( function, [numTasks]) • sortByKey( [ascending], [numTasks]) • join( otherDataSets, [numTasks] ) etc… • reduce(function) • collect() • count() • first() • take(n) • takeSample(..) • saveAsTextFile(path) • saveAsSequentialFile(path) • countByKey() • foreach( function ) etc…
  • 12. RDD Shuffle • Shuffling is a process of redistributing data across partitions (aka repartitioning) that may or may not cause moving data across JVM processes or even over the wire (between executors on separate machines). • “This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.” - wrote in Spark’s website
  • 14. Different Deployment Modes • Spark Stand-alone • Spark on Yarn • Spark on Mesos
  • 15. Common Performance Issues • Adequate Parallelism / partitioning: smaller/more numerous partitions allow work to be distributed among more workers, but larger/fewer partitions allow work to be done in larger chunks, which may result in the work getting done more quickly as long as all workers are kept busy, due to reduced overhead. • Re-partitioning on the Fly: each execution stage may have a different optimal degree of parallelism, and the data shuffling between stages become opportunities to adjust the partitioning accordingly. • Wrong ordering of the transformations: shuffling more unnecessary data • Data Layout: OO Languages add a layer of abstraction but this increase overhead in memory usage.Furthermore those frameworks run on top of JVM and it’s garbage collector is known to be sensitive to memory layout and access patterns. • Task Placement: The co-allocation of heterogeneous tasks has the potential for creating unexpected performance issues. • Load Balancing: assuming applications execute stages sequentially, every imbalance in a stage’s tasks lead to resource idleness.
  • 16. Other issues (1/2): Too many Shuffle files • Has been observed that the bottleneck that Spark currently faces is a problem specific to the existing implementation of how shuffle files are defined. • Each Map creates one shuffle file for each Reducer so in case that we have 5000 Maps and 1024 Reduces we end up with over 5 million shuffle files in total. • This can lead to: 1. Poor Performance due to communication via Socket 2. Suffer from Random I/Os
  • 17. Solutions • Unsuccessful: • Extra Processing Stage • TritonSort (Try to bottleneck every source at the same time) • Optimizations from static point of view or when the structure of the data is known • Successful: 1. Shuffle File Consolidation - proposed by A. Davidson at al. “Optimizing Shuffle Performance in Spark” 2. RDMA in Spark - proposed by W.Rahman at al “Accelerating Spark with RDMA for Big Data Processing: Early Experiences”
  • 18. Other issues (2/2): Data Shuffle blocks • It is not feasible to gather all the shuffle data before they are consumed, because: 1. The data transfers would take a long time to complete 2. A large amount of memory and local storage would be needed to cache it. • So Producer Consumer model of Shuffling-Reducing is adopted. 1. This create a complex all-to-all communication pattern that puts a significant burden on the networking infrastructure. 2. CPU Blocked due to missing Shuffle Block vs Explosion of Memory utilization due to accumulation of many Shuffle blocks.
  • 19. Evaluation tools • Spark monitoring web UI (Offers precise event timeline, DAG visualisation and other monitoring tools) • Sar to report iops for i/o usage. (provided as a part of the sysstat package) • iostat • htop • free
  • 20. Evaluation applications/Benchmarks • SparkBench ( Benchmark suite ) - by IBM “SPARKBENCH: A Comprehensive Benchmarking Suite For In Memory Data Analytic Platform Spark” • GroupBy Test ( commonly used Spark Benchmark ) - used by “Accelerating Spark with RDMA for Big Data Processing: Early Experiences" • Twidd (application) - used by “Diagnosing Performance Bottlenecks in Massive Data Parallel Programs.” • Elcat (application) - used by “Diagnosing Performance Bottlenecks in Massive Data Parallel Programs.” • PageRank (application) - used by “Diagnosing Performance Bottlenecks in Massive Data Parallel Programs.” • BDBench (benchmark) • TPC-DS (benchmark)
  • 21. Blocked Time Analysis • Issues: 1. Per-Task Utilization can not be measured in Spark because all tasks run in a single process. 2. Instrumentation should be Light in terms of memory 3.Instrumentation shouldn’t add to job time. 4.Needed to add logging in HDFS.
  • 22. Usage of Memory • Execution: Memory used for shuffles, sorts and aggregation • Storage: Memory used to cache data that will be reused later • 1st approach: static assignment • 2nd approach: Unified memory (always storage spills to disk) • 3rd approach: Dynamic Assignment into different cores (Each task is now assigned 1/N of the memory) -> helps with stranglers.
  • 23. References • https://jaceklaskowski.gitbooks.io/mastering-apache-spark/ • SPARKBENCH: A Comprehensive Benchmarking Suite For In Memory Data Analytic Platform Spark • Accelerating Spark with RDMA for Big Data Processing: Early Experiences • Diagnosing Performance Bottlenecks in Massive Data Parallel Programs. • A. Davidson at al. “Optimizing Shuffle Performance in Spark” • W.Rahman at al “Accelerating Spark with RDMA for Big Data Processing: Early Experiences”