SlideShare una empresa de Scribd logo
1 de 31
Apache SPARK fundamentals
Agenda
• Overview of job execution in Hadoop map reduce
• Recap of the drawbacks in map reduce
• Overview of spark architecture
• Spark deployment
• Job execution in SPARK
• RDD’s
• Actions and transformations
• Demo
Overview of job execution in HADOOP
A job submitted by the user is picked up by the name node and the resource
manager
Job gets submitted to the name node and eventually resource manager is
responsible for scheduling the execution of the job on the data nodes in the
cluster
The data nodes in cluster consists the data blocks on which the user’s
program will be executed in parallel
Job client Name node Resource manager/YARN
Name node Data nodes
The map and reduce stage
Disc I/O problem in Hadoop Map-Reduce
• The above example demonstrates a map-reduce job involving 3 mappers on 3 input splits
• There is 1 reducer
• Each input split on each data resides on the hard disc. Mapper reading them would involve a disc
read operation.
• There would be 3 disc read operations from all the 3 mappers put together
• Merging in the reduce stage involves 1 disc write operation
• Reducer would write the final output file to the HDFS, which indeed is another disc write operation
• Totally there are a minimum of 5 disc I/O operations in the above example (3 from the map stage and
2 from reduce stage)
• The number of disc read operations from the map stage is equal to the number of input splits
Calculating the number of disc I/O operations
on a large data set
• Typically a HDFS input split size would be 128 MB
• Let’s consider a file of size 100TB and the number of file blocks on HDFS would be
(100 * 1024 * 1024) / 128 = 8,19,200
• There would around 8.2 lakh mappers which needs to run on the above data set
once a job is launched using Hadoop map reduce
• 8.2 lakh mappers means, 8.2 lakh disc read operations
• Disc read operations are 10 times slower when compared to a memory read
operation
• Map-Reduce does not inherently support iterations on the data set
• Several rounds of Map-Reduce jobs needs to be chained to achieve the result of
an iterative job in Hadoop
• Most of the machine learning algorithms involves an iterative approach
• 10 rounds of iterations in a single job leads to 8.2 lakh X 10 disc I/O operations
SPARK’s approach to problem solving
• Spark allows the results of computation to be saved in the memory for future re-use
• Reading the data from the memory is much faster than that of reading from the disc
• Caching the result in memory is under the programmer’s control
• Not always is possible to save such results completely in memory especially when the
object is too large and memory is low
• In such cases the objects needs to be moved to the disc
• Spark, therefore is not a completely in memory based parallel processing platform
• Spark however is 3X to 10X faster in most of the jobs when compared to that of
Hadoop
Understanding SPARK architecture
Spark cluster on Ubuntu host machines
Simplifying SPARK architecture
• Driver is the starting point of a job submission (this can be compared to the driver code in Java MR)
• Cluster Manager can be compared to the Resource Manager in Hadoop
• Worker is a software service running on slave nodes, similar to the are the data nodes in a HADOOP cluster
• The executor is a container which is responsible for running the tasks
Job execution in a SPARK cluster
Spark deployment modes
Image courtesy : https://databricks.com/blog/2014/01/21/spark-and-hadoop.html
Spark deployment modes
Standalone mode :
All the spark services run on a single machine but in separate JVM’s. Mainly
used for learning and development purposes(something like the pseudo
distributed mode of Hadoop deployment)
Cluster mode with YARN or MESOS:
This is the fully distributed mode of SPARK used in a production environment
Spark in Map Reduce (SIMR) :
Allows Hadoop MR1 users to run their map reduce jobs as spark jobs
Loading spark data objects (RDD)
• The data loaded into a SPARK object is called as an RDD
• A detailed discussion about RDD’s will be covered shortly
Data loaded into the
spark object and
subjected to
manipulations
Data source
Capture the final
result
Under the hood
• Job execution starts with loading the data from a data source (e.g. HDFS) into spark environment
• Data read from the hard drives of worker nodes and loaded into the RAM of multiple machines
• The data could be spread out into different files (each file could be a block in HDFS)
• After the computation, the final result is captured
Capture the final
result
Partitions and data locality
Block 1 Block 2 Block 3
1,2 3,4 5
• Loading of the data from the hard drives to the RAM of the worker nodes is based on data locality
• The data in the data blocks is illustrated in the block diagram below
Transformation the data object
1,2 3,4 5
2,3 4,5 6
Transformation
User defined
Object 2
Object 1
• The data in the objects cannot be modified as the very nature of the SPARK objects is
immutable and the data in these objects are partitioned & distributed across nodes
3 important properties of an RDD
• We have just understood 3 important properties of an RDD in spark
1) They are immutable
2) They are partitioned
3) They are distributed and spread across multiple nodes in a machine
Task execution
1,2
3,4
5
Executor
Executor
2,3 4,5
6
Tasks
Task
MAP(increment)
Workflow
Job is launched
from the driver
SC Result is
brought back to
the driver
Cluster
manager
(MESOS/YARN)
RDD lazy evaluation
(DAG creation)
RDD 2
(2,4,6)
RDD 1
(2,3,4,5,6)
map(increment) filter (even)
Base RDD:
(1,2,3,4,5)
• Lets start calling these objects as RDD’s hereafter
• RDD’s are immutable & partitioned
• RDD’s mostly reside in the RAM (memory) unless the RAM (memory) is running of
space
Execution starts only when ACTION starts
Base RDD:
(1,2,3,4,5)
Object 2:
(2,4,6)
Display output or save
in a persistent file
Object 1 :
(2,3,4,5,6)
map(increment) filter(even nos) collect the output
RDD’s are fault tolerant (resilient)
Base
RDD
RDD1 RDD3 RDD4
Final output
• RDD’s lost or corrupted during the course of execution can be reconstructed from the lineage
LOST
RDD
0. Create Base RDD
1. Increment the data elements
2. Filter the even numbers
3. Pick only those divisible by 6
4. Select only those greater than 78
RDD’s are fault tolerant (resilient)
Base
RDD
RDD1 RDD3 RDD4 Final output
• Lineage is a history of how an RDD was created from it’s parent RDD through a transformation
• The steps in the transformation are re-executed to create a lost RDD
RDD2
0. Create Base RDD
1. Increment the data elements
2. Filter the even numbers
3. Pick only those divisible by 6
4. Select only those greater than 78
Properties of RDD
• They are RESILIENT DISTRIBUTED DATA sets
• Resilience (fault tolerant) due to the lineage feature in SPARK
• They are distributed and spread across many data nodes
• They are in-memory objects
• They are immutable
Caching the RDD’s
SPARK’s approach to problem solving
• Spark reads the data from the disc once initially and loads it into its
memory
• The in-memory data objects are called RDD’s in spark
• Spark can read the data from HDFS where large files are spit into smaller
blocks and distributed across several data nodes
• Data nodes are called as worker nodes in Spark eco system
• Spark’s way of problem solving also involves map and reduce operations
• The results of the computation can be saved in memory in case if its going
to be re-used as the part of an iterative job
• Saving a SPARK object(RDD) in memory for future re-use is called caching
Note : RDD’s are not always cached by default in the RAM (memory). They will have to
be written on to the disc when the system is facing a low memory condition due to too
many RDD’s already in the RAM. Hence SPARK is not a completely in memory based
computing framework
3 different ways of creating an RDD in SPARK
• Created by read a big data file directly from an external file system,
this is used while working on large data sets
• Using the parallelize API, this is usually used on small data sets
• Using the makeRDD API
Actions and transformations
• Transformations are any operations on the RDD’s which are subjected to manipulations during the
course of analysis
• A SPARK job is a collection of a sequence of a several TRANSFORMATIONS
• The above job is usually a program written in SCALA or Python
• Actions are those operations which trigger the execution of a sequence of transformations
• There are over 2 dozen transformations and 1 dozen actions
• A glimpse of the actions and transformations in SPARK can be found in the official SPARK programming
documentation guide
https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html
• Most of them will be discussed in detail during the demo
Summary
• A quick recap about the drawbacks of Map-Reduce in Hadoop
• Spark architecture
• Spark deployment modes
• Job execution in SPARK
• RDD’s

Más contenido relacionado

Similar a Spark architechure.pptx

Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2Stefanie Zhao
 
Spark streaming high level overview
Spark streaming high level overviewSpark streaming high level overview
Spark streaming high level overviewAvi Levi
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxAishg4
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache SparkGao Yunzhong
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptxSakthiVinoth78
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureDatabricks
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 

Similar a Spark architechure.pptx (20)

Spark
SparkSpark
Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Hadoop
HadoopHadoop
Hadoop
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2
 
Spark streaming high level overview
Spark streaming high level overviewSpark streaming high level overview
Spark streaming high level overview
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 

Último

Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 

Último (20)

Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 

Spark architechure.pptx

  • 2. Agenda • Overview of job execution in Hadoop map reduce • Recap of the drawbacks in map reduce • Overview of spark architecture • Spark deployment • Job execution in SPARK • RDD’s • Actions and transformations • Demo
  • 3. Overview of job execution in HADOOP A job submitted by the user is picked up by the name node and the resource manager Job gets submitted to the name node and eventually resource manager is responsible for scheduling the execution of the job on the data nodes in the cluster The data nodes in cluster consists the data blocks on which the user’s program will be executed in parallel Job client Name node Resource manager/YARN Name node Data nodes
  • 4. The map and reduce stage
  • 5. Disc I/O problem in Hadoop Map-Reduce • The above example demonstrates a map-reduce job involving 3 mappers on 3 input splits • There is 1 reducer • Each input split on each data resides on the hard disc. Mapper reading them would involve a disc read operation. • There would be 3 disc read operations from all the 3 mappers put together • Merging in the reduce stage involves 1 disc write operation • Reducer would write the final output file to the HDFS, which indeed is another disc write operation • Totally there are a minimum of 5 disc I/O operations in the above example (3 from the map stage and 2 from reduce stage) • The number of disc read operations from the map stage is equal to the number of input splits
  • 6. Calculating the number of disc I/O operations on a large data set • Typically a HDFS input split size would be 128 MB • Let’s consider a file of size 100TB and the number of file blocks on HDFS would be (100 * 1024 * 1024) / 128 = 8,19,200 • There would around 8.2 lakh mappers which needs to run on the above data set once a job is launched using Hadoop map reduce • 8.2 lakh mappers means, 8.2 lakh disc read operations • Disc read operations are 10 times slower when compared to a memory read operation • Map-Reduce does not inherently support iterations on the data set • Several rounds of Map-Reduce jobs needs to be chained to achieve the result of an iterative job in Hadoop • Most of the machine learning algorithms involves an iterative approach • 10 rounds of iterations in a single job leads to 8.2 lakh X 10 disc I/O operations
  • 7. SPARK’s approach to problem solving • Spark allows the results of computation to be saved in the memory for future re-use • Reading the data from the memory is much faster than that of reading from the disc • Caching the result in memory is under the programmer’s control • Not always is possible to save such results completely in memory especially when the object is too large and memory is low • In such cases the objects needs to be moved to the disc • Spark, therefore is not a completely in memory based parallel processing platform • Spark however is 3X to 10X faster in most of the jobs when compared to that of Hadoop
  • 8.
  • 10. Spark cluster on Ubuntu host machines
  • 11. Simplifying SPARK architecture • Driver is the starting point of a job submission (this can be compared to the driver code in Java MR) • Cluster Manager can be compared to the Resource Manager in Hadoop • Worker is a software service running on slave nodes, similar to the are the data nodes in a HADOOP cluster • The executor is a container which is responsible for running the tasks
  • 12. Job execution in a SPARK cluster
  • 13. Spark deployment modes Image courtesy : https://databricks.com/blog/2014/01/21/spark-and-hadoop.html
  • 14. Spark deployment modes Standalone mode : All the spark services run on a single machine but in separate JVM’s. Mainly used for learning and development purposes(something like the pseudo distributed mode of Hadoop deployment) Cluster mode with YARN or MESOS: This is the fully distributed mode of SPARK used in a production environment Spark in Map Reduce (SIMR) : Allows Hadoop MR1 users to run their map reduce jobs as spark jobs
  • 15. Loading spark data objects (RDD) • The data loaded into a SPARK object is called as an RDD • A detailed discussion about RDD’s will be covered shortly Data loaded into the spark object and subjected to manipulations Data source Capture the final result
  • 16. Under the hood • Job execution starts with loading the data from a data source (e.g. HDFS) into spark environment • Data read from the hard drives of worker nodes and loaded into the RAM of multiple machines • The data could be spread out into different files (each file could be a block in HDFS) • After the computation, the final result is captured Capture the final result
  • 17. Partitions and data locality Block 1 Block 2 Block 3 1,2 3,4 5 • Loading of the data from the hard drives to the RAM of the worker nodes is based on data locality • The data in the data blocks is illustrated in the block diagram below
  • 18. Transformation the data object 1,2 3,4 5 2,3 4,5 6 Transformation User defined Object 2 Object 1 • The data in the objects cannot be modified as the very nature of the SPARK objects is immutable and the data in these objects are partitioned & distributed across nodes
  • 19. 3 important properties of an RDD • We have just understood 3 important properties of an RDD in spark 1) They are immutable 2) They are partitioned 3) They are distributed and spread across multiple nodes in a machine
  • 21. Workflow Job is launched from the driver SC Result is brought back to the driver Cluster manager (MESOS/YARN)
  • 22. RDD lazy evaluation (DAG creation) RDD 2 (2,4,6) RDD 1 (2,3,4,5,6) map(increment) filter (even) Base RDD: (1,2,3,4,5) • Lets start calling these objects as RDD’s hereafter • RDD’s are immutable & partitioned • RDD’s mostly reside in the RAM (memory) unless the RAM (memory) is running of space
  • 23. Execution starts only when ACTION starts Base RDD: (1,2,3,4,5) Object 2: (2,4,6) Display output or save in a persistent file Object 1 : (2,3,4,5,6) map(increment) filter(even nos) collect the output
  • 24. RDD’s are fault tolerant (resilient) Base RDD RDD1 RDD3 RDD4 Final output • RDD’s lost or corrupted during the course of execution can be reconstructed from the lineage LOST RDD 0. Create Base RDD 1. Increment the data elements 2. Filter the even numbers 3. Pick only those divisible by 6 4. Select only those greater than 78
  • 25. RDD’s are fault tolerant (resilient) Base RDD RDD1 RDD3 RDD4 Final output • Lineage is a history of how an RDD was created from it’s parent RDD through a transformation • The steps in the transformation are re-executed to create a lost RDD RDD2 0. Create Base RDD 1. Increment the data elements 2. Filter the even numbers 3. Pick only those divisible by 6 4. Select only those greater than 78
  • 26. Properties of RDD • They are RESILIENT DISTRIBUTED DATA sets • Resilience (fault tolerant) due to the lineage feature in SPARK • They are distributed and spread across many data nodes • They are in-memory objects • They are immutable
  • 28. SPARK’s approach to problem solving • Spark reads the data from the disc once initially and loads it into its memory • The in-memory data objects are called RDD’s in spark • Spark can read the data from HDFS where large files are spit into smaller blocks and distributed across several data nodes • Data nodes are called as worker nodes in Spark eco system • Spark’s way of problem solving also involves map and reduce operations • The results of the computation can be saved in memory in case if its going to be re-used as the part of an iterative job • Saving a SPARK object(RDD) in memory for future re-use is called caching Note : RDD’s are not always cached by default in the RAM (memory). They will have to be written on to the disc when the system is facing a low memory condition due to too many RDD’s already in the RAM. Hence SPARK is not a completely in memory based computing framework
  • 29. 3 different ways of creating an RDD in SPARK • Created by read a big data file directly from an external file system, this is used while working on large data sets • Using the parallelize API, this is usually used on small data sets • Using the makeRDD API
  • 30. Actions and transformations • Transformations are any operations on the RDD’s which are subjected to manipulations during the course of analysis • A SPARK job is a collection of a sequence of a several TRANSFORMATIONS • The above job is usually a program written in SCALA or Python • Actions are those operations which trigger the execution of a sequence of transformations • There are over 2 dozen transformations and 1 dozen actions • A glimpse of the actions and transformations in SPARK can be found in the official SPARK programming documentation guide https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html • Most of them will be discussed in detail during the demo
  • 31. Summary • A quick recap about the drawbacks of Map-Reduce in Hadoop • Spark architecture • Spark deployment modes • Job execution in SPARK • RDD’s