SlideShare a Scribd company logo
1 of 13
Download to read offline
MapReduce:
Simplified Data Processing on Large Clusters	
2015-08-30 さとうかずま
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
The reasons for the success of MapReduce	
2	
The MapReduce programming model has been
successfully used at Google 	
Reasons
•  MapReduce is easy to use
•  Many problems are easily expressible
as MapReduce computations
•  An implementation of MapReduce can run on a large
cluster of commodity machines and is high scalable
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
The reasons why MapReduce is easy to use	
3	
MapReduce hides the messy details of the following	
•  Parallelization
•  Fault-tolerance
•  Locality optimization
•  Load balancing
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
MapReduce computation	
4	
Computation is expressed as two functions: Map and Reduce 	
(n1, s1)
(n2, s2)
(n1,s1)
(w1, 1)
(w2, 1)
(w1, (1,1)) (2)
(k1, v1) → list(k2,v2)	
 (k1, list(v2)) → list(v2)	
(n1, s1)	
(n2, s2)	
phase 	
type	
map reduce
(w2, (1,1)) (2)(n2,s2)
(w1, 1)
(w2, 1)
time	
input file	
split	
A Flow of an execution
(Counting the number of occurrence of each word in a document)	
Input and output is a set of key/value pairs	
machine	
 machine	
machine	
 machine	
PCs	
→	
→	
 →	
→
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
Map	
5	
Map takes an input pair and produces a set of intermediate
key/value pairs	
Map example
(Counting the number of occurrence of each word in a document)
Map, written by the user, produces a set of intermediate key/value pairs	
Map(String	
  key,	
  String	
  value):	
  
	
  	
  //	
  key:	
  document	
  name	
  
	
  	
  //	
  value:	
  document	
  contents	
  
	
  	
  for	
  each	
  word	
  w	
  in	
  value:	
  
	
  	
  	
  	
  EmitIntermediate(w,	
  “1”’);	
  
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
Reduce	
6	
Reduce accepts an intermediate key and a set of values for
that key, then form a possibly smaller set of values	
Reduce example
(Counting the number of occurrence of each word in a document)
Reduce(String	
  key,	
  String	
  value):	
  
	
  	
  //	
  key:	
  a	
  word	
  
	
  	
  //	
  value:	
  a	
  list	
  of	
  counts	
  
	
  	
  for	
  each	
  v	
  in	
  values:	
  
	
  	
  	
  	
  result	
  +=	
  ParseInt(v);	
  
	
  	
  Emit	
  (AsString(result));	
  	
  
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
Parallelization	
7	
Map and reduce allow programmers to parallelize
computations easily	
•  They are inspired by map and reduce present in
functional languages
•  Referential transparency is one of the principle of
functional programming
•  Referential transparency encourages language based
parallelism of computation
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
MapReduce operation	
8	
First, The MapReduce library copies user program
on a cluster of machines	
Copy of user program on a cluster of machines	
User program	
fork	
fork	
 fork
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
The master and workers	
9	
One of copies is the master, and it assigns map tasks and
reduce tasks to workers(the rest) 	
master	
worker	
 worker	
Copy of user program on a cluster of machines	
assign map	
 assign reduce
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
Fault-tolerance | worker	
10	
Any map task or reduce task in progress on a failed worker
becomes eligible for rescheduling	
Task scheduling on a failure	
master	
machine A	
machine B	
Task status	
idle	
assign task	
in-progress	
ping	
 (no response)	
idle	
 in-progress	
assign task	
time	
exception	
ping	
pong	
Master stores states(idle, in-progress, or completed) of tasks
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
Fault-tolerance | master	
11	
If the master task dies, a new copy can be started from
the last checkpoint state	
Restart of the master task from a checkpoint	
master task	
exception	
new copy	
checkpoint	
The master writes periodic checkpoints of the master data structures
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
Locality optimization	
12	
Network bandwidth is conserved by taking
advantage of the GFS	
•  Input data(managed by GFS) is stored on the local disks of
the machines that make up a cluster
•  GFS divides each file into 64MB blocks, and stores several
copies of each block on different machines
•  The master attempts to schedule a map task on a machine
that contains a replica of the corresponding input data
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
Load balancing	
13	
Having each worker perform many different tasks improves
dynamic load balancing	
The many map tasks a worker has completed can be
spread out across all the other machines

More Related Content

What's hot

06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh
 

What's hot (20)

Map reduce
Map reduceMap reduce
Map reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoop
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
E031201032036
E031201032036E031201032036
E031201032036
 
MapReduce
MapReduceMapReduce
MapReduce
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Hadoop
HadoopHadoop
Hadoop
 

Viewers also liked (9)

The google file system
The google file systemThe google file system
The google file system
 
Impuesto
ImpuestoImpuesto
Impuesto
 
Telemática
TelemáticaTelemática
Telemática
 
Cochin Muziris Biennale | Gogeo Holidays
Cochin Muziris Biennale | Gogeo HolidaysCochin Muziris Biennale | Gogeo Holidays
Cochin Muziris Biennale | Gogeo Holidays
 
Power Point
Power PointPower Point
Power Point
 
William Isbell Portfolio for Uploads
William Isbell Portfolio for UploadsWilliam Isbell Portfolio for Uploads
William Isbell Portfolio for Uploads
 
kannur as a folklore destination
kannur as a folklore destination kannur as a folklore destination
kannur as a folklore destination
 
Tutela interpretacion
Tutela interpretacionTutela interpretacion
Tutela interpretacion
 
MECANISMOS DE PROTECCION DE DERECHOS FUNDAMENTALES S.s semana clase 4
 MECANISMOS DE PROTECCION DE DERECHOS FUNDAMENTALES S.s semana clase 4 MECANISMOS DE PROTECCION DE DERECHOS FUNDAMENTALES S.s semana clase 4
MECANISMOS DE PROTECCION DE DERECHOS FUNDAMENTALES S.s semana clase 4
 

Similar to MapReduce: Simplified Data Processing On Large Clusters

Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
ijcsit
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
KhanKhaja1
 
Generating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop ClustersGenerating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop Clusters
BRNSSPublicationHubI
 

Similar to MapReduce: Simplified Data Processing On Large Clusters (20)

Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
A Brief on MapReduce Performance
A Brief on MapReduce PerformanceA Brief on MapReduce Performance
A Brief on MapReduce Performance
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterEnhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop Cluster
 
H04502048051
H04502048051H04502048051
H04502048051
 
Generating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop ClustersGenerating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop Clusters
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Survey on load balancing and data skew mitigation in mapreduce applications
Survey on load balancing and data skew mitigation in mapreduce applicationsSurvey on load balancing and data skew mitigation in mapreduce applications
Survey on load balancing and data skew mitigation in mapreduce applications
 

Recently uploaded

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Recently uploaded (20)

Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 

MapReduce: Simplified Data Processing On Large Clusters

  • 1. MapReduce: Simplified Data Processing on Large Clusters 2015-08-30 さとうかずま
  • 2. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters The reasons for the success of MapReduce 2 The MapReduce programming model has been successfully used at Google Reasons •  MapReduce is easy to use •  Many problems are easily expressible as MapReduce computations •  An implementation of MapReduce can run on a large cluster of commodity machines and is high scalable
  • 3. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters The reasons why MapReduce is easy to use 3 MapReduce hides the messy details of the following •  Parallelization •  Fault-tolerance •  Locality optimization •  Load balancing
  • 4. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters MapReduce computation 4 Computation is expressed as two functions: Map and Reduce (n1, s1) (n2, s2) (n1,s1) (w1, 1) (w2, 1) (w1, (1,1)) (2) (k1, v1) → list(k2,v2) (k1, list(v2)) → list(v2) (n1, s1) (n2, s2) phase type map reduce (w2, (1,1)) (2)(n2,s2) (w1, 1) (w2, 1) time input file split A Flow of an execution (Counting the number of occurrence of each word in a document) Input and output is a set of key/value pairs machine machine machine machine PCs → → → →
  • 5. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters Map 5 Map takes an input pair and produces a set of intermediate key/value pairs Map example (Counting the number of occurrence of each word in a document) Map, written by the user, produces a set of intermediate key/value pairs Map(String  key,  String  value):      //  key:  document  name      //  value:  document  contents      for  each  word  w  in  value:          EmitIntermediate(w,  “1”’);  
  • 6. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters Reduce 6 Reduce accepts an intermediate key and a set of values for that key, then form a possibly smaller set of values Reduce example (Counting the number of occurrence of each word in a document) Reduce(String  key,  String  value):      //  key:  a  word      //  value:  a  list  of  counts      for  each  v  in  values:          result  +=  ParseInt(v);      Emit  (AsString(result));    
  • 7. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters Parallelization 7 Map and reduce allow programmers to parallelize computations easily •  They are inspired by map and reduce present in functional languages •  Referential transparency is one of the principle of functional programming •  Referential transparency encourages language based parallelism of computation
  • 8. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters MapReduce operation 8 First, The MapReduce library copies user program on a cluster of machines Copy of user program on a cluster of machines User program fork fork fork
  • 9. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters The master and workers 9 One of copies is the master, and it assigns map tasks and reduce tasks to workers(the rest) master worker worker Copy of user program on a cluster of machines assign map assign reduce
  • 10. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters Fault-tolerance | worker 10 Any map task or reduce task in progress on a failed worker becomes eligible for rescheduling Task scheduling on a failure master machine A machine B Task status idle assign task in-progress ping (no response) idle in-progress assign task time exception ping pong Master stores states(idle, in-progress, or completed) of tasks
  • 11. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters Fault-tolerance | master 11 If the master task dies, a new copy can be started from the last checkpoint state Restart of the master task from a checkpoint master task exception new copy checkpoint The master writes periodic checkpoints of the master data structures
  • 12. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters Locality optimization 12 Network bandwidth is conserved by taking advantage of the GFS •  Input data(managed by GFS) is stored on the local disks of the machines that make up a cluster •  GFS divides each file into 64MB blocks, and stores several copies of each block on different machines •  The master attempts to schedule a map task on a machine that contains a replica of the corresponding input data
  • 13. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters Load balancing 13 Having each worker perform many different tasks improves dynamic load balancing The many map tasks a worker has completed can be spread out across all the other machines