SlideShare una empresa de Scribd logo
1 de 15
MapReduce Rahul Agarwal irahul.com
Dean, Ghemawat: http://labs.google.com/papers/mapreduce.html Attributions
[object Object]
MapReduce
Programming model
Examples
Execution
Refinements
Q&AAgenda
HDFS: Fault-tolerant high-bandwidth clustered storage Automatically and transparently route around failure Master (named-node) – Slave architecture Speculatively execute redundant tasks if certain nodes are detected to be slow Move compute to data Lower latency, lower bandwidth Hadoop principles and MapReduce
HDFS: Hadoop Distributed FS Block Size = 64MB Replication Factor = 3
Patented by Google “programming model… for processing and generating large data sets” Allows such programs to be “automatically parallelized and executed on a large cluster” Works with structured and unstructured data Map function processes a key/value pair to generate a set of intermediate key/value pairs Reduce function merges all intermediate values with the same intermediate key MapReduce
map (in_key, in_value) -> list(intermidiate_key, intermediate_value) reduce (intermediate_key, list(intermediate_value)) -> list(out_value) MapReduce
Example: count word occurences map (String key, String value):    //key: document name   //value: document contents   for each word w in value:     EmitIntermediate(w,”1”); reduce (String key, Iterator values):   //key: a word   //values: a list of counts   for each v in values:     result+=ParseInt(v);   Emit(AsString(result));
Example: distributed grep map (String key, String value):    //key: document name   //value: document contents   for each line in value:     if line.match(pattern)       EmitIntermediate(key, line); reduce (String key, Iterator values):   //key: document name   //values: a list lines   for each v in values:     Emit(v);

Más contenido relacionado

La actualidad más candente

Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceBhupesh Chawda
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizationsscottcrespo
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 
Hadoop eco system-first class
Hadoop eco system-first classHadoop eco system-first class
Hadoop eco system-first classalogarg
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 

La actualidad más candente (18)

Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Hadoop eco system-first class
Hadoop eco system-first classHadoop eco system-first class
Hadoop eco system-first class
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 

Similar a Map Reduce

Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingcoolmirza143
 
Map reduce
Map reduceMap reduce
Map reducexydii
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)anh tuan
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analyticsAvinash Pandu
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopSvetlin Nakov
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangaloreappaji intelhunt
 

Similar a Map Reduce (20)

Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
 
Map reduce
Map reduceMap reduce
Map reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
Map reduce
Map reduceMap reduce
Map reduce
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Hadoop
HadoopHadoop
Hadoop
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 

Map Reduce

  • 3.
  • 10. HDFS: Fault-tolerant high-bandwidth clustered storage Automatically and transparently route around failure Master (named-node) – Slave architecture Speculatively execute redundant tasks if certain nodes are detected to be slow Move compute to data Lower latency, lower bandwidth Hadoop principles and MapReduce
  • 11. HDFS: Hadoop Distributed FS Block Size = 64MB Replication Factor = 3
  • 12. Patented by Google “programming model… for processing and generating large data sets” Allows such programs to be “automatically parallelized and executed on a large cluster” Works with structured and unstructured data Map function processes a key/value pair to generate a set of intermediate key/value pairs Reduce function merges all intermediate values with the same intermediate key MapReduce
  • 13. map (in_key, in_value) -> list(intermidiate_key, intermediate_value) reduce (intermediate_key, list(intermediate_value)) -> list(out_value) MapReduce
  • 14. Example: count word occurences map (String key, String value): //key: document name //value: document contents for each word w in value: EmitIntermediate(w,”1”); reduce (String key, Iterator values): //key: a word //values: a list of counts for each v in values: result+=ParseInt(v); Emit(AsString(result));
  • 15. Example: distributed grep map (String key, String value): //key: document name //value: document contents for each line in value: if line.match(pattern) EmitIntermediate(key, line); reduce (String key, Iterator values): //key: document name //values: a list lines for each v in values: Emit(v);
  • 16. Example: URL access frequency map (String key, String value): //key: log name //value: log contents for each line in value: EmitIntermediate(URL(line), “1”); reduce (String key, Iterator values): //key: URL //values: list of counts for each v in values: result+=ParseInt(v); Emit(AsString(result));
  • 17. Example: Reverse web-link graph map (String key, String value): //key: source document name //value: document contents for each link in value: EmitIntermediate(link, key); reduce (String key, Iterator values): //key: each target link //values: list of sources for each v in values: source_list.add(v) Emit(AsPair(key, source_list));
  • 19. Locality Network bandwidth is scarce Compute on local copies which are distributed by HFDS Task Granularity Ratio of Map (M) to Reduce (R) workers Ideally M and R should be much larger than cluster Typically M such that enough tasks for each 64M block, R is a small multiple Eg: 200,000 M for 5,000 R with 2,000 machines Backup Tasks ‘Straggling’ workers Execution Optimization
  • 20. Partitioning Function How to distribute the intermediate results to Reduce workers? Default: hash(key) mod R Eg: hash(Hostname(URL)) mod R Combiner Function Partial merging of data before the reduce step Save bandwidth Eg: Lots of <the,1> in word counting Refinements
  • 21. Ordering Process values and produce ordered results Strong typing Strongly type input/output values Skip bad records Skip consistently failing data Counters Shared counters May be updated from any map/reduce worker Refinements

Notas del editor

  1. HDFS takes care of details of data partitioningScheduling program executionHandling machine failuresHandling inter-machine communication and data transfers
  2. Pool commodity servers in a single hierarchical namespace.Designed for large files that are written once and read many times.Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes.Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks.Default block size is 64MB, though most folks now set it to 128MB