Map Reduce

1. MapReduce Rahul Agarwal irahul.com

2. Dean, Ghemawat: http://labs.google.com/papers/mapreduce.html Attributions

4. MapReduce

5. Programming model

6. Examples

7. Execution

8. Refinements

9. Q&AAgenda

10. HDFS: Fault-tolerant high-bandwidth clustered storage Automatically and transparently route around failure Master (named-node) – Slave architecture Speculatively execute redundant tasks if certain nodes are detected to be slow Move compute to data Lower latency, lower bandwidth Hadoop principles and MapReduce

11. HDFS: Hadoop Distributed FS Block Size = 64MB Replication Factor = 3

12. Patented by Google “programming model… for processing and generating large data sets” Allows such programs to be “automatically parallelized and executed on a large cluster” Works with structured and unstructured data Map function processes a key/value pair to generate a set of intermediate key/value pairs Reduce function merges all intermediate values with the same intermediate key MapReduce

13. map (in_key, in_value) -> list(intermidiate_key, intermediate_value) reduce (intermediate_key, list(intermediate_value)) -> list(out_value) MapReduce

14. Example: count word occurences map (String key, String value): //key: document name //value: document contents for each word w in value: EmitIntermediate(w,”1”); reduce (String key, Iterator values): //key: a word //values: a list of counts for each v in values: result+=ParseInt(v); Emit(AsString(result));

15. Example: distributed grep map (String key, String value): //key: document name //value: document contents for each line in value: if line.match(pattern) EmitIntermediate(key, line); reduce (String key, Iterator values): //key: document name //values: a list lines for each v in values: Emit(v);

16. Example: URL access frequency map (String key, String value): //key: log name //value: log contents for each line in value: EmitIntermediate(URL(line), “1”); reduce (String key, Iterator values): //key: URL //values: list of counts for each v in values: result+=ParseInt(v); Emit(AsString(result));

17. Example: Reverse web-link graph map (String key, String value): //key: source document name //value: document contents for each link in value: EmitIntermediate(link, key); reduce (String key, Iterator values): //key: each target link //values: list of sources for each v in values: source_list.add(v) Emit(AsPair(key, source_list));

18. Execution

19. Locality Network bandwidth is scarce Compute on local copies which are distributed by HFDS Task Granularity Ratio of Map (M) to Reduce (R) workers Ideally M and R should be much larger than cluster Typically M such that enough tasks for each 64M block, R is a small multiple Eg: 200,000 M for 5,000 R with 2,000 machines Backup Tasks ‘Straggling’ workers Execution Optimization

20. Partitioning Function How to distribute the intermediate results to Reduce workers? Default: hash(key) mod R Eg: hash(Hostname(URL)) mod R Combiner Function Partial merging of data before the reduce step Save bandwidth Eg: Lots of <the,1> in word counting Refinements

21. Ordering Process values and produce ordered results Strong typing Strongly type input/output values Skip bad records Skip consistently failing data Counters Shared counters May be updated from any map/reduce worker Refinements

Map Reduce

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (18)

Similar a Map Reduce

Similar a Map Reduce (20)

Map Reduce

Notas del editor