SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
MapReduce: Simplified Data
Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
Presented By
Ashraf Uddin
South Asian University
(http://ashrafsau.blogspot.in/)
11 February 2014
MapReduce
●

A programming model & associated implementaion
–

Processing & generating large datasets

●

Programs written are automatically parallelized

●

Takes care of
–

Partitioning the input data

–

Scheduling the program's execution

–

Handling machine failures

–

Managing inter-machine communication
MapReduce: Programming Model
●

MapReduce expresses the computation as two
functions: Map and Reduce

●

Map: an input pair --> key/value pairs

●

Reduce: Intermediate key/values --> output
MapReduce: Examples
●

Word Frequency in a large collection of documents

●

Distributed grep

●

Count of URL access Frequency

●

Reverse Web-Link graph

●

Inverted Index

●

Distributed Sort

●

Term-Vector per Host
Implementation
●

Many different implementaions interfaces

●

Depends on the environment
–

A small memory shared memory

–

A large NUMA multi-processor

–

A large collection of networked machine
Implementaion: Execution Overview
●

●

●

Map invocations are distributed across multiple
machines by automatically partitioning the input
data in a set of M splits.
Reduce
invocations
are
distributed
by
partitioning the intermediate key space into R
pieces.
There are M map tasks and R reduce tasks to
assign. The master picks idle workers and
assigns each one a map task or a reduce task.
Implementaion: Execution Overview

Fig: How MapReduce works & Data flow
Source: Guruzon.com
Implementaion: Execution Overview

Fig: input data values in the MapReduce model
Source: Google Developers
Master Data Structure
●

●

●

For each map task and reduce task, it stores
the state (idle, in-progress, or completed), and
the identity of the worker machine
For each completed map task, the master
stores the locations and sizes of R intermediate
file regions produced by the map task.
The information is pushed incrementally to
workers that have in progress task.
Fault Tolerance: Worker Failure
●

The master pings every worker periodically

●

No respose means the worker is failed

●

●

All map tasks completed or in-progressby the
worker are reset to idle state and reexecuted on
other machines.
For a failed machine, in-progress reduce tasks
are rescheduled but completed reduce tasks do
not need to be re-executed.
Fault Tolerance: Worker Failure
●

●

When a map task is executed first by worker A
and later executed by worker B (because A
failed), all workers executing reduce tasks are
notified of the re-execution.
Any reduce task that has not already read data
from worker A will read data from B.
Fault Tolerance: Semantics in the
Presence of Failures

●

When the Map and Reduce operators are
deterministic functions, this implementation
produces the same output as would have been
produced by a non-faulting sequential
execution.
Implementaion: Locality
●

●

●

Network bandwith is relatively scarce resource
in the computing environment.
The input data managed by GFS is stored on
the local disk of the machines
GFS divides each file into 64 MB blocks, and
stores several copies of each block
Implementaion: Locality
●

●

●

The MapReduce master takes the location
information of input files into acount and attempts to
schedule a map task on a machine that contains a
replica of the corresponding input data.
Failing that, it attempts to schedule a map task near
a replica of that task's input data.
A significant fraction of the workers in a cluster,
most input is read locally and consumes no network
bandwidth.
Implementaion: Task Granularity
●

●

●

M and R should be much larger than the
number of worker machines.
Having each worker perform many different
tasks improves dynamic load balancing and
also speeds up recovery.
The master makes O(M+R) scheduling
decisions and keeps O(M*R) state in memory.
Implementaion: Task Granularity
●

●

R is often constrained by users because the
outout of each reduce task ends up in a
separate output file.
Choose M such that individual task is roughly
16 MB to 64 MB input data for locality
optimization.
Implementaion: Backup Tasks
●

●

●

A “straggler”: a machine that takes an unusually
long time to complete one of the last few map or
reduce tasks.
For example, a machine with a bad disk may
experiance frequent correctable errors that slow its
read performance.
When a MapReduce is close to completion, the
master schedules backup executions of the
remaining in-progress tasks.
Refinement: Partitioning Function
●

●

Data gets partitioned across R reduce tasks
using a partitioning function on the intermediate
key. (eg. “hash(key) mod R)
This tends to result in fairly well-balanced
partitions.
Refinement: Ordering Guarantees
●

●

Within a given partition, the intermediate
key/value pairs are processed in increasing key
order.
This ordering guarantee makes it easy to
generate a sorted output file per partition.
Refinement: Combiner Function
●

●

●

●

Each map task may produce hundreds or
thousands of records with same key.
The combiner function that does partial merging
of this data before it is sent over the network.
The combiner function is executed on each
machine that performs a map task
It significantly speeds up certain class of
MapReduce operations. (eg. Word Frequency)
Refinement: Skipping Bad Records
●

●

●

●

Sometimes there are bugs in user code that cause
the Map or Reduce functions to crash
deterministically on certain records.
If the bugs in third-party library for which source
code is not available then the bugs can not be fixed.
Also, sometimes it is acceptable to ignore a few
records (eg. Statistical analysis on larage dataset)
MapReduce library detects which records cause
deterministic crashes.
Refinement: Status Information
●

●

The master runs at HTTP server and exports a
set of status pages for human consumption.
It shows the progress of the computation such
as how many tasks have been completed, how
many are in progress, bytes of input data, bytes
of intermediate data, processing rate etc.
Refinement: Counters
●

●

To count occurances of various events
User code creates a named counter object and then increments the counter
appropiately in the Map and/or Reduce function.
Counter* uppercase;
uppercase = GetCounter("uppercase");
map(String name, String contents):
for each word w in contents:
if (IsCapitalized(w)):
uppercase->Increment();

EmitIntermediate(w, "1");
Performance: Cluster Configuration
●

Approximately 1800 machines

●

2 Ghz Intel Xenon processors

●

4GB of memory

●

Two 160GB IDE disks

●

A gigabit Ethernet link

●

●

Switched with 100-200 Gbps of aggregate
bandwidth available at the root
Roun trip time was less than a milisecond
Performance: Grep
●

●
●

Scans through 10^10 100-byte records,
searching for a relatively rare three-character
patterns (92,337 records)
64MB pieces (M=15000)
The entire output is placed in one file (R=1)
Performance: Grep
Performance: Grep
●

150 seconds from start to finish

●

The overhead is due to
–

the propogation of the program to all worker
machines

–

delays interacting with GFS to open the set of 1000
input files

–

get the information
optimization

needed

for

the

locality
Performance: Sort
●

●

●

●

●

Scans through 10^10 100-byte records (approximately 1
terabyte of data)
The sorting program consists of less than 50 lines of
code
A three line Map function extracts a 10-byte sorting key
from text line and emits the key and theoriginal text line.
M=15000, R= 4000
The final sorted output is written to a set of 2-way
replicated GFS files (i.e., 2 terabytes of data)
Performance: Sort
Performance: Sort
●

●

●
●

The input rate is higher than the shuffle rate
and output rate because of locality optimization
The shuffle rate is higher than the output rate
because the output phase writes two copies of
the sorted data.
No Backup tasks (an increase of 44% in time)
Process killed (an increase of 5% over the
normal execution time 891 seconds)
Experience
●

Using MapReduce(instead of the ad-hoc
distributed passes in the prior version of the
indexing system) has provided several benefits:
–

The indexing code is simpler, smaller and easier to
understand (3800 lines to only 700 lines)

–

Computation time a few months to a few days to
implement in the new system

–

Machine failures, slow machines, and networking
hiccups are dealt automatically
THANK YOU

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Database replication
Database replicationDatabase replication
Database replication
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
Google File System
Google File SystemGoogle File System
Google File System
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File System
 
Web mining
Web miningWeb mining
Web mining
 
Hadoop Mapreduce Job Execution By Ravi Namboori Babson
Hadoop Mapreduce Job Execution By Ravi Namboori BabsonHadoop Mapreduce Job Execution By Ravi Namboori Babson
Hadoop Mapreduce Job Execution By Ravi Namboori Babson
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
 
Cloud computing stack
Cloud computing stackCloud computing stack
Cloud computing stack
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
Multivector and multiprocessor
Multivector and multiprocessorMultivector and multiprocessor
Multivector and multiprocessor
 
OpenMP Tutorial for Beginners
OpenMP Tutorial for BeginnersOpenMP Tutorial for Beginners
OpenMP Tutorial for Beginners
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Parallel Database
Parallel DatabaseParallel Database
Parallel Database
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 

Destacado

Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceHortonworks
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceM Baddar
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Cloud storage slides
Cloud storage slidesCloud storage slides
Cloud storage slidesEvan Powell
 
Hyper threading technology
Hyper threading technologyHyper threading technology
Hyper threading technologydeepakmarndi
 

Destacado (13)

Mapreduce Osdi04
Mapreduce Osdi04Mapreduce Osdi04
Mapreduce Osdi04
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Cloud storage
Cloud storageCloud storage
Cloud storage
 
Cloud storage slides
Cloud storage slidesCloud storage slides
Cloud storage slides
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Hyper threading technology
Hyper threading technologyHyper threading technology
Hyper threading technology
 
MapReduce in Simple Terms
MapReduce in Simple TermsMapReduce in Simple Terms
MapReduce in Simple Terms
 

Similar a MapReduce: Simplified Data Processing on Large Clusters

MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...areej qasrawi
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentationVu Thi Trang
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptxShimoFcis
 
Introduction of MapReduce
Introduction of MapReduceIntroduction of MapReduce
Introduction of MapReduceHC Lin
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelMartin Zapletal
 
Hadoop Map Reduce OS
Hadoop Map Reduce OSHadoop Map Reduce OS
Hadoop Map Reduce OSVedant Mane
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceBhupesh Chawda
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce scriptHaripritha
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationAhmad El Tawil
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 

Similar a MapReduce: Simplified Data Processing on Large Clusters (20)

MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
Map reduce
Map reduceMap reduce
Map reduce
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
 
Introduction of MapReduce
Introduction of MapReduceIntroduction of MapReduce
Introduction of MapReduce
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hadoop Map Reduce OS
Hadoop Map Reduce OSHadoop Map Reduce OS
Hadoop Map Reduce OS
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
E031201032036
E031201032036E031201032036
E031201032036
 
Apache Giraph
Apache GiraphApache Giraph
Apache Giraph
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
MapReduce
MapReduceMapReduce
MapReduce
 

Más de Ashraf Uddin

Más de Ashraf Uddin (6)

A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on r
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
Software piracy
Software piracySoftware piracy
Software piracy
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Freenet
FreenetFreenet
Freenet
 
Dynamic source routing
Dynamic source routingDynamic source routing
Dynamic source routing
 

Último

Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxNikitaBankoti2
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesShubhangi Sonawane
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIShubhangi Sonawane
 

Último (20)

Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 

MapReduce: Simplified Data Processing on Large Clusters

  • 1. MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Presented By Ashraf Uddin South Asian University (http://ashrafsau.blogspot.in/) 11 February 2014
  • 2. MapReduce ● A programming model & associated implementaion – Processing & generating large datasets ● Programs written are automatically parallelized ● Takes care of – Partitioning the input data – Scheduling the program's execution – Handling machine failures – Managing inter-machine communication
  • 3. MapReduce: Programming Model ● MapReduce expresses the computation as two functions: Map and Reduce ● Map: an input pair --> key/value pairs ● Reduce: Intermediate key/values --> output
  • 4. MapReduce: Examples ● Word Frequency in a large collection of documents ● Distributed grep ● Count of URL access Frequency ● Reverse Web-Link graph ● Inverted Index ● Distributed Sort ● Term-Vector per Host
  • 5. Implementation ● Many different implementaions interfaces ● Depends on the environment – A small memory shared memory – A large NUMA multi-processor – A large collection of networked machine
  • 6. Implementaion: Execution Overview ● ● ● Map invocations are distributed across multiple machines by automatically partitioning the input data in a set of M splits. Reduce invocations are distributed by partitioning the intermediate key space into R pieces. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
  • 7. Implementaion: Execution Overview Fig: How MapReduce works & Data flow Source: Guruzon.com
  • 8. Implementaion: Execution Overview Fig: input data values in the MapReduce model Source: Google Developers
  • 9. Master Data Structure ● ● ● For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine For each completed map task, the master stores the locations and sizes of R intermediate file regions produced by the map task. The information is pushed incrementally to workers that have in progress task.
  • 10. Fault Tolerance: Worker Failure ● The master pings every worker periodically ● No respose means the worker is failed ● ● All map tasks completed or in-progressby the worker are reset to idle state and reexecuted on other machines. For a failed machine, in-progress reduce tasks are rescheduled but completed reduce tasks do not need to be re-executed.
  • 11. Fault Tolerance: Worker Failure ● ● When a map task is executed first by worker A and later executed by worker B (because A failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read data from worker A will read data from B.
  • 12. Fault Tolerance: Semantics in the Presence of Failures ● When the Map and Reduce operators are deterministic functions, this implementation produces the same output as would have been produced by a non-faulting sequential execution.
  • 13. Implementaion: Locality ● ● ● Network bandwith is relatively scarce resource in the computing environment. The input data managed by GFS is stored on the local disk of the machines GFS divides each file into 64 MB blocks, and stores several copies of each block
  • 14. Implementaion: Locality ● ● ● The MapReduce master takes the location information of input files into acount and attempts to schedule a map task on a machine that contains a replica of the corresponding input data. Failing that, it attempts to schedule a map task near a replica of that task's input data. A significant fraction of the workers in a cluster, most input is read locally and consumes no network bandwidth.
  • 15. Implementaion: Task Granularity ● ● ● M and R should be much larger than the number of worker machines. Having each worker perform many different tasks improves dynamic load balancing and also speeds up recovery. The master makes O(M+R) scheduling decisions and keeps O(M*R) state in memory.
  • 16. Implementaion: Task Granularity ● ● R is often constrained by users because the outout of each reduce task ends up in a separate output file. Choose M such that individual task is roughly 16 MB to 64 MB input data for locality optimization.
  • 17. Implementaion: Backup Tasks ● ● ● A “straggler”: a machine that takes an unusually long time to complete one of the last few map or reduce tasks. For example, a machine with a bad disk may experiance frequent correctable errors that slow its read performance. When a MapReduce is close to completion, the master schedules backup executions of the remaining in-progress tasks.
  • 18. Refinement: Partitioning Function ● ● Data gets partitioned across R reduce tasks using a partitioning function on the intermediate key. (eg. “hash(key) mod R) This tends to result in fairly well-balanced partitions.
  • 19. Refinement: Ordering Guarantees ● ● Within a given partition, the intermediate key/value pairs are processed in increasing key order. This ordering guarantee makes it easy to generate a sorted output file per partition.
  • 20. Refinement: Combiner Function ● ● ● ● Each map task may produce hundreds or thousands of records with same key. The combiner function that does partial merging of this data before it is sent over the network. The combiner function is executed on each machine that performs a map task It significantly speeds up certain class of MapReduce operations. (eg. Word Frequency)
  • 21. Refinement: Skipping Bad Records ● ● ● ● Sometimes there are bugs in user code that cause the Map or Reduce functions to crash deterministically on certain records. If the bugs in third-party library for which source code is not available then the bugs can not be fixed. Also, sometimes it is acceptable to ignore a few records (eg. Statistical analysis on larage dataset) MapReduce library detects which records cause deterministic crashes.
  • 22. Refinement: Status Information ● ● The master runs at HTTP server and exports a set of status pages for human consumption. It shows the progress of the computation such as how many tasks have been completed, how many are in progress, bytes of input data, bytes of intermediate data, processing rate etc.
  • 23. Refinement: Counters ● ● To count occurances of various events User code creates a named counter object and then increments the counter appropiately in the Map and/or Reduce function. Counter* uppercase; uppercase = GetCounter("uppercase"); map(String name, String contents): for each word w in contents: if (IsCapitalized(w)): uppercase->Increment(); EmitIntermediate(w, "1");
  • 24. Performance: Cluster Configuration ● Approximately 1800 machines ● 2 Ghz Intel Xenon processors ● 4GB of memory ● Two 160GB IDE disks ● A gigabit Ethernet link ● ● Switched with 100-200 Gbps of aggregate bandwidth available at the root Roun trip time was less than a milisecond
  • 25. Performance: Grep ● ● ● Scans through 10^10 100-byte records, searching for a relatively rare three-character patterns (92,337 records) 64MB pieces (M=15000) The entire output is placed in one file (R=1)
  • 27. Performance: Grep ● 150 seconds from start to finish ● The overhead is due to – the propogation of the program to all worker machines – delays interacting with GFS to open the set of 1000 input files – get the information optimization needed for the locality
  • 28. Performance: Sort ● ● ● ● ● Scans through 10^10 100-byte records (approximately 1 terabyte of data) The sorting program consists of less than 50 lines of code A three line Map function extracts a 10-byte sorting key from text line and emits the key and theoriginal text line. M=15000, R= 4000 The final sorted output is written to a set of 2-way replicated GFS files (i.e., 2 terabytes of data)
  • 30. Performance: Sort ● ● ● ● The input rate is higher than the shuffle rate and output rate because of locality optimization The shuffle rate is higher than the output rate because the output phase writes two copies of the sorted data. No Backup tasks (an increase of 44% in time) Process killed (an increase of 5% over the normal execution time 891 seconds)
  • 31. Experience ● Using MapReduce(instead of the ad-hoc distributed passes in the prior version of the indexing system) has provided several benefits: – The indexing code is simpler, smaller and easier to understand (3800 lines to only 700 lines) – Computation time a few months to a few days to implement in the new system – Machine failures, slow machines, and networking hiccups are dealt automatically