An Intro to Hadoop

MapReduce: Simplified Dat
a Processing on Large Clusters

Jeffrey Dean and Sanjay Ghe
mawat
jeff@google.com, sanjay@goo
gle.com

Google, Inc.

Abstract
given day, etc. Most such comp
MapReduce is a programming utations are conceptu-
model and an associ- ally straightforward. However,
ated implementation for proce the input data is usually
ssing and generating large large and the computations have
data sets. Users specify a map to be distributed across
function that processes a hundreds or thousands of mach
key/value pair to generate a set ines in order to finish in
of intermediate key/value a reasonable amount of time.
pairs, and a reduce function that The issues of how to par-
merges all intermediate allelize the computation, distri
values associated with the same bute the data, and handle
intermediate key. Many failures conspire to obscure the
real world tasks are expressible original simple compu-
in this model, as shown tation with large amounts of
in the paper. complex code to deal with
these issues.
Programs written in this funct As a reaction to this complexity
ional style are automati- , we designed a new
cally parallelized and executed abstraction that allows us to expre
on a large cluster of com- ss the simple computa-
modity machines. The run-time tions we were trying to perfo
system takes care of the rm but hides the messy de-
details of partitioning the input tails of parallelization, fault-
data, scheduling the pro- tolerance, data distribution
gram’s execution across a set and load balancing in a librar
of machines, handling ma- y. Our abstraction is in-
chine failures, and managing spired by the map and reduce
the required inter-machine primitives present in Lisp
communication. This allows and many other functional langu
programmers without any ages. We realized that
experience with parallel and most of our computations invol
distributed systems to eas- ved applying a map op-
ily utilize the resources of a large eration to each logical “record”
distributed system. in our input in order to
Our implementation of MapR compute a set of intermediat
educe runs on a large e key/value pairs, and then
cluster of commodity machines applying a reduce operation to
and is highly scalable: all the values that shared
a typical MapReduce computatio the same key, in order to comb
n processes many ter- ine the derived data ap-
abytes of data on thousands of propriately. Our use of a funct
machines. Programmers ional model with user-
find the system easy to use: hund specified map and reduce opera
reds of MapReduce pro- tions allows us to paral-
grams have been implemented lelize large computations easily
and upwards of one thou- and to use re-execution
sand MapReduce jobs are execu as the primary mechanism for
ted on Google’s clusters fault tolerance.
every day. The major contributions of this
work are a simple and
powerful interface that enables
automatic parallelization
and distribution of large-scale
computations, combined
1 Introduction with an implementation of this
interface that achieves
high performance on large cluste
rs of commodity PCs.
Over the past five years, the Section 2 describes the basic
authors and many others at programming model and
Google have implemented hund gives several examples. Secti
reds of special-purpose on 3 describes an imple-
computations that process large mentation of the MapReduce
amounts of raw data, interface tailored towards
such as crawled documents, our cluster-based computing
web request logs, etc., to environment. Section 4 de-
compute various kinds of deriv scribes several refinements of
ed data, such as inverted the programming model
indices, various representations that we have found useful. Secti
of the graph structure on 5 has performance
of web documents, summaries measurements of our implement
of the number of pages ation for a variety of
crawled per host, the set of tasks. Section 6 explores the
most frequent queries in a use of MapReduce within
Google including our experience
s in using it as the basis
To appear in OSDI 2004
1

rs
Jeffrey Dean and Sanjay
Ghemawat
jeff@google.com, sanjay
@ google.com

Google, Inc.

Abstract
given day, etc. Most su
MapReduce is a progra ch computations are co
mming model and an as ally straightforward. Ho nceptu-
ated implementation fo soci- wever, the input data is
r processing and genera large and the computatio usually
data sets. Users specify ting large ns have to be distributed
a map function that proc hundreds or thousands across
key/value pair to genera esses a of machines in order to
te a set of intermediate ke a reasonable amount of finish in
pairs, and a reduce func y/value time. The issues of how
tion that merges all inter allelize the computatio to par-
values associated with th mediate n, distribute the data, an
e same intermediate key. failures conspire to obsc d handle
real world tasks are expr Many ure the original simple
essible in this model, as tation with large amount compu-
in the paper. shown s of complex code to de
these issues. al with
Programs written in this As a reaction to this co
functional style are auto mplexity, we designed
cally parallelized and ex mati- abstraction that allows us a new
ecuted on a large cluste to express the simple co
modity machines. The r of com- tions we were trying to mputa-
run-time system takes ca perform but hides the m
details of partitioning th re of the tails of parallelization, essy de-
e input data, scheduling fault-tolerance, data distr
gram’s execution across the pro- and load balancing in ibution
a set of machines, hand a library. Our abstractio
chine failures, and man ling ma- spired by the map and n is in-
aging the required inter reduce primitives presen
communication. This all -machine and many other functio t in Lisp
ows programmers with nal languages. We reali
experience with paralle out any most of our computatio zed that
l and distributed system ns involved applying a
ily utilize the resources s to eas- eration to each logical map op-
of a large distributed sy “record” in our input in
stem. compute a set of interm order to
Our implementation of ediate key/value pairs,
MapReduce runs on a and then
cluster of commodity m large applying a reduce oper
achines and is highly sc ation to all the values th
a typical MapReduce co alable: the same key, in order at shared
mputation processes m to combine the derived
abytes of data on thousa any ter- propriately. Our use of data ap-
nds of machines. Progra a functional model with
find the system easy to us mmers specified map and redu user-
e: hundreds of MapRedu ce operations allows us
grams have been implem ce pro- lelize large computatio to paral-
ented and upwards of on ns easily and to use re-ex
sand MapReduce jobs ar e thou- as the primary mechani ecution
e executed on Google’s sm for fault tolerance.
every day. clusters The major contributions
of this work are a simpl
powerful interface that e and
enables automatic paralle
and distribution of large lization
1 Introduction -scale computations, co
with an implementatio mbined
n of this interface that
high performance on lar achieves
ge clusters of commodity
Over the past five years, Section 2 describes the PCs.
the authors and many ot basic programming mod
Google have implemen hers at gives several examples el and
ted hundreds of special . Section 3 describes
computatio -purpose ment an imple-

gle, Inc.

Abstract
given day
MapReduce is a progra ally straig
mming model and an a
ated implementation fo ssoci-
r processing and genera large and t
data sets. Users specify ting large
a map function that pro hundreds o
key/value pair to genera cesses a
te a set of intermediate k a reasonab
pairs, and a reduce func ey/value
tion that merges all inte allelize the
values associated with th rmediate
e same intermediate key failures con
real world tasks are exp . Many
ressible in this model, a tation with
in the paper. s shown
these issues
Programs written in this As a rea
functional style are auto
cally parallelized and ex mati- abstraction
ecuted on a large cluste
modity machines. The r o f co m - tions we we
run-time system takes c
details of partitioning th are of the tails of para
e input data, scheduling
gram’s execution across the pro- and load ba
a set of machines, hand
chine failures, and man ling ma- spired by th
aging the required inter-
communication. This a machine and many ot
llows programmers wit
experience with paralle hout any most of our
l and distributed system
ily utilize the resour s to eas- eration

l world tasks are express tation with
ible in this model, as sh
in the paper. own
these issue
Programs written in this As a rea
functional style are auto
cally parallelized and ex mati- abstraction
ecuted on a large cluste
modity machines. The r o f co m - tions we we
run-time system takes c
details of partitioning th are of the tails of para
e input data, scheduling
gram’s execution across the pro- and load ba
a set of machines, hand
chine failures, and man ling ma- spired by th
aging the required inter-
communication. This a machine and many o
llows programmers wit
experience with paralle hout any most of our
l and distributed system
ily utilize the resources s to eas- eration to ea
of a large distributed sy
stem. compute a s
Our implementation of
MapReduce runs on a
cluster of commodity m large applying a re
achines and is highly sc
a typical MapReduce c alable: the same key
omputation processes m
abytes of data on thousa any ter- propriately.
nds of machines. Progra
ﬁnd the system easy to u mmers speciﬁed map
se: hundreds of MapRed
grams have been implem uce pro- lelize large c
ented and upwards of o
sand MapReduce jobs a ne thou- as the primary
re executed on Google’s
every day. clusters The major
powerful inter
and distributio
1

MapReduce history

“
”

0
0
,0
$ 10

vs
0
0
,0
$1

r
u t
o u e
y o ur
Bu ay il
w Fa
f
o

vs
is
re ble
i lu ta
a vi
F e p
a
in C he
Go

Sproinnnng!

Bzzzt!

Crrrkt!

Pig Sample

A = load 'passwd' using PigStorage(':');
B = foreach A generate $0 as id;
dump B;
store B into 'id.out';

Ha! Your
Hadoop is Shut up!
slower than my I’m reducing.
Hadoop!

Does this
Hadoop make my
list look ugly?

Hadoop

Friday, January 15, 2010 1

Talk Metadata


MapReduce: Simplified Data
Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemaw
at
jeff@google.com, sanjay@go
ogle.com

Google, Inc.

Abstract
given day, etc. Most such computat
MapReduce is a programming ions are conceptu-
model and an associ- ally straightforward. However,
ated implementation for processin the input data is usually
g and generating large large and the computations have
data sets. Users specify a map to be distributed across
function that processes a hundreds or thousands of machines
key/value pair to generate a set in order to finish in
of intermediate key/value a reasonable amount of time.
pairs, and a reduce function that The issues of how to par-
merges all intermediate allelize the computation, distribute
values associated with the same the data, and handle
intermediate key. Many failures conspire to obscure the
real world tasks are expressib original simple compu-
le in this model, as shown tation with large amounts of
in the paper. complex code to deal with
these issues.
Programs written in this functiona As a reaction to this complexit
l style are automati- y, we designed a new
cally parallelized and executed abstraction that allows us to express
on a large cluster of com- the simple computa-
modity machines. The run-time tions we were trying to perform
system takes care of the but hides the messy de-
details of partitioning the input tails of parallelization, fault-toler
data, scheduling the pro- ance, data distribution
gram’s execution across a set and load balancing in a library.
of machines, handling ma- Our abstraction is in-
chine failures, and managing spired by the map and reduce
the required inter-machine primitives present in Lisp
communication. This allows and many other functional languages
programmers without any . We realized that
experience with parallel and most of our computations involved
distributed systems to eas- applying a map op-
ily utilize the resources of a large eration to each logical “record”
distributed system. in our input in order to
Our implementation of MapRedu compute a set of intermediate
ce runs on a large key/value pairs, and then
cluster of commodity machines applying a reduce operation to
and is highly scalable: all the values that shared
a typical MapReduce computat the same key, in order to combine
ion processes many ter- the derived data ap-
abytes of data on thousands of propriately. Our use of a functiona
machines. Programmers l model with user-
find the system easy to use: hundreds specified map and reduce operation
of MapReduce pro- s allows us to paral-
grams have been implemented lelize large computations easily
and upwards of one thou- and to use re-execution
sand MapReduce jobs are executed as the primary mechanism for
on Google’s clusters fault tolerance.
every day. The major contributions of this
work are a simple and
powerful interface that enables
automatic parallelization
and distribution of large-scal
e computations, combined
1 Introduction with an implementation of this
interface that achieves
high performance on large clusters
of commodity PCs.
Over the past five years, the Section 2 describes the basic
authors and many others at programming model and
Google have implemented hundreds gives several examples. Section
of special-purpose 3 describes an imple-
computations that process large mentation of the MapReduce
amounts of raw data, interface tailored towards
such as crawled documents, our cluster-based computing
web request logs, etc., to environment. Section 4 de-
compute various kinds of derived scribes several refinements of
data, such as inverted the programming model
indices, various representations that we have found useful. Section
of the graph structure 5 has performance
of web documents, summarie measurements of our implemen
s of the number of pages tation for a variety of
crawled per host, the set of tasks. Section 6 explores the
most frequent queries in a use of MapReduce within
Google including our experienc
es in using it as the basis
1


Seminal paper on MapReduce
http://labs.google.com/papers/mapreduce.html

Large Clusters

Jeffrey Dean and Sanjay
Ghemawat
jeff@google.com, sanja
y@google.com

Google, Inc.

Abstract
given day, etc. Most such
MapReduce is a program computations are conc
ming model and an asso ally straightforward. How eptu-
ated implementation for ci- ever, the input data is usua
processing and generatin large and the computation lly
data sets. Users specify g large s have to be distributed
a map function that proc hundreds or thousands across
key/value pair to generate esses a of machines in order to
a set of intermediate key/ a reasonable amount of finish in
pairs, and a reduce func value time. The issues of how
tion that merges all inter allelize the computation to par-
values associated with the mediate , distribute the data, and
same intermediate key. failures conspire to obsc handle
real world tasks are expr Many ure the original simple
essible in this model, as tation with large amounts compu-
in the paper. shown of complex code to deal
these issues. with
Programs written in this As a reaction to this com
functional style are auto plexity, we designed a
cally parallelized and exec mati- abstraction that allows us new
uted on a large cluster of to express the simple com
modity machines. The com- tions we were trying to puta-
run-time system takes care perform but hides the mes
details of partitioning the of the tails of parallelization, sy de-
input data, scheduling the fault-tolerance, data distr
gram’s execution across pro- and load balancing in ibution
a set of machines, hand a library. Our abstracti
chine failures, and man ling ma- spired by the map and on is in-
aging the required inter reduce primitives present
communication. This allow -machine and many other function in Lisp
s programmers without al languages. We reali
experience with parallel any most of our computation zed that
and distributed systems s involved applying a map
ily utilize the resources to eas- eration to each logical op-
of a large distributed syst “record” in our input in
em. compute a set of intermed order to
Our implementation of iate key/value pairs, and
MapReduce runs on a then
cluster of commodity mac large applying a reduce oper
hines and is highly scal ation to all the values that
a typical MapReduce com able: the same key, in order shared
putation processes man to combine the derived
abytes of data on thousand y ter- propriately. Our use of data ap-
s of machines. Program a functional model with
find the system easy to use: mers specified map and redu user-
hundreds of MapReduce ce operations allows us
grams have been impleme pro- lelize large computation to paral-
nted and upwards of one s easily and to use re-e
sand MapReduce jobs are thou- as the primary mechani xecution
executed on Google’s clus sm for fault tolerance.
every day. ters The major contribution
s of this work are a simp
powerful interface that le and
enables automatic paralleli
and distribution of larg zation
1 Introduction e-scale computations, com
with an implementation bined
of this interface that achi
high performance on larg eves
e clusters of commodity
Over the past five years, Section 2 describes the PCs.
the authors and many othe basic programming mod
Google have implemented rs at gives several examples el and
hundreds of special-purpo . Section 3 describes
computations that proc se mentation of the MapRed an imple-
ess large amounts of raw uce interface tailored towa
data,
Friday, January 15, 2010
such as crawled documen
compute various kinds
ts, web request logs, etc.,
to
our cluster-based computi
scribes several refineme
ng environment. Section
rds
4 de-
5
of derived data, such as nts of the programmin
indices, various represen inverted that we have found usef g model
tations of the graph struc ul. Section 5 has perform
of web documents, sum ture measurements of our ance
maries of the number of implementation for a
crawled per host, the set pages tasks. Section 6 explores variety of
of most frequent queries the use of MapReduce
in a Google including our expe within
riences in using it as the
basis

1

MapReduce history

“
”


http://www.cern.ch/

Origins


http://en.wikipedia.org/wiki/Mapreduce

Today


http://hadoop.apache.org

Today


http://wiki.apache.org/hadoop/PoweredBy

Today


Why Hadoop?


b
1t
$74
.85

b
g
4


0
0
,0
$10

vs
0
0
0
$ 1,


ur t
o u
y o ure
Bu ay il
w Fa
f
o

vs
is
e le
ur b
il ita
Fa ev p
in ea
Ch
Go


This concept doesn’t work well at weddings or dinner parties

Sproinnnng!

Bzzzt!

Crrrkt!


http://www.robinmajumdar.com/2006/08/05/google-dalles-data-centre-has-serious-cooling-needs/
http://www.greenm3.com/2009/10/googles-secret-to-efficient-data-center-design-ability-to-predict-
performance.html

Unstructured
Structured


NOSQL


http://nosql.mypopescu.com/

Applications


Particle Physics


http://upload.wikimedia.org/wikipedia/commons/f/fc/
CERN_LHC_Tunnel1.jpg

Financial Trends


Contextual Ads


Hadoop Family


Hadoop
Components


the Players
the PlayAs


http://www.flickr.com/photos/mandj98/3804322095/
http://www.flickr.com/photos/8583446@N05/3304141843/
http://www.flickr.com/photos/joits/219824254/
http://www.flickr.com/photos/streetfly_jz/2312194534/
http://www.flickr.com/photos/sybrenstuvel/2811467787/
http://www.flickr.com/photos/lacklusters/2080288154/
http://www.flickr.com/photos/sybrenstuvel/2811467787/

MapReduce


The process


Start
Map


Grouping


Reduce


MapReduce
Demo


HDFS


HDFS Basics


Data Overload


HDFS Demo


Pig


Pig Basics


http://hadoop.apache.org/pig/docs/r0.3.0/getstarted.html#Sample
+Code

Pig Sample

A = load 'passwd' using PigStorage(':');
B = foreach A generate $0 as id;
dump B;
store B into 'id.out';


Pig demo


Hive


Hive Basics


Sqoop


http://www.cloudera.com/hadoop-sqoop

HBase


HBase Basics


HBase Demo


on


Amazon
Elastic
MapReduce


Launched in April 2009
Save results to S3 buckets
http://aws.amazon.com/elasticmapreduce/#functionality

EMR Languages


EMR Pricing


Pay for both columns additively

EMR Functions


Final
Thoughts


Ha! Your
Hadoop is Shut up!
slower than my I’m reducing.
Hadoop!


http://www.ﬂickr.com/photos/robryb/14826486/sizes/l/

Does this
Hadoop make my
list look ugly?


http://www.ﬂickr.com/photos/mckaysavage/1037160492/sizes/l/

Use Hadoop!


http://www.ﬂickr.com/photos/robryb/14826417/sizes/l/

Thanks


Hadoop


Contact


Credits


An Intro to Hadoop

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to An Intro to Hadoop

Similar to An Intro to Hadoop (20)

More from Matthew McCullough

More from Matthew McCullough (20)

Recently uploaded

Recently uploaded (20)

An Intro to Hadoop