Moore’s law has finally hit the wall and CPU speeds have actually decreased in the last few years. The industry is reacting with hardware with an ever-growing number of cores and software that can leverage “grids” of distributed, often commodity, computing resources. But how is a traditional Java developer supposed to easily take advantage of this revolution? The answer is the Apache Hadoop family of projects. Hadoop is a suite of Open Source APIs at the forefront of this grid computing revolution and is considered the absolute gold standard for the divide-and-conquer model of distributed problem crunching. The well-travelled Apache Hadoop framework is currently being leveraged in production by prominent names such as Yahoo, IBM, Amazon, Adobe, AOL, Facebook and Hulu just to name a few.
In this session, you’ll start by learning the vocabulary unique to the distributed computing space. Next, we’ll discover how to shape a problem and processing to fit the Hadoop MapReduce framework. We’ll then examine the incredible auto-replicating, redundant and self-healing HDFS filesystem. Finally, we’ll fire up several Hadoop nodes and watch our calculation process get devoured live by our Hadoop grid. At this talk’s conclusion, you’ll feel equipped to take on any massive data set and processing your employer can throw at you with absolute ease.
4. MapReduce: Simplified Dat
a Processing on Large Clusters
Jeffrey Dean and Sanjay Ghe
mawat
jeff@google.com, sanjay@goo
gle.com
Google, Inc.
Abstract
given day, etc. Most such comp
MapReduce is a programming utations are conceptu-
model and an associ- ally straightforward. However,
ated implementation for proce the input data is usually
ssing and generating large large and the computations have
data sets. Users specify a map to be distributed across
function that processes a hundreds or thousands of mach
key/value pair to generate a set ines in order to finish in
of intermediate key/value a reasonable amount of time.
pairs, and a reduce function that The issues of how to par-
merges all intermediate allelize the computation, distri
values associated with the same bute the data, and handle
intermediate key. Many failures conspire to obscure the
real world tasks are expressible original simple compu-
in this model, as shown tation with large amounts of
in the paper. complex code to deal with
these issues.
Programs written in this funct As a reaction to this complexity
ional style are automati- , we designed a new
cally parallelized and executed abstraction that allows us to expre
on a large cluster of com- ss the simple computa-
modity machines. The run-time tions we were trying to perfo
system takes care of the rm but hides the messy de-
details of partitioning the input tails of parallelization, fault-
data, scheduling the pro- tolerance, data distribution
gram’s execution across a set and load balancing in a librar
of machines, handling ma- y. Our abstraction is in-
chine failures, and managing spired by the map and reduce
the required inter-machine primitives present in Lisp
communication. This allows and many other functional langu
programmers without any ages. We realized that
experience with parallel and most of our computations invol
distributed systems to eas- ved applying a map op-
ily utilize the resources of a large eration to each logical “record”
distributed system. in our input in order to
Our implementation of MapR compute a set of intermediat
educe runs on a large e key/value pairs, and then
cluster of commodity machines applying a reduce operation to
and is highly scalable: all the values that shared
a typical MapReduce computatio the same key, in order to comb
n processes many ter- ine the derived data ap-
abytes of data on thousands of propriately. Our use of a funct
machines. Programmers ional model with user-
find the system easy to use: hund specified map and reduce opera
reds of MapReduce pro- tions allows us to paral-
grams have been implemented lelize large computations easily
and upwards of one thou- and to use re-execution
sand MapReduce jobs are execu as the primary mechanism for
ted on Google’s clusters fault tolerance.
every day. The major contributions of this
work are a simple and
powerful interface that enables
automatic parallelization
and distribution of large-scale
computations, combined
1 Introduction with an implementation of this
interface that achieves
high performance on large cluste
rs of commodity PCs.
Over the past five years, the Section 2 describes the basic
authors and many others at programming model and
Google have implemented hund gives several examples. Secti
reds of special-purpose on 3 describes an imple-
computations that process large mentation of the MapReduce
amounts of raw data, interface tailored towards
such as crawled documents, our cluster-based computing
web request logs, etc., to environment. Section 4 de-
compute various kinds of deriv scribes several refinements of
ed data, such as inverted the programming model
indices, various representations that we have found useful. Secti
of the graph structure on 5 has performance
of web documents, summaries measurements of our implement
of the number of pages ation for a variety of
crawled per host, the set of tasks. Section 6 explores the
most frequent queries in a use of MapReduce within
Google including our experience
s in using it as the basis
To appear in OSDI 2004
1
5. rs
Jeffrey Dean and Sanjay
Ghemawat
jeff@google.com, sanjay
@ google.com
Google, Inc.
Abstract
given day, etc. Most su
MapReduce is a progra ch computations are co
mming model and an as ally straightforward. Ho nceptu-
ated implementation fo soci- wever, the input data is
r processing and genera large and the computatio usually
data sets. Users specify ting large ns have to be distributed
a map function that proc hundreds or thousands across
key/value pair to genera esses a of machines in order to
te a set of intermediate ke a reasonable amount of finish in
pairs, and a reduce func y/value time. The issues of how
tion that merges all inter allelize the computatio to par-
values associated with th mediate n, distribute the data, an
e same intermediate key. failures conspire to obsc d handle
real world tasks are expr Many ure the original simple
essible in this model, as tation with large amount compu-
in the paper. shown s of complex code to de
these issues. al with
Programs written in this As a reaction to this co
functional style are auto mplexity, we designed
cally parallelized and ex mati- abstraction that allows us a new
ecuted on a large cluste to express the simple co
modity machines. The r of com- tions we were trying to mputa-
run-time system takes ca perform but hides the m
details of partitioning th re of the tails of parallelization, essy de-
e input data, scheduling fault-tolerance, data distr
gram’s execution across the pro- and load balancing in ibution
a set of machines, hand a library. Our abstractio
chine failures, and man ling ma- spired by the map and n is in-
aging the required inter reduce primitives presen
communication. This all -machine and many other functio t in Lisp
ows programmers with nal languages. We reali
experience with paralle out any most of our computatio zed that
l and distributed system ns involved applying a
ily utilize the resources s to eas- eration to each logical map op-
of a large distributed sy “record” in our input in
stem. compute a set of interm order to
Our implementation of ediate key/value pairs,
MapReduce runs on a and then
cluster of commodity m large applying a reduce oper
achines and is highly sc ation to all the values th
a typical MapReduce co alable: the same key, in order at shared
mputation processes m to combine the derived
abytes of data on thousa any ter- propriately. Our use of data ap-
nds of machines. Progra a functional model with
find the system easy to us mmers specified map and redu user-
e: hundreds of MapRedu ce operations allows us
grams have been implem ce pro- lelize large computatio to paral-
ented and upwards of on ns easily and to use re-ex
sand MapReduce jobs ar e thou- as the primary mechani ecution
e executed on Google’s sm for fault tolerance.
every day. clusters The major contributions
of this work are a simpl
powerful interface that e and
enables automatic paralle
and distribution of large lization
1 Introduction -scale computations, co
with an implementatio mbined
n of this interface that
high performance on lar achieves
ge clusters of commodity
Over the past five years, Section 2 describes the PCs.
the authors and many ot basic programming mod
Google have implemen hers at gives several examples el and
ted hundreds of special . Section 3 describes
computatio -purpose ment an imple-
6. gle, Inc.
Abstract
given day
MapReduce is a progra ally straig
mming model and an a
ated implementation fo ssoci-
r processing and genera large and t
data sets. Users specify ting large
a map function that pro hundreds o
key/value pair to genera cesses a
te a set of intermediate k a reasonab
pairs, and a reduce func ey/value
tion that merges all inte allelize the
values associated with th rmediate
e same intermediate key failures con
real world tasks are exp . Many
ressible in this model, a tation with
in the paper. s shown
these issues
Programs written in this As a rea
functional style are auto
cally parallelized and ex mati- abstraction
ecuted on a large cluste
modity machines. The r o f co m - tions we we
run-time system takes c
details of partitioning th are of the tails of para
e input data, scheduling
gram’s execution across the pro- and load ba
a set of machines, hand
chine failures, and man ling ma- spired by th
aging the required inter-
communication. This a machine and many ot
llows programmers wit
experience with paralle hout any most of our
l and distributed system
ily utilize the resour s to eas- eration
7. l world tasks are express tation with
ible in this model, as sh
in the paper. own
these issue
Programs written in this As a rea
functional style are auto
cally parallelized and ex mati- abstraction
ecuted on a large cluste
modity machines. The r o f co m - tions we we
run-time system takes c
details of partitioning th are of the tails of para
e input data, scheduling
gram’s execution across the pro- and load ba
a set of machines, hand
chine failures, and man ling ma- spired by th
aging the required inter-
communication. This a machine and many o
llows programmers wit
experience with paralle hout any most of our
l and distributed system
ily utilize the resources s to eas- eration to ea
of a large distributed sy
stem. compute a s
Our implementation of
MapReduce runs on a
cluster of commodity m large applying a re
achines and is highly sc
a typical MapReduce c alable: the same key
omputation processes m
abytes of data on thousa any ter- propriately.
nds of machines. Progra
find the system easy to u mmers specified map
se: hundreds of MapRed
grams have been implem uce pro- lelize large c
ented and upwards of o
sand MapReduce jobs a ne thou- as the primary
re executed on Google’s
every day. clusters The major
powerful inter
and distributio
1
119. MapReduce: Simplified Data
Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemaw
at
jeff@google.com, sanjay@go
ogle.com
Google, Inc.
Abstract
given day, etc. Most such computat
MapReduce is a programming ions are conceptu-
model and an associ- ally straightforward. However,
ated implementation for processin the input data is usually
g and generating large large and the computations have
data sets. Users specify a map to be distributed across
function that processes a hundreds or thousands of machines
key/value pair to generate a set in order to finish in
of intermediate key/value a reasonable amount of time.
pairs, and a reduce function that The issues of how to par-
merges all intermediate allelize the computation, distribute
values associated with the same the data, and handle
intermediate key. Many failures conspire to obscure the
real world tasks are expressib original simple compu-
le in this model, as shown tation with large amounts of
in the paper. complex code to deal with
these issues.
Programs written in this functiona As a reaction to this complexit
l style are automati- y, we designed a new
cally parallelized and executed abstraction that allows us to express
on a large cluster of com- the simple computa-
modity machines. The run-time tions we were trying to perform
system takes care of the but hides the messy de-
details of partitioning the input tails of parallelization, fault-toler
data, scheduling the pro- ance, data distribution
gram’s execution across a set and load balancing in a library.
of machines, handling ma- Our abstraction is in-
chine failures, and managing spired by the map and reduce
the required inter-machine primitives present in Lisp
communication. This allows and many other functional languages
programmers without any . We realized that
experience with parallel and most of our computations involved
distributed systems to eas- applying a map op-
ily utilize the resources of a large eration to each logical “record”
distributed system. in our input in order to
Our implementation of MapRedu compute a set of intermediate
ce runs on a large key/value pairs, and then
cluster of commodity machines applying a reduce operation to
and is highly scalable: all the values that shared
a typical MapReduce computat the same key, in order to combine
ion processes many ter- the derived data ap-
abytes of data on thousands of propriately. Our use of a functiona
machines. Programmers l model with user-
find the system easy to use: hundreds specified map and reduce operation
of MapReduce pro- s allows us to paral-
grams have been implemented lelize large computations easily
and upwards of one thou- and to use re-execution
sand MapReduce jobs are executed as the primary mechanism for
on Google’s clusters fault tolerance.
every day. The major contributions of this
work are a simple and
powerful interface that enables
automatic parallelization
and distribution of large-scal
e computations, combined
1 Introduction with an implementation of this
interface that achieves
high performance on large clusters
of commodity PCs.
Over the past five years, the Section 2 describes the basic
authors and many others at programming model and
Google have implemented hundreds gives several examples. Section
of special-purpose 3 describes an imple-
computations that process large mentation of the MapReduce
amounts of raw data, interface tailored towards
such as crawled documents, our cluster-based computing
web request logs, etc., to environment. Section 4 de-
compute various kinds of derived scribes several refinements of
data, such as inverted the programming model
indices, various representations that we have found useful. Section
of the graph structure 5 has performance
of web documents, summarie measurements of our implemen
s of the number of pages tation for a variety of
crawled per host, the set of tasks. Section 6 explores the
most frequent queries in a use of MapReduce within
Google including our experienc
es in using it as the basis
To appear in OSDI 2004
1
Friday, January 15, 2010 4
Seminal paper on MapReduce
http://labs.google.com/papers/mapreduce.html
120. Large Clusters
Jeffrey Dean and Sanjay
Ghemawat
jeff@google.com, sanja
y@google.com
Google, Inc.
Abstract
given day, etc. Most such
MapReduce is a program computations are conc
ming model and an asso ally straightforward. How eptu-
ated implementation for ci- ever, the input data is usua
processing and generatin large and the computation lly
data sets. Users specify g large s have to be distributed
a map function that proc hundreds or thousands across
key/value pair to generate esses a of machines in order to
a set of intermediate key/ a reasonable amount of finish in
pairs, and a reduce func value time. The issues of how
tion that merges all inter allelize the computation to par-
values associated with the mediate , distribute the data, and
same intermediate key. failures conspire to obsc handle
real world tasks are expr Many ure the original simple
essible in this model, as tation with large amounts compu-
in the paper. shown of complex code to deal
these issues. with
Programs written in this As a reaction to this com
functional style are auto plexity, we designed a
cally parallelized and exec mati- abstraction that allows us new
uted on a large cluster of to express the simple com
modity machines. The com- tions we were trying to puta-
run-time system takes care perform but hides the mes
details of partitioning the of the tails of parallelization, sy de-
input data, scheduling the fault-tolerance, data distr
gram’s execution across pro- and load balancing in ibution
a set of machines, hand a library. Our abstracti
chine failures, and man ling ma- spired by the map and on is in-
aging the required inter reduce primitives present
communication. This allow -machine and many other function in Lisp
s programmers without al languages. We reali
experience with parallel any most of our computation zed that
and distributed systems s involved applying a map
ily utilize the resources to eas- eration to each logical op-
of a large distributed syst “record” in our input in
em. compute a set of intermed order to
Our implementation of iate key/value pairs, and
MapReduce runs on a then
cluster of commodity mac large applying a reduce oper
hines and is highly scal ation to all the values that
a typical MapReduce com able: the same key, in order shared
putation processes man to combine the derived
abytes of data on thousand y ter- propriately. Our use of data ap-
s of machines. Program a functional model with
find the system easy to use: mers specified map and redu user-
hundreds of MapReduce ce operations allows us
grams have been impleme pro- lelize large computation to paral-
nted and upwards of one s easily and to use re-e
sand MapReduce jobs are thou- as the primary mechani xecution
executed on Google’s clus sm for fault tolerance.
every day. ters The major contribution
s of this work are a simp
powerful interface that le and
enables automatic paralleli
and distribution of larg zation
1 Introduction e-scale computations, com
with an implementation bined
of this interface that achi
high performance on larg eves
e clusters of commodity
Over the past five years, Section 2 describes the PCs.
the authors and many othe basic programming mod
Google have implemented rs at gives several examples el and
hundreds of special-purpo . Section 3 describes
computations that proc se mentation of the MapRed an imple-
ess large amounts of raw uce interface tailored towa
data,
Friday, January 15, 2010
such as crawled documen
compute various kinds
ts, web request logs, etc.,
to
our cluster-based computi
scribes several refineme
ng environment. Section
rds
4 de-
5
of derived data, such as nts of the programmin
indices, various represen inverted that we have found usef g model
tations of the graph struc ul. Section 5 has perform
of web documents, sum ture measurements of our ance
maries of the number of implementation for a
crawled per host, the set pages tasks. Section 6 explores variety of
of most frequent queries the use of MapReduce
in a Google including our expe within
riences in using it as the
basis
To appear in OSDI 2004
1
128. b
1t
$74
.85
b
g
4
Friday, January 15, 2010 13
129. 0
0
,0
$10
vs
0
0
0
$ 1,
Friday, January 15, 2010 14
130. ur t
o u
y o ure
Bu ay il
w Fa
f
o
vs
is
e le
ur b
il ita
Fa ev p
in ea
Ch
Go
Friday, January 15, 2010 15
This concept doesn’t work well at weddings or dinner parties
165. Amazon
Elastic
MapReduce
Friday, January 15, 2010 50
Launched in April 2009
Save results to S3 buckets
http://aws.amazon.com/elasticmapreduce/#functionality