SlideShare a Scribd company logo
1 of 180
Download to read offline
Hadoop
Talk Metadata
MapReduce: Simplified Dat
                                             a Processing                                on Large Clusters

                                             Jeffrey Dean and Sanjay Ghe
                                                                        mawat
                                                 jeff@google.com, sanjay@goo
                                                                            gle.com

                                                            Google, Inc.


                            Abstract
                                                                         given day, etc. Most such comp
        MapReduce is a programming                                                                               utations are conceptu-
                                            model and an associ-        ally straightforward. However,
     ated implementation for proce                                                                            the input data is usually
                                       ssing and generating large       large and the computations have
     data sets. Users specify a map                                                                             to be distributed across
                                        function that processes a       hundreds or thousands of mach
     key/value pair to generate a set                                                                        ines in order to finish in
                                        of intermediate key/value       a reasonable amount of time.
     pairs, and a reduce function that                                                                     The issues of how to par-
                                          merges all intermediate       allelize the computation, distri
    values associated with the same                                                                        bute the data, and handle
                                         intermediate key. Many        failures conspire to obscure the
    real world tasks are expressible                                                                          original simple compu-
                                         in this model, as shown       tation with large amounts of
    in the paper.                                                                                       complex code to deal with
                                                                       these issues.
       Programs written in this funct                                     As a reaction to this complexity
                                        ional style are automati-                                                 , we designed a new
    cally parallelized and executed                                   abstraction that allows us to expre
                                      on a large cluster of com-                                               ss the simple computa-
   modity machines. The run-time                                      tions we were trying to perfo
                                        system takes care of the                                       rm but hides the messy de-
   details of partitioning the input                                  tails of parallelization, fault-
                                       data, scheduling the pro-                                       tolerance, data distribution
   gram’s execution across a set                                     and load balancing in a librar
                                   of machines, handling ma-                                             y. Our abstraction is in-
   chine failures, and managing                                      spired by the map and reduce
                                   the required inter-machine                                            primitives present in Lisp
   communication. This allows                                        and many other functional langu
                                    programmers without any                                                  ages. We realized that
  experience with parallel and                                       most of our computations invol
                                   distributed systems to eas-                                             ved applying a map op-
  ily utilize the resources of a large                               eration to each logical “record”
                                         distributed system.                                                in our input in order to
      Our implementation of MapR                                    compute a set of intermediat
                                        educe runs on a large                                        e key/value pairs, and then
  cluster of commodity machines                                     applying a reduce operation to
                                        and is highly scalable:                                          all the values that shared
  a typical MapReduce computatio                                    the same key, in order to comb
                                         n processes many ter-                                            ine the derived data ap-
 abytes of data on thousands of                                    propriately. Our use of a funct
                                     machines. Programmers                                                 ional model with user-
 find the system easy to use: hund                                  specified map and reduce opera
                                      reds of MapReduce pro-                                             tions allows us to paral-
 grams have been implemented                                       lelize large computations easily
                                   and upwards of one thou-                                               and to use re-execution
 sand MapReduce jobs are execu                                     as the primary mechanism for
                                     ted on Google’s clusters                                         fault tolerance.
 every day.                                                            The major contributions of this
                                                                                                             work are a simple and
                                                                  powerful interface that enables
                                                                                                        automatic parallelization
                                                                  and distribution of large-scale
                                                                                                       computations, combined
 1 Introduction                                                   with an implementation of this
                                                                                                          interface that achieves
                                                                  high performance on large cluste
                                                                                                         rs of commodity PCs.
 Over the past five years, the                                         Section 2 describes the basic
                              authors and many others at                                               programming model and
 Google have implemented hund                                    gives several examples. Secti
                                 reds of special-purpose                                             on 3 describes an imple-
 computations that process large                                 mentation of the MapReduce
                                   amounts of raw data,                                             interface tailored towards
such as crawled documents,                                       our cluster-based computing
                               web request logs, etc., to                                         environment. Section 4 de-
compute various kinds of deriv                                   scribes several refinements of
                                ed data, such as inverted                                            the programming model
indices, various representations                                that we have found useful. Secti
                                  of the graph structure                                                  on 5 has performance
of web documents, summaries                                     measurements of our implement
                                of the number of pages                                                   ation for a variety of
crawled per host, the set of                                    tasks. Section 6 explores the
                              most frequent queries in a                                           use of MapReduce within
                                                                Google including our experience
                                                                                                       s in using it as the basis
To appear in OSDI 2004
                                                                                                                                1
rs
                                                 Jeffrey Dean and Sanjay
                                                                         Ghemawat
                                                     jeff@google.com, sanjay
                                                                            @   google.com

                                                                Google, Inc.


                               Abstract
                                                                            given day, etc. Most su
          MapReduce is a progra                                                                          ch computations are co
                                     mming model and an as                  ally straightforward. Ho                                nceptu-
       ated implementation fo                                   soci-                                   wever, the input data is
                                r processing and genera                    large and the computatio                                 usually
      data sets. Users specify                             ting large                                   ns have to be distributed
                                  a map function that proc                 hundreds or thousands                                     across
      key/value pair to genera                                esses a                                 of machines in order to
                                te a set of intermediate ke                a reasonable amount of                                 finish in
      pairs, and a reduce func                               y/value                                   time. The issues of how
                                tion that merges all inter                 allelize the computatio                                  to par-
      values associated with th                              mediate                                 n, distribute the data, an
                                 e same intermediate key.                 failures conspire to obsc                              d handle
     real world tasks are expr                                 Many                                    ure the original simple
                                 essible in this model, as                tation with large amount                                 compu-
     in the paper.                                            shown                                    s of complex code to de
                                                                          these issues.                                            al with
        Programs written in this                                             As a reaction to this co
                                    functional style are auto                                            mplexity, we designed
     cally parallelized and ex                                 mati-     abstraction that allows us                                 a new
                               ecuted on a large cluste                                                to express the simple co
    modity machines. The                                 r of com-       tions we were trying to                                  mputa-
                              run-time system takes ca                                               perform but hides the m
    details of partitioning th                            re of the      tails of parallelization,                              essy de-
                               e input data, scheduling                                             fault-tolerance, data distr
    gram’s execution across                                the pro-      and load balancing in                                    ibution
                               a set of machines, hand                                             a library. Our abstractio
    chine failures, and man                               ling ma-      spired by the map and                                    n is in-
                             aging the required inter                                              reduce primitives presen
   communication. This all                              -machine        and many other functio                                 t in Lisp
                                ows programmers with                                                nal languages. We reali
   experience with paralle                                 out any      most of our computatio                                  zed that
                             l and distributed system                                               ns involved applying a
   ily utilize the resources                             s to eas-      eration to each logical                                map op-
                             of a large distributed sy                                             “record” in our input in
                                                        stem.          compute a set of interm                                 order to
      Our implementation of                                                                         ediate key/value pairs,
                                  MapReduce runs on a                                                                         and then
  cluster of commodity m                                     large     applying a reduce oper
                              achines and is highly sc                                            ation to all the values th
  a typical MapReduce co                                   alable:     the same key, in order                                at shared
                              mputation processes m                                               to combine the derived
  abytes of data on thousa                               any ter-     propriately. Our use of                                 data ap-
                             nds of machines. Progra                                                 a functional model with
 find the system easy to us                                mmers       specified map and redu                                        user-
                             e: hundreds of MapRedu                                              ce operations allows us
 grams have been implem                                   ce pro-     lelize large computatio                                to paral-
                             ented and upwards of on                                             ns easily and to use re-ex
 sand MapReduce jobs ar                                  e thou-     as the primary mechani                                    ecution
                             e executed on Google’s                                              sm for fault tolerance.
 every day.                                             clusters          The major contributions
                                                                                                      of this work are a simpl
                                                                     powerful interface that                                     e and
                                                                                                enables automatic paralle
                                                                     and distribution of large                                lization
1 Introduction                                                                                    -scale computations, co
                                                                    with an implementatio                                    mbined
                                                                                                n of this interface that
                                                                    high performance on lar                                 achieves
                                                                                                ge clusters of commodity
Over the past five years,                                                Section 2 describes the                                PCs.
                         the authors and many ot                                                   basic programming mod
Google have implemen                            hers at             gives several examples                                     el and
                       ted hundreds of special                                                 . Section 3 describes
computatio                                    -purpose              ment                                                 an imple-
gle, Inc.


                            Abstract
                                                                        given day
       MapReduce is a progra                                            ally straig
                                  mming model and an a
    ated implementation fo                                   ssoci-
                             r processing and genera                   large and t
   data sets. Users specify                             ting large
                               a map function that pro                 hundreds o
   key/value pair to genera                               cesses a
                             te a set of intermediate k                a reasonab
   pairs, and a reduce func                              ey/value
                             tion that merges all inte                 allelize the
   values associated with th                            rmediate
                              e same intermediate key                 failures con
  real world tasks are exp                                . Many
                             ressible in this model, a                tation with
  in the paper.                                          s shown
                                                                      these issues
     Programs written in this                                            As a rea
                                 functional style are auto
  cally parallelized and ex                                 mati-    abstraction
                            ecuted on a large cluste
 modity machines. The                                 r o f co m -   tions we we
                           run-time system takes c
 details of partitioning th                           are of the     tails of para
                            e input data, scheduling
 gram’s execution across                                the pro-     and load ba
                            a set of machines, hand
 chine failures, and man                               ling ma-     spired by th
                          aging the required inter-
communication. This a                                  machine      and many ot
                           llows programmers wit
experience with paralle                               hout any      most of our
                          l and distributed system
ily utilize the resour                                s to eas-     eration
l world tasks are express                                          tation with
                                 ible in this model, as sh
    in the paper.                                          own
                                                                           these issue
       Programs written in this                                               As a rea
                                  functional style are auto
    cally parallelized and ex                                mati-        abstraction
                              ecuted on a large cluste
   modity machines. The                                r o f co m -       tions we we
                             run-time system takes c
   details of partitioning th                          are of the         tails of para
                              e input data, scheduling
   gram’s execution across                               the pro-         and load ba
                              a set of machines, hand
   chine failures, and man                              ling ma-         spired by th
                             aging the required inter-
  communication. This a                                 machine          and many o
                             llows programmers wit
  experience with paralle                              hout any          most of our
                            l and distributed system
  ily utilize the resources                            s to eas-         eration to ea
                            of a large distributed sy
                                                      stem.             compute a s
     Our implementation of
                                 MapReduce runs on a
 cluster of commodity m                                    large        applying a re
                             achines and is highly sc
 a typical MapReduce c                                   alable:        the same key
                            omputation processes m
 abytes of data on thousa                              any ter-        propriately.
                             nds of machines. Progra
find the system easy to u                                mmers          specified map
                           se: hundreds of MapRed
grams have been implem                                uce pro-         lelize large c
                             ented and upwards of o
sand MapReduce jobs a                                 ne thou-        as the primary
                           re executed on Google’s
every day.                                            clusters             The major
                                                                      powerful inter
                                                                      and distributio
1
MapReduce history


“
               ”
Origins
Today
Today
Today
Why Hadoop?
$74
  .85
$74
          .85
    b
g
4
b
          1t
        $74
          .85
    b
g
4
vs
0
                       0
                    ,0
                $ 10



           vs
       0
    0
  ,0
$1
vs
r
                                u t
                               o u e
                            y o ur
                          Bu ay il
                           w   Fa
                             f
                           o



                     vs
           is
       re ble
   i lu ta
 a vi
F e              p
               a
 in       C he
      Go
Sproinnnng!

Bzzzt!

                       Crrrkt!
Structured
Structured




Unstructured
NOSQL
NOSQL
NOSQL
Applications
Applications
Applications
Applications
Applications
Applications
Applications
Applications
Applications
Applications
Particle Physics
Particle Physics
Particle Physics
Financial Trends
Financial Trends
Financial Trends
Financial Trends
Financial Trends
Contextual Ads
Contextual Ads
Hadoop Family
Hadoop
Components
the Players
the PlayAs
the PlayAs
MapReduce
MapReduce
The process
The process
The process
The process
The process
The process
Start
Map
Grouping
Reduce
MapReduce
  Demo
HDFS
HDFS Basics
HDFS Basics
HDFS Basics
HDFS Basics
Data Overload
Data Overload
Data Overload
Data Overload
Data Overload
HDFS Demo
Pig
Pig Basics
Pig Sample


A = load 'passwd' using PigStorage(':');
B = foreach A generate $0 as id;
dump B;
store B into 'id.out';
Pig demo
Hive
Hive Basics
Sqoop
HBase
HBase Basics
HBase Demo
on
Amazon
 Elastic
  MapReduce
Amazon
 Elastic
  MapReduce
Amazon
 Elastic
  MapReduce
Amazon
 Elastic
  MapReduce
EMR Languages
EMR Languages
EMR Pricing
EMR Pricing
EMR Functions
EMR Functions
Final
Thoughts
Ha! Your
   Hadoop is       Shut up!
slower than my   I’m reducing.
    Hadoop!
Does this
Hadoop make my
 list look ugly?
Use Hadoop!
Thanks
Hadoop
Contact
Credits
Hadoop




Friday, January 15, 2010   1
Talk Metadata




Friday, January 15, 2010   2
Friday, January 15, 2010   3
MapReduce: Simplified Data
                                                                                   Processing on Large                                             Clusters

                                                                                  Jeffrey Dean and Sanjay Ghemaw
                                                                                                                at
                                                                                      jeff@google.com, sanjay@go
                                                                                                                ogle.com

                                                                                                 Google, Inc.


                                                                 Abstract
                                                                                                                 given day, etc. Most such computat
                                           MapReduce is a programming                                                                                          ions are conceptu-
                                                                             model and an associ-               ally straightforward. However,
                                        ated implementation for processin                                                                            the input data is usually
                                                                             g and generating large             large and the computations have
                                        data sets. Users specify a map                                                                                 to be distributed across
                                                                         function that processes a              hundreds or thousands of machines
                                        key/value pair to generate a set                                                                                     in order to finish in
                                                                         of intermediate key/value              a reasonable amount of time.
                                        pairs, and a reduce function that                                                                         The issues of how to par-
                                                                           merges all intermediate              allelize the computation, distribute
                                       values associated with the same                                                                                     the data, and handle
                                                                          intermediate key. Many               failures conspire to obscure the
                                       real world tasks are expressib                                                                                original simple compu-
                                                                       le in this model, as shown              tation with large amounts of
                                       in the paper.                                                                                            complex code to deal with
                                                                                                               these issues.
                                           Programs written in this functiona                                     As a reaction to this complexit
                                                                                l style are automati-                                                  y, we designed a new
                                        cally parallelized and executed                                       abstraction that allows us to express
                                                                          on a large cluster of com-                                                      the simple computa-
                                       modity machines. The run-time                                          tions we were trying to perform
                                                                           system takes care of the                                                 but hides the messy de-
                                       details of partitioning the input                                      tails of parallelization, fault-toler
                                                                          data, scheduling the pro-                                                   ance, data distribution
                                       gram’s execution across a set                                         and load balancing in a library.
                                                                       of machines, handling ma-                                                      Our abstraction is in-
                                       chine failures, and managing                                          spired by the map and reduce
                                                                       the required inter-machine                                               primitives present in Lisp
                                       communication. This allows                                            and many other functional languages
                                                                        programmers without any                                                             . We realized that
                                      experience with parallel and                                           most of our computations involved
                                                                       distributed systems to eas-                                                       applying a map op-
                                      ily utilize the resources of a large                                   eration to each logical “record”
                                                                            distributed system.                                                    in our input in order to
                                         Our implementation of MapRedu                                      compute a set of intermediate
                                                                               ce runs on a large                                               key/value pairs, and then
                                      cluster of commodity machines                                         applying a reduce operation to
                                                                           and is highly scalable:                                               all the values that shared
                                      a typical MapReduce computat                                          the same key, in order to combine
                                                                         ion processes many ter-                                                        the derived data ap-
                                     abytes of data on thousands of                                        propriately. Our use of a functiona
                                                                         machines. Programmers                                                           l model with user-
                                     find the system easy to use: hundreds                                  specified map and reduce operation
                                                                               of MapReduce pro-                                                       s allows us to paral-
                                     grams have been implemented                                           lelize large computations easily
                                                                       and upwards of one thou-                                                   and to use re-execution
                                     sand MapReduce jobs are executed                                      as the primary mechanism for
                                                                             on Google’s clusters                                             fault tolerance.
                                     every day.                                                                The major contributions of this
                                                                                                                                                    work are a simple and
                                                                                                          powerful interface that enables
                                                                                                                                               automatic parallelization
                                                                                                          and distribution of large-scal
                                                                                                                                            e computations, combined
                                     1 Introduction                                                       with an implementation of this
                                                                                                                                                  interface that achieves
                                                                                                          high performance on large clusters
                                                                                                                                                    of commodity PCs.
                                     Over the past five years, the                                             Section 2 describes the basic
                                                                  authors and many others at                                                   programming model and
                                     Google have implemented hundreds                                    gives several examples. Section
                                                                           of special-purpose                                                      3 describes an imple-
                                     computations that process large                                     mentation of the MapReduce
                                                                       amounts of raw data,                                                 interface tailored towards
                                    such as crawled documents,                                           our cluster-based computing
                                                                   web request logs, etc., to                                             environment. Section 4 de-
                                    compute various kinds of derived                                     scribes several refinements of
                                                                        data, such as inverted                                               the programming model
                                    indices, various representations                                    that we have found useful. Section
                                                                      of the graph structure                                                           5 has performance
                                    of web documents, summarie                                          measurements of our implemen
                                                                   s of the number of pages                                                     tation for a variety of
                                    crawled per host, the set of                                        tasks. Section 6 explores the
                                                                  most frequent queries in a                                               use of MapReduce within
                                                                                                        Google including our experienc
                                                                                                                                             es in using it as the basis
                                    To appear in OSDI 2004
                                                                                                                                                                         1




               Friday, January 15, 2010                                                                                                                                             4

Seminal paper on MapReduce
http://labs.google.com/papers/mapreduce.html
Large Clusters

                                                            Jeffrey Dean and Sanjay
                                                                                    Ghemawat
                                                                jeff@google.com, sanja
                                                                                      y@google.com

                                                                            Google, Inc.


                                         Abstract
                                                                                        given day, etc. Most such
                MapReduce is a program                                                                                   computations are conc
                                            ming model and an asso                      ally straightforward. How                                    eptu-
             ated implementation for                                    ci-                                           ever, the input data is usua
                                        processing and generatin                       large and the computation                                       lly
             data sets. Users specify                              g large                                            s have to be distributed
                                       a map function that proc                        hundreds or thousands                                       across
             key/value pair to generate                           esses a                                         of machines in order to
                                         a set of intermediate key/                    a reasonable amount of                                  finish in
             pairs, and a reduce func                               value                                          time. The issues of how
                                      tion that merges all inter                       allelize the computation                                   to par-
            values associated with the                           mediate                                          , distribute the data, and
                                         same intermediate key.                       failures conspire to obsc                                  handle
            real world tasks are expr                              Many                                            ure the original simple
                                       essible in this model, as                      tation with large amounts                                 compu-
            in the paper.                                         shown                                              of complex code to deal
                                                                                      these issues.                                                 with
                 Programs written in this                                                As a reaction to this com
                                            functional style are auto                                                  plexity, we designed a
              cally parallelized and exec                               mati-        abstraction that allows us                                     new
                                           uted on a large cluster of                                              to express the simple com
             modity machines. The                                       com-         tions we were trying to                                       puta-
                                       run-time system takes care                                                perform but hides the mes
             details of partitioning the                              of the         tails of parallelization,                                   sy de-
                                          input data, scheduling the                                            fault-tolerance, data distr
             gram’s execution across                                     pro-        and load balancing in                                     ibution
                                        a set of machines, hand                                                a library. Our abstracti
             chine failures, and man                             ling ma-           spired by the map and                                  on is in-
                                      aging the required inter                                                reduce primitives present
            communication. This allow                           -machine            and many other function                                   in Lisp
                                            s programmers without                                                al languages. We reali
            experience with parallel                                     any        most of our computation                                 zed that
                                        and distributed systems                                                  s involved applying a map
            ily utilize the resources                              to eas-          eration to each logical                                         op-
                                      of a large distributed syst                                             “record” in our input in
                                                                  em.              compute a set of intermed                                order to
               Our implementation of                                                                               iate key/value pairs, and
                                           MapReduce runs on a                                                                                    then
           cluster of commodity mac                                   large        applying a reduce oper
                                          hines and is highly scal                                           ation to all the values that
           a typical MapReduce com                                   able:         the same key, in order                                     shared
                                         putation processes man                                              to combine the derived
           abytes of data on thousand                               y ter-        propriately. Our use of                                  data ap-
                                         s of machines. Program                                                 a functional model with
          find the system easy to use:                                mers         specified map and redu                                         user-
                                         hundreds of MapReduce                                               ce operations allows us
          grams have been impleme                                     pro-        lelize large computation                                to paral-
                                        nted and upwards of one                                               s easily and to use re-e
          sand MapReduce jobs are                                   thou-        as the primary mechani                                  xecution
                                        executed on Google’s clus                                           sm for fault tolerance.
         every day.                                                   ters            The major contribution
                                                                                                                s of this work are a simp
                                                                                 powerful interface that                                     le and
                                                                                                            enables automatic paralleli
                                                                                 and distribution of larg                                   zation
         1 Introduction                                                                                     e-scale computations, com
                                                                                with an implementation                                       bined
                                                                                                               of this interface that achi
                                                                                high performance on larg                                       eves
                                                                                                              e clusters of commodity
         Over the past five years,                                                   Section 2 describes the                                PCs.
                                     the authors and many othe                                                 basic programming mod
        Google have implemented                                     rs at       gives several examples                                     el and
                                        hundreds of special-purpo                                          . Section 3 describes
        computations that proc                                         se      mentation of the MapRed                                an imple-
                                   ess large amounts of raw                                                    uce interface tailored towa
                                                                  data,
Friday, January 15, 2010
        such as crawled documen
       compute various kinds
                                      ts, web request logs, etc.,
                                                                       to
                                                                               our cluster-based computi
                                                                               scribes several refineme
                                                                                                              ng environment. Section
                                                                                                                                                rds
                                                                                                                                            4 de-
                                                                                                                                                             5
                                   of derived data, such as                                                 nts of the programmin
       indices, various represen                              inverted        that we have found usef                                  g model
                                     tations of the graph struc                                             ul. Section 5 has perform
       of web documents, sum                                       ture       measurements of our                                           ance
                                    maries of the number of                                               implementation for a
       crawled per host, the set                                pages         tasks. Section 6 explores                              variety of
                                     of most frequent queries                                                the use of MapReduce
                                                                  in a        Google including our expe                                  within
                                                                                                             riences in using it as the
                                                                                                                                           basis
       To appear in OSDI 2004

                                                                                                                                                1
MapReduce history


    “
                           ”
Friday, January 15, 2010       6
Friday, January 15, 2010   7

http://www.cern.ch/
Origins




                Friday, January 15, 2010   8

http://en.wikipedia.org/wiki/Mapreduce
Today




               Friday, January 15, 2010   9

http://hadoop.apache.org
Today




               Friday, January 15, 2010   10

http://wiki.apache.org/hadoop/PoweredBy
Today




Friday, January 15, 2010   11
Why Hadoop?



Friday, January 15, 2010   12
b
                           1t
                      $74
                           .85



       b
     g
4

Friday, January 15, 2010             13
0
                                      0
                                    ,0
                                $10



                           vs
              0
             0
         0
     $ 1,



Friday, January 15, 2010                   14
ur t
                                                      o u
                                                   y o ure
                                                 Bu ay il
                                                  w Fa
                                                    f
                                                  o



                                            vs
                              is
                            e le
                         ur b
                       il ita
                    Fa ev           p
                     in          ea
                             Ch
                         Go

                 Friday, January 15, 2010                     15

This concept doesn’t work well at weddings or dinner parties
Sproinnnng!

                                         Bzzzt!

                                                                         Crrrkt!




                           Friday, January 15, 2010                                16

http://www.robinmajumdar.com/2006/08/05/google-dalles-data-centre-has-serious-cooling-needs/
http://www.greenm3.com/2009/10/googles-secret-to-efficient-data-center-design-ability-to-predict-
performance.html
Unstructured
      Structured

Friday, January 15, 2010                  17
NOSQL




               Friday, January 15, 2010   18

http://nosql.mypopescu.com/
Applications




Friday, January 15, 2010   19
Applications




Friday, January 15, 2010   20
Particle Physics




               Friday, January 15, 2010         21

http://upload.wikimedia.org/wikipedia/commons/f/fc/
CERN_LHC_Tunnel1.jpg
Financial Trends




Friday, January 15, 2010   22
Contextual Ads




Friday, January 15, 2010   23
Hadoop Family



Friday, January 15, 2010   24
Hadoop
       Components




Friday, January 15, 2010   25
the Players
                                        the PlayAs




                     Friday, January 15, 2010          26

http://www.flickr.com/photos/mandj98/3804322095/
http://www.flickr.com/photos/8583446@N05/3304141843/
http://www.flickr.com/photos/joits/219824254/
http://www.flickr.com/photos/streetfly_jz/2312194534/
http://www.flickr.com/photos/sybrenstuvel/2811467787/
http://www.flickr.com/photos/lacklusters/2080288154/
http://www.flickr.com/photos/sybrenstuvel/2811467787/
MapReduce




Friday, January 15, 2010   27
The process




Friday, January 15, 2010   28
Start
                            Map




Friday, January 15, 2010           29
Grouping




Friday, January 15, 2010     30
Reduce




Friday, January 15, 2010        31
MapReduce
                 Demo



Friday, January 15, 2010   32
HDFS



Friday, January 15, 2010          33
Friday, January 15, 2010   34
HDFS Basics




Friday, January 15, 2010   35
Data Overload




Friday, January 15, 2010   36
Friday, January 15, 2010   37
HDFS Demo



Friday, January 15, 2010    38
Pig



Friday, January 15, 2010         39
Pig Basics




                Friday, January 15, 2010         40

http://hadoop.apache.org/pig/docs/r0.3.0/getstarted.html#Sample
+Code
Pig Sample


      A = load 'passwd' using PigStorage(':');
      B = foreach A generate $0 as id;
      dump B;
      store B into 'id.out';




Friday, January 15, 2010                         41
Pig demo



Friday, January 15, 2010      42
Hive



Friday, January 15, 2010          43
Hive Basics




Friday, January 15, 2010   44
Sqoop




              Friday, January 15, 2010   45

http://www.cloudera.com/hadoop-sqoop
HBase



Friday, January 15, 2010           46
HBase Basics




Friday, January 15, 2010   47
HBase Demo



Friday, January 15, 2010   48
on


Friday, January 15, 2010        49
Amazon
                    Elastic
                     MapReduce




                Friday, January 15, 2010         50

Launched in April 2009
Save results to S3 buckets
http://aws.amazon.com/elasticmapreduce/#functionality
EMR Languages




Friday, January 15, 2010   51
EMR Pricing




                 Friday, January 15, 2010   52

Pay for both columns additively
EMR Functions




Friday, January 15, 2010   53
Friday, January 15, 2010   54
Final
                  Thoughts



Friday, January 15, 2010     55
Ha! Your
                                             Hadoop is       Shut up!
                                          slower than my   I’m reducing.
                                              Hadoop!




               Friday, January 15, 2010                                    56

http://www.flickr.com/photos/robryb/14826486/sizes/l/
Friday, January 15, 2010   57
Does this
                                          Hadoop make my
                                           list look ugly?




               Friday, January 15, 2010                      58

http://www.flickr.com/photos/mckaysavage/1037160492/sizes/l/
Friday, January 15, 2010   59
Use Hadoop!




               Friday, January 15, 2010                 60

http://www.flickr.com/photos/robryb/14826417/sizes/l/
Thanks



Friday, January 15, 2010       61
Hadoop




Friday, January 15, 2010   62
Contact




Friday, January 15, 2010   63
Credits



Friday, January 15, 2010        64
Friday, January 15, 2010   65

More Related Content

Viewers also liked

Encryption Boot Camp on the JVM
Encryption Boot Camp on the JVMEncryption Boot Camp on the JVM
Encryption Boot Camp on the JVMMatthew McCullough
 
Git Going for the Transylvania JUG
Git Going for the Transylvania JUGGit Going for the Transylvania JUG
Git Going for the Transylvania JUGMatthew McCullough
 
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)Jun Rao
 
Advanced Encryption on the JVM v0.2.8
Advanced Encryption on the JVM v0.2.8Advanced Encryption on the JVM v0.2.8
Advanced Encryption on the JVM v0.2.8Matthew McCullough
 
Realtime BigData Step by Step mit Lambda, Kafka, Storm und Hadoop
Realtime BigData Step by Step mit Lambda, Kafka, Storm und HadoopRealtime BigData Step by Step mit Lambda, Kafka, Storm und Hadoop
Realtime BigData Step by Step mit Lambda, Kafka, Storm und HadoopValentin Zacharias
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupGwen (Chen) Shapira
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
Indexing in Cassandra
Indexing in CassandraIndexing in Cassandra
Indexing in CassandraEd Anuff
 
101 ways to configure kafka - badly (Kafka Summit)
101 ways to configure kafka - badly (Kafka Summit)101 ways to configure kafka - badly (Kafka Summit)
101 ways to configure kafka - badly (Kafka Summit)Henning Spjelkavik
 
Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012jbellis
 
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Michael Noll
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesTodd Palino
 
ELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedTin Le
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafkaconfluent
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 

Viewers also liked (20)

Encryption Boot Camp on the JVM
Encryption Boot Camp on the JVMEncryption Boot Camp on the JVM
Encryption Boot Camp on the JVM
 
Survey Of Open Source
Survey Of Open SourceSurvey Of Open Source
Survey Of Open Source
 
Git Going for the Transylvania JUG
Git Going for the Transylvania JUGGit Going for the Transylvania JUG
Git Going for the Transylvania JUG
 
OpenSolaris Overview
OpenSolaris OverviewOpenSolaris Overview
OpenSolaris Overview
 
Open Source Debugging v1.3.2
Open Source Debugging v1.3.2Open Source Debugging v1.3.2
Open Source Debugging v1.3.2
 
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
 
Advanced Encryption on the JVM v0.2.8
Advanced Encryption on the JVM v0.2.8Advanced Encryption on the JVM v0.2.8
Advanced Encryption on the JVM v0.2.8
 
Realtime BigData Step by Step mit Lambda, Kafka, Storm und Hadoop
Realtime BigData Step by Step mit Lambda, Kafka, Storm und HadoopRealtime BigData Step by Step mit Lambda, Kafka, Storm und Hadoop
Realtime BigData Step by Step mit Lambda, Kafka, Storm und Hadoop
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn Meetup
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
Indexing in Cassandra
Indexing in CassandraIndexing in Cassandra
Indexing in Cassandra
 
101 ways to configure kafka - badly (Kafka Summit)
101 ways to configure kafka - badly (Kafka Summit)101 ways to configure kafka - badly (Kafka Summit)
101 ways to configure kafka - badly (Kafka Summit)
 
Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012
 
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
 
ELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learned
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafka
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 

Similar to An Intro to Hadoop

Data Processing in the Work of NoSQL? An Introduction to Hadoop
Data Processing in the Work of NoSQL? An Introduction to HadoopData Processing in the Work of NoSQL? An Introduction to Hadoop
Data Processing in the Work of NoSQL? An Introduction to HadoopDan Harvey
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacmlmphuong06
 
Map reduce
Map reduceMap reduce
Map reducexydii
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingcoolmirza143
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)anh tuan
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
 
Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop
Mochi: Visual Log-Analysis Based Tools for Debugging HadoopMochi: Visual Log-Analysis Based Tools for Debugging Hadoop
Mochi: Visual Log-Analysis Based Tools for Debugging HadoopGeorge Ang
 
MAP REDUCE SLIDESHARE
MAP REDUCE SLIDESHAREMAP REDUCE SLIDESHARE
MAP REDUCE SLIDESHAREdharanis15
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
 
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Yahoo Developer Network
 
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...LeMeniz Infotech
 
E5 05 ijcite august 2014
E5 05 ijcite august 2014E5 05 ijcite august 2014
E5 05 ijcite august 2014ijcite
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
 

Similar to An Intro to Hadoop (20)

Mapreduce Osdi04
Mapreduce Osdi04Mapreduce Osdi04
Mapreduce Osdi04
 
Data Processing in the Work of NoSQL? An Introduction to Hadoop
Data Processing in the Work of NoSQL? An Introduction to HadoopData Processing in the Work of NoSQL? An Introduction to Hadoop
Data Processing in the Work of NoSQL? An Introduction to Hadoop
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
 
Map reduce
Map reduceMap reduce
Map reduce
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
 
International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)
 
Map reduce
Map reduceMap reduce
Map reduce
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
 
Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop
Mochi: Visual Log-Analysis Based Tools for Debugging HadoopMochi: Visual Log-Analysis Based Tools for Debugging Hadoop
Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop
 
MAP REDUCE SLIDESHARE
MAP REDUCE SLIDESHAREMAP REDUCE SLIDESHARE
MAP REDUCE SLIDESHARE
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARN
 
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
 
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
 
E5 05 ijcite august 2014
E5 05 ijcite august 2014E5 05 ijcite august 2014
E5 05 ijcite august 2014
 
CLOUD BIOINFORMATICS Part1
 CLOUD BIOINFORMATICS Part1 CLOUD BIOINFORMATICS Part1
CLOUD BIOINFORMATICS Part1
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
 

More from Matthew McCullough

Using Git and GitHub Effectively at Emerge Interactive
Using Git and GitHub Effectively at Emerge InteractiveUsing Git and GitHub Effectively at Emerge Interactive
Using Git and GitHub Effectively at Emerge InteractiveMatthew McCullough
 
All About GitHub Pull Requests
All About GitHub Pull RequestsAll About GitHub Pull Requests
All About GitHub Pull RequestsMatthew McCullough
 
Git Graphs, Hashes, and Compression, Oh My
Git Graphs, Hashes, and Compression, Oh MyGit Graphs, Hashes, and Compression, Oh My
Git Graphs, Hashes, and Compression, Oh MyMatthew McCullough
 
Git and GitHub at the San Francisco JUG
 Git and GitHub at the San Francisco JUG Git and GitHub at the San Francisco JUG
Git and GitHub at the San Francisco JUGMatthew McCullough
 
Migrating from Subversion to Git and GitHub
Migrating from Subversion to Git and GitHubMigrating from Subversion to Git and GitHub
Migrating from Subversion to Git and GitHubMatthew McCullough
 
Build Lifecycle Craftsmanship for the Transylvania JUG
Build Lifecycle Craftsmanship for the Transylvania JUGBuild Lifecycle Craftsmanship for the Transylvania JUG
Build Lifecycle Craftsmanship for the Transylvania JUGMatthew McCullough
 
Transylvania JUG Pre-Meeting Announcements
Transylvania JUG Pre-Meeting AnnouncementsTransylvania JUG Pre-Meeting Announcements
Transylvania JUG Pre-Meeting AnnouncementsMatthew McCullough
 
Game Theory for Software Developers at the Boulder JUG
Game Theory for Software Developers at the Boulder JUGGame Theory for Software Developers at the Boulder JUG
Game Theory for Software Developers at the Boulder JUGMatthew McCullough
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGMatthew McCullough
 

More from Matthew McCullough (20)

Using Git and GitHub Effectively at Emerge Interactive
Using Git and GitHub Effectively at Emerge InteractiveUsing Git and GitHub Effectively at Emerge Interactive
Using Git and GitHub Effectively at Emerge Interactive
 
All About GitHub Pull Requests
All About GitHub Pull RequestsAll About GitHub Pull Requests
All About GitHub Pull Requests
 
Adam Smith Builds an App
Adam Smith Builds an AppAdam Smith Builds an App
Adam Smith Builds an App
 
Git's Filter Branch Command
Git's Filter Branch CommandGit's Filter Branch Command
Git's Filter Branch Command
 
Git Graphs, Hashes, and Compression, Oh My
Git Graphs, Hashes, and Compression, Oh MyGit Graphs, Hashes, and Compression, Oh My
Git Graphs, Hashes, and Compression, Oh My
 
Git and GitHub at the San Francisco JUG
 Git and GitHub at the San Francisco JUG Git and GitHub at the San Francisco JUG
Git and GitHub at the San Francisco JUG
 
Finding Things in Git
Finding Things in GitFinding Things in Git
Finding Things in Git
 
Git and GitHub for RallyOn
Git and GitHub for RallyOnGit and GitHub for RallyOn
Git and GitHub for RallyOn
 
Migrating from Subversion to Git and GitHub
Migrating from Subversion to Git and GitHubMigrating from Subversion to Git and GitHub
Migrating from Subversion to Git and GitHub
 
Git Notes and GitHub
Git Notes and GitHubGit Notes and GitHub
Git Notes and GitHub
 
Intro to Git and GitHub
Intro to Git and GitHubIntro to Git and GitHub
Intro to Git and GitHub
 
Build Lifecycle Craftsmanship for the Transylvania JUG
Build Lifecycle Craftsmanship for the Transylvania JUGBuild Lifecycle Craftsmanship for the Transylvania JUG
Build Lifecycle Craftsmanship for the Transylvania JUG
 
Transylvania JUG Pre-Meeting Announcements
Transylvania JUG Pre-Meeting AnnouncementsTransylvania JUG Pre-Meeting Announcements
Transylvania JUG Pre-Meeting Announcements
 
Game Theory for Software Developers at the Boulder JUG
Game Theory for Software Developers at the Boulder JUGGame Theory for Software Developers at the Boulder JUG
Game Theory for Software Developers at the Boulder JUG
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
JQuery Mobile
JQuery MobileJQuery Mobile
JQuery Mobile
 
R Data Analysis Software
R Data Analysis SoftwareR Data Analysis Software
R Data Analysis Software
 
Please, Stop Using Git
Please, Stop Using GitPlease, Stop Using Git
Please, Stop Using Git
 
Dr. Strangedev
Dr. StrangedevDr. Strangedev
Dr. Strangedev
 
Jenkins for One
Jenkins for OneJenkins for One
Jenkins for One
 

Recently uploaded

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

An Intro to Hadoop

  • 3.
  • 4. MapReduce: Simplified Dat a Processing on Large Clusters Jeffrey Dean and Sanjay Ghe mawat jeff@google.com, sanjay@goo gle.com Google, Inc. Abstract given day, etc. Most such comp MapReduce is a programming utations are conceptu- model and an associ- ally straightforward. However, ated implementation for proce the input data is usually ssing and generating large large and the computations have data sets. Users specify a map to be distributed across function that processes a hundreds or thousands of mach key/value pair to generate a set ines in order to finish in of intermediate key/value a reasonable amount of time. pairs, and a reduce function that The issues of how to par- merges all intermediate allelize the computation, distri values associated with the same bute the data, and handle intermediate key. Many failures conspire to obscure the real world tasks are expressible original simple compu- in this model, as shown tation with large amounts of in the paper. complex code to deal with these issues. Programs written in this funct As a reaction to this complexity ional style are automati- , we designed a new cally parallelized and executed abstraction that allows us to expre on a large cluster of com- ss the simple computa- modity machines. The run-time tions we were trying to perfo system takes care of the rm but hides the messy de- details of partitioning the input tails of parallelization, fault- data, scheduling the pro- tolerance, data distribution gram’s execution across a set and load balancing in a librar of machines, handling ma- y. Our abstraction is in- chine failures, and managing spired by the map and reduce the required inter-machine primitives present in Lisp communication. This allows and many other functional langu programmers without any ages. We realized that experience with parallel and most of our computations invol distributed systems to eas- ved applying a map op- ily utilize the resources of a large eration to each logical “record” distributed system. in our input in order to Our implementation of MapR compute a set of intermediat educe runs on a large e key/value pairs, and then cluster of commodity machines applying a reduce operation to and is highly scalable: all the values that shared a typical MapReduce computatio the same key, in order to comb n processes many ter- ine the derived data ap- abytes of data on thousands of propriately. Our use of a funct machines. Programmers ional model with user- find the system easy to use: hund specified map and reduce opera reds of MapReduce pro- tions allows us to paral- grams have been implemented lelize large computations easily and upwards of one thou- and to use re-execution sand MapReduce jobs are execu as the primary mechanism for ted on Google’s clusters fault tolerance. every day. The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined 1 Introduction with an implementation of this interface that achieves high performance on large cluste rs of commodity PCs. Over the past five years, the Section 2 describes the basic authors and many others at programming model and Google have implemented hund gives several examples. Secti reds of special-purpose on 3 describes an imple- computations that process large mentation of the MapReduce amounts of raw data, interface tailored towards such as crawled documents, our cluster-based computing web request logs, etc., to environment. Section 4 de- compute various kinds of deriv scribes several refinements of ed data, such as inverted the programming model indices, various representations that we have found useful. Secti of the graph structure on 5 has performance of web documents, summaries measurements of our implement of the number of pages ation for a variety of crawled per host, the set of tasks. Section 6 explores the most frequent queries in a use of MapReduce within Google including our experience s in using it as the basis To appear in OSDI 2004 1
  • 5. rs Jeffrey Dean and Sanjay Ghemawat jeff@google.com, sanjay @ google.com Google, Inc. Abstract given day, etc. Most su MapReduce is a progra ch computations are co mming model and an as ally straightforward. Ho nceptu- ated implementation fo soci- wever, the input data is r processing and genera large and the computatio usually data sets. Users specify ting large ns have to be distributed a map function that proc hundreds or thousands across key/value pair to genera esses a of machines in order to te a set of intermediate ke a reasonable amount of finish in pairs, and a reduce func y/value time. The issues of how tion that merges all inter allelize the computatio to par- values associated with th mediate n, distribute the data, an e same intermediate key. failures conspire to obsc d handle real world tasks are expr Many ure the original simple essible in this model, as tation with large amount compu- in the paper. shown s of complex code to de these issues. al with Programs written in this As a reaction to this co functional style are auto mplexity, we designed cally parallelized and ex mati- abstraction that allows us a new ecuted on a large cluste to express the simple co modity machines. The r of com- tions we were trying to mputa- run-time system takes ca perform but hides the m details of partitioning th re of the tails of parallelization, essy de- e input data, scheduling fault-tolerance, data distr gram’s execution across the pro- and load balancing in ibution a set of machines, hand a library. Our abstractio chine failures, and man ling ma- spired by the map and n is in- aging the required inter reduce primitives presen communication. This all -machine and many other functio t in Lisp ows programmers with nal languages. We reali experience with paralle out any most of our computatio zed that l and distributed system ns involved applying a ily utilize the resources s to eas- eration to each logical map op- of a large distributed sy “record” in our input in stem. compute a set of interm order to Our implementation of ediate key/value pairs, MapReduce runs on a and then cluster of commodity m large applying a reduce oper achines and is highly sc ation to all the values th a typical MapReduce co alable: the same key, in order at shared mputation processes m to combine the derived abytes of data on thousa any ter- propriately. Our use of data ap- nds of machines. Progra a functional model with find the system easy to us mmers specified map and redu user- e: hundreds of MapRedu ce operations allows us grams have been implem ce pro- lelize large computatio to paral- ented and upwards of on ns easily and to use re-ex sand MapReduce jobs ar e thou- as the primary mechani ecution e executed on Google’s sm for fault tolerance. every day. clusters The major contributions of this work are a simpl powerful interface that e and enables automatic paralle and distribution of large lization 1 Introduction -scale computations, co with an implementatio mbined n of this interface that high performance on lar achieves ge clusters of commodity Over the past five years, Section 2 describes the PCs. the authors and many ot basic programming mod Google have implemen hers at gives several examples el and ted hundreds of special . Section 3 describes computatio -purpose ment an imple-
  • 6. gle, Inc. Abstract given day MapReduce is a progra ally straig mming model and an a ated implementation fo ssoci- r processing and genera large and t data sets. Users specify ting large a map function that pro hundreds o key/value pair to genera cesses a te a set of intermediate k a reasonab pairs, and a reduce func ey/value tion that merges all inte allelize the values associated with th rmediate e same intermediate key failures con real world tasks are exp . Many ressible in this model, a tation with in the paper. s shown these issues Programs written in this As a rea functional style are auto cally parallelized and ex mati- abstraction ecuted on a large cluste modity machines. The r o f co m - tions we we run-time system takes c details of partitioning th are of the tails of para e input data, scheduling gram’s execution across the pro- and load ba a set of machines, hand chine failures, and man ling ma- spired by th aging the required inter- communication. This a machine and many ot llows programmers wit experience with paralle hout any most of our l and distributed system ily utilize the resour s to eas- eration
  • 7. l world tasks are express tation with ible in this model, as sh in the paper. own these issue Programs written in this As a rea functional style are auto cally parallelized and ex mati- abstraction ecuted on a large cluste modity machines. The r o f co m - tions we we run-time system takes c details of partitioning th are of the tails of para e input data, scheduling gram’s execution across the pro- and load ba a set of machines, hand chine failures, and man ling ma- spired by th aging the required inter- communication. This a machine and many o llows programmers wit experience with paralle hout any most of our l and distributed system ily utilize the resources s to eas- eration to ea of a large distributed sy stem. compute a s Our implementation of MapReduce runs on a cluster of commodity m large applying a re achines and is highly sc a typical MapReduce c alable: the same key omputation processes m abytes of data on thousa any ter- propriately. nds of machines. Progra find the system easy to u mmers specified map se: hundreds of MapRed grams have been implem uce pro- lelize large c ented and upwards of o sand MapReduce jobs a ne thou- as the primary re executed on Google’s every day. clusters The major powerful inter and distributio 1
  • 9.
  • 11. Today
  • 12. Today
  • 13. Today
  • 16. $74 .85 b g 4
  • 17. b 1t $74 .85 b g 4
  • 18. vs
  • 19. 0 0 ,0 $ 10 vs 0 0 ,0 $1
  • 20. vs
  • 21. r u t o u e y o ur Bu ay il w Fa f o vs is re ble i lu ta a vi F e p a in C he Go
  • 25. NOSQL
  • 26. NOSQL
  • 27. NOSQL
  • 61. Start
  • 62. Map
  • 66. HDFS
  • 67.
  • 77.
  • 79. Pig
  • 81. Pig Sample A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id; dump B; store B into 'id.out';
  • 83. Hive
  • 85. Sqoop
  • 86. HBase
  • 89. on
  • 90.
  • 91. Amazon Elastic MapReduce
  • 92. Amazon Elastic MapReduce
  • 93. Amazon Elastic MapReduce
  • 94. Amazon Elastic MapReduce
  • 101.
  • 103. Ha! Your Hadoop is Shut up! slower than my I’m reducing. Hadoop!
  • 104.
  • 105. Does this Hadoop make my list look ugly?
  • 106.
  • 107.
  • 108.
  • 109.
  • 111. Thanks
  • 112. Hadoop
  • 115.
  • 119. MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemaw at jeff@google.com, sanjay@go ogle.com Google, Inc. Abstract given day, etc. Most such computat MapReduce is a programming ions are conceptu- model and an associ- ally straightforward. However, ated implementation for processin the input data is usually g and generating large large and the computations have data sets. Users specify a map to be distributed across function that processes a hundreds or thousands of machines key/value pair to generate a set in order to finish in of intermediate key/value a reasonable amount of time. pairs, and a reduce function that The issues of how to par- merges all intermediate allelize the computation, distribute values associated with the same the data, and handle intermediate key. Many failures conspire to obscure the real world tasks are expressib original simple compu- le in this model, as shown tation with large amounts of in the paper. complex code to deal with these issues. Programs written in this functiona As a reaction to this complexit l style are automati- y, we designed a new cally parallelized and executed abstraction that allows us to express on a large cluster of com- the simple computa- modity machines. The run-time tions we were trying to perform system takes care of the but hides the messy de- details of partitioning the input tails of parallelization, fault-toler data, scheduling the pro- ance, data distribution gram’s execution across a set and load balancing in a library. of machines, handling ma- Our abstraction is in- chine failures, and managing spired by the map and reduce the required inter-machine primitives present in Lisp communication. This allows and many other functional languages programmers without any . We realized that experience with parallel and most of our computations involved distributed systems to eas- applying a map op- ily utilize the resources of a large eration to each logical “record” distributed system. in our input in order to Our implementation of MapRedu compute a set of intermediate ce runs on a large key/value pairs, and then cluster of commodity machines applying a reduce operation to and is highly scalable: all the values that shared a typical MapReduce computat the same key, in order to combine ion processes many ter- the derived data ap- abytes of data on thousands of propriately. Our use of a functiona machines. Programmers l model with user- find the system easy to use: hundreds specified map and reduce operation of MapReduce pro- s allows us to paral- grams have been implemented lelize large computations easily and upwards of one thou- and to use re-execution sand MapReduce jobs are executed as the primary mechanism for on Google’s clusters fault tolerance. every day. The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scal e computations, combined 1 Introduction with an implementation of this interface that achieves high performance on large clusters of commodity PCs. Over the past five years, the Section 2 describes the basic authors and many others at programming model and Google have implemented hundreds gives several examples. Section of special-purpose 3 describes an imple- computations that process large mentation of the MapReduce amounts of raw data, interface tailored towards such as crawled documents, our cluster-based computing web request logs, etc., to environment. Section 4 de- compute various kinds of derived scribes several refinements of data, such as inverted the programming model indices, various representations that we have found useful. Section of the graph structure 5 has performance of web documents, summarie measurements of our implemen s of the number of pages tation for a variety of crawled per host, the set of tasks. Section 6 explores the most frequent queries in a use of MapReduce within Google including our experienc es in using it as the basis To appear in OSDI 2004 1 Friday, January 15, 2010 4 Seminal paper on MapReduce http://labs.google.com/papers/mapreduce.html
  • 120. Large Clusters Jeffrey Dean and Sanjay Ghemawat jeff@google.com, sanja y@google.com Google, Inc. Abstract given day, etc. Most such MapReduce is a program computations are conc ming model and an asso ally straightforward. How eptu- ated implementation for ci- ever, the input data is usua processing and generatin large and the computation lly data sets. Users specify g large s have to be distributed a map function that proc hundreds or thousands across key/value pair to generate esses a of machines in order to a set of intermediate key/ a reasonable amount of finish in pairs, and a reduce func value time. The issues of how tion that merges all inter allelize the computation to par- values associated with the mediate , distribute the data, and same intermediate key. failures conspire to obsc handle real world tasks are expr Many ure the original simple essible in this model, as tation with large amounts compu- in the paper. shown of complex code to deal these issues. with Programs written in this As a reaction to this com functional style are auto plexity, we designed a cally parallelized and exec mati- abstraction that allows us new uted on a large cluster of to express the simple com modity machines. The com- tions we were trying to puta- run-time system takes care perform but hides the mes details of partitioning the of the tails of parallelization, sy de- input data, scheduling the fault-tolerance, data distr gram’s execution across pro- and load balancing in ibution a set of machines, hand a library. Our abstracti chine failures, and man ling ma- spired by the map and on is in- aging the required inter reduce primitives present communication. This allow -machine and many other function in Lisp s programmers without al languages. We reali experience with parallel any most of our computation zed that and distributed systems s involved applying a map ily utilize the resources to eas- eration to each logical op- of a large distributed syst “record” in our input in em. compute a set of intermed order to Our implementation of iate key/value pairs, and MapReduce runs on a then cluster of commodity mac large applying a reduce oper hines and is highly scal ation to all the values that a typical MapReduce com able: the same key, in order shared putation processes man to combine the derived abytes of data on thousand y ter- propriately. Our use of data ap- s of machines. Program a functional model with find the system easy to use: mers specified map and redu user- hundreds of MapReduce ce operations allows us grams have been impleme pro- lelize large computation to paral- nted and upwards of one s easily and to use re-e sand MapReduce jobs are thou- as the primary mechani xecution executed on Google’s clus sm for fault tolerance. every day. ters The major contribution s of this work are a simp powerful interface that le and enables automatic paralleli and distribution of larg zation 1 Introduction e-scale computations, com with an implementation bined of this interface that achi high performance on larg eves e clusters of commodity Over the past five years, Section 2 describes the PCs. the authors and many othe basic programming mod Google have implemented rs at gives several examples el and hundreds of special-purpo . Section 3 describes computations that proc se mentation of the MapRed an imple- ess large amounts of raw uce interface tailored towa data, Friday, January 15, 2010 such as crawled documen compute various kinds ts, web request logs, etc., to our cluster-based computi scribes several refineme ng environment. Section rds 4 de- 5 of derived data, such as nts of the programmin indices, various represen inverted that we have found usef g model tations of the graph struc ul. Section 5 has perform of web documents, sum ture measurements of our ance maries of the number of implementation for a crawled per host, the set pages tasks. Section 6 explores variety of of most frequent queries the use of MapReduce in a Google including our expe within riences in using it as the basis To appear in OSDI 2004 1
  • 121. MapReduce history “ ” Friday, January 15, 2010 6
  • 122. Friday, January 15, 2010 7 http://www.cern.ch/
  • 123. Origins Friday, January 15, 2010 8 http://en.wikipedia.org/wiki/Mapreduce
  • 124. Today Friday, January 15, 2010 9 http://hadoop.apache.org
  • 125. Today Friday, January 15, 2010 10 http://wiki.apache.org/hadoop/PoweredBy
  • 128. b 1t $74 .85 b g 4 Friday, January 15, 2010 13
  • 129. 0 0 ,0 $10 vs 0 0 0 $ 1, Friday, January 15, 2010 14
  • 130. ur t o u y o ure Bu ay il w Fa f o vs is e le ur b il ita Fa ev p in ea Ch Go Friday, January 15, 2010 15 This concept doesn’t work well at weddings or dinner parties
  • 131. Sproinnnng! Bzzzt! Crrrkt! Friday, January 15, 2010 16 http://www.robinmajumdar.com/2006/08/05/google-dalles-data-centre-has-serious-cooling-needs/ http://www.greenm3.com/2009/10/googles-secret-to-efficient-data-center-design-ability-to-predict- performance.html
  • 132. Unstructured Structured Friday, January 15, 2010 17
  • 133. NOSQL Friday, January 15, 2010 18 http://nosql.mypopescu.com/
  • 136. Particle Physics Friday, January 15, 2010 21 http://upload.wikimedia.org/wikipedia/commons/f/fc/ CERN_LHC_Tunnel1.jpg
  • 140. Hadoop Components Friday, January 15, 2010 25
  • 141. the Players the PlayAs Friday, January 15, 2010 26 http://www.flickr.com/photos/mandj98/3804322095/ http://www.flickr.com/photos/8583446@N05/3304141843/ http://www.flickr.com/photos/joits/219824254/ http://www.flickr.com/photos/streetfly_jz/2312194534/ http://www.flickr.com/photos/sybrenstuvel/2811467787/ http://www.flickr.com/photos/lacklusters/2080288154/ http://www.flickr.com/photos/sybrenstuvel/2811467787/
  • 144. Start Map Friday, January 15, 2010 29
  • 147. MapReduce Demo Friday, January 15, 2010 32
  • 155. Pig Basics Friday, January 15, 2010 40 http://hadoop.apache.org/pig/docs/r0.3.0/getstarted.html#Sample +Code
  • 156. Pig Sample A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id; dump B; store B into 'id.out'; Friday, January 15, 2010 41
  • 157. Pig demo Friday, January 15, 2010 42
  • 160. Sqoop Friday, January 15, 2010 45 http://www.cloudera.com/hadoop-sqoop
  • 165. Amazon Elastic MapReduce Friday, January 15, 2010 50 Launched in April 2009 Save results to S3 buckets http://aws.amazon.com/elasticmapreduce/#functionality
  • 167. EMR Pricing Friday, January 15, 2010 52 Pay for both columns additively
  • 170. Final Thoughts Friday, January 15, 2010 55
  • 171. Ha! Your Hadoop is Shut up! slower than my I’m reducing. Hadoop! Friday, January 15, 2010 56 http://www.flickr.com/photos/robryb/14826486/sizes/l/
  • 173. Does this Hadoop make my list look ugly? Friday, January 15, 2010 58 http://www.flickr.com/photos/mckaysavage/1037160492/sizes/l/
  • 175. Use Hadoop! Friday, January 15, 2010 60 http://www.flickr.com/photos/robryb/14826417/sizes/l/