SlideShare una empresa de Scribd logo
1 de 47
Descargar para leer sin conexión
MapR,	
  Implica0ons	
  for	
  Integra0on	
  

                            CMU	
  –	
  September	
  2011	
  




  10/11/11	
     ©	
  MapR	
  Confiden0al	
                      1	
  
Outline	
  

•  MapR	
  system	
  overview	
  
   •    Map-­‐reduce	
  review	
  
   •    MapR	
  architecture	
  
   •    Performance	
  Results	
  
   •    Map-­‐reduce	
  on	
  MapR	
  
•  Architectural	
  implica0ons	
  
   •  Search	
  indexing	
  /	
  deployment	
  
   •  EM	
  algorithm	
  for	
  machine	
  learning	
  
   •  …	
  and	
  more	
  …	
  

          10/11/11	
          ©	
  MapR	
  Confiden0al	
     2	
  
!"!#




Map-­‐Reduce	
  
                                    !"          @/-,9)                      !#
                                                A.0B




  Input	
                                                                                        Output	
  




                                    !"          @/-,9)                      !#
                                                A.0B



                                            Shuffle	
  
                   $%&'()"*"   +,&)!'%-(./%0)            12'!!3)"*4       536'-3) 8'(&'()"930)
              10/11/11	
           "*#     ©	
  MapR	
  Confiden0al	
     !'%-(./%0)  "*:                      3	
  
                                                                            "*7
BoQlenecks	
  and	
  Issues	
  

•  Read-­‐only	
  files	
  
•  Many	
  copies	
  in	
  I/O	
  path	
  
•  Shuffle	
  based	
  on	
  HTTP	
  
    •  Can’t	
  use	
  new	
  technologies	
  
    •  Eats	
  file	
  descriptors	
  
•  Spills	
  go	
  to	
  local	
  file	
  space	
  
    •  Bad	
  for	
  skewed	
  distribu0on	
  of	
  sizes	
  


           10/11/11	
            ©	
  MapR	
  Confiden0al	
      4	
  
MapR	
  Areas	
  of	
  Development	
  

                                HBase	
                        Map	
  
                                                              Reduce	
  
       Ecosystem	
  


                 Storage	
  
                                                              Management	
  
                 Services	
  




  10/11/11	
                    ©	
  MapR	
  Confiden0al	
                      5	
  
MapR	
  Improvements	
  

•  Faster	
  file	
  system	
  
    •  Fewer	
  copies	
  
    •  Mul0ple	
  NICS	
  
    •  No	
  file	
  descriptor	
  or	
  page-­‐buf	
  compe00on	
  
•  Faster	
  map-­‐reduce	
  
    •  Uses	
  distributed	
  file	
  system	
  
    •  Direct	
  RPC	
  to	
  receiver	
  
    •  Very	
  wide	
  merges	
  

          10/11/11	
           ©	
  MapR	
  Confiden0al	
              6	
  
MapR	
  Innova0ons	
  

•  Volumes	
  
   •  Distributed	
  management	
  
   •  Data	
  placement	
  
•  Read/write	
  random	
  access	
  file	
  system	
  
   •  Allows	
  distributed	
  meta-­‐data	
  
   •  Improved	
  scaling	
  
   •  Enables	
  NFS	
  access	
  
•  Applica0on-­‐level	
  NIC	
  bonding	
  
•  Transac0onally	
  correct	
  snapshots	
  and	
  mirrors	
  

         10/11/11	
           ©	
  MapR	
  Confiden0al	
           7	
  
MapR's	
  Containers	
  
                            Files/directories	
  are	
  sharded	
  into	
  blocks,	
  which	
  
                            are	
  placed	
  into	
  mini	
  NNs	
  (containers	
  )	
  on	
  disks	
  
                                                                           l    Each	
  container	
  contains	
  
                                                                                  l  Directories	
  &	
  files	
  


                                                                                  l    Data	
  blocks	
  
                                                                           l    Replicated	
  on	
  servers	
  
Containers	
  are	
  
                                                                           l    No	
  need	
  to	
  manage	
  
16-­‐32	
  GB	
  segments	
  
                                                                                 directly	
  
of	
  disk,	
  placed	
  on	
  
nodes	
  




             10/11/11	
                      ©	
  MapR	
  Confiden0al	
                                       8	
  
MapR's	
  Containers	
  




                                    l    Each	
  container	
  has	
  a	
  
                                          replica0on	
  chain	
  
                                    l    Updates	
  are	
  transac0onal	
  
                                    l    Failures	
  are	
  handled	
  by	
  
                                          rearranging	
  replica0on	
  



10/11/11	
           ©	
  MapR	
  Confiden0al	
                               9	
  
Container	
  loca0ons	
  and	
  replica0on	
  

                 N1,	
  N2	
                                       N1	
  
                 N3,	
  N2	
  
                 N1,	
  N2	
  
                     N1,	
  N3	
                                   N2	
  

                     N3,	
  N2	
  

     CLDB	
  
                                                                   N3	
  
 Container	
  loca0on	
  database	
  
 (CLDB)	
  keeps	
  track	
  of	
  nodes	
  
 hos0ng	
  each	
  container	
  and	
  
 replica0on	
  chain	
  order	
  
      10/11/11	
                     ©	
  MapR	
  Confiden0al	
              10	
  
MapR	
  Scaling	
  
Containers	
  represent	
  16	
  -­‐	
  32GB	
  of	
  data	
  
     l     Each	
  can	
  hold	
  up	
  to	
  	
  1	
  Billion	
  files	
  and	
  directories	
  
     l     100M	
  containers	
  =	
  	
  ~	
  2	
  Exabytes	
  	
  (a	
  very	
  large	
  cluster)	
  
250	
  bytes	
  DRAM	
  to	
  cache	
  a	
  container	
  
     l     25GB	
  to	
  cache	
  all	
  containers	
  for	
  2EB	
  cluster	
  
                But	
  not	
  necessary,	
  can	
  page	
  to	
  disk	
  
     l     Typical	
  large	
  10PB	
  cluster	
  needs	
  2GB	
  
Container-­‐reports	
  are	
  100x	
  -­‐	
  1000x	
  	
  <	
  	
  HDFS	
  block-­‐reports	
  
     l     Serve	
  100x	
  more	
  data-­‐nodes	
  
     l     Increase	
  container	
  size	
  to	
  64G	
  to	
  serve	
  4EB	
  cluster	
  
             l    Map/reduce	
  not	
  affected	
  


           10/11/11	
                   ©	
  MapR	
  Confiden0al	
                                      11	
  
MapR's	
  Streaming	
  Performance	
  
               2250                                                                       2250
                                      11	
  x	
  7200rpm	
  SATA	
                                                                           11	
  x	
  15Krpm	
  SAS	
  
               2000                                                                       2000
               1750                                                                       1750
               1500                                                                       1500
               1250                                                                       1250                                                                  Hardware
                                                                                                                                                                MapR
               1000                                                                       1000
MB	
                                                                                                                                                            Hadoop
                 750                                                                        750
per	
  
sec	
            500                                                                        500
                 250                                                                        250
                      0                                                                         0
                                   Read                       Write                                         Read                     Write
                                                                           Higher	
  is	
  be;er	
  


          Tests:	
  	
  	
  	
  	
  i.	
  	
  16	
  streams	
  x	
  120GB	
  	
  	
  	
  	
  	
  	
  ii.	
  	
  2000	
  streams	
  x	
  1GB	
  

                           10/11/11	
                                 ©	
  MapR	
  Confiden0al	
                                                                   12	
  
Terasort	
  on	
  MapR	
  
         10+1	
  nodes:	
  8	
  core,	
  24GB	
  DRAM,	
  11	
  x	
  1TB	
  SATA	
  7200	
  rpm	
  
              60                                              300

              50                                              250

              40                                              200

Elapsed	
     30                                              150
                                                                                                      MapR
=me	
                                                                                                 Hadoop
(mins)	
      20                                              100

              10                                                50


               0                                                  0
                                  1.0	
  TB                                   3.5	
  TB

                                                  Lower	
  is	
  be;er	
  

                   10/11/11	
                 ©	
  MapR	
  Confiden0al	
                               13	
  
HBase	
  on	
  MapR	
  
              YCSB	
  Random	
  Read	
  	
  with	
  1	
  billion	
  1K	
  records	
  
              10+1	
  node	
  cluster:	
  8	
  core,	
  24GB	
  DRAM,	
  11	
  x	
  1TB	
  7200	
  RPM	
  
             25000	
  

             20000	
  

Records	
   15000	
  
  per	
                                                                                          MapR	
  
second	
  
             10000	
                                                                             Apache	
  

              5000	
  

                    0	
  
                                    Zipfian	
                         Uniform	
                 Higher	
  is	
  be;er	
  


             10/11/11	
                          ©	
  MapR	
  Confiden0al	
                                                 14	
  
Small	
  Files	
  (Apache	
  Hadoop,	
  10	
  nodes)	
  

                                           Out	
  of	
  box	
  
                                                                                    Op:	
  	
  -­‐	
  create	
  file	
  
Rate (files/sec)




                                                                                    	
  	
  	
  	
  	
  	
  	
  	
  	
  -­‐	
  write	
  100	
  bytes	
  
                                                                     Tuned	
        	
  	
  	
  	
  	
  	
  	
  	
  	
  -­‐	
  close	
  
                                                                                    Notes:	
  
                                                                                    -­‐	
  NN	
  not	
  replicated	
  
                                                                                    -­‐	
  NN	
  uses	
  20G	
  DRAM	
  
                                                                                    -­‐	
  DN	
  uses	
  	
  2G	
  	
  DRAM	
  



                                      #	
  of	
  files	
  (m)	
  

                       10/11/11	
                     ©	
  MapR	
  Confiden0al	
                                                             15	
  
MUCH	
  faster	
  for	
  some	
  opera0ons	
  
Same	
  10	
  nodes	
  …	
  




 Create	
  
  Rate	
  




                                  #	
  of	
  files	
  (millions)	
  
              10/11/11	
       ©	
  MapR	
  Confiden0al	
              16	
  
What	
  MapR	
  is	
  not	
  

•  Volumes	
  !=	
  federa0on	
  
    •  MapR	
  supports	
  >	
  10,000	
  volumes	
  all	
  with	
  
       independent	
  placement	
  and	
  defaults	
  
    •  Volumes	
  support	
  snapshots	
  and	
  mirroring	
  
•  NFS	
  !=	
  FUSE	
  
    •  Checksum	
  and	
  compress	
  at	
  gateway	
  
    •  IP	
  fail-­‐over	
  
    •  Read/write/update	
  seman0cs	
  at	
  full	
  speed	
  
•  MapR	
  !=	
  maprfs	
  
          10/11/11	
           ©	
  MapR	
  Confiden0al	
               17	
  
New	
  Capabili0es	
  




10/11/11	
           ©	
  MapR	
  Confiden0al	
     18	
  
Alterna0ve	
  NFS	
  moun0ng	
  models	
  

•  Export	
  to	
  the	
  world	
  
    •  NFS	
  gateway	
  runs	
  on	
  selected	
  gateway	
  hosts	
  
•  Local	
  server	
  
    •  NFS	
  gateway	
  runs	
  on	
  local	
  host	
  
    •  Enables	
  local	
  compression	
  and	
  check	
  summing	
  
•  Export	
  to	
  self	
  
    •  NFS	
  gateway	
  runs	
  on	
  all	
  data	
  nodes,	
  mounted	
  
       from	
  localhost	
  

          10/11/11	
           ©	
  MapR	
  Confiden0al	
                      19	
  
Export	
  to	
  the	
  world	
  


                            NFS	
  
                              NFS	
  
                           Server	
  
                               NFS	
  
                            Server	
  
                                    NFS	
  
                              Server	
  
     NFS	
                     Server	
  
    Client	
  




            10/11/11	
       ©	
  MapR	
  Confiden0al	
     20	
  
Local	
  server	
  


                          Applica0on	
  

                                    NFS	
  
                                   Server	
  
                      Client	
  




                                                            Cluster	
  Nodes	
  




       10/11/11	
             ©	
  MapR	
  Confiden0al	
                            21	
  
Universal	
  export	
  to	
  self	
  

                                                               Cluster	
  Nodes	
  




                               Task	
  

                                    NFS	
  
                       Cluster	
   Server	
  
                       Node	
  




        10/11/11	
               ©	
  MapR	
  Confiden0al	
                            22	
  
Nodes	
  are	
  iden0cal	
  

              Task	
  
                                                                                 Task	
  
                   NFS	
  
                                                                                      NFS	
  
      Cluster	
   Server	
  
      Node	
                                                             Cluster	
   Server	
  
                                                                         Node	
  



                          Task	
  

                               NFS	
  
                  Cluster	
   Server	
  
                  Node	
  


        10/11/11	
                         ©	
  MapR	
  Confiden0al	
                              23	
  
Applica0on	
  architecture	
  

•  High	
  performance	
  map-­‐reduce	
  is	
  nice	
  



•  But	
  algorithmic	
  flexibility	
  is	
  even	
  nicer	
  




         10/11/11	
          ©	
  MapR	
  Confiden0al	
           24	
  
Sharded	
  text	
  Indexing	
  
           Assign	
  documents	
                           Index	
  text	
  to	
  local	
  disk	
  
               to	
  shards	
                              and	
  then	
  copy	
  index	
  to	
  
                                                            distributed	
  file	
  store	
  




                                                                                              Clustered	
  
                                             Reducer	
                                        index	
  storage	
  
      Input	
               Map	
  
  documents	
  
                               Copy	
  to	
  local	
  disk	
  
                                   Local	
  
                                        required	
  before	
   Local	
  
                           typically	
  disk	
                                                        Search	
  
                             index	
  can	
  be	
  loaded	
     disk	
                                Engine	
  




          10/11/11	
                   ©	
  MapR	
  Confiden0al	
                                                     25	
  
Sharded	
  text	
  indexing	
  

•  Mapper	
  assigns	
  document	
  to	
  shard	
  
    •  Shard	
  is	
  usually	
  hash	
  of	
  document	
  id	
  
•  Reducer	
  indexes	
  all	
  documents	
  for	
  a	
  shard	
  
    •  Indexes	
  created	
  on	
  local	
  disk	
  
    •  On	
  success,	
  copy	
  index	
  to	
  DFS	
  
    •  On	
  failure,	
  delete	
  local	
  files	
  
•  Must	
  avoid	
  directory	
  collisions	
  	
  
    •  can’t	
  use	
  shard	
  id!	
  
•  Must	
  manage	
  and	
  reclaim	
  local	
  disk	
  space	
  

           10/11/11	
                 ©	
  MapR	
  Confiden0al	
      26	
  
Conven0onal	
  data	
  flow	
  

                                                                       Failure	
  of	
  search	
  
                                                                       engine	
  requires	
  
                         Failure	
  of	
  a	
  reducer	
              another	
  download	
  
                          causes	
  garbage	
  to	
                   of	
  the	
  index	
  from	
  
                          accumulate	
  in	
  the	
                   clustered	
  storage.	
  
                                                                                                Clustered	
  
                               local	
  disk	
   Reducer	
  
                                                                                                index	
  storage	
  
      Input	
                   Map	
  
  documents	
  
                                          Local	
  
                                           disk	
                         Local	
                      Search	
  
                                                                           disk	
                      Engine	
  




          10/11/11	
                        ©	
  MapR	
  Confiden0al	
                                                  27	
  
Simplified	
  NFS	
  data	
  flows	
  
                              Index	
  to	
  task	
  work	
  
                               directory	
  via	
  NFS	
  


                                                                                                         Search	
  
                                                                                                         Engine	
  
                                                        Reducer	
  
      Input	
                      Map	
                                      Clustered	
  
  documents	
  
                                                                              index	
  storage	
  
                         Failure	
  of	
  a	
  reducer	
                               Search	
  engine	
  
                           is	
  cleaned	
  up	
  by	
                                reads	
  mirrored	
  
                                map-­‐reduce	
                                         index	
  directly.	
  
                                 framework	
  




          10/11/11	
                            ©	
  MapR	
  Confiden0al	
                                             28	
  
Simplified	
  NFS	
  data	
  flows	
  
                                                                                Search	
  
                                    Mirroring	
  allows	
                       Engine	
  
                                    exact	
  placement	
  
                                     of	
  index	
  data	
  



                                       Reducer	
  
      Input	
            Map	
  
  documents	
                                                                   Search	
  
                                                                                Engine	
  
                                       Aribitrary	
  levels	
  
                                        of	
  replica0on	
  
                                        also	
  possible	
        Mirrors	
  




          10/11/11	
               ©	
  MapR	
  Confiden0al	
                            29	
  
How	
  about	
  another	
  one?	
  




10/11/11	
        ©	
  MapR	
  Confiden0al	
     30	
  
K-­‐means	
  

•  Classic	
  E-­‐M	
  based	
  algorithm	
  
•  Given	
  cluster	
  centroids,	
  
    •  Assign	
  each	
  data	
  point	
  to	
  nearest	
  centroid	
  
    •  Accumulate	
  new	
  centroids	
  
    •  Rinse,	
  lather,	
  repeat	
  




          10/11/11	
            ©	
  MapR	
  Confiden0al	
                 31	
  
K-­‐means,	
  the	
  movie	
  
                      Centroids	
  




           I	
  
           n	
          Assign	
                     Aggregate	
  
           p	
            to	
                         new	
  
           u	
         Nearest	
                     centroids	
  
           t	
         centroid	
  




       10/11/11	
      ©	
  MapR	
  Confiden0al	
                     32	
  
But	
  …	
  




10/11/11	
     ©	
  MapR	
  Confiden0al	
     33	
  
Parallel	
  Stochas0c	
  Gradient	
  Descent	
  
                      Model	
  




          I	
  
          n	
  
                      Train	
                      Average	
  
          p	
  
                       sub	
                       models	
  
          u	
  
                      model	
  
          t	
  




      10/11/11	
     ©	
  MapR	
  Confiden0al	
                   34	
  
Varia0onal	
  Dirichlet	
  Assignment	
  
                      Model	
  




          I	
  
          n	
  
                      Gather	
                     Update	
  
          p	
  
                     sufficient	
                    model	
  
          u	
  
                     sta0s0cs	
  
          t	
  




      10/11/11	
     ©	
  MapR	
  Confiden0al	
                  35	
  
Old	
  tricks,	
  new	
  dogs	
  

                                    Read	
  from	
  local	
  disk	
  
•  Mapper	
                         from	
  distributed	
  cache	
  

    •  Assign	
  point	
  to	
  cluster	
  
                                                                  Read	
  from	
  
    •  Emit	
  cluster	
  id,	
  (1,	
  point)	
                  HDFS	
  to	
  local	
  disk	
  
                                                                  by	
  distributed	
  cache	
  
•  Combiner	
  and	
  reducer	
  
    •  Sum	
  counts,	
  weighted	
  sum	
  of	
  points	
  
    •  Emit	
  cluster	
  id,	
  (n,	
  sum/n)	
     WriQen	
  by	
  
                                                                         map-­‐reduce	
  
•  Output	
  to	
  HDFS	
  


           10/11/11	
               ©	
  MapR	
  Confiden0al	
                                       36	
  
Old	
  tricks,	
  new	
  dogs	
  

•  Mapper	
  
                                                                  Read	
  
    •  Assign	
  point	
  to	
  cluster	
                         from	
  
    •  Emit	
  cluster	
  id,	
  (1,	
  point)	
                  NFS	
  


•  Combiner	
  and	
  reducer	
  
    •  Sum	
  counts,	
  weighted	
  sum	
  of	
  points	
  
    •  Emit	
  cluster	
  id,	
  (n,	
  sum/n)	
         WriQen	
  by	
  
                                                                             map-­‐reduce	
  
•  Output	
  to	
  HDFS	
  
                          MapR	
  FS	
  
           10/11/11	
               ©	
  MapR	
  Confiden0al	
                                   37	
  
Poor	
  man’s	
  Pregel	
  

•  Mapper	
  
      while not done:!
          read and accumulate input models!
          for each input:!
              accumulate model!
          write model!
          synchronize!
          reset input format!
      emit summary!

•  Lines	
  in	
  bold	
  can	
  use	
  conven0onal	
  I/O	
  via	
  NFS	
  


          10/11/11	
          ©	
  MapR	
  Confiden0al	
                   38	
     38	
  
Click	
  modeling	
  architecture	
  
                       Side-­‐data	
  

                                                                       Now	
  via	
  NFS	
  




I	
  
                        Feature	
  
n	
                                                                          Sequen0al	
  
                       extrac0on	
                   Data	
  
p	
                                                                             SGD	
  
                          and	
                      join	
  
u	
                                                                           Learning	
  
                         down	
  
t	
  
                       sampling	
  




                                                 Map-­‐reduce	
  
        10/11/11	
                       ©	
  MapR	
  Confiden0al	
                             39	
  
Click	
  modeling	
  architecture	
  
                       Side-­‐data	
  

                                                         Map-­‐reduce	
  
                                                         cooperates	
       Sequen0al	
  
                                                          with	
  NFS	
        SGD	
  
                                                                             Learning	
  
                                                                                     Sequen0al	
  
                                                                                          SGD	
  
I	
                                                                                    Learning	
  
                        Feature	
  
n	
                                                                                    Sequen0al	
  
                       extrac0on	
                   Data	
  
p	
                                                                                       SGD	
  
                          and	
                      join	
  
u	
                                                                                     Learning	
  
                         down	
  
t	
  
                       sampling	
                                              Sequen0al	
  
                                                                                  SGD	
  
                                                                                Learning	
  

                                                 Map-­‐reduce	
                           Map-­‐reduce	
  
        10/11/11	
                       ©	
  MapR	
  Confiden0al	
                                     40	
  
And	
  another…	
  




10/11/11	
         ©	
  MapR	
  Confiden0al	
     41	
  
Hybrid	
  model	
  flow	
  


      Feature	
  extrac0on	
  	
  
             and	
  	
                                                 Down	
  	
  
       down	
  sampling	
                                             stream	
  	
  
                                                                     modeling	
  
                         Map-­‐reduce	
  

                                                                                       Deployed	
  
                                                                 Map-­‐reduce	
         Model	
  
                                         SVD	
  
                                     (PageRank)	
  
                                      (spectral)	
  

                                                    ??	
  


       10/11/11	
                      ©	
  MapR	
  Confiden0al	
                                      42	
  
10/11/11	
     ©	
  MapR	
  Confiden0al	
     43	
  
Hybrid	
  model	
  flow	
  


      Feature	
  extrac0on	
  	
  
             and	
  	
                                                 Down	
  	
  
       down	
  sampling	
                                             stream	
  	
  
                                                                     modeling	
  

                                                                                       Deployed	
  
                                                                                        Model	
  
                                         SVD	
  
                                     (PageRank)	
  
                                      (spectral)	
  

                                      Sequen0al	
  
                                      Map-­‐reduce	
  


       10/11/11	
                      ©	
  MapR	
  Confiden0al	
                                      44	
  
And	
  visualiza0on…	
  




10/11/11	
            ©	
  MapR	
  Confiden0al	
     45	
  
Trivial	
  visualiza0on	
  interface	
  

•  Map-­‐reduce	
  output	
  is	
  visible	
  via	
  NFS	
  
   $   R!
   >   x <- read.csv(“/mapr/my.cluster/home/ted/data/foo.out”)!
   >   plot(error ~ t, x)!
   >   q(save=‘n’)!



•  Legacy	
  visualiza0on	
  just	
  works	
  




          10/11/11	
         ©	
  MapR	
  Confiden0al	
         46	
  
Conclusions	
  

•  We	
  used	
  to	
  know	
  all	
  this	
  
•  Tab	
  comple0on	
  used	
  to	
  work	
  
•  5	
  years	
  of	
  work-­‐arounds	
  have	
  clouded	
  our	
  
   memories	
  

•  We	
  just	
  have	
  to	
  remember	
  the	
  future	
  



         10/11/11	
         ©	
  MapR	
  Confiden0al	
                 47	
  

Más contenido relacionado

Destacado

Destacado (8)

R user-group-2011-09
R user-group-2011-09R user-group-2011-09
R user-group-2011-09
 
Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
Mahout and Recommendations
Mahout and RecommendationsMahout and Recommendations
Mahout and Recommendations
 
Bda-dunning-2012-12-06
Bda-dunning-2012-12-06Bda-dunning-2012-12-06
Bda-dunning-2012-12-06
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop Performance
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
SQL + Hadoop: The High Performance Advantage�
SQL + Hadoop:  The High Performance Advantage�SQL + Hadoop:  The High Performance Advantage�
SQL + Hadoop: The High Performance Advantage�
 

Similar a Cmu-2011-09.pptx

Ted Dunning - Whither Hadoop
Ted Dunning - Whither HadoopTed Dunning - Whither Hadoop
Ted Dunning - Whither Hadoop
Ed Kohlwey
 
Critical Issues at Exascale for Algorithm and Software Design
Critical Issues at Exascale for Algorithm and Software DesignCritical Issues at Exascale for Algorithm and Software Design
Critical Issues at Exascale for Algorithm and Software Design
top500
 
HPDC 2012 presentation - June 19, 2012 - Delft, The Netherlands
HPDC 2012 presentation - June 19, 2012 -  Delft, The NetherlandsHPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands
HPDC 2012 presentation - June 19, 2012 - Delft, The Netherlands
balmanme
 
Performance evaluation of cloudera impala 0.6 beta with comparison to Hive
Performance evaluation of cloudera impala 0.6 beta with comparison to HivePerformance evaluation of cloudera impala 0.6 beta with comparison to Hive
Performance evaluation of cloudera impala 0.6 beta with comparison to Hive
Yukinori Suda
 
Tackling Disaster in a SCM Environment
Tackling Disaster in a SCM EnvironmentTackling Disaster in a SCM Environment
Tackling Disaster in a SCM Environment
ziaulm
 
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Ontico
 
オープンソース開発者がDeNAを選ぶ理由
オープンソース開発者がDeNAを選ぶ理由オープンソース開発者がDeNAを選ぶ理由
オープンソース開発者がDeNAを選ぶ理由
Kazuho Oku
 

Similar a Cmu-2011-09.pptx (20)

Cmu 2011 09.pptx
Cmu 2011 09.pptxCmu 2011 09.pptx
Cmu 2011 09.pptx
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
 
Ted Dunning - Whither Hadoop
Ted Dunning - Whither HadoopTed Dunning - Whither Hadoop
Ted Dunning - Whither Hadoop
 
Critical Issues at Exascale for Algorithm and Software Design
Critical Issues at Exascale for Algorithm and Software DesignCritical Issues at Exascale for Algorithm and Software Design
Critical Issues at Exascale for Algorithm and Software Design
 
Arrays in database systems, the next frontier?
Arrays in database systems, the next frontier?Arrays in database systems, the next frontier?
Arrays in database systems, the next frontier?
 
SLES 11 SP2 PerformanceEvaluation for Linux on System z
SLES 11 SP2 PerformanceEvaluation for Linux on System zSLES 11 SP2 PerformanceEvaluation for Linux on System z
SLES 11 SP2 PerformanceEvaluation for Linux on System z
 
クラウドいろは勉強会
クラウドいろは勉強会クラウドいろは勉強会
クラウドいろは勉強会
 
HPDC 2012 presentation - June 19, 2012 - Delft, The Netherlands
HPDC 2012 presentation - June 19, 2012 -  Delft, The NetherlandsHPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands
HPDC 2012 presentation - June 19, 2012 - Delft, The Netherlands
 
Performance evaluation of cloudera impala 0.6 beta with comparison to Hive
Performance evaluation of cloudera impala 0.6 beta with comparison to HivePerformance evaluation of cloudera impala 0.6 beta with comparison to Hive
Performance evaluation of cloudera impala 0.6 beta with comparison to Hive
 
Tackling Disaster in a SCM Environment
Tackling Disaster in a SCM EnvironmentTackling Disaster in a SCM Environment
Tackling Disaster in a SCM Environment
 
Fiche Online: A Vision for Digitizing All Documents Fiche
Fiche Online: A Vision for Digitizing All Documents FicheFiche Online: A Vision for Digitizing All Documents Fiche
Fiche Online: A Vision for Digitizing All Documents Fiche
 
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
 
Lawrence Livermore Labs talk 2011
Lawrence Livermore Labs talk 2011Lawrence Livermore Labs talk 2011
Lawrence Livermore Labs talk 2011
 
HBase with MapR
HBase with MapRHBase with MapR
HBase with MapR
 
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
 
HDFS - What's New and Future
HDFS - What's New and FutureHDFS - What's New and Future
HDFS - What's New and Future
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
 
SQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinarSQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinar
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svcc
 
オープンソース開発者がDeNAを選ぶ理由
オープンソース開発者がDeNAを選ぶ理由オープンソース開発者がDeNAを選ぶ理由
オープンソース開発者がDeNAを選ぶ理由
 

Más de Ted Dunning

Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
Ted Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
Ted Dunning
 

Más de Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Cmu-2011-09.pptx

  • 1. MapR,  Implica0ons  for  Integra0on   CMU  –  September  2011   10/11/11   ©  MapR  Confiden0al   1  
  • 2. Outline   •  MapR  system  overview   •  Map-­‐reduce  review   •  MapR  architecture   •  Performance  Results   •  Map-­‐reduce  on  MapR   •  Architectural  implica0ons   •  Search  indexing  /  deployment   •  EM  algorithm  for  machine  learning   •  …  and  more  …   10/11/11   ©  MapR  Confiden0al   2  
  • 3. !"!# Map-­‐Reduce   !" @/-,9) !# A.0B Input   Output   !" @/-,9) !# A.0B Shuffle   $%&'()"*" +,&)!'%-(./%0) 12'!!3)"*4 536'-3) 8'(&'()"930) 10/11/11   "*# ©  MapR  Confiden0al   !'%-(./%0) "*: 3   "*7
  • 4. BoQlenecks  and  Issues   •  Read-­‐only  files   •  Many  copies  in  I/O  path   •  Shuffle  based  on  HTTP   •  Can’t  use  new  technologies   •  Eats  file  descriptors   •  Spills  go  to  local  file  space   •  Bad  for  skewed  distribu0on  of  sizes   10/11/11   ©  MapR  Confiden0al   4  
  • 5. MapR  Areas  of  Development   HBase   Map   Reduce   Ecosystem   Storage   Management   Services   10/11/11   ©  MapR  Confiden0al   5  
  • 6. MapR  Improvements   •  Faster  file  system   •  Fewer  copies   •  Mul0ple  NICS   •  No  file  descriptor  or  page-­‐buf  compe00on   •  Faster  map-­‐reduce   •  Uses  distributed  file  system   •  Direct  RPC  to  receiver   •  Very  wide  merges   10/11/11   ©  MapR  Confiden0al   6  
  • 7. MapR  Innova0ons   •  Volumes   •  Distributed  management   •  Data  placement   •  Read/write  random  access  file  system   •  Allows  distributed  meta-­‐data   •  Improved  scaling   •  Enables  NFS  access   •  Applica0on-­‐level  NIC  bonding   •  Transac0onally  correct  snapshots  and  mirrors   10/11/11   ©  MapR  Confiden0al   7  
  • 8. MapR's  Containers   Files/directories  are  sharded  into  blocks,  which   are  placed  into  mini  NNs  (containers  )  on  disks   l  Each  container  contains   l  Directories  &  files   l  Data  blocks   l  Replicated  on  servers   Containers  are   l  No  need  to  manage   16-­‐32  GB  segments   directly   of  disk,  placed  on   nodes   10/11/11   ©  MapR  Confiden0al   8  
  • 9. MapR's  Containers   l  Each  container  has  a   replica0on  chain   l  Updates  are  transac0onal   l  Failures  are  handled  by   rearranging  replica0on   10/11/11   ©  MapR  Confiden0al   9  
  • 10. Container  loca0ons  and  replica0on   N1,  N2   N1   N3,  N2   N1,  N2   N1,  N3   N2   N3,  N2   CLDB   N3   Container  loca0on  database   (CLDB)  keeps  track  of  nodes   hos0ng  each  container  and   replica0on  chain  order   10/11/11   ©  MapR  Confiden0al   10  
  • 11. MapR  Scaling   Containers  represent  16  -­‐  32GB  of  data   l  Each  can  hold  up  to    1  Billion  files  and  directories   l  100M  containers  =    ~  2  Exabytes    (a  very  large  cluster)   250  bytes  DRAM  to  cache  a  container   l  25GB  to  cache  all  containers  for  2EB  cluster     But  not  necessary,  can  page  to  disk   l  Typical  large  10PB  cluster  needs  2GB   Container-­‐reports  are  100x  -­‐  1000x    <    HDFS  block-­‐reports   l  Serve  100x  more  data-­‐nodes   l  Increase  container  size  to  64G  to  serve  4EB  cluster   l  Map/reduce  not  affected   10/11/11   ©  MapR  Confiden0al   11  
  • 12. MapR's  Streaming  Performance   2250 2250 11  x  7200rpm  SATA   11  x  15Krpm  SAS   2000 2000 1750 1750 1500 1500 1250 1250 Hardware MapR 1000 1000 MB   Hadoop 750 750 per   sec   500 500 250 250 0 0 Read Write Read Write Higher  is  be;er   Tests:          i.    16  streams  x  120GB              ii.    2000  streams  x  1GB   10/11/11   ©  MapR  Confiden0al   12  
  • 13. Terasort  on  MapR   10+1  nodes:  8  core,  24GB  DRAM,  11  x  1TB  SATA  7200  rpm   60 300 50 250 40 200 Elapsed   30 150 MapR =me   Hadoop (mins)   20 100 10 50 0 0 1.0  TB 3.5  TB Lower  is  be;er   10/11/11   ©  MapR  Confiden0al   13  
  • 14. HBase  on  MapR   YCSB  Random  Read    with  1  billion  1K  records   10+1  node  cluster:  8  core,  24GB  DRAM,  11  x  1TB  7200  RPM   25000   20000   Records   15000   per   MapR   second   10000   Apache   5000   0   Zipfian   Uniform   Higher  is  be;er   10/11/11   ©  MapR  Confiden0al   14  
  • 15. Small  Files  (Apache  Hadoop,  10  nodes)   Out  of  box   Op:    -­‐  create  file   Rate (files/sec)                  -­‐  write  100  bytes   Tuned                    -­‐  close   Notes:   -­‐  NN  not  replicated   -­‐  NN  uses  20G  DRAM   -­‐  DN  uses    2G    DRAM   #  of  files  (m)   10/11/11   ©  MapR  Confiden0al   15  
  • 16. MUCH  faster  for  some  opera0ons   Same  10  nodes  …   Create   Rate   #  of  files  (millions)   10/11/11   ©  MapR  Confiden0al   16  
  • 17. What  MapR  is  not   •  Volumes  !=  federa0on   •  MapR  supports  >  10,000  volumes  all  with   independent  placement  and  defaults   •  Volumes  support  snapshots  and  mirroring   •  NFS  !=  FUSE   •  Checksum  and  compress  at  gateway   •  IP  fail-­‐over   •  Read/write/update  seman0cs  at  full  speed   •  MapR  !=  maprfs   10/11/11   ©  MapR  Confiden0al   17  
  • 18. New  Capabili0es   10/11/11   ©  MapR  Confiden0al   18  
  • 19. Alterna0ve  NFS  moun0ng  models   •  Export  to  the  world   •  NFS  gateway  runs  on  selected  gateway  hosts   •  Local  server   •  NFS  gateway  runs  on  local  host   •  Enables  local  compression  and  check  summing   •  Export  to  self   •  NFS  gateway  runs  on  all  data  nodes,  mounted   from  localhost   10/11/11   ©  MapR  Confiden0al   19  
  • 20. Export  to  the  world   NFS   NFS   Server   NFS   Server   NFS   Server   NFS   Server   Client   10/11/11   ©  MapR  Confiden0al   20  
  • 21. Local  server   Applica0on   NFS   Server   Client   Cluster  Nodes   10/11/11   ©  MapR  Confiden0al   21  
  • 22. Universal  export  to  self   Cluster  Nodes   Task   NFS   Cluster   Server   Node   10/11/11   ©  MapR  Confiden0al   22  
  • 23. Nodes  are  iden0cal   Task   Task   NFS   NFS   Cluster   Server   Node   Cluster   Server   Node   Task   NFS   Cluster   Server   Node   10/11/11   ©  MapR  Confiden0al   23  
  • 24. Applica0on  architecture   •  High  performance  map-­‐reduce  is  nice   •  But  algorithmic  flexibility  is  even  nicer   10/11/11   ©  MapR  Confiden0al   24  
  • 25. Sharded  text  Indexing   Assign  documents   Index  text  to  local  disk   to  shards   and  then  copy  index  to   distributed  file  store   Clustered   Reducer   index  storage   Input   Map   documents   Copy  to  local  disk   Local   required  before   Local   typically  disk   Search   index  can  be  loaded   disk   Engine   10/11/11   ©  MapR  Confiden0al   25  
  • 26. Sharded  text  indexing   •  Mapper  assigns  document  to  shard   •  Shard  is  usually  hash  of  document  id   •  Reducer  indexes  all  documents  for  a  shard   •  Indexes  created  on  local  disk   •  On  success,  copy  index  to  DFS   •  On  failure,  delete  local  files   •  Must  avoid  directory  collisions     •  can’t  use  shard  id!   •  Must  manage  and  reclaim  local  disk  space   10/11/11   ©  MapR  Confiden0al   26  
  • 27. Conven0onal  data  flow   Failure  of  search   engine  requires   Failure  of  a  reducer   another  download   causes  garbage  to   of  the  index  from   accumulate  in  the   clustered  storage.   Clustered   local  disk   Reducer   index  storage   Input   Map   documents   Local   disk   Local   Search   disk   Engine   10/11/11   ©  MapR  Confiden0al   27  
  • 28. Simplified  NFS  data  flows   Index  to  task  work   directory  via  NFS   Search   Engine   Reducer   Input   Map   Clustered   documents   index  storage   Failure  of  a  reducer   Search  engine   is  cleaned  up  by   reads  mirrored   map-­‐reduce   index  directly.   framework   10/11/11   ©  MapR  Confiden0al   28  
  • 29. Simplified  NFS  data  flows   Search   Mirroring  allows   Engine   exact  placement   of  index  data   Reducer   Input   Map   documents   Search   Engine   Aribitrary  levels   of  replica0on   also  possible   Mirrors   10/11/11   ©  MapR  Confiden0al   29  
  • 30. How  about  another  one?   10/11/11   ©  MapR  Confiden0al   30  
  • 31. K-­‐means   •  Classic  E-­‐M  based  algorithm   •  Given  cluster  centroids,   •  Assign  each  data  point  to  nearest  centroid   •  Accumulate  new  centroids   •  Rinse,  lather,  repeat   10/11/11   ©  MapR  Confiden0al   31  
  • 32. K-­‐means,  the  movie   Centroids   I   n   Assign   Aggregate   p   to   new   u   Nearest   centroids   t   centroid   10/11/11   ©  MapR  Confiden0al   32  
  • 33. But  …   10/11/11   ©  MapR  Confiden0al   33  
  • 34. Parallel  Stochas0c  Gradient  Descent   Model   I   n   Train   Average   p   sub   models   u   model   t   10/11/11   ©  MapR  Confiden0al   34  
  • 35. Varia0onal  Dirichlet  Assignment   Model   I   n   Gather   Update   p   sufficient   model   u   sta0s0cs   t   10/11/11   ©  MapR  Confiden0al   35  
  • 36. Old  tricks,  new  dogs   Read  from  local  disk   •  Mapper   from  distributed  cache   •  Assign  point  to  cluster   Read  from   •  Emit  cluster  id,  (1,  point)   HDFS  to  local  disk   by  distributed  cache   •  Combiner  and  reducer   •  Sum  counts,  weighted  sum  of  points   •  Emit  cluster  id,  (n,  sum/n)   WriQen  by   map-­‐reduce   •  Output  to  HDFS   10/11/11   ©  MapR  Confiden0al   36  
  • 37. Old  tricks,  new  dogs   •  Mapper   Read   •  Assign  point  to  cluster   from   •  Emit  cluster  id,  (1,  point)   NFS   •  Combiner  and  reducer   •  Sum  counts,  weighted  sum  of  points   •  Emit  cluster  id,  (n,  sum/n)   WriQen  by   map-­‐reduce   •  Output  to  HDFS   MapR  FS   10/11/11   ©  MapR  Confiden0al   37  
  • 38. Poor  man’s  Pregel   •  Mapper   while not done:! read and accumulate input models! for each input:! accumulate model! write model! synchronize! reset input format! emit summary! •  Lines  in  bold  can  use  conven0onal  I/O  via  NFS   10/11/11   ©  MapR  Confiden0al   38   38  
  • 39. Click  modeling  architecture   Side-­‐data   Now  via  NFS   I   Feature   n   Sequen0al   extrac0on   Data   p   SGD   and   join   u   Learning   down   t   sampling   Map-­‐reduce   10/11/11   ©  MapR  Confiden0al   39  
  • 40. Click  modeling  architecture   Side-­‐data   Map-­‐reduce   cooperates   Sequen0al   with  NFS   SGD   Learning   Sequen0al   SGD   I   Learning   Feature   n   Sequen0al   extrac0on   Data   p   SGD   and   join   u   Learning   down   t   sampling   Sequen0al   SGD   Learning   Map-­‐reduce   Map-­‐reduce   10/11/11   ©  MapR  Confiden0al   40  
  • 41. And  another…   10/11/11   ©  MapR  Confiden0al   41  
  • 42. Hybrid  model  flow   Feature  extrac0on     and     Down     down  sampling   stream     modeling   Map-­‐reduce   Deployed   Map-­‐reduce   Model   SVD   (PageRank)   (spectral)   ??   10/11/11   ©  MapR  Confiden0al   42  
  • 43. 10/11/11   ©  MapR  Confiden0al   43  
  • 44. Hybrid  model  flow   Feature  extrac0on     and     Down     down  sampling   stream     modeling   Deployed   Model   SVD   (PageRank)   (spectral)   Sequen0al   Map-­‐reduce   10/11/11   ©  MapR  Confiden0al   44  
  • 45. And  visualiza0on…   10/11/11   ©  MapR  Confiden0al   45  
  • 46. Trivial  visualiza0on  interface   •  Map-­‐reduce  output  is  visible  via  NFS   $ R! > x <- read.csv(“/mapr/my.cluster/home/ted/data/foo.out”)! > plot(error ~ t, x)! > q(save=‘n’)! •  Legacy  visualiza0on  just  works   10/11/11   ©  MapR  Confiden0al   46  
  • 47. Conclusions   •  We  used  to  know  all  this   •  Tab  comple0on  used  to  work   •  5  years  of  work-­‐arounds  have  clouded  our   memories   •  We  just  have  to  remember  the  future   10/11/11   ©  MapR  Confiden0al   47