SlideShare a Scribd company logo
1 of 81
Download to read offline
TechMesh	
  London	
  2012
        December	
  5,	
  2012
        dean.wampler@thinkbiganalyCcs.com
        @deanwampler	
  
        polyglotprogramming.com/talks




                                                           MapReduce	
  
                                                             and	
  Its	
  
Tuesday, December 4, 12
                                                           Discontents
MR has been a useful technology, but it has a “first generation” feel. What’s next?

(All photos are © Dean Wampler, 2011-2012, All Rights Reserved. Otherwise, the
presentation is free to use.)
About	
  Me...
        dean.wampler@thinkbiganalyCcs.com
        @deanwampler




                                                        Programming



                                                        Hive
                          Functional
                          Programming
                               for Java Developers
                                                                          Dean Wampler,
                                        Dean Wampler                  Jason Rutherglen &
                                                                        Edward Capriolo




Tuesday, December 4, 12

My books and contact information.
Big	
  Data
          Data	
  so	
  big	
  that	
  
      tradiConal	
  soluCons	
  are	
  
       too	
  slow,	
  too	
  small,	
  or	
  
        too	
  expensive	
  to	
  use.

                                            3               Hat tip: Bob Korbus
Tuesday, December 4, 12

It’s a buzz word, but generally associated with the problem of data sets too big to manage
with traditional SQL databases. A parallel development has been the NoSQL movement that is
good at handling semistructured data, scaling, etc.
3	
  Trends
                                                4
Tuesday, December 4, 12
Three trends influence my thinking...
Data	
  Size	
  




                                                                      5
Tuesday, December 4, 12
Data volumes are obviously growing… rapidly.
Facebook now has over 600PB (Petabytes) of data in Hadoop clusters!
Formal	
  Schemas	
  ➡




                                                                        6
Tuesday, December 4, 12
There is less emphasis on “formal” schemas and domain models, i.e., both relational models of data and OO models, because data schemas and
sources change rapidly, and we need to integrate so many disparate sources of data. So, using relatively-agnostic software, e.g., collections of
things where the software is more agnostic about the structure of the data and the domain, tends to be faster to develop, test, and deploy. Put
another way, we find it more useful to build somewhat agnostic applications and drive their behavior through data...
Data-­‐Driven	
  Programs	
  




                                                                         7
Tuesday, December 4, 12
This is the 2nd generation “Stanley”, the most successful self-driving car ever built (by a Google-Stanford) team. Machine learning is growing in
importance. Here, generic algorithms and data structures are trained to represent the “world” using data, rather than encoding a model of the
world in the software itself. It’s another example of generic algorithms that produce the desired behavior by being application agnostic and data
driven, rather than hard-coding a model of the world. (In practice, however, a balance is struck between completely agnostic apps and some
engineering towards for the specific problem, as you might expect...)
Big	
  Data
                                                                               Architecture
                                                                           8
Tuesday, December 4, 12
What should software architectures look like for these kinds of systems?
Other, Object-
                                                                                                                                    1
                                                                                          Oriented
                                                           Objects                      Domain Logic

                                                               5

                                      Object Model
                                                                                                                                                          Query
                                        ParentB1                                                                                                     SQL
                                        toJSON
                                                                                       4
                                                                                                                                                                  2
                            ChildB1                    ChildB2                                           Object-
                           toJSON                     toJSON                                            Relational
                                                                                                        Mapping


                                                                                                        Result Set

                                                                                                                       3



                                                                                                                              Database




                                                                                                 9
Tuesday, December 4, 12
Traditionally, we’ve kept a rich, in-memory domain model requiring an ORM to convert persistent data into the model. This is resource overhead and complexity we can’t afford in big data
systems. Rather, we should treat the result set as it is, a particular kind of collection, do the minimal transformation required to exploit our collections libraries and classes representing some
domain concepts (e.g., Address, StockOption, etc.), then write functional code to implement business logic (or drive emergent behavior with machine learning algorithms…)

The toJSON methods are there because we often convert these object graphs back into fundamental structures, such as the maps and arrays of JSON so we can send them to the browser!
Relational/
                                                                                                                        Functional
                                                                                                                       Domain Logic


                                                                                                                                              1
                                                                                   Functional
                                                                                  Abstractions
                                                                                                                                                        SQL
                                                                                                           4

                                                                                                         Functional                                           Query
                                                                                                        Wrapper for
                                        Other, Object-
                                          Oriented
                                                                1                                      Relational Data
                           Objects      Domain Logic

                             5                                                                                                                                  2
                 Object Model
                                                                         Query
                                                                        SQL
                  ParentB1
                  toJSON
                                                                                                         Result Set
                                        4
                                                                              2
            ChildB1      ChildB2                 Object-
           toJSON       toJSON                  Relational
                                                Mapping                                                                   3
                                                Result Set

                                                         3
                                                                                                                                  Database
                                                             Database




                                                                                             10
Tuesday, December 4, 12
But the traditional systems are a poor fit for this new world: 1) they add too much overhead in computation (the ORM layer, etc.) and memory (to store the objects). Most of what we do with
data is mathematical transformation, so we’re far more productive (and runtime efficient) if we embrace fundamental data structures used throughout (lists, sets, maps, trees) and build rich
transformations into those libraries, transformations that are composable to implement business logic.
Relational/
                                                                                                                        Functional
                                                                                                                       Domain Logic

        • Focus on:                                                                                                                           1
                                                                                  Functional

                 • Lists
                                                                                 Abstractions
                                                                                                                                                        SQL
                                                                                                           4



                 • Maps
                                                                                                         Functional                                           Query
                                                                                                        Wrapper for
                                                                                                       Relational Data


                 • Sets
                                                                                                                                                                2

                                                                                                         Result Set


                 • Trees                                                                                                  3




                 • ...
                                                                                                                                 Database




                                                                                             11
Tuesday, December 4, 12
But the traditional systems are a poor fit for this new world: 1) they add too much overhead in computation (the ORM layer, etc.) and memory (to store the objects). Most of what we do with
data is mathematical transformation, so we’re far more productive (and runtime efficient) if we embrace fundamental data structures used throughout (lists, sets, maps, trees) and build rich
transformations into those libraries, transformations that are composable to implement business logic.
Web Client 1                       Web Client 2                        Web Client 3




                                                                                   ParentB1
                                                                                  methods



                                                                      ChildB1                   ChildB2
                                                                     methods                   methods




                                                                         Database                    Files




                                                                                          12
Tuesday, December 4, 12
In a broader view, object models tend to push us towards centralized, complex systems that don’t decompose well and stifle reuse and optimal deployment scenarios. FP code makes it
easier to write smaller, focused services that we compose and deploy as appropriate.
Web Client 1                     Web Client 2                      Web Client 3




                                                                                               Process 1                            Process 2                      Process 3


            Web Client 1           Web Client 2           Web Client 3




                                    ParentB1
                                   methods



                            ChildB1        ChildB2
                           methods        methods




                                                                                                                      Database                  Files




                             Database             Files




                                                                                                13
Tuesday, December 4, 12
In a broader view, object models tend to push us towards centralized, complex systems that don’t decompose well and stifle reuse and optimal deployment scenarios. FP code makes it
easier to write smaller, focused services that we compose and deploy as appropriate. Each “ProcessN” could be a parallel copy of another process, for horizontal, “shared-nothing”
scalability, or some of these processes could be other services…
Smaller, focused services scale better, especially horizontally. They also don’t encapsulate more business logic than is required, and this (informal) architecture is also suitable for scaling
ML and related algorithms.
Web Client 1         Web Client 2             Web Client 3



         • Data Size

         • Formal
           Schema ➡
                                                                                     Process 1                Process 2             Process 3




         • Data-Driven
           Programs                                                                                Database               Files




                                                                                     14
Tuesday, December 4, 12
And this structure better fits the trends I outlined at the beginning of the talk.
Web Client 1         Web Client 2             Web Client 3




   • MapReduce                                                                       Process 1                Process 2             Process 3




   • Distributed FS                                                                                Database               Files




                                                                                      15
Tuesday, December 4, 12
And MapReduce + a distributed file system, like Hadoop’s MapReduce and HDFS, fit this model.
What	
  Is
                               MapReduce?
                          16
Tuesday, December 4, 12
MapReduce	
  in	
  Hadoop
                      Let’s	
  look	
  at	
  a	
  
                   MapReduce	
  algorithm:	
  
                       WordCount.
         (The	
  Hello	
  World	
  of	
  MapReduce…)

Tuesday, December 4, 12
Let’s walk through the “Hello World” of MapReduce, the Word Count algorithm, at a conceptual level. We’ll see actual code shortly!
Input           Mappers           Sort,            Reducers                                                Output
                                           Shuffle

      Hadoop uses
      MapReduce
                                                                                                               a2
                                                                                                               hadoop 1
                                                                                                               is 2


        There is a
        Map phase
                          We	
  need	
  to	
  convert	
  
                                 the	
  Input	
                                                                map 1
                                                                                                               mapreduce 1
                                                                                                               phase 2

                           into	
  the	
  Output.
                                                                                                               reduce 1
                                                                                                               there 2
        There is a
      Reduce phase                                                                                             uses 1



                                                                   Copyright	
  ©	
  2011-­‐2012,	
  Think	
  Big	
  AnalyCcs,	
  All	
  Rights	
  Reserved
Tuesday, December 4, 12

Four input documents, one left empty, the others with small phrases (for simplicity…). The word count
output is on the right (we’ll see why there are three output “documents”). We need to get from the input
on the left-hand side to the output on the right-hand side.
Input           Mappers          Sort,             Reducers                                               Output
                                          Shuffle

      Hadoop uses
      MapReduce
                                                                                                              a2
                                                                                                              hadoop 1
                                                                                                              is 2


        There is a
        Map phase
                                                                                                              map 1
                                                                                                              mapreduce 1
                                                                                                              phase 2




                                                                                                              reduce 1
                                                                                                              there 2
        There is a
      Reduce phase                                                                                            uses 1



                                                                  Copyright	
  ©	
  2011-­‐2012,	
  Think	
  Big	
  AnalyCcs,	
  All	
  Rights	
  Reserved
Tuesday, December 4, 12

Here is a schematic view of the steps in Hadoop MapReduce. Each Input file is read by a single
Mapper process (default: can be many-to-many, as we’ll see later).
The Mappers emit key-value pairs that will be sorted, then partitioned and “shuffled” to the reducers,
where each Reducer will get all instances of a given key (for 1 or more values).
Each Reducer generates the final key-value pairs and writes them to one or more files (based on the
size of the output).
Input           Mappers


      Hadoop uses
      MapReduce
                           (n, "…")




        There is a
        Map phase
                           (n, "…")




                            (n, "")




        There is a
      Reduce phase
                           (n, "…")




                                                           Copyright	
  ©	
  2011-­‐2012,	
  Think	
  Big	
  AnalyCcs,	
  All	
  Rights	
  Reserved
Tuesday, December 4, 12

Each document gets a mapper. All data is organized into key-value pairs; each line will be a
value and the offset position into the file will be the key, which we don’t care about. I’m
showing each document’s contents in a box and 1 mapper task (JVM process) per document.
Large documents might get split to several mapper tasks.
The mappers tokenize each line, one at a time, converting all words to lower case and
counting them...
Input           Mappers

                                      (hadoop, 1)
      Hadoop uses
      MapReduce
                           (n, "…")   (uses, 1)
                                      (mapreduce, 1)

                                      (there, 1)
                                      (is, 1)
        There is a
        Map phase
                           (n, "…")   (a, 1)
                                      (map, 1)
                                      (phase, 1)


                            (n, "")


                                      (there, 1)
                                      (is, 1)
        There is a
      Reduce phase
                           (n, "…")   (a, 1)
                                      (reduce, 1)
                                      (phase, 1)

                                                           Copyright	
  ©	
  2011-­‐2012,	
  Think	
  Big	
  AnalyCcs,	
  All	
  Rights	
  Reserved
Tuesday, December 4, 12

The mappers emit key-value pairs, where each key is one of the words, and the value is the
count. In the most naive (but also most memory efficient) implementation, each mapper
simply emits (word, 1) each time “word” is seen. However, this is IO inefficient!
Note that the mapper for the empty doc. emits no pairs, as you would expect.
Input           Mappers               Sort,           Reducers
                                               Shuffle
                                                                  0-9, a-l
      Hadoop uses                            (hadoop, 1)
      MapReduce
                           (n, "…")
                                         (mapreduce, 1)
                                        (uses, 1)
                                           (is, 1), (a, 1)
        There is a
        Map phase
                           (n, "…")                                m-q
                                           (map, 1),(phase,1)
                                        (there, 1)


                            (n, "")
                                                (phase,1)          r-z
                                      (is, 1), (a, 1)
                                            (there, 1),
        There is a
      Reduce phase
                           (n, "…")        (reduce, 1)



                                                                   Copyright	
  ©	
  2011-­‐2012,	
  Think	
  Big	
  AnalyCcs,	
  All	
  Rights	
  Reserved
Tuesday, December 4, 12

The mappers themselves don’t decide to which reducer each pair should be sent. Rather, the
job setup configures what to do and the Hadoop runtime enforces it during the Sort/Shuffle
phase, where the key-value pairs in each mapper are sorted by key (that is locally, not
globally) and then the pairs are routed to the correct reducer, on the current machine or
other machines.
Note how we partitioned the reducers, by first letter of the keys. (By default, MR just hashes
the keys and distributes them modulo # of reducers.)
Input           Mappers               Sort,         Reducers
                                               Shuffle
                                                                  0-9, a-l
      Hadoop uses                            (hadoop, 1)
      MapReduce
                           (n, "…")                            (a, [1,1]),
                                         (mapreduce, 1)      (hadoop, [1]),
                                                               (is, [1,1])
                                        (uses, 1)
                                           (is, 1), (a, 1)
        There is a
        Map phase
                           (n, "…")                               m-q
                                           (map, 1),(phase,1)
                                        (there, 1)              (map, [1]),
                                                            (mapreduce, [1]),
                                                              (phase, [1,1])
                            (n, "")
                                                (phase,1)          r-z
                                      (is, 1), (a, 1)        (reduce, [1]),
                                            (there, 1),      (there, [1,1]),
        There is a
      Reduce phase
                           (n, "…")        (reduce, 1)          (uses, 1)




                                                                   Copyright	
  ©	
  2011-­‐2012,	
  Think	
  Big	
  AnalyCcs,	
  All	
  Rights	
  Reserved
Tuesday, December 4, 12

The reducers are passed each key (word) and a collection of all the values for that key (the
individual counts emitted by the mapper tasks). The MR framework creates these collections
for us.
Input           Mappers               Sort,         Reducers                                              Output
                                               Shuffle
                                                                  0-9, a-l
      Hadoop uses                            (hadoop, 1)
      MapReduce
                           (n, "…")                            (a, [1,1]),                                    a2
                                         (mapreduce, 1)      (hadoop, [1]),                                   hadoop 1
                                                               (is, [1,1])                                    is 2
                                        (uses, 1)
                                           (is, 1), (a, 1)
        There is a
        Map phase
                           (n, "…")                               m-q
                                           (map, 1),(phase,1)
                                        (there, 1)              (map, [1]),                                   map 1
                                                            (mapreduce, [1]),                                 mapreduce 1
                                                              (phase, [1,1])                                  phase 2
                            (n, "")
                                                (phase,1)          r-z
                                      (is, 1), (a, 1)        (reduce, [1]),                                   reduce 1
                                            (there, 1),      (there, [1,1]),                                  there 2
        There is a
      Reduce phase
                           (n, "…")        (reduce, 1)          (uses, 1)                                     uses 1



                                                                   Copyright	
  ©	
  2011-­‐2012,	
  Think	
  Big	
  AnalyCcs,	
  All	
  Rights	
  Reserved
Tuesday, December 4, 12

The final view of the WordCount process flow. The reducer just sums the counts and writes the output.
The output files contain one line for each key (the word) and value (the count), assuming we’re using
text output. The choice of delimiter between key and value is up to you, but tab is common.
Input            Mappers               Sort,           Reducers                                             Output
                                                Shuffle
                                                                     0-9, a-l
      Hadoop uses                             (hadoop, 1)
      MapReduce
                            (n, "…")                              (a, [1,1]),                                   a2
                                          (mapreduce, 1)        (hadoop, [1]),                                  hadoop 1
                                                                  (is, [1,1])                                   is 2
                                         (uses, 1)
                                            (is, 1), (a, 1)
        There is a
        Map phase
                            (n, "…")                                 m-q
                                            (map, 1),(phase,1)
                                         (there, 1)              (map, [1]),                                    map 1
                                                             (mapreduce, [1]),                                  mapreduce 1
                                                               (phase, [1,1])                                   phase 2
                             (n, "")
            Map:                                 (phase,1)    Reduce:
                                                                      r-z

     • Transform	
  one	
  input	
  to	
  0-­‐N	
   •
                                       (is, 1), (a, 1)
                                             (there, 1),
                                                              Collect multiple
                                                                (reduce, [1]),
                                                                (there, [1,1]),
                                                              one output.
                                                                                                      inputs into
                                                                                                         reduce 1
                                                                                                                there 2
            outputs.	
  
        There is a
      Reduce phase
                            (n, "…")        (reduce, 1)            (uses, 1)                                    uses 1



                                                                     Copyright	
  ©	
  2011-­‐2012,	
  Think	
  Big	
  AnalyCcs,	
  All	
  Rights	
  Reserved
Tuesday, December 4, 12

To recap, a “map” transforms one input to one output, but this is generalized in MapReduce to be one
to 0-N. The output key-value pairs are distributed to reducers. The “reduce” collects together multiple
inputs with the same key into
History	
  of	
  
                                                  MapReduce
Tuesday, December 4, 12

Let’s review where MapReduce came from and its best-known, open-source incarnation,
Hadoop.
How	
  would	
  you	
  
                          index	
  the	
  web?




Tuesday, December 4, 12
How	
  would	
  you	
  
                          index	
  the	
  web?




Tuesday, December 4, 12

Did Google search the entire web in 0.26 seconds to find these ~49M results?
You	
  ask	
  a	
  phrase	
  and	
  
                  the	
  search	
  engine	
  finds	
  
                     the	
  best	
  match	
  in	
  
                   billions	
  of	
  web	
  pages.

Tuesday, December 4, 12
Actually,	
  Google	
  
                          computes	
  the	
  index	
  
                          that	
  maps	
  terms	
  to	
  
                           pages	
  in	
  advance.
                      Google’s famous Page Rank algorithm.
Tuesday, December 4, 12
In	
  the	
  early	
  2000s,	
  
              Google	
  invented	
  server	
  
             infrastructure	
  to	
  support	
  
                  PageRank,	
  etc...

Tuesday, December 4, 12
Google	
  File	
  System
                             for	
  Storage
                                          2003




Tuesday, December 4, 12

A distributed file system provides horizontal scalability and resiliency when file blocks are
duplicated around the cluster.
MapReduce
                          for	
  ComputaCon
                                          2004




Tuesday, December 4, 12

The compute model for processing all that data is MapReduce. It handles lots of boilerplate,
like breaking down jobs into tasks, distributing the tasks around the cluster, monitoring the
tasks, etc. You write your algorithm to the MR programming model.
About	
  this	
  Cme,	
  Doug	
  
                  CuGng,	
  the	
  creator	
  of	
  	
  
                  Lucene	
  was	
  working	
  on	
  
                            Nutch...

Tuesday, December 4, 12

Lucene is an open-source text search engine. Nutch is an open source web crawler.
He	
  implemented	
  clean-­‐
                       room	
  versions	
  of	
  
                 MapReduce	
  and	
  GFS...


Tuesday, December 4, 12
By	
  2006	
  ,	
  they	
  became	
  
                     part	
  of	
  a	
  separate	
  	
  
                         Apache	
  project,	
  
                          called	
  Hadoop.

Tuesday, December 4, 12

The name comes from a toy, stuffed elephant that Cutting’s son owned at the time.
Benefits	
  of	
  
  MapReduce




Tuesday, December 4, 12
The	
  best	
  way	
  to	
  
                 approach	
  Big	
  Data	
  is	
  to	
  
                   scale	
  horizontally.


Tuesday, December 4, 12

We can’t build vertical systems big enough and if we could, they would cost a fortune!
Hadoop	
  Design	
  Goals
                                              I /   O
                                        e !!
                                      iz d
                              in rh im ea
                            M ve
                               Ond	
  run	
  on	
  server-­‐class,	
  
                             Oh,	
  a
                                     commodity	
  	
  hardware.
Tuesday, December 4, 12

True of Google’s GFS and MapReduce, too. Minimizing disk and network I/O latency/
overhead is critical, because it’s the largest throughput bottleneck. So, optimization is a core
design goal of Hadoop (both MR and HDFS). It affects the features and performance of
everything in the stack above it, including high-level programming tools!
By	
  design,	
  Hadoop	
  is	
  	
  
                          great	
  for	
  batch	
  mode	
  
                             data	
  crunching.


Tuesday, December 4, 12

… but less so for “real-time” event handling, as we’ll discuss...
Hadoop	
  also	
  has	
  a	
  
                          vibrant	
  community	
  
                          that’s	
  evolving	
  the	
  
                               pla_orm.

Tuesday, December 4, 12

For the IT manager, especially at large, cautious organizations.
Commercial	
  support	
  is	
  
                     available	
  from	
  many	
  
                          companies.


Tuesday, December 4, 12

From Hadoop-oriented companies like Cloudera, MapR, and HortonWorks, to integrators like
IBM and Greenplum.
MapReduce	
  
                                                                 and	
  its	
  
                                                              Discontents




Tuesday, December 4, 12

Is MapReduce the end of the story? Does it meet all our needs? Let’s look at a few problems...
#1
                    It’s	
  hard	
  to	
  implement	
  
                        many	
  Algorithms	
  
                            in	
  MapReduce.


Tuesday, December 4, 12

Even word count is not “obvious”. When you get to fancier stuff like joins, group-bys, etc.,
the mapping from the algorithm to the implementation is not trivial at all. In fact,
implementing algorithms in MR is now a specialized body of knowledge...
#2
                          For	
  Hadoop	
  in	
  
                           parCcularly,	
  
                          the	
  Java	
  API	
  is
                           hard	
  to	
  use.

Tuesday, December 4, 12

The Hadoop Java API is even more verbose and tedious to use than it should be.
import org.apache.hadoop.io.*;
      import org.apache.hadoop.mapred.*;
      import java.util.StringTokenizer;

      class WCMapper extends MapReduceBase
          implements Mapper<LongWritable, Text, Text, IntWritable> {

          static final IntWritable one = new IntWritable(1);
          static final Text word = new Text;  // Value will be set in a non-thread-safe way!

          @Override
          public void map(LongWritable key, Text valueDocContents,
                  OutputCollector<Text, IntWritable> output, Reporter reporter) {
              String[] tokens = valueDocContents.toString.split("s+");
              for (String wordString: tokens) {
                if (wordString.length > 0) {
                  word.set(wordString.toLowerCase);
                  output.collect(word, one);
                }
              }
            }
      }

      class WCReduce extends MapReduceBase
          implements Reducer[Text, IntWritable, Text, IntWritable] {

          public void reduce(Text keyWord, java.util.Iterator<IntWritable> valuesCounts,
                     OutputCollector<Text, IntWritable> output, Reporter reporter) {
            int totalCount = 0;
            while (valuesCounts.hasNext) {
              totalCount += valuesCounts.next.get;
            }
            output.collect(keyWord, new IntWritable(totalCount));
          }
      }
                                                                                   46
Tuesday, December 4, 12
This is intentionally too small to read and we’re not showing the main routine, which doubles the code size. The algorithm is simple, but the framework is in your
face. In the next several slides, notice which colors dominate. In this slide, it’s green for types (classes), with relatively few yellow functions that implement actual
operations.
The main routine I’ve omitted contains additional boilerplate for configuring and running the job. This is just the “core” MapReduce code. In fact, Word Count is not
too bad, but when you get to more complex algorithms, even conceptually simple ideas like relational-style joins and group-bys, the corresponding MapReduce
code in this API gets complex and tedious very fast!
Use	
  SQL!
                                                                   (SoluCon	
  #1)




                                                                       47
Tuesday, December 4, 12
I claimed we should be using FP for big data, so why not use SQL when you can! Here are 3 options.
SQL!

     CREATE TABLE docs (line STRING);
     LOAD DATA INPATH '/path/to/docs' INTO TABLE docs;

     CREATE TABLE word_counts AS
     SELECT word, count(1) AS count FROM
     (SELECT explode(split(line, 'W+')) AS word FROM docs) w
     GROUP BY word
     ORDER BY word;




                                               Word Count, in HiveQL

                                                                               48
Tuesday, December 4, 12
This is how you could implement word count in Hive. We’re using some Hive built-in functions for tokenizing words in each “line”, the one “column” in the docs
table, etc., etc.
 Use	
  SQL	
  when	
  you	
  can!

         • Hive:	
  SQL	
  on	
  top	
  of	
  MapReduce.
         • Shark:	
  Hive	
  ported	
  to	
  Spark.
         • Impala:	
  HiveQL	
  with	
  new,	
  faster	
  
           back	
  end.
                                                            Copyright	
  ©	
  2011-­‐2012,	
  Think	
  Big	
  AnalyCcs,	
  All	
  Rights	
  Reserved
Tuesday, December 4, 12

See http://hive.apache.org/ or my book for Hive, http://shark.cs.berkeley.edu/ for shark,
and http://www.cloudera.com/content/cloudera/en/products/cloudera-enterprise-core/
cloudera-enterprise-RTQ.html for Impala. Impala is very new. It doesn’t yet support all Hive
features.
 Impala
         • Copied	
  from	
  Google’s	
  Dremel.
         • C++	
  and	
  Java	
  back	
  end.
         • Provides	
  up	
  to	
  100x	
  performance	
  
           improvement!
         • Developed	
  by	
  Cloudera.
                                                        Copyright	
  ©	
  2011-­‐2012,	
  Think	
  Big	
  AnalyCcs,	
  All	
  Rights	
  Reserved
Tuesday, December 4, 12

See http://www.cloudera.com/content/cloudera/en/products/cloudera-enterprise-core/
cloudera-enterprise-RTQ.html. However, this was just announced recently, so it’s not
production ready quite yet...
Use Cascading (Java)
                                                                (SoluCon	
  #2a)




                                                                         51
Tuesday, December 4, 12
Cascading is a Java library that provides higher-level abstractions for building data processing pipelines with concepts familiar from SQL such as a
joins, group-bys, etc. It works on top of Hadoop’s MapReduce and hides most of the boilerplate from you.
See http://cascading.org.
Cascading	
  Concepts
                          Data	
  flows	
  consist	
  of	
  	
  
                          source	
  and	
  sink	
  Taps	
  
                          connected	
  by	
  Pipes.


Tuesday, December 4, 12
Word	
  Count
    Flow
      Sequence of Pipes ("word count assembly")
       line                      words               word                                                           count
                   Each(Regex)           GroupBy                Every(Count)


                      Tap                                                                   Tap
                                           HDFS
                    (source)                                                               (sink)




                                                            Copyright	
  ©	
  2011-­‐2012,	
  Think	
  Big	
  AnalyCcs,	
  All	
  Rights	
  Reserved
Tuesday, December 4, 12

Schematically, here is what Word Count looks like in Cascading. See http://
docs.cascading.org/cascading/1.2/userguide/html/ch02.html for details.
import org.cascading.*;
     ...
     public class WordCount {
       public static void main(String[] args) {
         String inputPath = args[0];
         String outputPath = args[1];
         Properties properties = new Properties();
         FlowConnector.setApplicationJarClass( properties, Main.class );

             Scheme sourceScheme = new TextLine( new Fields( "line" ) );
             Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );
             Tap source = new Hfs( sourceScheme, inputPath );
             Tap sink   = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );

             Pipe assembly = new Pipe( "wordcount" );

             String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?!pL)";
             Function function = new RegexGenerator( new Fields( "word" ), regex );
             assembly = new Each( assembly, new Fields( "line" ), function );
             assembly = new GroupBy( assembly, new Fields( "word" ) );
             Aggregator count = new Count( new Fields( "count" ) );
             assembly = new Every( assembly, count );

             FlowConnector flowConnector = new FlowConnector( properties );
             Flow flow = flowConnector.connect( "word-count", source, sink, assembly);
             flow.complete();
         }
     }
                                                                         54
Tuesday, December 4, 12
Here is the Cascading Java code. It’s cleaner than the MapReduce API, because the code is more focused on the algorithm with less boilerplate,
although it looks like it’s not that much shorter. HOWEVER, this is all the code, where as previously I omitted the setup (main) code. See http://
docs.cascading.org/cascading/1.2/userguide/html/ch02.html for details of the API features used here; we won’t discuss them here, but just
mention some highlights.
Note that there is still a lot of green for types, but at least the API emphasizes composing behaviors together.
Use	
  Scalding (Scala)
                                                            (SoluCon	
  #2b)




                                                                      55
Tuesday, December 4, 12
Scalding is a Scala “DSL” (domain-specific language) that wraps Cascading providing an even more intuitive and more boilerplate-free API for
writing MapReduce jobs. https://github.com/twitter/scalding
Scala is a new JVM language that modernizes Java’s object-oriented (OO) features and adds support for functional programming, as we discussed
previously and we’ll revisit shortly.
import com.twitter.scalding._

     class WordCountJob(args: Args) extends Job(args) {
       TextLine( args("input") )
         .read
         .flatMap('line -> 'word) {
           line: String => line.trim.toLowerCase.split("W+")
         }
         .groupBy('word) { group => group.size('count) }
       }
       .write(Tsv(args("output")))
     }



                                                                     That’s It!!

                                                                                 56
Tuesday, December 4, 12
This Scala code is almost pure domain logic with very little boilerplate. There are a few minor differences in the implementation. You don’t explicitly specify the
“Hfs” (Hadoop Distributed File System) taps. That’s handled by Scalding implicitly when you run in “non-local” model. Also, I’m using a simpler tokenization
approach here, where I split on anything that isn’t a “word character” [0-9a-zA-Z_].
There is little green, in part because Scala infers type in many cases. There is a lot more yellow for the functions that do real work!
What if MapReduce, and hence Cascading and Scalding, went obsolete tomorrow? This code is so short, I wouldn’t care about throwing it away! I invested little
time writing it, testing it, etc.
Other	
  Improved	
  APIs:

                      • Crunch	
  (Java)	
  &
                        Scrunch	
  (Scala)
                      • Scoobi	
  (Scala)
                      • ...
                                                           Copyright	
  ©	
  2011-­‐2012,	
  Think	
  Big	
  AnalyCcs,	
  All	
  Rights	
  Reserved
Tuesday, December 4, 12

See https://github.com/cloudera/crunch.
Others include Scoobi (http://nicta.github.com/scoobi/) and Spark, which we’ll discuss next.
Use	
  Spark (Scala)
                              (SoluCon	
  #3)




                               58
Tuesday, December 4, 12
Spark	
  is	
  an	
  AlternaCve	
  to	
  
              Hadoop	
  MapReduce:

                      • Distributed	
  compuCng	
  with	
  
                        in-­‐memory	
  caching.
                      • Up	
  to	
  30x	
  faster	
  than	
  
                        MapReduce.
                                                            Copyright	
  ©	
  2011-­‐2012,	
  Think	
  Big	
  AnalyCcs,	
  All	
  Rights	
  Reserved
Tuesday, December 4, 12

See http://www.spark-project.org/
Why isn’t it more widely used? 1) lack of commercial support, 2) only recently emerged out of
academia.
Spark	
  is	
  an	
  AlternaCve	
  to	
  
              Hadoop	
  MapReduce:

                      • Originally	
  designed	
  for	
  
                        machine	
  learning	
  
                        applicaCons.
                                                 Copyright	
  ©	
  2011-­‐2012,	
  Think	
  Big	
  AnalyCcs,	
  All	
  Rights	
  Reserved
Tuesday, December 4, 12
object WordCountSpark {
       def main(args: Array[String]) {
          val file = spark.textFile(args(0))
          val counts = file.flatMap(line => line.split("W+"))
                           .map(word => (word, 1))
                           .reduceByKey(_ + _)
          counts.saveAsTextFile(args(1))
       }
     }



                          Even more succinct.
            Note that there are only 3 explicit types required.
                                                                                 61
Tuesday, December 4, 12
This spark example is actually closer in a few details, i.e., function names used, to the original Hadoop Java API example, but it cuts down boilerplate to the bare
minimum.
#3
                          It’s	
  not	
  suitable	
  for	
  
                                  “real-­‐Sme”
                           event	
  processing.


Tuesday, December 4, 12

For typical web/enterprise systems, “real-time” is up to 100s of milliseconds, so I’m using
the term broadly (but following common practice in this industry). True real-time systems,
such as avionics, have much tighter constraints.
Storm!
Tuesday, December 4, 12
Storm	
  implements	
  
                          reliable,	
  distributed	
  
                               “real-­‐Sme”
                            event	
  processing.

Tuesday, December 4, 12

http://storm-project.net/ Created by Nathan Marz, now at Twitter, who also created
Cascalog, the Clojure wrapper around Cascading with added Datalog (logic programming)
features.
Bolt


                                                                         Bolt



                          Spout                   Bolt




                          Spout                   Bolt




Tuesday, December 4, 12

In Storm terminology, Spouts are data sources and bolts are the event processors. There are
facilities to support reliable message handling, various sources encapsulated in Sprouts and
various targets of output. Distributed processing is baked in from the start.
Databases?




Tuesday, December 4, 12
 SQL	
  	
  or	
  NoSQL	
  
                               Databases?

         • Since	
  databases	
  are	
  designed	
  for	
  
           fast,	
  transacConal	
  updates,	
  
           consider	
  a	
  database	
  for	
  	
  	
  	
  	
  	
  	
  	
  	
  
           event	
  processing.
                                                         Copyright	
  ©	
  2011-­‐2012,	
  Think	
  Big	
  AnalyCcs,	
  All	
  Rights	
  Reserved
Tuesday, December 4, 12

Use a SQL database unless you need the scale and looser schema of a NoSQL database!
#4

                           It’s	
  not	
  ideal	
  for	
  
                          graph	
  processing.


Tuesday, December 4, 12
 Google’s	
  Page	
  Rank

         • Google	
  invented	
  MapReduce,
         • …	
  but	
  MapReduce	
  is	
  not	
  ideal	
  for	
  
           Page	
  Rank	
  and	
  other	
  graph	
  
           algorithms.	
  
                                                           Copyright	
  ©	
  2011-­‐2012,	
  Think	
  Big	
  AnalyCcs,	
  All	
  Rights	
  Reserved
Tuesday, December 4, 12

Recall that PageRank is the famous algorithm invented by Sergey Brin and Larry Page to index
the web. It’s the foundation of Google’s search engine.
Why	
  not	
  MapReduce?
         • 1	
  MR	
  job	
  for	
  each	
  
                                                                                                                   B
                                                                           A


           iteraCon	
  that	
  updates	
                                                                                     D

           all	
  n	
  nodes/edges.                                                        C


                                                                 E

         • Graph	
  saved	
  to	
  disk	
                                                                        F


           ajer	
  each	
  iteraCon.
         • ...                                          Copyright	
  ©	
  2011-­‐2012,	
  Think	
  Big	
  AnalyCcs,	
  All	
  Rights	
  Reserved
Tuesday, December 4, 12

The presentation http://www.slideshare.net/shatteredNirvana/pregel-a-system-for-
largescale-graph-processing
itemizes all the major issues with using MR to implement graph algorithms.
In a nutshell, a job with a map and reduce phase is waaay to course-grained...
Use	
  Graph	
  Processing
                                                               (SoluCon	
  #4)




                                                                   71
Tuesday, December 4, 12
A good summary presentation: http://www.slideshare.net/shatteredNirvana/pregel-a-system-for-largescale-graph-processing
 Google’s	
  Pregel
         • Pregel:	
  New	
  graph	
  framework	
  for	
  
           Page	
  Rank.
               • Bulk,	
  Synchronous	
  Parallel	
  (BSP).
                    • Graphs	
  are	
  first-­‐class	
  ciCzens.
                    • Efficiently	
  processes	
  updates...	
  
                                                               Copyright	
  ©	
  2011-­‐2012,	
  Think	
  Big	
  AnalyCcs,	
  All	
  Rights	
  Reserved
Tuesday, December 4, 12

Pregel is intended to replace MR for PageRank, but I’ve heard they haven’t actually been able
to switch over to it yet. “Pregel” is the river that runs through the city of Königsberg, Prussia
(now called Kaliningrad, Ukraine). 7 bridges crossed the river in the city (including to 5 to 2
islands between river branches). Leonhard Euler invented graph theory when he analyzed the
question of whether or not you can cross all 7 bridges without retracing your steps (you
can’t).
 Open-­‐source	
  
                              AlternaCves

         • Apache	
  Giraph.
                                                                               All are
         • Apache	
  Hama.                                                   somewhat
                                                                             immature.
         • Aurelius	
  Titan.
                                                            Copyright	
  ©	
  2011-­‐2012,	
  Think	
  Big	
  AnalyCcs,	
  All	
  Rights	
  Reserved
Tuesday, December 4, 12

http://incubator.apache.org/giraph/
http://hama.apache.org/
http://thinkaurelius.github.com/titan/
None is very mature nor has extensive commercial support.
A Manifesto...




                                                                          74
Tuesday, December 4, 12
To bring this altogether, I think we have opportunities for a better way...
Hadoop is the
                                             Enterprise Java Beans
                                                 of our time.

Tuesday, December 4, 12
I worked with EJBs a decade ago. The framework was completely invasive into your business logic. There were too many configuration options in
XML files. The framework “paradigm” was a poor fit for most problems (like soft real time systems and most algorithms beyond Word Count).
Internally, EJB implementations were inefficient and hard to optimize, because they relied on poorly considered object boundaries that muddled
more natural boundaries. (I’ve argued in other presentations and my “FP for Java Devs” book that OOP is a poor modularity tool…)
The fact is, Hadoop reminds me of EJBs in almost every way. It’s a 1st generation solution that mostly works okay and people do get work done
with it, but just as the Spring Framework brought an essential rethinking to Enterprise Java, I think there is an essential rethink that needs to
happen in Big Data, specifically around Hadoop. The functional programming community, is well positioned to create it...
Stop using Java!




Tuesday, December 4, 12
Java has taken us a long way and made many organizations successful over the years. Also, the JVM remains one of our most valuable tools in all of
IT. But the language is really wrong for data purposes and its continued use by Big Data vendors is slowing down overall progress, as well as
application developer productivity, IMHO. Java emphasizes the wrong abstractions, objects instead of mathematically-inspired functional
programming constructs, and Java encourages inflexible bloat because it’s verbose compared to more modern alternatives and objects (at least
class-based ones…) are far less reusable and flexible than people realize. They also contribute to focusing on the wrong concepts, structure
instead of behavior.
Use Functional
          Languages.
                                                                         77
Tuesday, December 4, 12
Why is Functional Programming better for Big Data? The work we do with data is inherently mathematical transformations and FP is inspired by
math. Hence, it’s naturally a better fit, much more so than object-oriented programming. And, modern languages like Scala, Clojure, Erlang, F#,
OCaml, and Haskell are more concise and better at eliminating boilerplate, while still providing excellent performance.

Note that one reason SQL has succeeded all these years is because it is also inspired by math, e.g., set theory.
Understand
                      Functional Collections.
Tuesday, December 4, 12
We already have the right model in the collection APIs that come with functional languages. They are far better engineered for intuitive data
transformations. They provide the right abstractions and hide boilerplate. In fact, they make it relatively easy to optimize implementations for
parallelization. The Scala collections offer parallelization with a tiny API call. Spark and Cascading transparently distribute collections across a
cluster.
Erlang, Akka:
                                                                         Actor-based,
                                                                           Distributed
                                                                         Computation

                                    Fine Grain
                                  Compute Models.
Tuesday, December 4, 12
We can start using new, more efficient compute models, like Spark, Pregel, and Impala today. Of course, you have to consider maturity, viability,
and support issues in large organizations. So if you want to wait until these alternatives are more mature, then at least use better APIs for Hadoop!
For example, Erlang is a very mature language with the Actor model backed in. Akka is a Scala distributed computing model based on the Actor
model of concurrency. It exposes clean, low-level primitives for robust, distributed services (e.g., Actors), upon which we can build flexible big data
systems that can handle soft real time and batch processing efficiently and with great scalability.
Final Thought:




                                   80
Tuesday, December 4, 12
A final thought about Big Data...
QuesCons?




  TechMesh	
  London	
  2012
  December	
  5,	
  2012
  @deanwampler
  dean.wampler@thinkbiganalyCcs.com
  polyglotprogramming.com/talks
                                          81
Tuesday, December 4, 12

All pictures © Dean Wampler, 2011-2012.

More Related Content

What's hot

Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questionsKalyan Hadoop
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceobdit
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Srivatsan Ramanujam
 
Hadoop interview quations1
Hadoop interview quations1Hadoop interview quations1
Hadoop interview quations1Vemula Ravi
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
Neural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep LearningNeural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep LearningAsim Jalis
 
Spark autotuning talk final
Spark autotuning talk finalSpark autotuning talk final
Spark autotuning talk finalRachel Warren
 
Understanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New YorkUnderstanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New YorkRachel Warren
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks
 
Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made EasyDataWorks Summit
 

What's hot (20)

Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Hadoop interview quations1
Hadoop interview quations1Hadoop interview quations1
Hadoop interview quations1
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Neural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep LearningNeural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep Learning
 
Spark autotuning talk final
Spark autotuning talk finalSpark autotuning talk final
Spark autotuning talk final
 
Understanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New YorkUnderstanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New York
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 
Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made Easy
 

Similar to MapReduce and Its Discontents

Django and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks assDjango and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks assTobias Lindaaker
 
CloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heavenCloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heavenPatrick Chanezon
 
A Dynamic Topic Model of Learning Analytics Research
A Dynamic Topic Model of Learning Analytics ResearchA Dynamic Topic Model of Learning Analytics Research
A Dynamic Topic Model of Learning Analytics ResearchMichael Derntl
 
Pal gov.tutorial2.session13 1.data schema integration
Pal gov.tutorial2.session13 1.data schema integrationPal gov.tutorial2.session13 1.data schema integration
Pal gov.tutorial2.session13 1.data schema integrationMustafa Jarrar
 
OUTCOME ANALYSIS IN ACADEMIC INSTITUTIONS USING NEO4J
OUTCOME ANALYSIS IN ACADEMIC INSTITUTIONS USING NEO4JOUTCOME ANALYSIS IN ACADEMIC INSTITUTIONS USING NEO4J
OUTCOME ANALYSIS IN ACADEMIC INSTITUTIONS USING NEO4Jijcsity
 
8 douetteau - dataiku - data tuesday open source 26 fev 2013
8   douetteau - dataiku - data tuesday open source 26 fev 2013 8   douetteau - dataiku - data tuesday open source 26 fev 2013
8 douetteau - dataiku - data tuesday open source 26 fev 2013 Data Tuesday
 
Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1ErhardRahm
 
Silicon valley nosql meetup april 2012
Silicon valley nosql meetup  april 2012Silicon valley nosql meetup  april 2012
Silicon valley nosql meetup april 2012InfiniteGraph
 
Implementation of nosql for robotics
Implementation of nosql for roboticsImplementation of nosql for robotics
Implementation of nosql for roboticsJoão Gabriel Lima
 
data science chapter-4,5,6
data science chapter-4,5,6data science chapter-4,5,6
data science chapter-4,5,6varshakumar21
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling TechniqueCarmen Sanborn
 
Teradata Aster Discovery Platform
Teradata Aster Discovery PlatformTeradata Aster Discovery Platform
Teradata Aster Discovery PlatformScott Antony
 
Big Data Basic Concepts | Presented in 2014
Big Data Basic Concepts  | Presented in 2014Big Data Basic Concepts  | Presented in 2014
Big Data Basic Concepts | Presented in 2014Kenneth Igiri
 
Mapping objects to_relational_databases
Mapping objects to_relational_databasesMapping objects to_relational_databases
Mapping objects to_relational_databasesIvan Paredes
 
Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013CS, NcState
 
Pal gov.tutorial2.session12 2.architectural solutions for the integration issues
Pal gov.tutorial2.session12 2.architectural solutions for the integration issuesPal gov.tutorial2.session12 2.architectural solutions for the integration issues
Pal gov.tutorial2.session12 2.architectural solutions for the integration issuesMustafa Jarrar
 
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmPerformance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmIRJET Journal
 

Similar to MapReduce and Its Discontents (20)

Django and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks assDjango and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks ass
 
CloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heavenCloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heaven
 
A Dynamic Topic Model of Learning Analytics Research
A Dynamic Topic Model of Learning Analytics ResearchA Dynamic Topic Model of Learning Analytics Research
A Dynamic Topic Model of Learning Analytics Research
 
Pal gov.tutorial2.session13 1.data schema integration
Pal gov.tutorial2.session13 1.data schema integrationPal gov.tutorial2.session13 1.data schema integration
Pal gov.tutorial2.session13 1.data schema integration
 
OUTCOME ANALYSIS IN ACADEMIC INSTITUTIONS USING NEO4J
OUTCOME ANALYSIS IN ACADEMIC INSTITUTIONS USING NEO4JOUTCOME ANALYSIS IN ACADEMIC INSTITUTIONS USING NEO4J
OUTCOME ANALYSIS IN ACADEMIC INSTITUTIONS USING NEO4J
 
8 douetteau - dataiku - data tuesday open source 26 fev 2013
8   douetteau - dataiku - data tuesday open source 26 fev 2013 8   douetteau - dataiku - data tuesday open source 26 fev 2013
8 douetteau - dataiku - data tuesday open source 26 fev 2013
 
Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1
 
Silicon valley nosql meetup april 2012
Silicon valley nosql meetup  april 2012Silicon valley nosql meetup  april 2012
Silicon valley nosql meetup april 2012
 
Implementation of nosql for robotics
Implementation of nosql for roboticsImplementation of nosql for robotics
Implementation of nosql for robotics
 
Data science unit2
Data science unit2Data science unit2
Data science unit2
 
data science chapter-4,5,6
data science chapter-4,5,6data science chapter-4,5,6
data science chapter-4,5,6
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling Technique
 
Teradata Aster Discovery Platform
Teradata Aster Discovery PlatformTeradata Aster Discovery Platform
Teradata Aster Discovery Platform
 
On no sql.partiii
On no sql.partiiiOn no sql.partiii
On no sql.partiii
 
Big Data Basic Concepts | Presented in 2014
Big Data Basic Concepts  | Presented in 2014Big Data Basic Concepts  | Presented in 2014
Big Data Basic Concepts | Presented in 2014
 
Mapping objects to_relational_databases
Mapping objects to_relational_databasesMapping objects to_relational_databases
Mapping objects to_relational_databases
 
Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013
 
Pal gov.tutorial2.session12 2.architectural solutions for the integration issues
Pal gov.tutorial2.session12 2.architectural solutions for the integration issuesPal gov.tutorial2.session12 2.architectural solutions for the integration issues
Pal gov.tutorial2.session12 2.architectural solutions for the integration issues
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmPerformance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
 

Recently uploaded

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

MapReduce and Its Discontents

  • 1. TechMesh  London  2012 December  5,  2012 dean.wampler@thinkbiganalyCcs.com @deanwampler   polyglotprogramming.com/talks MapReduce   and  Its   Tuesday, December 4, 12 Discontents MR has been a useful technology, but it has a “first generation” feel. What’s next? (All photos are © Dean Wampler, 2011-2012, All Rights Reserved. Otherwise, the presentation is free to use.)
  • 2. About  Me... dean.wampler@thinkbiganalyCcs.com @deanwampler Programming Hive Functional Programming for Java Developers Dean Wampler, Dean Wampler Jason Rutherglen & Edward Capriolo Tuesday, December 4, 12 My books and contact information.
  • 3. Big  Data Data  so  big  that   tradiConal  soluCons  are   too  slow,  too  small,  or   too  expensive  to  use. 3 Hat tip: Bob Korbus Tuesday, December 4, 12 It’s a buzz word, but generally associated with the problem of data sets too big to manage with traditional SQL databases. A parallel development has been the NoSQL movement that is good at handling semistructured data, scaling, etc.
  • 4. 3  Trends 4 Tuesday, December 4, 12 Three trends influence my thinking...
  • 5. Data  Size   5 Tuesday, December 4, 12 Data volumes are obviously growing… rapidly. Facebook now has over 600PB (Petabytes) of data in Hadoop clusters!
  • 6. Formal  Schemas  ➡ 6 Tuesday, December 4, 12 There is less emphasis on “formal” schemas and domain models, i.e., both relational models of data and OO models, because data schemas and sources change rapidly, and we need to integrate so many disparate sources of data. So, using relatively-agnostic software, e.g., collections of things where the software is more agnostic about the structure of the data and the domain, tends to be faster to develop, test, and deploy. Put another way, we find it more useful to build somewhat agnostic applications and drive their behavior through data...
  • 7. Data-­‐Driven  Programs   7 Tuesday, December 4, 12 This is the 2nd generation “Stanley”, the most successful self-driving car ever built (by a Google-Stanford) team. Machine learning is growing in importance. Here, generic algorithms and data structures are trained to represent the “world” using data, rather than encoding a model of the world in the software itself. It’s another example of generic algorithms that produce the desired behavior by being application agnostic and data driven, rather than hard-coding a model of the world. (In practice, however, a balance is struck between completely agnostic apps and some engineering towards for the specific problem, as you might expect...)
  • 8. Big  Data Architecture 8 Tuesday, December 4, 12 What should software architectures look like for these kinds of systems?
  • 9. Other, Object- 1 Oriented Objects Domain Logic 5 Object Model Query ParentB1 SQL toJSON 4 2 ChildB1 ChildB2 Object- toJSON toJSON Relational Mapping Result Set 3 Database 9 Tuesday, December 4, 12 Traditionally, we’ve kept a rich, in-memory domain model requiring an ORM to convert persistent data into the model. This is resource overhead and complexity we can’t afford in big data systems. Rather, we should treat the result set as it is, a particular kind of collection, do the minimal transformation required to exploit our collections libraries and classes representing some domain concepts (e.g., Address, StockOption, etc.), then write functional code to implement business logic (or drive emergent behavior with machine learning algorithms…) The toJSON methods are there because we often convert these object graphs back into fundamental structures, such as the maps and arrays of JSON so we can send them to the browser!
  • 10. Relational/ Functional Domain Logic 1 Functional Abstractions SQL 4 Functional Query Wrapper for Other, Object- Oriented 1 Relational Data Objects Domain Logic 5 2 Object Model Query SQL ParentB1 toJSON Result Set 4 2 ChildB1 ChildB2 Object- toJSON toJSON Relational Mapping 3 Result Set 3 Database Database 10 Tuesday, December 4, 12 But the traditional systems are a poor fit for this new world: 1) they add too much overhead in computation (the ORM layer, etc.) and memory (to store the objects). Most of what we do with data is mathematical transformation, so we’re far more productive (and runtime efficient) if we embrace fundamental data structures used throughout (lists, sets, maps, trees) and build rich transformations into those libraries, transformations that are composable to implement business logic.
  • 11. Relational/ Functional Domain Logic • Focus on: 1 Functional • Lists Abstractions SQL 4 • Maps Functional Query Wrapper for Relational Data • Sets 2 Result Set • Trees 3 • ... Database 11 Tuesday, December 4, 12 But the traditional systems are a poor fit for this new world: 1) they add too much overhead in computation (the ORM layer, etc.) and memory (to store the objects). Most of what we do with data is mathematical transformation, so we’re far more productive (and runtime efficient) if we embrace fundamental data structures used throughout (lists, sets, maps, trees) and build rich transformations into those libraries, transformations that are composable to implement business logic.
  • 12. Web Client 1 Web Client 2 Web Client 3 ParentB1 methods ChildB1 ChildB2 methods methods Database Files 12 Tuesday, December 4, 12 In a broader view, object models tend to push us towards centralized, complex systems that don’t decompose well and stifle reuse and optimal deployment scenarios. FP code makes it easier to write smaller, focused services that we compose and deploy as appropriate.
  • 13. Web Client 1 Web Client 2 Web Client 3 Process 1 Process 2 Process 3 Web Client 1 Web Client 2 Web Client 3 ParentB1 methods ChildB1 ChildB2 methods methods Database Files Database Files 13 Tuesday, December 4, 12 In a broader view, object models tend to push us towards centralized, complex systems that don’t decompose well and stifle reuse and optimal deployment scenarios. FP code makes it easier to write smaller, focused services that we compose and deploy as appropriate. Each “ProcessN” could be a parallel copy of another process, for horizontal, “shared-nothing” scalability, or some of these processes could be other services… Smaller, focused services scale better, especially horizontally. They also don’t encapsulate more business logic than is required, and this (informal) architecture is also suitable for scaling ML and related algorithms.
  • 14. Web Client 1 Web Client 2 Web Client 3 • Data Size • Formal Schema ➡ Process 1 Process 2 Process 3 • Data-Driven Programs Database Files 14 Tuesday, December 4, 12 And this structure better fits the trends I outlined at the beginning of the talk.
  • 15. Web Client 1 Web Client 2 Web Client 3 • MapReduce Process 1 Process 2 Process 3 • Distributed FS Database Files 15 Tuesday, December 4, 12 And MapReduce + a distributed file system, like Hadoop’s MapReduce and HDFS, fit this model.
  • 16. What  Is MapReduce? 16 Tuesday, December 4, 12
  • 17. MapReduce  in  Hadoop Let’s  look  at  a   MapReduce  algorithm:   WordCount. (The  Hello  World  of  MapReduce…) Tuesday, December 4, 12 Let’s walk through the “Hello World” of MapReduce, the Word Count algorithm, at a conceptual level. We’ll see actual code shortly!
  • 18. Input Mappers Sort, Reducers Output Shuffle Hadoop uses MapReduce a2 hadoop 1 is 2 There is a Map phase We  need  to  convert   the  Input   map 1 mapreduce 1 phase 2 into  the  Output. reduce 1 there 2 There is a Reduce phase uses 1 Copyright  ©  2011-­‐2012,  Think  Big  AnalyCcs,  All  Rights  Reserved Tuesday, December 4, 12 Four input documents, one left empty, the others with small phrases (for simplicity…). The word count output is on the right (we’ll see why there are three output “documents”). We need to get from the input on the left-hand side to the output on the right-hand side.
  • 19. Input Mappers Sort, Reducers Output Shuffle Hadoop uses MapReduce a2 hadoop 1 is 2 There is a Map phase map 1 mapreduce 1 phase 2 reduce 1 there 2 There is a Reduce phase uses 1 Copyright  ©  2011-­‐2012,  Think  Big  AnalyCcs,  All  Rights  Reserved Tuesday, December 4, 12 Here is a schematic view of the steps in Hadoop MapReduce. Each Input file is read by a single Mapper process (default: can be many-to-many, as we’ll see later). The Mappers emit key-value pairs that will be sorted, then partitioned and “shuffled” to the reducers, where each Reducer will get all instances of a given key (for 1 or more values). Each Reducer generates the final key-value pairs and writes them to one or more files (based on the size of the output).
  • 20. Input Mappers Hadoop uses MapReduce (n, "…") There is a Map phase (n, "…") (n, "") There is a Reduce phase (n, "…") Copyright  ©  2011-­‐2012,  Think  Big  AnalyCcs,  All  Rights  Reserved Tuesday, December 4, 12 Each document gets a mapper. All data is organized into key-value pairs; each line will be a value and the offset position into the file will be the key, which we don’t care about. I’m showing each document’s contents in a box and 1 mapper task (JVM process) per document. Large documents might get split to several mapper tasks. The mappers tokenize each line, one at a time, converting all words to lower case and counting them...
  • 21. Input Mappers (hadoop, 1) Hadoop uses MapReduce (n, "…") (uses, 1) (mapreduce, 1) (there, 1) (is, 1) There is a Map phase (n, "…") (a, 1) (map, 1) (phase, 1) (n, "") (there, 1) (is, 1) There is a Reduce phase (n, "…") (a, 1) (reduce, 1) (phase, 1) Copyright  ©  2011-­‐2012,  Think  Big  AnalyCcs,  All  Rights  Reserved Tuesday, December 4, 12 The mappers emit key-value pairs, where each key is one of the words, and the value is the count. In the most naive (but also most memory efficient) implementation, each mapper simply emits (word, 1) each time “word” is seen. However, this is IO inefficient! Note that the mapper for the empty doc. emits no pairs, as you would expect.
  • 22. Input Mappers Sort, Reducers Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (n, "…") (mapreduce, 1) (uses, 1) (is, 1), (a, 1) There is a Map phase (n, "…") m-q (map, 1),(phase,1) (there, 1) (n, "") (phase,1) r-z (is, 1), (a, 1) (there, 1), There is a Reduce phase (n, "…") (reduce, 1) Copyright  ©  2011-­‐2012,  Think  Big  AnalyCcs,  All  Rights  Reserved Tuesday, December 4, 12 The mappers themselves don’t decide to which reducer each pair should be sent. Rather, the job setup configures what to do and the Hadoop runtime enforces it during the Sort/Shuffle phase, where the key-value pairs in each mapper are sorted by key (that is locally, not globally) and then the pairs are routed to the correct reducer, on the current machine or other machines. Note how we partitioned the reducers, by first letter of the keys. (By default, MR just hashes the keys and distributes them modulo # of reducers.)
  • 23. Input Mappers Sort, Reducers Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (n, "…") (a, [1,1]), (mapreduce, 1) (hadoop, [1]), (is, [1,1]) (uses, 1) (is, 1), (a, 1) There is a Map phase (n, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) (n, "") (phase,1) r-z (is, 1), (a, 1) (reduce, [1]), (there, 1), (there, [1,1]), There is a Reduce phase (n, "…") (reduce, 1) (uses, 1) Copyright  ©  2011-­‐2012,  Think  Big  AnalyCcs,  All  Rights  Reserved Tuesday, December 4, 12 The reducers are passed each key (word) and a collection of all the values for that key (the individual counts emitted by the mapper tasks). The MR framework creates these collections for us.
  • 24. Input Mappers Sort, Reducers Output Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (n, "…") (a, [1,1]), a2 (mapreduce, 1) (hadoop, [1]), hadoop 1 (is, [1,1]) is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (n, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), map 1 (mapreduce, [1]), mapreduce 1 (phase, [1,1]) phase 2 (n, "") (phase,1) r-z (is, 1), (a, 1) (reduce, [1]), reduce 1 (there, 1), (there, [1,1]), there 2 There is a Reduce phase (n, "…") (reduce, 1) (uses, 1) uses 1 Copyright  ©  2011-­‐2012,  Think  Big  AnalyCcs,  All  Rights  Reserved Tuesday, December 4, 12 The final view of the WordCount process flow. The reducer just sums the counts and writes the output. The output files contain one line for each key (the word) and value (the count), assuming we’re using text output. The choice of delimiter between key and value is up to you, but tab is common.
  • 25. Input Mappers Sort, Reducers Output Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (n, "…") (a, [1,1]), a2 (mapreduce, 1) (hadoop, [1]), hadoop 1 (is, [1,1]) is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (n, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), map 1 (mapreduce, [1]), mapreduce 1 (phase, [1,1]) phase 2 (n, "") Map: (phase,1) Reduce: r-z • Transform  one  input  to  0-­‐N   • (is, 1), (a, 1) (there, 1), Collect multiple (reduce, [1]), (there, [1,1]), one output. inputs into reduce 1 there 2 outputs.   There is a Reduce phase (n, "…") (reduce, 1) (uses, 1) uses 1 Copyright  ©  2011-­‐2012,  Think  Big  AnalyCcs,  All  Rights  Reserved Tuesday, December 4, 12 To recap, a “map” transforms one input to one output, but this is generalized in MapReduce to be one to 0-N. The output key-value pairs are distributed to reducers. The “reduce” collects together multiple inputs with the same key into
  • 26. History  of   MapReduce Tuesday, December 4, 12 Let’s review where MapReduce came from and its best-known, open-source incarnation, Hadoop.
  • 27. How  would  you   index  the  web? Tuesday, December 4, 12
  • 28. How  would  you   index  the  web? Tuesday, December 4, 12 Did Google search the entire web in 0.26 seconds to find these ~49M results?
  • 29. You  ask  a  phrase  and   the  search  engine  finds   the  best  match  in   billions  of  web  pages. Tuesday, December 4, 12
  • 30. Actually,  Google   computes  the  index   that  maps  terms  to   pages  in  advance. Google’s famous Page Rank algorithm. Tuesday, December 4, 12
  • 31. In  the  early  2000s,   Google  invented  server   infrastructure  to  support   PageRank,  etc... Tuesday, December 4, 12
  • 32. Google  File  System for  Storage 2003 Tuesday, December 4, 12 A distributed file system provides horizontal scalability and resiliency when file blocks are duplicated around the cluster.
  • 33. MapReduce for  ComputaCon 2004 Tuesday, December 4, 12 The compute model for processing all that data is MapReduce. It handles lots of boilerplate, like breaking down jobs into tasks, distributing the tasks around the cluster, monitoring the tasks, etc. You write your algorithm to the MR programming model.
  • 34. About  this  Cme,  Doug   CuGng,  the  creator  of     Lucene  was  working  on   Nutch... Tuesday, December 4, 12 Lucene is an open-source text search engine. Nutch is an open source web crawler.
  • 35. He  implemented  clean-­‐ room  versions  of   MapReduce  and  GFS... Tuesday, December 4, 12
  • 36. By  2006  ,  they  became   part  of  a  separate     Apache  project,   called  Hadoop. Tuesday, December 4, 12 The name comes from a toy, stuffed elephant that Cutting’s son owned at the time.
  • 37. Benefits  of   MapReduce Tuesday, December 4, 12
  • 38. The  best  way  to   approach  Big  Data  is  to   scale  horizontally. Tuesday, December 4, 12 We can’t build vertical systems big enough and if we could, they would cost a fortune!
  • 39. Hadoop  Design  Goals I / O e !! iz d in rh im ea M ve Ond  run  on  server-­‐class,   Oh,  a commodity    hardware. Tuesday, December 4, 12 True of Google’s GFS and MapReduce, too. Minimizing disk and network I/O latency/ overhead is critical, because it’s the largest throughput bottleneck. So, optimization is a core design goal of Hadoop (both MR and HDFS). It affects the features and performance of everything in the stack above it, including high-level programming tools!
  • 40. By  design,  Hadoop  is     great  for  batch  mode   data  crunching. Tuesday, December 4, 12 … but less so for “real-time” event handling, as we’ll discuss...
  • 41. Hadoop  also  has  a   vibrant  community   that’s  evolving  the   pla_orm. Tuesday, December 4, 12 For the IT manager, especially at large, cautious organizations.
  • 42. Commercial  support  is   available  from  many   companies. Tuesday, December 4, 12 From Hadoop-oriented companies like Cloudera, MapR, and HortonWorks, to integrators like IBM and Greenplum.
  • 43. MapReduce   and  its   Discontents Tuesday, December 4, 12 Is MapReduce the end of the story? Does it meet all our needs? Let’s look at a few problems...
  • 44. #1 It’s  hard  to  implement   many  Algorithms   in  MapReduce. Tuesday, December 4, 12 Even word count is not “obvious”. When you get to fancier stuff like joins, group-bys, etc., the mapping from the algorithm to the implementation is not trivial at all. In fact, implementing algorithms in MR is now a specialized body of knowledge...
  • 45. #2 For  Hadoop  in   parCcularly,   the  Java  API  is hard  to  use. Tuesday, December 4, 12 The Hadoop Java API is even more verbose and tedious to use than it should be.
  • 46. import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import java.util.StringTokenizer; class WCMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final IntWritable one = new IntWritable(1); static final Text word = new Text; // Value will be set in a non-thread-safe way! @Override public void map(LongWritable key, Text valueDocContents, OutputCollector<Text, IntWritable> output, Reporter reporter) { String[] tokens = valueDocContents.toString.split("s+"); for (String wordString: tokens) { if (wordString.length > 0) { word.set(wordString.toLowerCase); output.collect(word, one); } } } } class WCReduce extends MapReduceBase implements Reducer[Text, IntWritable, Text, IntWritable] { public void reduce(Text keyWord, java.util.Iterator<IntWritable> valuesCounts, OutputCollector<Text, IntWritable> output, Reporter reporter) { int totalCount = 0; while (valuesCounts.hasNext) { totalCount += valuesCounts.next.get; } output.collect(keyWord, new IntWritable(totalCount)); } } 46 Tuesday, December 4, 12 This is intentionally too small to read and we’re not showing the main routine, which doubles the code size. The algorithm is simple, but the framework is in your face. In the next several slides, notice which colors dominate. In this slide, it’s green for types (classes), with relatively few yellow functions that implement actual operations. The main routine I’ve omitted contains additional boilerplate for configuring and running the job. This is just the “core” MapReduce code. In fact, Word Count is not too bad, but when you get to more complex algorithms, even conceptually simple ideas like relational-style joins and group-bys, the corresponding MapReduce code in this API gets complex and tedious very fast!
  • 47. Use  SQL! (SoluCon  #1) 47 Tuesday, December 4, 12 I claimed we should be using FP for big data, so why not use SQL when you can! Here are 3 options.
  • 48. SQL! CREATE TABLE docs (line STRING); LOAD DATA INPATH '/path/to/docs' INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, 'W+')) AS word FROM docs) w GROUP BY word ORDER BY word; Word Count, in HiveQL 48 Tuesday, December 4, 12 This is how you could implement word count in Hive. We’re using some Hive built-in functions for tokenizing words in each “line”, the one “column” in the docs table, etc., etc.
  • 49.  Use  SQL  when  you  can! • Hive:  SQL  on  top  of  MapReduce. • Shark:  Hive  ported  to  Spark. • Impala:  HiveQL  with  new,  faster   back  end. Copyright  ©  2011-­‐2012,  Think  Big  AnalyCcs,  All  Rights  Reserved Tuesday, December 4, 12 See http://hive.apache.org/ or my book for Hive, http://shark.cs.berkeley.edu/ for shark, and http://www.cloudera.com/content/cloudera/en/products/cloudera-enterprise-core/ cloudera-enterprise-RTQ.html for Impala. Impala is very new. It doesn’t yet support all Hive features.
  • 50.  Impala • Copied  from  Google’s  Dremel. • C++  and  Java  back  end. • Provides  up  to  100x  performance   improvement! • Developed  by  Cloudera. Copyright  ©  2011-­‐2012,  Think  Big  AnalyCcs,  All  Rights  Reserved Tuesday, December 4, 12 See http://www.cloudera.com/content/cloudera/en/products/cloudera-enterprise-core/ cloudera-enterprise-RTQ.html. However, this was just announced recently, so it’s not production ready quite yet...
  • 51. Use Cascading (Java) (SoluCon  #2a) 51 Tuesday, December 4, 12 Cascading is a Java library that provides higher-level abstractions for building data processing pipelines with concepts familiar from SQL such as a joins, group-bys, etc. It works on top of Hadoop’s MapReduce and hides most of the boilerplate from you. See http://cascading.org.
  • 52. Cascading  Concepts Data  flows  consist  of     source  and  sink  Taps   connected  by  Pipes. Tuesday, December 4, 12
  • 53. Word  Count Flow Sequence of Pipes ("word count assembly") line words word count Each(Regex) GroupBy Every(Count) Tap Tap HDFS (source) (sink) Copyright  ©  2011-­‐2012,  Think  Big  AnalyCcs,  All  Rights  Reserved Tuesday, December 4, 12 Schematically, here is what Word Count looks like in Cascading. See http:// docs.cascading.org/cascading/1.2/userguide/html/ch02.html for details.
  • 54. import org.cascading.*; ... public class WordCount { public static void main(String[] args) { String inputPath = args[0]; String outputPath = args[1]; Properties properties = new Properties(); FlowConnector.setApplicationJarClass( properties, Main.class ); Scheme sourceScheme = new TextLine( new Fields( "line" ) ); Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) ); Tap source = new Hfs( sourceScheme, inputPath ); Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE ); Pipe assembly = new Pipe( "wordcount" ); String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?!pL)"; Function function = new RegexGenerator( new Fields( "word" ), regex ); assembly = new Each( assembly, new Fields( "line" ), function ); assembly = new GroupBy( assembly, new Fields( "word" ) ); Aggregator count = new Count( new Fields( "count" ) ); assembly = new Every( assembly, count ); FlowConnector flowConnector = new FlowConnector( properties ); Flow flow = flowConnector.connect( "word-count", source, sink, assembly); flow.complete(); } } 54 Tuesday, December 4, 12 Here is the Cascading Java code. It’s cleaner than the MapReduce API, because the code is more focused on the algorithm with less boilerplate, although it looks like it’s not that much shorter. HOWEVER, this is all the code, where as previously I omitted the setup (main) code. See http:// docs.cascading.org/cascading/1.2/userguide/html/ch02.html for details of the API features used here; we won’t discuss them here, but just mention some highlights. Note that there is still a lot of green for types, but at least the API emphasizes composing behaviors together.
  • 55. Use  Scalding (Scala) (SoluCon  #2b) 55 Tuesday, December 4, 12 Scalding is a Scala “DSL” (domain-specific language) that wraps Cascading providing an even more intuitive and more boilerplate-free API for writing MapReduce jobs. https://github.com/twitter/scalding Scala is a new JVM language that modernizes Java’s object-oriented (OO) features and adds support for functional programming, as we discussed previously and we’ll revisit shortly.
  • 56. import com.twitter.scalding._ class WordCountJob(args: Args) extends Job(args) { TextLine( args("input") ) .read .flatMap('line -> 'word) { line: String => line.trim.toLowerCase.split("W+") } .groupBy('word) { group => group.size('count) } } .write(Tsv(args("output"))) } That’s It!! 56 Tuesday, December 4, 12 This Scala code is almost pure domain logic with very little boilerplate. There are a few minor differences in the implementation. You don’t explicitly specify the “Hfs” (Hadoop Distributed File System) taps. That’s handled by Scalding implicitly when you run in “non-local” model. Also, I’m using a simpler tokenization approach here, where I split on anything that isn’t a “word character” [0-9a-zA-Z_]. There is little green, in part because Scala infers type in many cases. There is a lot more yellow for the functions that do real work! What if MapReduce, and hence Cascading and Scalding, went obsolete tomorrow? This code is so short, I wouldn’t care about throwing it away! I invested little time writing it, testing it, etc.
  • 57. Other  Improved  APIs: • Crunch  (Java)  & Scrunch  (Scala) • Scoobi  (Scala) • ... Copyright  ©  2011-­‐2012,  Think  Big  AnalyCcs,  All  Rights  Reserved Tuesday, December 4, 12 See https://github.com/cloudera/crunch. Others include Scoobi (http://nicta.github.com/scoobi/) and Spark, which we’ll discuss next.
  • 58. Use  Spark (Scala) (SoluCon  #3) 58 Tuesday, December 4, 12
  • 59. Spark  is  an  AlternaCve  to   Hadoop  MapReduce: • Distributed  compuCng  with   in-­‐memory  caching. • Up  to  30x  faster  than   MapReduce. Copyright  ©  2011-­‐2012,  Think  Big  AnalyCcs,  All  Rights  Reserved Tuesday, December 4, 12 See http://www.spark-project.org/ Why isn’t it more widely used? 1) lack of commercial support, 2) only recently emerged out of academia.
  • 60. Spark  is  an  AlternaCve  to   Hadoop  MapReduce: • Originally  designed  for   machine  learning   applicaCons. Copyright  ©  2011-­‐2012,  Think  Big  AnalyCcs,  All  Rights  Reserved Tuesday, December 4, 12
  • 61. object WordCountSpark { def main(args: Array[String]) { val file = spark.textFile(args(0)) val counts = file.flatMap(line => line.split("W+"))                    .map(word => (word, 1))                   .reduceByKey(_ + _) counts.saveAsTextFile(args(1)) } } Even more succinct. Note that there are only 3 explicit types required. 61 Tuesday, December 4, 12 This spark example is actually closer in a few details, i.e., function names used, to the original Hadoop Java API example, but it cuts down boilerplate to the bare minimum.
  • 62. #3 It’s  not  suitable  for   “real-­‐Sme” event  processing. Tuesday, December 4, 12 For typical web/enterprise systems, “real-time” is up to 100s of milliseconds, so I’m using the term broadly (but following common practice in this industry). True real-time systems, such as avionics, have much tighter constraints.
  • 64. Storm  implements   reliable,  distributed   “real-­‐Sme” event  processing. Tuesday, December 4, 12 http://storm-project.net/ Created by Nathan Marz, now at Twitter, who also created Cascalog, the Clojure wrapper around Cascading with added Datalog (logic programming) features.
  • 65. Bolt Bolt Spout Bolt Spout Bolt Tuesday, December 4, 12 In Storm terminology, Spouts are data sources and bolts are the event processors. There are facilities to support reliable message handling, various sources encapsulated in Sprouts and various targets of output. Distributed processing is baked in from the start.
  • 67.  SQL    or  NoSQL   Databases? • Since  databases  are  designed  for   fast,  transacConal  updates,   consider  a  database  for                   event  processing. Copyright  ©  2011-­‐2012,  Think  Big  AnalyCcs,  All  Rights  Reserved Tuesday, December 4, 12 Use a SQL database unless you need the scale and looser schema of a NoSQL database!
  • 68. #4 It’s  not  ideal  for   graph  processing. Tuesday, December 4, 12
  • 69.  Google’s  Page  Rank • Google  invented  MapReduce, • …  but  MapReduce  is  not  ideal  for   Page  Rank  and  other  graph   algorithms.   Copyright  ©  2011-­‐2012,  Think  Big  AnalyCcs,  All  Rights  Reserved Tuesday, December 4, 12 Recall that PageRank is the famous algorithm invented by Sergey Brin and Larry Page to index the web. It’s the foundation of Google’s search engine.
  • 70. Why  not  MapReduce? • 1  MR  job  for  each   B A iteraCon  that  updates   D all  n  nodes/edges. C E • Graph  saved  to  disk   F ajer  each  iteraCon. • ... Copyright  ©  2011-­‐2012,  Think  Big  AnalyCcs,  All  Rights  Reserved Tuesday, December 4, 12 The presentation http://www.slideshare.net/shatteredNirvana/pregel-a-system-for- largescale-graph-processing itemizes all the major issues with using MR to implement graph algorithms. In a nutshell, a job with a map and reduce phase is waaay to course-grained...
  • 71. Use  Graph  Processing (SoluCon  #4) 71 Tuesday, December 4, 12 A good summary presentation: http://www.slideshare.net/shatteredNirvana/pregel-a-system-for-largescale-graph-processing
  • 72.  Google’s  Pregel • Pregel:  New  graph  framework  for   Page  Rank. • Bulk,  Synchronous  Parallel  (BSP). • Graphs  are  first-­‐class  ciCzens. • Efficiently  processes  updates...   Copyright  ©  2011-­‐2012,  Think  Big  AnalyCcs,  All  Rights  Reserved Tuesday, December 4, 12 Pregel is intended to replace MR for PageRank, but I’ve heard they haven’t actually been able to switch over to it yet. “Pregel” is the river that runs through the city of Königsberg, Prussia (now called Kaliningrad, Ukraine). 7 bridges crossed the river in the city (including to 5 to 2 islands between river branches). Leonhard Euler invented graph theory when he analyzed the question of whether or not you can cross all 7 bridges without retracing your steps (you can’t).
  • 73.  Open-­‐source   AlternaCves • Apache  Giraph. All are • Apache  Hama. somewhat immature. • Aurelius  Titan. Copyright  ©  2011-­‐2012,  Think  Big  AnalyCcs,  All  Rights  Reserved Tuesday, December 4, 12 http://incubator.apache.org/giraph/ http://hama.apache.org/ http://thinkaurelius.github.com/titan/ None is very mature nor has extensive commercial support.
  • 74. A Manifesto... 74 Tuesday, December 4, 12 To bring this altogether, I think we have opportunities for a better way...
  • 75. Hadoop is the Enterprise Java Beans of our time. Tuesday, December 4, 12 I worked with EJBs a decade ago. The framework was completely invasive into your business logic. There were too many configuration options in XML files. The framework “paradigm” was a poor fit for most problems (like soft real time systems and most algorithms beyond Word Count). Internally, EJB implementations were inefficient and hard to optimize, because they relied on poorly considered object boundaries that muddled more natural boundaries. (I’ve argued in other presentations and my “FP for Java Devs” book that OOP is a poor modularity tool…) The fact is, Hadoop reminds me of EJBs in almost every way. It’s a 1st generation solution that mostly works okay and people do get work done with it, but just as the Spring Framework brought an essential rethinking to Enterprise Java, I think there is an essential rethink that needs to happen in Big Data, specifically around Hadoop. The functional programming community, is well positioned to create it...
  • 76. Stop using Java! Tuesday, December 4, 12 Java has taken us a long way and made many organizations successful over the years. Also, the JVM remains one of our most valuable tools in all of IT. But the language is really wrong for data purposes and its continued use by Big Data vendors is slowing down overall progress, as well as application developer productivity, IMHO. Java emphasizes the wrong abstractions, objects instead of mathematically-inspired functional programming constructs, and Java encourages inflexible bloat because it’s verbose compared to more modern alternatives and objects (at least class-based ones…) are far less reusable and flexible than people realize. They also contribute to focusing on the wrong concepts, structure instead of behavior.
  • 77. Use Functional Languages. 77 Tuesday, December 4, 12 Why is Functional Programming better for Big Data? The work we do with data is inherently mathematical transformations and FP is inspired by math. Hence, it’s naturally a better fit, much more so than object-oriented programming. And, modern languages like Scala, Clojure, Erlang, F#, OCaml, and Haskell are more concise and better at eliminating boilerplate, while still providing excellent performance. Note that one reason SQL has succeeded all these years is because it is also inspired by math, e.g., set theory.
  • 78. Understand Functional Collections. Tuesday, December 4, 12 We already have the right model in the collection APIs that come with functional languages. They are far better engineered for intuitive data transformations. They provide the right abstractions and hide boilerplate. In fact, they make it relatively easy to optimize implementations for parallelization. The Scala collections offer parallelization with a tiny API call. Spark and Cascading transparently distribute collections across a cluster.
  • 79. Erlang, Akka: Actor-based, Distributed Computation Fine Grain Compute Models. Tuesday, December 4, 12 We can start using new, more efficient compute models, like Spark, Pregel, and Impala today. Of course, you have to consider maturity, viability, and support issues in large organizations. So if you want to wait until these alternatives are more mature, then at least use better APIs for Hadoop! For example, Erlang is a very mature language with the Actor model backed in. Akka is a Scala distributed computing model based on the Actor model of concurrency. It exposes clean, low-level primitives for robust, distributed services (e.g., Actors), upon which we can build flexible big data systems that can handle soft real time and batch processing efficiently and with great scalability.
  • 80. Final Thought: 80 Tuesday, December 4, 12 A final thought about Big Data...
  • 81. QuesCons? TechMesh  London  2012 December  5,  2012 @deanwampler dean.wampler@thinkbiganalyCcs.com polyglotprogramming.com/talks 81 Tuesday, December 4, 12 All pictures © Dean Wampler, 2011-2012.