SlideShare a Scribd company logo
1 of 64
Download to read offline
MapReduce Intro




                  The MapReduce Programming Model

                         Introduction and Examples

                       Dr. Jose Mar´ Alvarez-Rodr´
                                   ıa            ıguez

            “Quality Management in Service-based Systems and Cloud
                                Applications”

                               FP7 RELATE-ITN

                         South East European Research Center


                        Thessaloniki, 10th of April, 2013

                                                                     1 / 61
MapReduce Intro




      1   MapReduce in a nutshell

      2   Thinking in MapReduce

      3   Applying MapReduce

      4   Success Stories with MapReduce

      5   Summary and Conclusions




                                           2 / 61
MapReduce Intro
  MapReduce in a nutshell



 Features




      A programming model...
         1   Large-scale distributed data processing
         2   Simple but restricted
         3   Paralell programming
         4   Extensible




                                                       3 / 61
MapReduce Intro
  MapReduce in a nutshell



 Antecedents

      Functional programming
         1   Inspired
         2   ...but not equivalent

      Example in Python
      “Given a list of numbers between 1 and 50 print only even
      numbers”
              §                                                             ¤
                  print filter ( lambda x : x % 2 == 0 , range (1 , 50) )
             ¦
                                                                           ¥

             A list of numbers (data)
             A condition (even numbers)
             A function filter that is applied to the list (map)


                                                                                4 / 61
MapReduce Intro
  MapReduce in a nutshell



 Antecedents

      Functional programming
         1   Inspired
         2   ...but not equivalent

      Example in Python
      “Given a list of numbers between 1 and 50 print only even
      numbers”
              §                                                             ¤
                  print filter ( lambda x : x % 2 == 0 , range (1 , 50) )
             ¦
                                                                           ¥

             A list of numbers (data)
             A condition (even numbers)
             A function filter that is applied to the list (map)


                                                                                5 / 61
MapReduce Intro
  MapReduce in a nutshell



 ...Other examples...

      Example in Python
      “Return the sum of the squares of a list of numbers between 1 and
      50”
              §                                                                               ¤
                  import operator
                  reduce ( operator . add , map (( lambda x : x **2) , range (1 ,50) ) , 0)
             ¦
                                                                                             ¥



             “reduce” is equivalent to “foldl” in other func. languages as
             Haskell
             other math considerations should be taken into account (kind
             of operator)...


                                                                                                  6 / 61
MapReduce Intro
  MapReduce in a nutshell



 Some interesting points...



      The Map Reduce framework...
         1   Inspired in functional programming concepts (but not
             equivalent)
         2   Problems that can be paralellized
         3   Sometimes recursive solutions
         4   ...




                                                                    7 / 61
MapReduce Intro
  MapReduce in a nutshell



 Basic Model




      “MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google.




                                                                                             8 / 61
MapReduce Intro
  MapReduce in a nutshell



 Map Function




      Figure: Mapping creates a new output list by applying a function to
      individual elements of an input list.


      “Module 4: MapReduce”, Hadoop Tutorial, Yahoo!.




                                                                            9 / 61
MapReduce Intro
  MapReduce in a nutshell



 Reduce Function




      Figure: Reducing a list iterates over the input values to produce an
      aggregate value as output.


      “Module 4: MapReduce”, Hadoop Tutorial, Yahoo!.



                                                                             10 / 61
MapReduce Intro
  MapReduce in a nutshell



 MapReduce Flow




                              Figure: High-level MapReduce pipeline.


      “Module 4: MapReduce”, Hadoop Tutorial, Yahoo!.

                                                                       11 / 61
MapReduce Intro
  MapReduce in a nutshell



 MapReduce Flow




                       Figure: Detailed Hadoop MapReduce data flow.
                                                                     12 / 61
MapReduce Intro
  MapReduce in a nutshell



 Tip




      What is MapReduce?
      It is a framework inspired in functional programming to tackle
      problems in which steps can be paralellized applying a divide and
      conquer approach.




                                                                          13 / 61
MapReduce Intro
  Thinking in MapReduce



 When should I use MapReduce?
      Query
              Index and Search: inverted index
              Filtering
              Classification
              Recommendations: clustering or collaborative filtering


      Analytics
              Summarization and statistics
              Sorting and merging
              Frequency distribution
              SQL-based queries: group-by, having, etc.
              Generation of graphics: histograms, scatter plots.


      Others
      Message passing such as Breadth First-Search or PageRank algorithms.

                                                                             14 / 61
MapReduce Intro
  Thinking in MapReduce



 When should I use MapReduce?
      Query
              Index and Search: inverted index
              Filtering
              Classification
              Recommendations: clustering or collaborative filtering


      Analytics
              Summarization and statistics
              Sorting and merging
              Frequency distribution
              SQL-based queries: group-by, having, etc.
              Generation of graphics: histograms, scatter plots.


      Others
      Message passing such as Breadth First-Search or PageRank algorithms.

                                                                             15 / 61
MapReduce Intro
  Thinking in MapReduce



 When should I use MapReduce?
      Query
              Index and Search: inverted index
              Filtering
              Classification
              Recommendations: clustering or collaborative filtering


      Analytics
              Summarization and statistics
              Sorting and merging
              Frequency distribution
              SQL-based queries: group-by, having, etc.
              Generation of graphics: histograms, scatter plots.


      Others
      Message passing such as Breadth First-Search or PageRank algorithms.

                                                                             16 / 61
MapReduce Intro
  Thinking in MapReduce



 How Google uses MapReduce (80% of data processing)



             Large-scale web search indexing
             Clustering problems for Google News
             Produce reports for popular queries, e.g. Google Trend
             Processing of satellite imagery data
             Language model processing for statistical machine translation
             Large-scale machine learning problems
             ...




                                                                             17 / 61
MapReduce Intro
  Thinking in MapReduce



 Comparison of MapReduce and other approaches




      “MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google.


                                                                                             18 / 61
MapReduce Intro
  Thinking in MapReduce



 Evaluation of MapReduce and other approaches




      “MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google.




                                                                                             19 / 61
MapReduce Intro
  Thinking in MapReduce



 Apache Hadoop



   MapReduce definition
   The Apache Hadoop software
   library is a framework that
   allows for the distributed
   processing of large data sets
                                   Figure: Apache Hadoop Logo.
   across clusters of computers
   using simple programming
   models.




                                                                 20 / 61
MapReduce Intro
  Thinking in MapReduce



 Tip



      What can I do in MapReduce?
      Three main functions:
         1   Querying
         2   Summarizing
         3   Analyzing
      . . . large datasets in off-line mode for boosting other on-line
      processes.




                                                                        21 / 61
MapReduce Intro
  Applying MapReduce



 MapReduce in Action

      MapReduce Patterns
         1   Summarization
         2   Filtering
         3   Data Organization (sort, merging, etc.)
         4   Relational-based (join, selection, projection, etc.)
         5   Iterative Message Passing (graph processing)
         6   Others (depending on the implementation):
                   Simulation of distributed systems
                   Cross-correlation
                   Metapatterns
                   Input-output
                   ...

                                                                    22 / 61
MapReduce Intro
  Applying MapReduce



 Overview (stages)-Counting Letters




                                      23 / 61
MapReduce Intro
  Applying MapReduce



 Summarization




      Types
         1   Numerical summarizations
         2   Inverted index
         3   Counting and counters




                                        24 / 61
MapReduce Intro
  Applying MapReduce



 Numerical Summarization-I



      Description
      A general pattern for calculating aggregate statistical values over
      your data.

      Intent
      Group records together by a key field and calculate a numerical
      aggregate per group to get a top-level view of the larger data set.




                                                                            25 / 61
MapReduce Intro
  Applying MapReduce



 Numerical Summarization-II


      Applicability
          To deal with numerical data or counting.
              To group data by specific fields

      Examples

          1   Word count
          2   Record count
          3   Min/Max/Count
          4   Average/Median/Standard deviation
          5   ...




                                                     26 / 61
MapReduce Intro
  Applying MapReduce



 Numerical Summarization-Pseudocode


        class Mapper
          method Map(recordid id, record r)
             for all term t in record r do
                Emit(term t, count 1)

      class Reducer
         method Reduce(term t, counts [c1, c2,...])
            sum = 0
            for all count c in [c1, c2,...] do
                sum = sum + c
            Emit(term t, count sum)


                                                      27 / 61
MapReduce Intro
  Applying MapReduce



 Overview-Word Counter




                         28 / 61
MapReduce Intro
  Applying MapReduce



 Numerical Summarization-Word Counter

             §                                                                            ¤
                  public void map ( LongWritable key , Text value , Context context )
                        throws Exception {
                          String line = value . toString () ;
                          StringTokenizer tokenizer = new StringTokenizer ( line ) ;
                          while ( tokenizer . hasMoreTokens () ) {
                              word . set ( tokenizer . nextToken () ) ;
                              context . write ( word , one ) ;
                          }
                      }

                  public void reduce ( Text key , Iterable  IntWritable  values ,
                        Context context )
                         throws IOException , I n t e r r u p t e d E x c e p t i o n {
                           int sum = 0;
                           for ( IntWritable val : values ) {
                               sum += val . get () ;
                           }
                           context . write ( key , new IntWritable ( sum ) ) ;
                      }
             ¦
                                                                                         ¥



                                                                                              29 / 61
MapReduce Intro
  Applying MapReduce



 Example-II




      Min/Max
      Given a list of tweets (username, date, text) determine first and
      last time an user commented and the number of times.

      Implementation

      See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro




                                                                                    30 / 61
MapReduce Intro
  Applying MapReduce



 Overview - Min/Max




      ∗ Min and max creation date are the same in the map phase.
                                                                   31 / 61
MapReduce Intro
  Applying MapReduce



 Example II-Min/Max, function Map


             §                                                                            ¤
                  public void map ( Object key , Text value , Context context )
                        throws IOException , InterruptedException , ParseException {
                          Map  String , String  parsed = MRDPUtils . parse ( value .
                                 toString () ) ;
                          String strDate = parsed . get ( MRDPUtils . CREATION_DATE ) ;
                          String userId = parsed . get ( MRDPUtils . USER_ID ) ;
                          if ( strDate == null || userId == null ) {
                            return ;
                          }
                          Date creationDate = MRDPUtils . frmt . parse ( strDate ) ;
                          outTuple . setMin ( creationDate ) ;
                          outTuple . setMax ( creationDate ) ;
                          outTuple . setCount (1) ;
                          outUserId . set ( userId ) ;
                          context . write ( outUserId , outTuple ) ;
                  }
             ¦
                                                                                         ¥




                                                                                              32 / 61
MapReduce Intro
  Applying MapReduce



 Example II-Min/Max, function Reduce

             §                                                                                             ¤
                  public void reduce ( Text key , Iterable  MinMaxCountTuple  values ,
                        Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n {
                        result . setMin ( null ) ;
                        result . setMax ( null ) ;
                        int sum = 0;
                        for ( MinMaxCountTuple val : values ) {
                               if ( result . getMin () == null
                                      || val . getMin () . compareTo ( result . getMin () )  0)
                                               {
                                      result . setMin ( val . getMin () ) ;
                               }
                               if ( result . getMax () == null
                                      || val . getMax () . compareTo ( result . getMax () )  0)
                                               {
                                      result . setMax ( val . getMax () ) ;
                                      }
                                      sum += val . getCount () ;}
                        result . setCount ( sum ) ;
                        context . write ( key , result ) ;
                  }
             ¦
                                                                                                          ¥



                                                                                                               33 / 61
MapReduce Intro
  Applying MapReduce



 Example-III




      Average
      Given a list of tweets (username, date, text) determine the average
      comment length per hour of day.

      Implementation

      See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro




                                                                                    34 / 61
MapReduce Intro
  Applying MapReduce



 Overview - Average




                       35 / 61
MapReduce Intro
  Applying MapReduce



 Example III-Average, function Map


             §                                                                          ¤
                  public void map ( Object key , Text value , Context context )
                        throws IOException , InterruptedException , ParseException {
                        Map  String , String  parsed =
                                MRDPUtils . parse ( value . toString () ) ;
                        String strDate = parsed . get ( MRDPUtils . CREATION_DATE ) ;
                        String text = parsed . get ( MRDPUtils . TEXT ) ;
                        if ( strDate == null || text == null ) {
                                return ;
                        }
                        Date creationDate = MRDPUtils . frmt . parse ( strDate ) ;
                        outHour . set ( creationDate . getHours () ) ;
                        outCountAverage . setCount (1) ;
                        outCountAverage . setAverage ( text . length () ) ;
                        context . write ( outHour , outCountAverage ) ;
                  }
             ¦
                                                                                       ¥




                                                                                            36 / 61
MapReduce Intro
  Applying MapReduce



 Example III-Average, function Reduce


             §                                                                                             ¤
                  public void reduce ( IntWritable key , Iterable  CountAverageTuple 
                       values ,
                        Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n {
                        float sum = 0;
                        float count = 0;
                        for ( Co unt Ave rage Tup le val : values ) {
                               sum += val . getCount () * val . getAverage () ;
                               count += val . getCount () ;
                        }
                        result . setCount ( count ) ;
                        result . setAverage ( sum / count ) ;
                        context . write ( key , result ) ;
                  }
             ¦
                                                                                                          ¥




                                                                                                               37 / 61
MapReduce Intro
  Applying MapReduce



 Numerical Summarization-Other approaches

      Relation to SQL
             §                                                           ¤
                  SELECT MIN ( numcol1 ) , MAX ( numcol1 ) ,
                  COUNT (*) FROM table GROUP BY groupcol2 ;
             ¦
                                                                        ¥



      Implementation in PIG
             §                                                           ¤
                  b = GROUP a BY groupcol2 ;
                  c = FOREACH b GENERATE group , MIN ( a . numcol1 ) ,
                  MAX ( a . numcol1 ) , COUNT_STAR ( a ) ;
             ¦
                                                                        ¥




                                                                             38 / 61
MapReduce Intro
  Applying MapReduce



 Numerical Summarization-Other approaches

      Relation to SQL
             §                                                           ¤
                  SELECT MIN ( numcol1 ) , MAX ( numcol1 ) ,
                  COUNT (*) FROM table GROUP BY groupcol2 ;
             ¦
                                                                        ¥



      Implementation in PIG
             §                                                           ¤
                  b = GROUP a BY groupcol2 ;
                  c = FOREACH b GENERATE group , MIN ( a . numcol1 ) ,
                  MAX ( a . numcol1 ) , COUNT_STAR ( a ) ;
             ¦
                                                                        ¥




                                                                             39 / 61
MapReduce Intro
  Applying MapReduce



 Filtering




      Types
         1   Filtering
         2   Top N records
         3   Bloom filtering
         4   Distinct




                              40 / 61
MapReduce Intro
  Applying MapReduce



 Filtering-I



      Description
      It evaluates each record separately and decides, based on some
      condition, whether it should stay or go.

      Intent
      Filter out records that are not of interest and keep ones that are.




                                                                            41 / 61
MapReduce Intro
  Applying MapReduce



 Filtering-II


      Applicability
      To collate data

      Examples

          1   Closer view of dataset
          2   Data cleansing
          3   Tracking a thread of events
          4   Simple random sampling
          5   Distributed Grep
          6   Removing low scoring dataset
          7   Log Analysis
          8   Data Querying
          9   Data Validation
         10 . . .




                                             42 / 61
MapReduce Intro
  Applying MapReduce



 Filtering-Pseudocode


      class Mapper
         method Map(recordid id, record r)
            field f = extract(r)
            if predicate (f)
               Emit(recordid id, value(r))

      class Reducer
         method Reduce(recordid id, values [r1, r2,...])
            //Whatever
            Emit(recordid id, aggregate (values))



                                                           43 / 61
MapReduce Intro
  Applying MapReduce



 Example-IV




      Distributed Grep
      Given a list of tweets (username, date, text) determine the tweets
      that contain a word.

      Implementation

      See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro




                                                                                    44 / 61
MapReduce Intro
  Applying MapReduce



 Overview - Distributed Grep




                               45 / 61
MapReduce Intro
  Applying MapReduce



 Example IV-Distributed Grep, function Map


               §                                                                               ¤
                  public void map ( Object key , Text value , Context context )
                        throws IOException , I n t e r r u p t e d E x c e p t i o n {
                        Map  String , String  parsed =
                                MRDPUtils . parse ( value . toString () ) ;
                        String txt = parsed . get ( MRDPUtils . TEXT ) ;
                        String mapRegex =  .* b  + context . getConfiguration ()
                                . get (  mapregex  ) +  (.) * b .*  ;
                        if ( txt . matches ( mapRegex ) ) {
                                context . write ( NullWritable . get () , value ) ;
                        }
                  }
              ¦
                                                                                              ¥


      ...and the Reduce function?

      In this case it is not necessary and output values are directly writing to the output.




                                                                                                   46 / 61
MapReduce Intro
  Applying MapReduce



 Example-V




      Top 5
      Given a list of tweets (username, date, text) determine the 5 users
      that wrote longer tweets

      Implementation

      See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro




                                                                                    47 / 61
MapReduce Intro
  Applying MapReduce



 Overview - Top 5




                       48 / 61
MapReduce Intro
  Applying MapReduce



 Example V-Top 5, function Map

             §                                                                                    ¤
                  private TreeMap  Integer , Text  repToRecordMap = new TreeMap 
                       Integer , Text () ;
                  public void map ( Object key , Text value , Context context )
                        throws IOException , I n t e r r u p t e d E x c e p t i o n {
                        Map  String , String  parsed =
                        MRDPUtils . parse ( value . toString () ) ;
                        if ( parsed == null ) { return ;}
                        String userId = parsed . get ( MRDPUtils . USER_ID ) ;
                        String reputation = String . valueOf ( parsed . get ( MRDPUtils .
                               TEXT ) . length () ) ;
                        // Max reputation if you write tweets longer
                        if ( userId == null || reputation == null ) { return ;}
                                repToRecordMap . put ( Integer . parseInt ( reputation ) , new
                                        Text ( value ) ) ;
                                if ( repToRecordMap . size ()  MAX_TOP ) {
                                         repToRecordMap . remove ( repToRecordMap . firstKey ()
                                                );
                                }
                           }
             ¦
                                                                                                 ¥



                                                                                                      49 / 61
MapReduce Intro
  Applying MapReduce



 Example V-Top 5, function Reduce


             §                                                                                             ¤
                  public void reduce ( NullWritable key , Iterable  Text  values ,
                        Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n {
                              for ( Text value : values ) {
                              Map  String , String  parsed = MRDPUtils . parse ( value .
                                     toString () ) ;
                              repToRecordMap . put ( parsed . get ( MRDPUtils . TEXT ) . length
                                     () , new Text ( value ) ) ;
                              if ( repToRecordMap . size ()  MAX_TOP ) {
                                      repToRecordMap . remove ( repToRecordMap . firstKey ()
                                             );
                                      }
                                }
                              for ( Text t : repToRecordMap . descendingMap () . values ()
                                     ) {
                                      context . write ( NullWritable . get () , t ) ;
                              }
                  }
             ¦
                                                                                                          ¥




                                                                                                               50 / 61
MapReduce Intro
  Applying MapReduce



 Filtering-Other approaches


      Relation to SQL
             §                                                   ¤
                  SELECT * FROM table WHERE colvalue  VALUE ;
             ¦
                                                                ¥



      Implementation in PIG
             §                                                   ¤
                  b = FILTER a BY colvalue  VALUE ;
             ¦
                                                                ¥




                                                                     51 / 61
MapReduce Intro
  Applying MapReduce



 Filtering-Other approaches


      Relation to SQL
             §                                                   ¤
                  SELECT * FROM table WHERE colvalue  VALUE ;
             ¦
                                                                ¥



      Implementation in PIG
             §                                                   ¤
                  b = FILTER a BY colvalue  VALUE ;
             ¦
                                                                ¥




                                                                     52 / 61
MapReduce Intro
  Applying MapReduce



 Tip




      How can I use and run a MapReduce framework?
      You should identify what kind of problem you are addressing and
      apply a design pattern to be implemented in a framework such
      as Apache Hadoop.




                                                                        53 / 61
MapReduce Intro
  Success Stories with MapReduce



 Tip



      Who is using MapReduce?
      All companies that are dealing with Big Data problems for
      analytics such as:
             Cloudera
             Datasalt
             Elasticsearch
             ...




                                                                  54 / 61
MapReduce Intro
  Success Stories with MapReduce



 Apache Hadoop-Related Projects




                                   55 / 61
MapReduce Intro
  Success Stories with MapReduce



 More tips


      FAQ
             MapReduce is a framework based on a simple programming
             model
             ...to deal with large datasets in a distributed fashion
             ...scalability, replication, fault-tolerant, etc.
             Apache Hadoop is not a database
             New frameworks on top of Hadoop for specific tasks:
             querying, analysis, etc.
             Other similar frameworks: Storm, Signal/Collect, etc.
             ...


                                                                       56 / 61
MapReduce Intro
  Summary and Conclusions



 Summary




                            57 / 61
MapReduce Intro
  Summary and Conclusions



 Conclusions


      What is MapReduce?

      It is a framework inspired in functional programming to tackle problems in which steps can be paralellized
      applying a divide and conquer approach.


      What can I do in MapReduce?

      Three main functions:
          1   Querying
          2   Summarizing
          3   Analyzing
      . . . large datasets in off-line mode for boosting other on-line processes.


      How can I use and run a MapReduce framework?

      You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a
      framework such as Apache Hadoop.




                                                                                                                      58 / 61
MapReduce Intro
  Summary and Conclusions



 Conclusions


      What is MapReduce?

      It is a framework inspired in functional programming to tackle problems in which steps can be paralellized
      applying a divide and conquer approach.


      What can I do in MapReduce?

      Three main functions:
          1   Querying
          2   Summarizing
          3   Analyzing
      . . . large datasets in off-line mode for boosting other on-line processes.


      How can I use and run a MapReduce framework?

      You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a
      framework such as Apache Hadoop.




                                                                                                                      59 / 61
MapReduce Intro
  Summary and Conclusions



 Conclusions


      What is MapReduce?

      It is a framework inspired in functional programming to tackle problems in which steps can be paralellized
      applying a divide and conquer approach.


      What can I do in MapReduce?

      Three main functions:
          1   Querying
          2   Summarizing
          3   Analyzing
      . . . large datasets in off-line mode for boosting other on-line processes.


      How can I use and run a MapReduce framework?

      You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a
      framework such as Apache Hadoop.




                                                                                                                      60 / 61
MapReduce Intro
  Summary and Conclusions



 What’s next?


      ...
             Concatenate MapReduce jobs
             Optimization using combiners and setting the parameters (size
             of partition, etc.)
             Pipelining with other languages such as Python
             Hadoop in Action: more examples, etc.
             New trending problems (image/video processing)
             Real-time processing
             ...



                                                                             61 / 61
MapReduce Intro
  References



               J. Dean and S. Ghemawat.
               MapReduce: simplified data processing on large clusters.
               Commun. ACM, 51(1):107–113, Jan. 2008.
               J. L. Jonathan R. Owens, Brian Femiano.
               Hadoop Real-World Solutions Cookbook.
               Packt Publishing Ltd, 2013.
               C. Lam.
               Hadoop in Action.
               Manning Publications Co., Greenwich, CT, USA, 1st edition,
               2010.
               J. Lin and C. Dyer.
               Data-intensive text processing with MapReduce.
               In Proceedings of Human Language Technologies: The 2009
               Annual Conference of the North American Chapter of the
               Association for Computational Linguistics, Companion
                                                                            62 / 61
MapReduce Intro
  References



               Volume: Tutorial Abstracts, NAACL-Tutorials ’09, pages 1–2,
               Stroudsburg, PA, USA, 2009. Association for Computational
               Linguistics.
               D. Miner and A. Shook.
               Mapreduce Design Patterns.
               Oreilly and Associates Inc, 2012.
               T. G. Srinath Perera.
               Hadoop MapReduce Cookbook.
               Packt Publishing Ltd, 2013.
               T. White.
               Hadoop: The Definitive Guide.
               O’Reilly Media, Inc., 1st edition, 2009.
               I. H. Witten and E. Frank.
               Data Mining: Practical Machine LearningTools and Techniques.

                                                                             63 / 61
MapReduce Intro
  References



               Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
               2005.




                                                                          64 / 61

More Related Content

What's hot

K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 
Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using PythonShirin Mojarad, Ph.D.
 
The Multivariate Gaussian Probability Distribution
The Multivariate Gaussian Probability DistributionThe Multivariate Gaussian Probability Distribution
The Multivariate Gaussian Probability DistributionPedro222284
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkReynold Xin
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Spark Summit
 
An Intro to NoSQL Databases
An Intro to NoSQL DatabasesAn Intro to NoSQL Databases
An Intro to NoSQL DatabasesRajith Pemabandu
 
Bag the model with bagging
Bag the model with baggingBag the model with bagging
Bag the model with baggingChode Amarnath
 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Alexander Hendorf
 
Realtime Sentiment Analysis Application Using Hadoop and HBase
Realtime Sentiment Analysis Application Using Hadoop and HBaseRealtime Sentiment Analysis Application Using Hadoop and HBase
Realtime Sentiment Analysis Application Using Hadoop and HBaseDataWorks Summit
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeAdam Kawa
 
Spectral clustering
Spectral clusteringSpectral clustering
Spectral clusteringSOYEON KIM
 

What's hot (20)

K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using Python
 
The Multivariate Gaussian Probability Distribution
The Multivariate Gaussian Probability DistributionThe Multivariate Gaussian Probability Distribution
The Multivariate Gaussian Probability Distribution
 
06 Vector Visualization
06 Vector Visualization06 Vector Visualization
06 Vector Visualization
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
An Intro to NoSQL Databases
An Intro to NoSQL DatabasesAn Intro to NoSQL Databases
An Intro to NoSQL Databases
 
MongoDB
MongoDBMongoDB
MongoDB
 
Bag the model with bagging
Bag the model with baggingBag the model with bagging
Bag the model with bagging
 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Realtime Sentiment Analysis Application Using Hadoop and HBase
Realtime Sentiment Analysis Application Using Hadoop and HBaseRealtime Sentiment Analysis Application Using Hadoop and HBase
Realtime Sentiment Analysis Application Using Hadoop and HBase
 
gSpan algorithm
gSpan algorithmgSpan algorithm
gSpan algorithm
 
Lect12 graph mining
Lect12 graph miningLect12 graph mining
Lect12 graph mining
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Spectral clustering
Spectral clusteringSpectral clustering
Spectral clustering
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 

Viewers also liked

Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word countJeff Patti
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ Devices
(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ Devices(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ Devices
(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ DevicesAmazon Web Services
 
[150824]symposium v4
[150824]symposium v4[150824]symposium v4
[150824]symposium v4yyooooon
 
Scalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationScalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationRevolution Analytics
 
Hadoop M/R Pig Hive
Hadoop M/R Pig HiveHadoop M/R Pig Hive
Hadoop M/R Pig Hivezahid-mian
 
Top 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceTop 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceEdureka!
 

Viewers also liked (20)

WP4-QoS Management in the Cloud
WP4-QoS Management in the CloudWP4-QoS Management in the Cloud
WP4-QoS Management in the Cloud
 
MOLDEAS at City College
MOLDEAS at City CollegeMOLDEAS at City College
MOLDEAS at City College
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ Devices
(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ Devices(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ Devices
(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ Devices
 
MapReduce DesignPatterns
MapReduce DesignPatternsMapReduce DesignPatterns
MapReduce DesignPatterns
 
[150824]symposium v4
[150824]symposium v4[150824]symposium v4
[150824]symposium v4
 
MOLDEAS-PhD Summary
MOLDEAS-PhD SummaryMOLDEAS-PhD Summary
MOLDEAS-PhD Summary
 
Scalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationScalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar Presentation
 
Researching Semantic Web-Overview
Researching Semantic Web-OverviewResearching Semantic Web-Overview
Researching Semantic Web-Overview
 
Hadoop M/R Pig Hive
Hadoop M/R Pig HiveHadoop M/R Pig Hive
Hadoop M/R Pig Hive
 
Internet, Web 2.0 y Salud 2.0
Internet, Web 2.0 y Salud 2.0Internet, Web 2.0 y Salud 2.0
Internet, Web 2.0 y Salud 2.0
 
HTML5 Audio & Vídeo
HTML5 Audio & VídeoHTML5 Audio & Vídeo
HTML5 Audio & Vídeo
 
QoS Management in Cloud Computing-Draft proposal
QoS Management in Cloud Computing-Draft proposalQoS Management in Cloud Computing-Draft proposal
QoS Management in Cloud Computing-Draft proposal
 
HTML5-Aplicaciones web
HTML5-Aplicaciones webHTML5-Aplicaciones web
HTML5-Aplicaciones web
 
Introducción a Sistemas de Información
Introducción a Sistemas de InformaciónIntroducción a Sistemas de Información
Introducción a Sistemas de Información
 
Ejemplos prácticos de Búsqueda en Salud
Ejemplos prácticos de Búsqueda en SaludEjemplos prácticos de Búsqueda en Salud
Ejemplos prácticos de Búsqueda en Salud
 
Top 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceTop 3 design patterns in Map Reduce
Top 3 design patterns in Map Reduce
 
Introducción a "La Web como una Base de Datos"
Introducción a "La Web como una Base de Datos"Introducción a "La Web como una Base de Datos"
Introducción a "La Web como una Base de Datos"
 

Similar to Map/Reduce intro

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
 
Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop
Mochi: Visual Log-Analysis Based Tools for Debugging HadoopMochi: Visual Log-Analysis Based Tools for Debugging Hadoop
Mochi: Visual Log-Analysis Based Tools for Debugging HadoopGeorge Ang
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsDilum Bandara
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFMLconf
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizationsscottcrespo
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Python for data science
Python for data sciencePython for data science
Python for data sciencebotsplash.com
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningGianvito Siciliano
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Optimization for iterative queries on Mapreduce
Optimization for iterative queries on MapreduceOptimization for iterative queries on Mapreduce
Optimization for iterative queries on Mapreducemakoto onizuka
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
 

Similar to Map/Reduce intro (20)

Tutorial5
Tutorial5Tutorial5
Tutorial5
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
 
Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop
Mochi: Visual Log-Analysis Based Tools for Debugging HadoopMochi: Visual Log-Analysis Based Tools for Debugging Hadoop
Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Python for data science
Python for data sciencePython for data science
Python for data science
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine Learning
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
Optimization for iterative queries on Mapreduce
Optimization for iterative queries on MapreduceOptimization for iterative queries on Mapreduce
Optimization for iterative queries on Mapreduce
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 

More from CARLOS III UNIVERSITY OF MADRID

Engineering 4.0: Digitization through task automation and reuse
Engineering 4.0:  Digitization through task automation and reuseEngineering 4.0:  Digitization through task automation and reuse
Engineering 4.0: Digitization through task automation and reuseCARLOS III UNIVERSITY OF MADRID
 
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...CARLOS III UNIVERSITY OF MADRID
 
Sailing the V: Engineering digitalization through task automation and reuse i...
Sailing the V: Engineering digitalization through task automation and reuse i...Sailing the V: Engineering digitalization through task automation and reuse i...
Sailing the V: Engineering digitalization through task automation and reuse i...CARLOS III UNIVERSITY OF MADRID
 
AI4SE: Challenges and opportunities in the integration of Systems Engineering...
AI4SE: Challenges and opportunities in the integration of Systems Engineering...AI4SE: Challenges and opportunities in the integration of Systems Engineering...
AI4SE: Challenges and opportunities in the integration of Systems Engineering...CARLOS III UNIVERSITY OF MADRID
 
Challenges in the integration of Systems Engineering and the AI/ML model life...
Challenges in the integration of Systems Engineering and the AI/ML model life...Challenges in the integration of Systems Engineering and the AI/ML model life...
Challenges in the integration of Systems Engineering and the AI/ML model life...CARLOS III UNIVERSITY OF MADRID
 
OSLC KM: Elevating the meaning of data and operations within the toolchain
OSLC KM: Elevating the meaning of data and operations within the toolchainOSLC KM: Elevating the meaning of data and operations within the toolchain
OSLC KM: Elevating the meaning of data and operations within the toolchainCARLOS III UNIVERSITY OF MADRID
 
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...CARLOS III UNIVERSITY OF MADRID
 
Systems and Software Architecture: an introduction to architectural modelling
Systems and Software Architecture: an introduction to architectural modellingSystems and Software Architecture: an introduction to architectural modelling
Systems and Software Architecture: an introduction to architectural modellingCARLOS III UNIVERSITY OF MADRID
 
Detection of fraud in financial blockchain-based transactions through big dat...
Detection of fraud in financial blockchain-based transactions through big dat...Detection of fraud in financial blockchain-based transactions through big dat...
Detection of fraud in financial blockchain-based transactions through big dat...CARLOS III UNIVERSITY OF MADRID
 
News headline generation with sentiment and patterns: A case study of sports ...
News headline generation with sentiment and patterns: A case study of sports ...News headline generation with sentiment and patterns: A case study of sports ...
News headline generation with sentiment and patterns: A case study of sports ...CARLOS III UNIVERSITY OF MADRID
 

More from CARLOS III UNIVERSITY OF MADRID (20)

Proyecto IVERES-UC3M
Proyecto IVERES-UC3MProyecto IVERES-UC3M
Proyecto IVERES-UC3M
 
RTVE: Sustainable Development Goal Radar
RTVE: Sustainable Development Goal  RadarRTVE: Sustainable Development Goal  Radar
RTVE: Sustainable Development Goal Radar
 
Engineering 4.0: Digitization through task automation and reuse
Engineering 4.0:  Digitization through task automation and reuseEngineering 4.0:  Digitization through task automation and reuse
Engineering 4.0: Digitization through task automation and reuse
 
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
 
SESE 2021: Where Systems Engineering meets AI/ML
SESE 2021: Where Systems Engineering meets AI/MLSESE 2021: Where Systems Engineering meets AI/ML
SESE 2021: Where Systems Engineering meets AI/ML
 
Sailing the V: Engineering digitalization through task automation and reuse i...
Sailing the V: Engineering digitalization through task automation and reuse i...Sailing the V: Engineering digitalization through task automation and reuse i...
Sailing the V: Engineering digitalization through task automation and reuse i...
 
Deep Learning Notes
Deep Learning NotesDeep Learning Notes
Deep Learning Notes
 
H2020-AHTOOLS Use Case 3 Functional Design
H2020-AHTOOLS Use Case 3 Functional DesignH2020-AHTOOLS Use Case 3 Functional Design
H2020-AHTOOLS Use Case 3 Functional Design
 
AI4SE: Challenges and opportunities in the integration of Systems Engineering...
AI4SE: Challenges and opportunities in the integration of Systems Engineering...AI4SE: Challenges and opportunities in the integration of Systems Engineering...
AI4SE: Challenges and opportunities in the integration of Systems Engineering...
 
INCOSE IS 2019: AI and Systems Engineering
INCOSE IS 2019: AI and Systems EngineeringINCOSE IS 2019: AI and Systems Engineering
INCOSE IS 2019: AI and Systems Engineering
 
Challenges in the integration of Systems Engineering and the AI/ML model life...
Challenges in the integration of Systems Engineering and the AI/ML model life...Challenges in the integration of Systems Engineering and the AI/ML model life...
Challenges in the integration of Systems Engineering and the AI/ML model life...
 
Blockchain en la Industria Musical
Blockchain en la Industria MusicalBlockchain en la Industria Musical
Blockchain en la Industria Musical
 
OSLC KM: Elevating the meaning of data and operations within the toolchain
OSLC KM: Elevating the meaning of data and operations within the toolchainOSLC KM: Elevating the meaning of data and operations within the toolchain
OSLC KM: Elevating the meaning of data and operations within the toolchain
 
Blockchain y sector asegurador
Blockchain y sector aseguradorBlockchain y sector asegurador
Blockchain y sector asegurador
 
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
 
Systems and Software Architecture: an introduction to architectural modelling
Systems and Software Architecture: an introduction to architectural modellingSystems and Software Architecture: an introduction to architectural modelling
Systems and Software Architecture: an introduction to architectural modelling
 
Detection of fraud in financial blockchain-based transactions through big dat...
Detection of fraud in financial blockchain-based transactions through big dat...Detection of fraud in financial blockchain-based transactions through big dat...
Detection of fraud in financial blockchain-based transactions through big dat...
 
News headline generation with sentiment and patterns: A case study of sports ...
News headline generation with sentiment and patterns: A case study of sports ...News headline generation with sentiment and patterns: A case study of sports ...
News headline generation with sentiment and patterns: A case study of sports ...
 
Blockchain y la industria musical
Blockchain y la industria musicalBlockchain y la industria musical
Blockchain y la industria musical
 
Preparing your Big Data start-up pitch
Preparing your Big Data start-up pitchPreparing your Big Data start-up pitch
Preparing your Big Data start-up pitch
 

Recently uploaded

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Recently uploaded (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Map/Reduce intro

  • 1. MapReduce Intro The MapReduce Programming Model Introduction and Examples Dr. Jose Mar´ Alvarez-Rodr´ ıa ıguez “Quality Management in Service-based Systems and Cloud Applications” FP7 RELATE-ITN South East European Research Center Thessaloniki, 10th of April, 2013 1 / 61
  • 2. MapReduce Intro 1 MapReduce in a nutshell 2 Thinking in MapReduce 3 Applying MapReduce 4 Success Stories with MapReduce 5 Summary and Conclusions 2 / 61
  • 3. MapReduce Intro MapReduce in a nutshell Features A programming model... 1 Large-scale distributed data processing 2 Simple but restricted 3 Paralell programming 4 Extensible 3 / 61
  • 4. MapReduce Intro MapReduce in a nutshell Antecedents Functional programming 1 Inspired 2 ...but not equivalent Example in Python “Given a list of numbers between 1 and 50 print only even numbers” § ¤ print filter ( lambda x : x % 2 == 0 , range (1 , 50) ) ¦ ¥ A list of numbers (data) A condition (even numbers) A function filter that is applied to the list (map) 4 / 61
  • 5. MapReduce Intro MapReduce in a nutshell Antecedents Functional programming 1 Inspired 2 ...but not equivalent Example in Python “Given a list of numbers between 1 and 50 print only even numbers” § ¤ print filter ( lambda x : x % 2 == 0 , range (1 , 50) ) ¦ ¥ A list of numbers (data) A condition (even numbers) A function filter that is applied to the list (map) 5 / 61
  • 6. MapReduce Intro MapReduce in a nutshell ...Other examples... Example in Python “Return the sum of the squares of a list of numbers between 1 and 50” § ¤ import operator reduce ( operator . add , map (( lambda x : x **2) , range (1 ,50) ) , 0) ¦ ¥ “reduce” is equivalent to “foldl” in other func. languages as Haskell other math considerations should be taken into account (kind of operator)... 6 / 61
  • 7. MapReduce Intro MapReduce in a nutshell Some interesting points... The Map Reduce framework... 1 Inspired in functional programming concepts (but not equivalent) 2 Problems that can be paralellized 3 Sometimes recursive solutions 4 ... 7 / 61
  • 8. MapReduce Intro MapReduce in a nutshell Basic Model “MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google. 8 / 61
  • 9. MapReduce Intro MapReduce in a nutshell Map Function Figure: Mapping creates a new output list by applying a function to individual elements of an input list. “Module 4: MapReduce”, Hadoop Tutorial, Yahoo!. 9 / 61
  • 10. MapReduce Intro MapReduce in a nutshell Reduce Function Figure: Reducing a list iterates over the input values to produce an aggregate value as output. “Module 4: MapReduce”, Hadoop Tutorial, Yahoo!. 10 / 61
  • 11. MapReduce Intro MapReduce in a nutshell MapReduce Flow Figure: High-level MapReduce pipeline. “Module 4: MapReduce”, Hadoop Tutorial, Yahoo!. 11 / 61
  • 12. MapReduce Intro MapReduce in a nutshell MapReduce Flow Figure: Detailed Hadoop MapReduce data flow. 12 / 61
  • 13. MapReduce Intro MapReduce in a nutshell Tip What is MapReduce? It is a framework inspired in functional programming to tackle problems in which steps can be paralellized applying a divide and conquer approach. 13 / 61
  • 14. MapReduce Intro Thinking in MapReduce When should I use MapReduce? Query Index and Search: inverted index Filtering Classification Recommendations: clustering or collaborative filtering Analytics Summarization and statistics Sorting and merging Frequency distribution SQL-based queries: group-by, having, etc. Generation of graphics: histograms, scatter plots. Others Message passing such as Breadth First-Search or PageRank algorithms. 14 / 61
  • 15. MapReduce Intro Thinking in MapReduce When should I use MapReduce? Query Index and Search: inverted index Filtering Classification Recommendations: clustering or collaborative filtering Analytics Summarization and statistics Sorting and merging Frequency distribution SQL-based queries: group-by, having, etc. Generation of graphics: histograms, scatter plots. Others Message passing such as Breadth First-Search or PageRank algorithms. 15 / 61
  • 16. MapReduce Intro Thinking in MapReduce When should I use MapReduce? Query Index and Search: inverted index Filtering Classification Recommendations: clustering or collaborative filtering Analytics Summarization and statistics Sorting and merging Frequency distribution SQL-based queries: group-by, having, etc. Generation of graphics: histograms, scatter plots. Others Message passing such as Breadth First-Search or PageRank algorithms. 16 / 61
  • 17. MapReduce Intro Thinking in MapReduce How Google uses MapReduce (80% of data processing) Large-scale web search indexing Clustering problems for Google News Produce reports for popular queries, e.g. Google Trend Processing of satellite imagery data Language model processing for statistical machine translation Large-scale machine learning problems ... 17 / 61
  • 18. MapReduce Intro Thinking in MapReduce Comparison of MapReduce and other approaches “MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google. 18 / 61
  • 19. MapReduce Intro Thinking in MapReduce Evaluation of MapReduce and other approaches “MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google. 19 / 61
  • 20. MapReduce Intro Thinking in MapReduce Apache Hadoop MapReduce definition The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets Figure: Apache Hadoop Logo. across clusters of computers using simple programming models. 20 / 61
  • 21. MapReduce Intro Thinking in MapReduce Tip What can I do in MapReduce? Three main functions: 1 Querying 2 Summarizing 3 Analyzing . . . large datasets in off-line mode for boosting other on-line processes. 21 / 61
  • 22. MapReduce Intro Applying MapReduce MapReduce in Action MapReduce Patterns 1 Summarization 2 Filtering 3 Data Organization (sort, merging, etc.) 4 Relational-based (join, selection, projection, etc.) 5 Iterative Message Passing (graph processing) 6 Others (depending on the implementation): Simulation of distributed systems Cross-correlation Metapatterns Input-output ... 22 / 61
  • 23. MapReduce Intro Applying MapReduce Overview (stages)-Counting Letters 23 / 61
  • 24. MapReduce Intro Applying MapReduce Summarization Types 1 Numerical summarizations 2 Inverted index 3 Counting and counters 24 / 61
  • 25. MapReduce Intro Applying MapReduce Numerical Summarization-I Description A general pattern for calculating aggregate statistical values over your data. Intent Group records together by a key field and calculate a numerical aggregate per group to get a top-level view of the larger data set. 25 / 61
  • 26. MapReduce Intro Applying MapReduce Numerical Summarization-II Applicability To deal with numerical data or counting. To group data by specific fields Examples 1 Word count 2 Record count 3 Min/Max/Count 4 Average/Median/Standard deviation 5 ... 26 / 61
  • 27. MapReduce Intro Applying MapReduce Numerical Summarization-Pseudocode class Mapper method Map(recordid id, record r) for all term t in record r do Emit(term t, count 1) class Reducer method Reduce(term t, counts [c1, c2,...]) sum = 0 for all count c in [c1, c2,...] do sum = sum + c Emit(term t, count sum) 27 / 61
  • 28. MapReduce Intro Applying MapReduce Overview-Word Counter 28 / 61
  • 29. MapReduce Intro Applying MapReduce Numerical Summarization-Word Counter § ¤ public void map ( LongWritable key , Text value , Context context ) throws Exception { String line = value . toString () ; StringTokenizer tokenizer = new StringTokenizer ( line ) ; while ( tokenizer . hasMoreTokens () ) { word . set ( tokenizer . nextToken () ) ; context . write ( word , one ) ; } } public void reduce ( Text key , Iterable IntWritable values , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { int sum = 0; for ( IntWritable val : values ) { sum += val . get () ; } context . write ( key , new IntWritable ( sum ) ) ; } ¦ ¥ 29 / 61
  • 30. MapReduce Intro Applying MapReduce Example-II Min/Max Given a list of tweets (username, date, text) determine first and last time an user commented and the number of times. Implementation See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro 30 / 61
  • 31. MapReduce Intro Applying MapReduce Overview - Min/Max ∗ Min and max creation date are the same in the map phase. 31 / 61
  • 32. MapReduce Intro Applying MapReduce Example II-Min/Max, function Map § ¤ public void map ( Object key , Text value , Context context ) throws IOException , InterruptedException , ParseException { Map String , String parsed = MRDPUtils . parse ( value . toString () ) ; String strDate = parsed . get ( MRDPUtils . CREATION_DATE ) ; String userId = parsed . get ( MRDPUtils . USER_ID ) ; if ( strDate == null || userId == null ) { return ; } Date creationDate = MRDPUtils . frmt . parse ( strDate ) ; outTuple . setMin ( creationDate ) ; outTuple . setMax ( creationDate ) ; outTuple . setCount (1) ; outUserId . set ( userId ) ; context . write ( outUserId , outTuple ) ; } ¦ ¥ 32 / 61
  • 33. MapReduce Intro Applying MapReduce Example II-Min/Max, function Reduce § ¤ public void reduce ( Text key , Iterable MinMaxCountTuple values , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { result . setMin ( null ) ; result . setMax ( null ) ; int sum = 0; for ( MinMaxCountTuple val : values ) { if ( result . getMin () == null || val . getMin () . compareTo ( result . getMin () ) 0) { result . setMin ( val . getMin () ) ; } if ( result . getMax () == null || val . getMax () . compareTo ( result . getMax () ) 0) { result . setMax ( val . getMax () ) ; } sum += val . getCount () ;} result . setCount ( sum ) ; context . write ( key , result ) ; } ¦ ¥ 33 / 61
  • 34. MapReduce Intro Applying MapReduce Example-III Average Given a list of tweets (username, date, text) determine the average comment length per hour of day. Implementation See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro 34 / 61
  • 35. MapReduce Intro Applying MapReduce Overview - Average 35 / 61
  • 36. MapReduce Intro Applying MapReduce Example III-Average, function Map § ¤ public void map ( Object key , Text value , Context context ) throws IOException , InterruptedException , ParseException { Map String , String parsed = MRDPUtils . parse ( value . toString () ) ; String strDate = parsed . get ( MRDPUtils . CREATION_DATE ) ; String text = parsed . get ( MRDPUtils . TEXT ) ; if ( strDate == null || text == null ) { return ; } Date creationDate = MRDPUtils . frmt . parse ( strDate ) ; outHour . set ( creationDate . getHours () ) ; outCountAverage . setCount (1) ; outCountAverage . setAverage ( text . length () ) ; context . write ( outHour , outCountAverage ) ; } ¦ ¥ 36 / 61
  • 37. MapReduce Intro Applying MapReduce Example III-Average, function Reduce § ¤ public void reduce ( IntWritable key , Iterable CountAverageTuple values , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { float sum = 0; float count = 0; for ( Co unt Ave rage Tup le val : values ) { sum += val . getCount () * val . getAverage () ; count += val . getCount () ; } result . setCount ( count ) ; result . setAverage ( sum / count ) ; context . write ( key , result ) ; } ¦ ¥ 37 / 61
  • 38. MapReduce Intro Applying MapReduce Numerical Summarization-Other approaches Relation to SQL § ¤ SELECT MIN ( numcol1 ) , MAX ( numcol1 ) , COUNT (*) FROM table GROUP BY groupcol2 ; ¦ ¥ Implementation in PIG § ¤ b = GROUP a BY groupcol2 ; c = FOREACH b GENERATE group , MIN ( a . numcol1 ) , MAX ( a . numcol1 ) , COUNT_STAR ( a ) ; ¦ ¥ 38 / 61
  • 39. MapReduce Intro Applying MapReduce Numerical Summarization-Other approaches Relation to SQL § ¤ SELECT MIN ( numcol1 ) , MAX ( numcol1 ) , COUNT (*) FROM table GROUP BY groupcol2 ; ¦ ¥ Implementation in PIG § ¤ b = GROUP a BY groupcol2 ; c = FOREACH b GENERATE group , MIN ( a . numcol1 ) , MAX ( a . numcol1 ) , COUNT_STAR ( a ) ; ¦ ¥ 39 / 61
  • 40. MapReduce Intro Applying MapReduce Filtering Types 1 Filtering 2 Top N records 3 Bloom filtering 4 Distinct 40 / 61
  • 41. MapReduce Intro Applying MapReduce Filtering-I Description It evaluates each record separately and decides, based on some condition, whether it should stay or go. Intent Filter out records that are not of interest and keep ones that are. 41 / 61
  • 42. MapReduce Intro Applying MapReduce Filtering-II Applicability To collate data Examples 1 Closer view of dataset 2 Data cleansing 3 Tracking a thread of events 4 Simple random sampling 5 Distributed Grep 6 Removing low scoring dataset 7 Log Analysis 8 Data Querying 9 Data Validation 10 . . . 42 / 61
  • 43. MapReduce Intro Applying MapReduce Filtering-Pseudocode class Mapper method Map(recordid id, record r) field f = extract(r) if predicate (f) Emit(recordid id, value(r)) class Reducer method Reduce(recordid id, values [r1, r2,...]) //Whatever Emit(recordid id, aggregate (values)) 43 / 61
  • 44. MapReduce Intro Applying MapReduce Example-IV Distributed Grep Given a list of tweets (username, date, text) determine the tweets that contain a word. Implementation See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro 44 / 61
  • 45. MapReduce Intro Applying MapReduce Overview - Distributed Grep 45 / 61
  • 46. MapReduce Intro Applying MapReduce Example IV-Distributed Grep, function Map § ¤ public void map ( Object key , Text value , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { Map String , String parsed = MRDPUtils . parse ( value . toString () ) ; String txt = parsed . get ( MRDPUtils . TEXT ) ; String mapRegex = .* b + context . getConfiguration () . get ( mapregex ) + (.) * b .* ; if ( txt . matches ( mapRegex ) ) { context . write ( NullWritable . get () , value ) ; } } ¦ ¥ ...and the Reduce function? In this case it is not necessary and output values are directly writing to the output. 46 / 61
  • 47. MapReduce Intro Applying MapReduce Example-V Top 5 Given a list of tweets (username, date, text) determine the 5 users that wrote longer tweets Implementation See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro 47 / 61
  • 48. MapReduce Intro Applying MapReduce Overview - Top 5 48 / 61
  • 49. MapReduce Intro Applying MapReduce Example V-Top 5, function Map § ¤ private TreeMap Integer , Text repToRecordMap = new TreeMap Integer , Text () ; public void map ( Object key , Text value , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { Map String , String parsed = MRDPUtils . parse ( value . toString () ) ; if ( parsed == null ) { return ;} String userId = parsed . get ( MRDPUtils . USER_ID ) ; String reputation = String . valueOf ( parsed . get ( MRDPUtils . TEXT ) . length () ) ; // Max reputation if you write tweets longer if ( userId == null || reputation == null ) { return ;} repToRecordMap . put ( Integer . parseInt ( reputation ) , new Text ( value ) ) ; if ( repToRecordMap . size () MAX_TOP ) { repToRecordMap . remove ( repToRecordMap . firstKey () ); } } ¦ ¥ 49 / 61
  • 50. MapReduce Intro Applying MapReduce Example V-Top 5, function Reduce § ¤ public void reduce ( NullWritable key , Iterable Text values , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { for ( Text value : values ) { Map String , String parsed = MRDPUtils . parse ( value . toString () ) ; repToRecordMap . put ( parsed . get ( MRDPUtils . TEXT ) . length () , new Text ( value ) ) ; if ( repToRecordMap . size () MAX_TOP ) { repToRecordMap . remove ( repToRecordMap . firstKey () ); } } for ( Text t : repToRecordMap . descendingMap () . values () ) { context . write ( NullWritable . get () , t ) ; } } ¦ ¥ 50 / 61
  • 51. MapReduce Intro Applying MapReduce Filtering-Other approaches Relation to SQL § ¤ SELECT * FROM table WHERE colvalue VALUE ; ¦ ¥ Implementation in PIG § ¤ b = FILTER a BY colvalue VALUE ; ¦ ¥ 51 / 61
  • 52. MapReduce Intro Applying MapReduce Filtering-Other approaches Relation to SQL § ¤ SELECT * FROM table WHERE colvalue VALUE ; ¦ ¥ Implementation in PIG § ¤ b = FILTER a BY colvalue VALUE ; ¦ ¥ 52 / 61
  • 53. MapReduce Intro Applying MapReduce Tip How can I use and run a MapReduce framework? You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a framework such as Apache Hadoop. 53 / 61
  • 54. MapReduce Intro Success Stories with MapReduce Tip Who is using MapReduce? All companies that are dealing with Big Data problems for analytics such as: Cloudera Datasalt Elasticsearch ... 54 / 61
  • 55. MapReduce Intro Success Stories with MapReduce Apache Hadoop-Related Projects 55 / 61
  • 56. MapReduce Intro Success Stories with MapReduce More tips FAQ MapReduce is a framework based on a simple programming model ...to deal with large datasets in a distributed fashion ...scalability, replication, fault-tolerant, etc. Apache Hadoop is not a database New frameworks on top of Hadoop for specific tasks: querying, analysis, etc. Other similar frameworks: Storm, Signal/Collect, etc. ... 56 / 61
  • 57. MapReduce Intro Summary and Conclusions Summary 57 / 61
  • 58. MapReduce Intro Summary and Conclusions Conclusions What is MapReduce? It is a framework inspired in functional programming to tackle problems in which steps can be paralellized applying a divide and conquer approach. What can I do in MapReduce? Three main functions: 1 Querying 2 Summarizing 3 Analyzing . . . large datasets in off-line mode for boosting other on-line processes. How can I use and run a MapReduce framework? You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a framework such as Apache Hadoop. 58 / 61
  • 59. MapReduce Intro Summary and Conclusions Conclusions What is MapReduce? It is a framework inspired in functional programming to tackle problems in which steps can be paralellized applying a divide and conquer approach. What can I do in MapReduce? Three main functions: 1 Querying 2 Summarizing 3 Analyzing . . . large datasets in off-line mode for boosting other on-line processes. How can I use and run a MapReduce framework? You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a framework such as Apache Hadoop. 59 / 61
  • 60. MapReduce Intro Summary and Conclusions Conclusions What is MapReduce? It is a framework inspired in functional programming to tackle problems in which steps can be paralellized applying a divide and conquer approach. What can I do in MapReduce? Three main functions: 1 Querying 2 Summarizing 3 Analyzing . . . large datasets in off-line mode for boosting other on-line processes. How can I use and run a MapReduce framework? You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a framework such as Apache Hadoop. 60 / 61
  • 61. MapReduce Intro Summary and Conclusions What’s next? ... Concatenate MapReduce jobs Optimization using combiners and setting the parameters (size of partition, etc.) Pipelining with other languages such as Python Hadoop in Action: more examples, etc. New trending problems (image/video processing) Real-time processing ... 61 / 61
  • 62. MapReduce Intro References J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113, Jan. 2008. J. L. Jonathan R. Owens, Brian Femiano. Hadoop Real-World Solutions Cookbook. Packt Publishing Ltd, 2013. C. Lam. Hadoop in Action. Manning Publications Co., Greenwich, CT, USA, 1st edition, 2010. J. Lin and C. Dyer. Data-intensive text processing with MapReduce. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion 62 / 61
  • 63. MapReduce Intro References Volume: Tutorial Abstracts, NAACL-Tutorials ’09, pages 1–2, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics. D. Miner and A. Shook. Mapreduce Design Patterns. Oreilly and Associates Inc, 2012. T. G. Srinath Perera. Hadoop MapReduce Cookbook. Packt Publishing Ltd, 2013. T. White. Hadoop: The Definitive Guide. O’Reilly Media, Inc., 1st edition, 2009. I. H. Witten and E. Frank. Data Mining: Practical Machine LearningTools and Techniques. 63 / 61
  • 64. MapReduce Intro References Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005. 64 / 61