SlideShare una empresa de Scribd logo
1 de 62
Descargar para leer sin conexión
Building Scale-Free
                   Applications with
                 Hadoop and Cascading


                         Chris K Wensel
                         Concurrent, Inc.



Thursday, May 28, 2009
Introduction
                                         Chris K Wensel
                                            chris@wensel.net


                    • Cascading, Lead developer
                         •   http://cascading.org/
                    • Concurrent, Inc., Founder
                         •   Hadoop/Cascading support and tools
                         •   http://concurrentinc.com
                    • Scale Unlimited, Principal Instructor
                         •   Hadoop related Training and Consulting
                         •   http://scaleunlimited.com
                                                                      Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Topics

                                    Cascading


                                    MapReduce


                                     Hadoop




                 A rapid introduction to Hadoop, MapReduce
                 patterns, and best practices with Cascading.


                                                    Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
What is Hadoop?




                                       Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Conceptually
                            Cluster




                                        Execution


                                         Storage




                    • Single FileSystem Namespace
                    • Single Execution-Space

                                                    Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
CPU, Disk, Rack, and
                                Node
                              Cluster




                                   Rack               Rack                Rack

                                   Node        Node   Node         Node   ...


                                        cpu             Execution


                                        disk             Storage




                    • Virtualization of storage and execution resources
                    • Normalizes disks and CPU
                    • Rack and node aware
                    • Data locality and fault tolerance
                                                                                 Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Storage
                         Rack                Rack             Rack

                         Node      Node      Node      Node   ...

                          block1    block1                          block1


                          block2              block2                block2




                         One file, two blocks large, 3 nodes replicated



                    • Files chunked into blocks
                    • Blocks independently replicated
                                                                             Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Execution
                         Rack                Rack            Rack

                         Node      Node      Node    Node    ...

                           map       map       map     map         map



                          reduce    reduce                     reduce




                    • Map and Reduce tasks move to data
                    • Or available processing slots

                                                                         Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Clients and Jobs
                         Users       Cluster

                            Client




                            Client
                                                           Execution
                                               job   job               job
                                                           Storage
                            Client




              • Clients launch jobs into the cluster
              • One at a time if there are dependencies
              • Clients are not managed by Hadoop
                                                                             Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Physical Architecture
                                                            NameNode                DataNode
                                                                                     DataNode
                                                                                       DataNode
                                                                                         DataNode                             data block



                                     ns                                                                          read/write
                                  operations                Secondary                            ns
                                                                                              operations
                                               read/write                           ns
                                                                                 operations         read/write
                                                                                                                        mapper
                                                                                                                            mapper
                                                                                                                        child jvm
                                                                                                                               mapper
                                                                                                                          child jvm
                                               jobs                      tasks                                                child jvm
                         Client                             JobTracker

                                                                                         TaskTracker
                                                                                                                        reducer
                                                                                                                            reducer
                                                                                                                        child jvm
                                                                                                                               reducer
                                                                                                                          child jvm
                                                                                                                              child jvm

                    • Solid boxes are unique applications
                    • Dashed boxes are child JVM instances on same node as parent
                    • Dotted boxes are blocks of managed files on same node as parent
                                                                                                                     Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Deployment
                         Node                    Node                        Node

                                          jobs                       tasks
                                Client                  JobTracker              TaskTracker




                                                                                    DataNode
                                                 Node


                                                    NameNode



                                                                             Not uncommon to
                                                 Node                         be same node

                                                        Secondary




                    • TaskTracker and DataNode collocated

                                                                                               Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Why Hadoop?




                                       Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Why It’s Adopted

                         Big Data             PaaS




                  Big Data             Platform as a Service
                Hard Problem           Wide Virtualization




                                                     Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
How It’s Used


                          Ad hoc
                          Queries         Querying / Sampling

                         Processing
                         Integration     Integration / Processing




                                                             Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Cascading and ...?
                                               Big Data      PaaS



                                  Ad hoc
                                               Pig/Hive       ?
                                  Queries


                                 Processing
                                               Cascading   Cascading
                                 Integration




                    • Designed for Processing / Integration
                    • OK for ad-hoc queries (API, not Syntax)
                    • Should you use Hadoop for small queries?

                                                                       Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
ShareThis.com
                             Processing
                                             Big Data
                             Integration


                    • Primarily integration and processing
                    • Running on AWS

                              Ad hoc
                                              PaaS
                              Queries



                    • Ad-hoc queries (mining) performed on
                      Aster Data nCluster, not Hadoop
                                                        Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
ShareThis in AWS
                                          bixo web            content
                  logprocessor                                                 ...
                                           crawler            indexer


                         Amazon Web Services

                                                            SQS


                                                                              AsterData
                                loggers
                               loggers
                              loggers                      Hadoop
                             loggers                       Hadoop
                                                            Hadoop
                                                          Cascading
                                                          Cascading
                                                           Cascading


                                                                                  Katta
                                                                   Hyper
                                                     S3
                                                                   Table


                                                                           Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Staged Pipeline
                                                every Y hrs         on bixo completion
                           every X hrs




                           logprocessor            bixo
                                                                         indexer
                                                 crawler



                                          ...                 ...
              • Each cluster
                    • is “on demand”
                    • is “right sized” to load
                    • is tuned to boot at optimum frequency
                                                                         Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
ShareThis:
                           Best Practices

                 • Authoritative/Original data kept in S3 in its
                   native format

                 • Derived data pipelined into relevant systems,
                   when possible

                 • Clusters tuned to load for best utilization

                 • Loose coupling keeps change cases isolated

                 • GC’ing Hadoop not a bad idea

                                                          Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Thinking In MapReduce




                                          Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
It’s really
                           Map-Group-Reduce
            •       Map and Reduce are user defined functions

            •       Map translates input Keys and Values to new Keys and Values

                                          [K1,V1]      Map       [K2,V2]



            •       System sorts the keys, and groups each unique Key with all its Values

                                          [K2,V2]      Group    [K2,{V2,V2,....}]



            •       Reduce translates the Values of each unique Key to new Keys and Values


                                   [K2,{V2,V2,....}]   Reduce    [K3,V3]


                                                                                    Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Word Count
                    • Read a line of text, output every word

     [0, "when in the course of
           human events"]                 Map      ["when",1]     ["in",1]      ["the",1]                    [...,1]



                    • Group all the values with each unique word
                                  ["when",1]
                                   ["when",1]
                                    ["when",1]
                                     ["when",1]    Group        ["when",{1,1,1,1,1}]
                                      ["when",1]


                    • Add up the values associated with each unique word

                           ["when",{1,1,1,1,1}]    Reduce       ["when",5]


                                                                                 Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Know That...
                    •    1 Job = 1 Map impl + 1 Reduce impl
                    •    Map:
                         •   is required in a Hadoop Job
                         •   may emit 0 or more Key Value pairs [K2,V2]
                    •    Reduce:
                         •   is optional in a Hadoop Job
                         •   sees Keys in sort order
                         •   but Keys will be randomly distributed across Reducers
                         •   the collection of Values per Key are unordered [K2,{V2..}]
                         •   may emit 0 or more Key Value pairs [K3,V3]
                                                                           Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Runtime Distribution
                                           [K1,V1]            Map      [K2,V2]       Group    [K2,{V2,...}]   Reduce    [K3,V3]


                                                             Mapper
                                                              Task


                                                             Mapper                                           Reducer
                                                                                     Shuffle
                                                              Task                                             Task


                                                             Mapper                                           Reducer
                                                                                     Shuffle
                                                              Task                                             Task


                                                             Mapper                                           Reducer
                                                                                     Shuffle                    Task
                                                              Task


                                                             Mapper
                                                              Task
                                                                                  Mappers must
                                                                                 complete before
                                                                                  Reducers can
                                                                                     begin
                         split1   split2   split3   split4      ...                                   part-00000    part-00001      part-000N

                                              file                                                                  directory
                           block1            block2           blockN
                                                                                                                         Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Common Patterns


            • Filtering or “Grepping”
                                        • Distributed Tasks
            • Parsing, Conversion
                                        • Simple Total Sorting
            • Counting, Summing
                                        • Chained Jobs
            • Binning, Collating




                                                      Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Advanced Patterns


            • Group By
                                  • CoGrouping / Joining
            • Distinct
                                  • Distributed Total Sort
            • Secondary Sort




                                                Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
For Example:
                               Secondary Sort
                              Mapper                          Reducer
                                         [K1,V1]                      [K2,{V2,V2,....}]



                                [K1,V1] -> <A1,B1,C1,D1>      [K2,V2] -> <A2,B2,{<C2,D2>,...}>



                                          Map                             Reduce



                              <A2,B2><C2> -> K2, <D2> -> V2   <A3,B3> -> K3, <C3,D3> -> V3



                                         [K2,V2]                          [K3,V3]



                    • The K2 key becomes a composite Key
                       • Key: [grouping, secondary], Value: [remaining values]
                    • Shuffle phase
                       • Custom Partitioner, must only partition on grouping Fields
                       • Standard ‘output key’ Comparator, must compare on all Fields
                       • Custom ‘value grouping’ Comparator, must compare grouping Fields
                                                                                           Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Very Advanced

                         • Classification   • Dimension
                                             Reduction
                         • Clustering
                                           • Evolutionary
                         • Regression        Algorithms




                                                     Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Thinking in MapReduce
                                       [ k, [v] ]2                                       [ k, [v] ]4
                                Map                   Reduce               Map                         Reduce


                         [ k, v ]1                   [ k, v ]3             [ k, v   ]3                         [ k, v ]5


                                File                              File                                     File



                                                                 [ k, v ] = key and value pair
                                                                 [ k, [v] ] = key and associated values collection



                    • It’s not just about Keys and Values
                    • And not just about Map and Reduce


                                                                                                       Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Real World Apps
                                                                                                                                                                                                           [37/75] map+reduce




                                                                                                                                                                                                           [54/75] map+reduce




       [41/75] map+reduce      [43/75] map+reduce       [42/75] map+reduce      [45/75] map+reduce       [44/75] map+reduce      [39/75] map+reduce    [36/75] map+reduce        [46/75] map+reduce        [40/75] map+reduce        [50/75] map+reduce     [38/75] map+reduce     [49/75] map+reduce     [51/75] map+reduce      [47/75] map+reduce     [52/75] map+reduce        [53/75] map+reduce    [48/75] map+reduce




       [23/75] map+reduce      [25/75] map+reduce       [24/75] map+reduce      [27/75] map+reduce       [26/75] map+reduce      [21/75] map+reduce    [19/75] map+reduce        [28/75] map+reduce        [22/75] map+reduce        [32/75] map+reduce     [20/75] map+reduce     [31/75] map+reduce     [33/75] map+reduce      [29/75] map+reduce     [34/75] map+reduce        [35/75] map+reduce    [30/75] map+reduce




           [7/75] map+reduce        [2/75] map+reduce       [8/75] map+reduce       [10/75] map+reduce       [9/75] map+reduce     [5/75] map+reduce    [3/75] map+reduce        [11/75] map+reduce         [6/75] map+reduce        [13/75] map+reduce     [4/75] map+reduce    [16/75] map+reduce     [14/75] map+reduce      [15/75] map+reduce     [17/75] map+reduce        [18/75] map+reduce     [12/75] map+reduce




              [60/75] map              [62/75] map             [61/75] map                                                            [58/75] map          [55/75] map                                                     [56/75] map+reduce                  [57/75] map                                                                                [71/75] map               [72/75] map
                                                                                                                                                                                                      [59/75] map




                                                                                                         [64/75] map+reduce                                 [63/75] map+reduce                        [65/75] map+reduce          [68/75] map+reduce      [67/75] map+reduce     [70/75] map+reduce     [69/75] map+reduce      [73/75] map+reduce     [66/75] map+reduce        [74/75] map+reduce




                                                                                                                                                                                                                                                                                                                                                            [75/75] map+reduce




                                                                                                                                                                                                                                                                                                                                                             [1/75] map+reduce




      1 app, 75 jobs

      green                                                      =                map + reduce
      purple                                                     =                map
      blue                                                       =                join/merge
      orange                                                     =                map split
                                                                                                                                                                                                                                                                                                                               Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Not Thinking in
                                    MapReduce

                                  [ f1,f2,.. ]2            [ f1,f2,.. ]3             [ f1,f2,.. ]4
                         Pipe                      Pipe                      Pipe                    Pipe


                                                                                                       [ f1,f2,.. ]5
                             [ f1,f2,..   ]1

                         Source                                                                      Sink

                                                  [ f1, f2,... ] = tuples with field names




                    • Its easier with Fields and Tuples
                    • And Source/Sinks and Pipes
                                                                                                            Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']                                    Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']


                                                                                         ['', '']                                                                                             ['', '']
                                                                                                                                                                                                                                                                          Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']
                                                                                         ['', '']                                                                                             ['', '']

                                                                                                                                                                                                                                                                                                      ['', '']
                                                                 Each('')[RegexSplitter[decl:'', ''][args:1]]                                          Each('')[RegexSplitter[decl:'', '', '', ''][args:1]]
                                                                                                                                                                                                                                                                                                      ['', '']

                                                                                         ['', '']                                                                                        ['', '']
                                                                                                                                                                                                                                                                             Each('')[RegexSplitter[decl:'', '', ''][args:1]]
                                                                                         ['', '']                                                                                        ['', '']




                                                                                                                             ['', '']                                           ['', '']
                                                                                                                             ['', '']                                           ['', '']
                                                                                                                                                                                                                                                                                           ['', '']
                                                                                                                                                          CoGroup('')[by:a:['']b:['']]                                                                                                     ['', '']

                                                                                                                                                                         a[''],b['']
                                                               ['', '']                                                                                                  ['', '', '', '']
                                                               ['', '']
                                                                                                                                                      Each('')[Identity[decl:'', '', '']]

                                                                                         ['', '']                                                                         ['', '', '']
                                                                                         ['', '']                                                                         ['', '', '']




                                                                                                                                                                                                                   ['', '', '']                             ['', '']
                                                                                                                                                                                                                   ['', '', '']                             ['', '']

                                                                                                                                                                                                                                       CoGroup('')[by:a:['']b:['']]
                                                                                                                                                                 ['', '', '']
                                                                                                                                                                 ['', '', '']                                                                                                                  ['', '']
                                                                                                                                                                                                    ['', '', '']                                a[''],b['']                                    ['', '']
                                                                                                                                                                                                    ['', '', '']                                ['', '', '', '', '']

                                                                                                                                                                                                                                  Each('')[Identity[decl:'', '', '', '']]

                                                                                                                                                                                                                                                  ['', '', '', '']
                                                                                                                                                                                                                                                  ['', '', '', '']



                                                                                                                                                                                            ['', '', '', '']
                                                                                                                                                                                            ['', '', '', '']                                      ['', '', '', '']
                                                                                                                                                                                                                                                  ['', '', '', '']

                                                                                                                                             CoGroup('')[by:a:['']b:['']]                                                              CoGroup('')[by:a:['']b:['']]

                                                                                                                                        a[''],b['']                                                                                            a[''],b['']
                                                                                                                                        ['', '', '', '', '', '', '']                                                                           ['', '', '', '', '', '']

                                                                                                                                                                                                                                  Each('')[Identity[decl:'', '', '', '']]

                                                                                         ['', '', '', '', '', '', '']                                                                                                                             ['', '', '', '']
                                                                                         ['', '', '', '', '', '', '']                                                                                                                             ['', '', '', '']

                                         CoGroup('')[by:a:['']b:['']]

                                           a[''],b['']                                                                                                                                                                      ['', '', '', '']
                                           ['', '', '', '', '', '', '', '', '']                                                                                                                                             ['', '', '', '']

                                      Each('')[Identity[decl:'', '']]                                                                                                            CoGroup('')[by:a:['']b:['']]

                                                    ['', '']                                                                                                                         a[''],b['']
                                                    ['', '']                                                                                                                         ['', '', '', '', '', '', '']

                                                                                                                                                                       Each('')[Identity[decl:'', '', '']]

                                                 ['', '']                                                                                                                  ['', '', '']
                                                 ['', '']                                                                                                                  ['', '', '']

                                       GroupBy('')[by:['']]

                                                 a['']                                                                        ['', '', '']
                                                 ['', '']                                                                     ['', '', '']

                                  Every('')[Agregator[decl:'']]                            CoGroup('')[by:a:['']b:['']]




                                                                                                                                                                                                                                  1 app, 12 jobs
                                                                                                      a[''],b['']
                                                                                                      ['', '', '', '', '']
                                                  ['', '']
                                                  ['', '']
                                                                                       Each('')[Identity[decl:'', '', '']]


                                                                                                         ['', '', '']
                         Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']
                                                                                                         ['', '', '']




                                                                                                         ['', '', '']
                                                                                                         ['', '', '']

                                                                                               GroupBy('')[by:['']]
                                                                                                                                                                                                                                  green = source or sink
                                                                                                         a['']
                                                                                                         ['', '', '']

                                                                                      Every('')[Agregator[decl:'', '']]
                                                                                                                                                                                                                                  blue = pipe+operation
                                                                                                         ['', '', '']
                                                                                                         ['', '', '']

                                                                             Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']

                                                                                                                                                                                                                                                                                                                                   Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Cascading




                                     Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Cascading
                    • A simpler alternative API to MapReduce
                         • Fields and Tuples
                         • Standard ‘relational’ operations

                    • Simplifies development of reusable
                      • data operations,
                      • logical process definitions
                      • integration end-points, and
                    • Can be used by any JVM based language

                                                              Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Some Sample Code


        http://github.com/cwensel/cascading.samples


        • hadoop
        • logparser
        • loganalysis



                          bit.ly links on next pages
                                                       Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Raw MapReduce:
                         Parsing a Log File

    > read one key/value at a time from inputPath
    > using some regular expression...
    > parse value into regular expression groups
    > save results as a TAB delimited row to the outputPath



                           http://bit.ly/RegexMap
                           http://bit.ly/RegexMain
                                                     Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
http://bit.ly/RegexMap
File - /Users/cwensel/sandbox/calku/cascading.samples/hadoop/src/java/hadoop/RegexParserMap.java
 public class RegexParserMap extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text>
   {
   private Pattern pattern;
   private Matcher matcher;

    @Override
    public void configure( JobConf job )
      {
      pattern = Pattern.compile( job.get( "logparser.regex" ) );
      matcher = pattern.matcher( "" ); // lets re-use the matcher
      }

    @Override
    public void map( LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter )
      {
      matcher.reset( value.toString() );

        if( !matcher.find() )
           throw new RuntimeException( "could not match pattern: [" + pattern + "] with value: [" + value + "]" );

        StringBuffer buffer = new StringBuffer();

        for( int i = 0; i < matcher.groupCount(); i++ )
          {
          if( i != 0 )
             buffer.append( "t" );

          buffer.append( matcher.group( i + 1 ) ); // skip group 0
          }

        // pass null so a TAB is not prepended, not all OutputFormats accept null
        output.collect( null, new Text( buffer.toString() ) );
        }
    }

                                                                                                   Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
{
    public static void main( String[] args ) throws IOException
      {
      // create Hadoop path instances
      Path inputPath = new Path( args[ 0 ] );
                                                                                      http://bit.ly/RegexMain
      Path outputPath = new Path( args[ 1 ] );

        // get the FileSystem instances for the input path
        FileSystem outputFS = outputPath.getFileSystem( new JobConf() );

        // if output path exists, delete recursively
        if( outputFS.exists( outputPath ) )
           outputFS.delete( outputPath, true );

        // initialize Hadoop job configuration
        JobConf jobConf = new JobConf();
        jobConf.setJobName( "logparser" );

        // set the current job jar
        jobConf.setJarByClass( Main.class );

        // set the input path and input format
        TextInputFormat.setInputPaths( jobConf, inputPath );
        jobConf.setInputFormat( TextInputFormat.class );

        // set the output path and output format
        TextOutputFormat.setOutputPath( jobConf, outputPath );
        jobConf.setOutputFormat( TextOutputFormat.class );
        jobConf.setOutputKeyClass( Text.class );
        jobConf.setOutputValueClass( Text.class );

        // must set to zero since we have no redcue function
        jobConf.setNumReduceTasks( 0 );

        // configure our parsing map classs
        jobConf.setMapperClass( RegexParserMap.class );
        String apacheRegex = "^([^ ]*) +[^ ]* +[^ ]* +  [([^]]*) ] + " ([^ ]*) ([^ ]*) [^ ]*  " ([^ ]*) ([^ ]*).*$" ;
        jobConf.set( "logparser.regex" , apacheRegex );

        // create Hadoop client, must pass in this JobConf for some reason
        JobClient jobClient = new JobClient( jobConf );

        // submit job
        RunningJob runningJob = jobClient.submitJob( jobConf );

        // block until job completes
        runningJob.waitForCompletion();
        }                                                                                                          Copyright Concurrent, Inc. 2009. All rights reserved
    }
Thursday, May 28, 2009
Cascading:
                         Parsing a Log File

    > read one “line” at a time from “source”
    > using some regular expression...
    > parse “line” into “ip, time, method, event, status, size”
    > save as a TAB delimited row to the “sink”




                            http://bit.ly/LPMain
                                                   Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
public static void main( String[] args )

                                                                                                 http://bit.ly/LPMain
     {
     String inputPath = args[ 0 ];
     String outputPath = args[ 1 ];

      // define what the input file looks like, "offset" is bytes from beginning
      TextLine scheme = new TextLine( new Fields( "offset", "line" ) );

      // create SOURCE tap to read a resource from the local file system, if input is not an URL
      Tap logTap = inputPath.matches( "^[^:]+://.*" ) ? new Hfs( scheme, inputPath ) : new Lfs( scheme, inputPath );

      // create an assembly to parse an Apache log file and store on an HDFS cluster

      // declare the field names we will parse out of the log file
      Fields apacheFields = new Fields( "ip" , "time" , "method", "event" , "status" , "size" );

      // define the regular expression to parse the log file with
      String apacheRegex = "^([^ ]*) +[^ ]* +[^ ]* +  [([^]]*) ] + " ([^ ]*) ([^ ]*) [^ ]*  " ([^ ]*) ([^ ]*).*$" ;

      // declare the groups from the above regex we want to keep. each regex group will be given
      // a field name from 'apacheFields', above, respectively
      int[] allGroups = {1, 2, 3, 4, 5, 6};

      // create the parser
      RegexParser parser = new RegexParser( apacheFields, apacheRegex, allGroups );

      // create the import pipe element, with the name 'import', and with the input argument named "line"
      // replace the incoming tuple with the parser results
      // "line" -> parser -> "ts"
      Pipe importPipe = new Each( "import" , new Fields( "line" ), parser, Fields.RESULTS );

      // create a SINK tap to write to the default filesystem
      // by default, TextLine writes all fields out
      Tap remoteLogTap = new Hfs( new TextLine(), outputPath, SinkMode.REPLACE );

      // set the current job jar
      Properties properties = new Properties();
      FlowConnector.setApplicationJarClass( properties, Main.class );

      // connect the assembly to the SOURCE and SINK taps
      Flow parsedLogFlow = new FlowConnector( properties ).connect( logTap, remoteLogTap, importPipe );

      // optionally print out the parsedLogFlow to a DOT file for import into a graphics package
      // parsedLogFlow.writeDOT( "logparser.dot" );

      // start execution of the flow (either locally or on the cluster
      parsedLogFlow.start();

      // block until the flow completes
      parsedLogFlow.complete();                                                                                        Copyright Concurrent, Inc. 2009. All rights reserved
      }
Thursday, May 28, 2009
   }
What Changes When...

                 • We want to sort by “status” field?
                 • We want to filter out some “urls”?
                 • We want to sum the “size”?
                 • Do all the above?


                 • We are thinking in ‘fields’ not Key/Values

                                                  Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Cascading:
                         Arrival rate

    > ..parse data as before..
    > read parsed logs
    > calculate “urls” per second
    > calculate “urls” per minute
    > save “sec interval” & “count” as a TAB’d row to “sink1”
    > save “min interval” & “count” as a TAB’d row to “sink2”




                         http://bit.ly/LAMain
                                                 Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
http://bit.ly/LAMain

                                  Arrival Rate Flow
                                 Hfs['SequenceFile[['ip', 'time', 'method', 'event', 'status', 'size']]']['output/logs/']']


                                                      ['ip', 'time', 'method', 'event', 'status', 'size']
                                                      ['ip', 'time', 'method', 'event', 'status', 'size']


                                                   Each('arrival rate')[DateParser[decl:'ts'][args:1]]

                                                                       ['ts']                        ['ts']
                                                                       ['ts']                        ['ts']

                                            GroupBy('tsCount')[by:['ts']]                      Each('arrival rate')[ExpressionFunction[decl:'tm'][args:1]]


                                                                                                                              ['tm']
                                                     tsCount['ts']
                                                                                                                              ['tm']
                                                     ['ts']

                                                                                                                 GroupBy('tmCount')[by:['tm']]
                                       Every('tsCount')[Count[decl:'count']]

                                                                                                                         tmCount['tm']
                                                     ['ts', 'count']                                                     ['tm']
                                                     ['ts']

                                                                                                              Every('tmCount')[Count[decl:'count']]

                         Hfs['TextLine[['offset', 'line']->[ALL]]']['output/arrivalrate/sec']']
                                                                                                                          ['tm', 'count']
                                                                                                                          ['tm']


                                                                                             Hfs['TextLine[['offset', 'line']->[ALL]]']['output/arrivalrate/min']']

                                                                                                                                               Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
// set the current job jar
                   Properties properties = new Properties ();


                                                                                                                              http://bit.ly/LAMain
                   FlowConnector .setApplicationJarClass ( properties , Main .class );

                   FlowConnector   flowConnector  = new FlowConnector ( properties );
                   CascadeConnector   cascadeConnector  = new CascadeConnector ();

                   String   inputPath = args [ 0 ];
                   String   logsPath = args [ 1 ] + "/logs/";
                   String   arrivalRatePath   = args [ 1 ] + "/arrivalrate/" ;
                   String   arrivalRateSecPath   = arrivalRatePath     + "sec";
                   String   arrivalRateMinPath   = arrivalRatePath     + "min" ;

                   // create an assembly to import an Apache log file and store on DFS
                   // declares: "time", "method", "event", "status", "size"
                   Fields apacheFields = new Fields ( "ip" , "time" , "method" , "event" , "status", "size" );
                   String apacheRegex = "^([^ ]*) +[^ ]* +[^ ]* +[([^]]*)] +"([^ ]*) ([^ ]*) [^ ]*
                                                                                                        " ([^ ]*) ([^ ]*).*$";
                   int[] apacheGroups = {1, 2, 3, 4, 5, 6};
                   RegexParser parser = new RegexParser ( apacheFields , apacheRegex , apacheGroups );
                   Pipe importPipe = new Each ( "import" , new Fields ( "line" ), parser );

                   // create tap to read a resource from the local file system, if not an url for an external resource
                   // Lfs allows for relative paths
                   Tap logTap =
                     inputPath .matches ( "^[^:]+://.*" ) ? new Hfs ( new TextLine (), inputPath ) : new Lfs ( new TextLine (), inputPath                            );
                   // create a tap to read/write from the default filesystem
                   Tap parsedLogTap = new Hfs ( apacheFields , logsPath );

                   // connect the assembly to source and sink taps
                   Flow importLogFlow   = flowConnector .connect ( logTap , parsedLogTap , importPipe                         );

                   // create an assembly to parse out the time field into a timestamp
                   // then count the number of requests per second and per minute

                   // apply a text parser to create a timestamp with 'second' granularity
                   // declares field "ts"
                   DateParser dateParser = new DateParser ( new Fields ( "ts" ), "dd/MMM/yyyy:HH:mm:ss Z");
                   Pipe tsPipe = new Each ( "arrival rate" , new Fields ( "time" ), dateParser , Fields .RESULTS                        );

                   // name the per second assembly and split on tsPipe
                   Pipe tsCountPipe = new Pipe ( "tsCount", tsPipe );
                   tsCountPipe = new GroupBy ( tsCountPipe , new Fields ( "ts" ) );
                   tsCountPipe = new Every ( tsCountPipe , Fields .GROUP , new Count () );

                   // apply expression to create a timestamp with 'minute' granularity
                   // declares field "tm"
                   Pipe tmPipe = new Each ( tsPipe , new ExpressionFunction ( new Fields ( "tm" ), "ts - (ts % (60 * 1000))", long.class ) );

                   // name the per minute assembly and split on tmPipe
                   Pipe tmCountPipe = new Pipe ( "tmCount", tmPipe );
                   tmCountPipe = new GroupBy ( tmCountPipe , new Fields ( "tm" ) );
                   tmCountPipe = new Every ( tmCountPipe , Fields .GROUP , new Count () );

                   // create taps to write the results the default filesystem, using the given fields
                   Tap tsSinkTap = new Hfs ( new TextLine (), arrivalRateSecPath      );
                   Tap tmSinkTap = new Hfs ( new TextLine (), arrivalRateMinPath      );

                   // a convenience method for binding taps and pipes, order is significant
                   Map <String , Tap > sinks = Cascades .tapsMap ( Pipe .pipes ( tsCountPipe , tmCountPipe                         ), Tap .taps ( tsSinkTap , tmSinkTap    ) );

                   // connect the assembly to the source and sink taps
                   Flow arrivalRateFlow   = flowConnector .connect ( parsedLogTap , sinks , tsCountPipe , tmCountPipe                           );

                   // optionally print out the arrivalRateFlow to a graph file for import into a graphics package
                   //arrivalRateFlow.writeDOT( "arrivalrate.dot" );

                   // connect the flows by their dependencies, order is not significant
                   Cascade cascade = cascadeConnector .connect ( importLogFlow , arrivalRateFlow                         );

                   // execute the cascade, which in turn executes each flow in dependency order
                   cascade .complete ();
                                                                                                                                                           Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Cascading Model
                                                                Pipe                                                                                Operation




                                                                                            Sub
                                 Each          GroupBy        CoGroup          Every                                  Function      Filter          Aggregator       Buffer       Assertion
                                                                                          Assembly




                                                               Joiner                           ??
                                                                                                                        ??           ??                ??                ??           ??




                         InnerJoin      OuterJoin    LeftJoin          RightJoin    MixedJoin        ??
                                                                                                                                                       Tap




                                                                                                                             Hdfs            JDBC            HBase               ??



                                           Each                                           Every


                                                                                                                                                   TupleEntry




                                                          Value                                            Group
                          Function         Filter        Assertion
                                                                           Aggregator     Buffer
                                                                                                          Assertion                       Fields                 Tuple


                                                                                                                                                                  Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Assembling Applications
                           Tap      func                                Tap
                                    Each    Group   Every
                                                    aggr
                         [source]   filter                              [sink]




                                                               Tap
                                                              [trap]




         • Pipes can be arranged into arbitrary assemblies
                                                     buffer




                           Tap      func                                Tap
                                            Group   aggr
                         [source]   filter                              [sink]




                                                               Tap
                                                              [trap]




         • Some limitations imposed by Operations
                                                                                Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Assembly -> Flow
                                  Pipe Assembly

                                  func     Group
                                          CoGroup   aggr    Group
                                                           GroupBy   aggr   func




                                  func     func




                         Flow

                         Source    func    Group
                                          CoGroup   aggr    Group
                                                           GroupBy   aggr   func   Sink




                         Source    func    func




                    • Assemblies are planned into Flows
                                                                                   Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Flow Planner
                  Map                       Reduce         Map       Reduce

                    Source   func      CoGroup   aggr   temp     GroupBy   aggr       func               Sink




                    Source   func   func




                    • The Planner handles how operations
                      are partitioned into MapReduce jobs

                                                                                  Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Assembly -> Cluster

                 Client               Cluster
                    Assembly   Flow    Job          Job




                                                Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Topological Scheduler
                                      Cascade

                                       Flow       Flow




                                       Flow       Flow




                  • Cascade will manage many Flows
                  • Scheduled in dependency order
                  • Will run only ‘stale’ Flows (pluggable behavior)
                  • Custom MapReduce jobs can participate in Cascade

                                                                Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Heterogeneous Inputs
                              Map

                               Text   func   Group




                               Seq    func   func




               • A single mapper can read data from multiple
                 InputFormats

                                                     Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Dynamic Outputs

                         Reduce
                                                        ./2008/december/part-0000x
                                                        ./2008/november/part-0000x
                         Group     aggr   func   Sink
                                                        ./2008/october/part-0000x
                                                        ...




                   • Sink to output folders named from Tuple values
                   • Works Map side (must take care on num files)


                                                               Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Stream Assertions
                         Flow
                                  Pipe Assembly

                                   assert                                 assert




                         Source    func     Group   aggr   Group   aggr   func                 Sink




                         Source    func     func




                                   assert




           • Unit and Regression tests for Flows
           • Planner removes ‘strict’, ‘validating’, or all assertions
                                                                                   Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Failure Traps
                         Flow
                                                                                         Trap


                                   Pipe Assembly

                                   assert                                  assert        Trap




                          Source    func     Group   aggr   Group   aggr   func          Sink




                          Source    func     func                                        Trap




                                   assert




              • Catch data causing Operations or Assertions to fail
              • Allows processes to continue without data loss
                                                                                    Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Event Notification


                    • Flow Listeners for common events
                         •   start
                         •   complete
                         •   failed
                         •   stopped



                                                     Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Best Practices:
                               Developing
              • Organize Flows as med/small work-loads
                    • Improves reusability and work-sharing

              • Small/Medium-grained Operations
                    • Improves reusability

              • Coarse-grained Operations
                    • Better performance vs flexibility

              • SubAssemblies encapsulate responsibilities
                    • Separate “rules” from “integration”

              • Solve problem first, Optimize second
                    • Can replace Flows with raw MapReduce

                                                              Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Best Practices:
                                  Testing

               • Unit tests
                     • Operations tested independent of Hadoop or Map/Reduce
               • Regression / Integration Testing
                     • Use built-in testing cluster for small data
                     • Use EC2 for large regression data sets
               • Inline “strict” Assertions into Assemblies
                     • Planner can remove them at run-time
                     • SubAssemblies tested independently of other logic



                                                                     Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Best Practices:
                            Data Processing
             • Account for good/bad data
                   • Split good/bad data by rules
                   • Capture rule reasons with “bad” data

             • Inline “validating” Assertions
                   • For staging or untrusted data-sources
                   • Again, can be “planned” out for performance at run-time

             • Capture really bad data through “traps”
                   • Depending on Fidelity of app

             • Use s3n:// before s3://
                   • S3:// stores data in proprietary format, never as default FS
                   • Store compressed, in chunks (1 file per Mapper)
                                                                    Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Cascading


                 • Open-Source/GPLv3   • User contributed
                                         modules
                 • 1.0 since January
                                       • Elastic MapReduce
                 • Growing community




                                                   Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
1.1

                    • Flow level counters
                    • Fields.SWAP and Fields.REPLACE
                    • “Safe” Operations
                    • Hadoop “globbing” support
                    • Detailed job statistics


                                                  Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Upcoming

                    • ACM Cloud PDS - June 6th
                         • http://bit.ly/U9x2q

                    • ScaleCamp - June 9th
                         • http://scalecamp.eventbrite.com/

                    • Hadoop Summit - June 10th
                         • http://hadoopsummit09.eventbrite.com/


                                                         Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009
Resources
                    • Cascading
                     • http://cascading.org


                    • Enterprise tools and support
                     • http://concurrentinc.com


                    • Professional Services and Training
                     • http://scaleunlimited.com
                                                     Copyright Concurrent, Inc. 2009. All rights reserved
Thursday, May 28, 2009

Más contenido relacionado

La actualidad más candente

German in 7 Million Shared Objects
German in 7 Million  Shared ObjectsGerman in 7 Million  Shared Objects
German in 7 Million Shared ObjectsESUG
 
MapReduce Using Perl and Gearman
MapReduce Using Perl and GearmanMapReduce Using Perl and Gearman
MapReduce Using Perl and GearmanJamie Pitts
 
Open repository 2011_duracloud-final
Open repository 2011_duracloud-finalOpen repository 2011_duracloud-final
Open repository 2011_duracloud-finalMark Diggory
 
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Cloudera, Inc.
 
Wicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Wicked Easy Ceph Block Storage & OpenStack Deployment with CrowbarWicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Wicked Easy Ceph Block Storage & OpenStack Deployment with CrowbarCeph Community
 
Openstack in action2! Automate and accelerate Cloud deployments with Dell Cro...
Openstack in action2! Automate and accelerate Cloud deployments with Dell Cro...Openstack in action2! Automate and accelerate Cloud deployments with Dell Cro...
Openstack in action2! Automate and accelerate Cloud deployments with Dell Cro...eNovance
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsShinya Takamaeda-Y
 
KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB
KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB
KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB Rakuten Group, Inc.
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 
GeForce 8800 OpenGL Extensions
GeForce 8800 OpenGL ExtensionsGeForce 8800 OpenGL Extensions
GeForce 8800 OpenGL Extensionsicastano
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandRichard McDougall
 
Making a game with Molehill: Zombie Tycoon
Making a game with Molehill: Zombie TycoonMaking a game with Molehill: Zombie Tycoon
Making a game with Molehill: Zombie TycoonJean-Philippe Doiron
 
IBM System z - zEnterprise a future platform for enterprise systems
IBM System z - zEnterprise a future platform for enterprise systemsIBM System z - zEnterprise a future platform for enterprise systems
IBM System z - zEnterprise a future platform for enterprise systemsIBM Sverige
 
Robust Cloud Resource Provisioning for Cloud Computing Environments
Robust Cloud Resource Provisioning for Cloud Computing EnvironmentsRobust Cloud Resource Provisioning for Cloud Computing Environments
Robust Cloud Resource Provisioning for Cloud Computing EnvironmentsSivadon Chaisiri
 
Training Iaas course outline 14-17 Fed 11
Training Iaas course outline 14-17 Fed 11Training Iaas course outline 14-17 Fed 11
Training Iaas course outline 14-17 Fed 11trueidc
 
Hana Memory Scale out using the hecatonchire Project
Hana Memory Scale out using the hecatonchire ProjectHana Memory Scale out using the hecatonchire Project
Hana Memory Scale out using the hecatonchire ProjectBenoit Hudzia
 

La actualidad más candente (18)

German in 7 Million Shared Objects
German in 7 Million  Shared ObjectsGerman in 7 Million  Shared Objects
German in 7 Million Shared Objects
 
Hadoop on Virtual Machines
Hadoop on Virtual MachinesHadoop on Virtual Machines
Hadoop on Virtual Machines
 
2012 11 Openstack China
2012 11 Openstack China2012 11 Openstack China
2012 11 Openstack China
 
MapReduce Using Perl and Gearman
MapReduce Using Perl and GearmanMapReduce Using Perl and Gearman
MapReduce Using Perl and Gearman
 
Open repository 2011_duracloud-final
Open repository 2011_duracloud-finalOpen repository 2011_duracloud-final
Open repository 2011_duracloud-final
 
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
 
Wicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Wicked Easy Ceph Block Storage & OpenStack Deployment with CrowbarWicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Wicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
 
Openstack in action2! Automate and accelerate Cloud deployments with Dell Cro...
Openstack in action2! Automate and accelerate Cloud deployments with Dell Cro...Openstack in action2! Automate and accelerate Cloud deployments with Dell Cro...
Openstack in action2! Automate and accelerate Cloud deployments with Dell Cro...
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
 
KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB
KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB
KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
GeForce 8800 OpenGL Extensions
GeForce 8800 OpenGL ExtensionsGeForce 8800 OpenGL Extensions
GeForce 8800 OpenGL Extensions
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Making a game with Molehill: Zombie Tycoon
Making a game with Molehill: Zombie TycoonMaking a game with Molehill: Zombie Tycoon
Making a game with Molehill: Zombie Tycoon
 
IBM System z - zEnterprise a future platform for enterprise systems
IBM System z - zEnterprise a future platform for enterprise systemsIBM System z - zEnterprise a future platform for enterprise systems
IBM System z - zEnterprise a future platform for enterprise systems
 
Robust Cloud Resource Provisioning for Cloud Computing Environments
Robust Cloud Resource Provisioning for Cloud Computing EnvironmentsRobust Cloud Resource Provisioning for Cloud Computing Environments
Robust Cloud Resource Provisioning for Cloud Computing Environments
 
Training Iaas course outline 14-17 Fed 11
Training Iaas course outline 14-17 Fed 11Training Iaas course outline 14-17 Fed 11
Training Iaas course outline 14-17 Fed 11
 
Hana Memory Scale out using the hecatonchire Project
Hana Memory Scale out using the hecatonchire ProjectHana Memory Scale out using the hecatonchire Project
Hana Memory Scale out using the hecatonchire Project
 

Destacado

Migrating 25K lines of Ant scripting to Gradle
Migrating 25K lines of Ant scripting to GradleMigrating 25K lines of Ant scripting to Gradle
Migrating 25K lines of Ant scripting to Gradle🎤 Hanno Embregts 🎸
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorSubhas Kumar Ghosh
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspectiveপল্লব রায়
 
Webinar: Electronic Health Records (EHRs) and MongoDB - Advancing the Data Pl...
Webinar: Electronic Health Records (EHRs) and MongoDB - Advancing the Data Pl...Webinar: Electronic Health Records (EHRs) and MongoDB - Advancing the Data Pl...
Webinar: Electronic Health Records (EHRs) and MongoDB - Advancing the Data Pl...MongoDB
 
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRmarkgrover
 

Destacado (6)

Migrating 25K lines of Ant scripting to Gradle
Migrating 25K lines of Ant scripting to GradleMigrating 25K lines of Ant scripting to Gradle
Migrating 25K lines of Ant scripting to Gradle
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
 
Webinar: Electronic Health Records (EHRs) and MongoDB - Advancing the Data Pl...
Webinar: Electronic Health Records (EHRs) and MongoDB - Advancing the Data Pl...Webinar: Electronic Health Records (EHRs) and MongoDB - Advancing the Data Pl...
Webinar: Electronic Health Records (EHRs) and MongoDB - Advancing the Data Pl...
 
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MR
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 

Similar a SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cascading

Partitioning CCGrid 2012
Partitioning CCGrid 2012Partitioning CCGrid 2012
Partitioning CCGrid 2012Weiwei Chen
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftLee Stott
 
Lug best practice_hpc_workflow
Lug best practice_hpc_workflowLug best practice_hpc_workflow
Lug best practice_hpc_workflowrjmurphyslideshare
 
Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01eimhee
 
CouchDB to the Edge ApacheCon EU
CouchDB to the  Edge ApacheCon EUCouchDB to the  Edge ApacheCon EU
CouchDB to the Edge ApacheCon EUChris Anderson
 
Creating polyglot and scalable applications on the jvm using Vert.x
Creating polyglot and scalable applications on the jvm using Vert.xCreating polyglot and scalable applications on the jvm using Vert.x
Creating polyglot and scalable applications on the jvm using Vert.xJettro Coenradie
 
High Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of ViewHigh Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of Viewaragozin
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Robert Grossman
 
Deploying OpenStack using Crowbar
Deploying OpenStack using CrowbarDeploying OpenStack using Crowbar
Deploying OpenStack using Crowbaropenstackindia
 
HA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talkHA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talkSteve Loughran
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCOlga Lavrentieva
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
Architecture Best Practices on Windows Azure
Architecture Best Practices on Windows AzureArchitecture Best Practices on Windows Azure
Architecture Best Practices on Windows AzureNuno Godinho
 
SQL Server Workshop Paul Bertucci
SQL Server Workshop Paul BertucciSQL Server Workshop Paul Bertucci
SQL Server Workshop Paul BertucciMark Ginnebaugh
 
SQL Server 2008 Migration Workshop 04/29/2009
SQL Server 2008 Migration Workshop 04/29/2009SQL Server 2008 Migration Workshop 04/29/2009
SQL Server 2008 Migration Workshop 04/29/2009Database Architechs
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
Tricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The CloudTricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The CloudMySQLConference
 

Similar a SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cascading (20)

Partitioning CCGrid 2012
Partitioning CCGrid 2012Partitioning CCGrid 2012
Partitioning CCGrid 2012
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
 
Lug best practice_hpc_workflow
Lug best practice_hpc_workflowLug best practice_hpc_workflow
Lug best practice_hpc_workflow
 
Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
CouchDB to the Edge ApacheCon EU
CouchDB to the  Edge ApacheCon EUCouchDB to the  Edge ApacheCon EU
CouchDB to the Edge ApacheCon EU
 
Creating polyglot and scalable applications on the jvm using Vert.x
Creating polyglot and scalable applications on the jvm using Vert.xCreating polyglot and scalable applications on the jvm using Vert.x
Creating polyglot and scalable applications on the jvm using Vert.x
 
High Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of ViewHigh Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of View
 
Developing Distributed Semantic Systems
Developing Distributed Semantic SystemsDeveloping Distributed Semantic Systems
Developing Distributed Semantic Systems
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
Deploying OpenStack using Crowbar
Deploying OpenStack using CrowbarDeploying OpenStack using Crowbar
Deploying OpenStack using Crowbar
 
HA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talkHA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talk
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPC
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
Architecture Best Practices on Windows Azure
Architecture Best Practices on Windows AzureArchitecture Best Practices on Windows Azure
Architecture Best Practices on Windows Azure
 
SQL Server Workshop Paul Bertucci
SQL Server Workshop Paul BertucciSQL Server Workshop Paul Bertucci
SQL Server Workshop Paul Bertucci
 
SQL Server 2008 Migration Workshop 04/29/2009
SQL Server 2008 Migration Workshop 04/29/2009SQL Server 2008 Migration Workshop 04/29/2009
SQL Server 2008 Migration Workshop 04/29/2009
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Tricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The CloudTricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
 

Último

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 

SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cascading

  • 1. Building Scale-Free Applications with Hadoop and Cascading Chris K Wensel Concurrent, Inc. Thursday, May 28, 2009
  • 2. Introduction Chris K Wensel chris@wensel.net • Cascading, Lead developer • http://cascading.org/ • Concurrent, Inc., Founder • Hadoop/Cascading support and tools • http://concurrentinc.com • Scale Unlimited, Principal Instructor • Hadoop related Training and Consulting • http://scaleunlimited.com Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 3. Topics Cascading MapReduce Hadoop A rapid introduction to Hadoop, MapReduce patterns, and best practices with Cascading. Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 4. What is Hadoop? Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 5. Conceptually Cluster Execution Storage • Single FileSystem Namespace • Single Execution-Space Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 6. CPU, Disk, Rack, and Node Cluster Rack Rack Rack Node Node Node Node ... cpu Execution disk Storage • Virtualization of storage and execution resources • Normalizes disks and CPU • Rack and node aware • Data locality and fault tolerance Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 7. Storage Rack Rack Rack Node Node Node Node ... block1 block1 block1 block2 block2 block2 One file, two blocks large, 3 nodes replicated • Files chunked into blocks • Blocks independently replicated Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 8. Execution Rack Rack Rack Node Node Node Node ... map map map map map reduce reduce reduce • Map and Reduce tasks move to data • Or available processing slots Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 9. Clients and Jobs Users Cluster Client Client Execution job job job Storage Client • Clients launch jobs into the cluster • One at a time if there are dependencies • Clients are not managed by Hadoop Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 10. Physical Architecture NameNode DataNode DataNode DataNode DataNode data block ns read/write operations Secondary ns operations read/write ns operations read/write mapper mapper child jvm mapper child jvm jobs tasks child jvm Client JobTracker TaskTracker reducer reducer child jvm reducer child jvm child jvm • Solid boxes are unique applications • Dashed boxes are child JVM instances on same node as parent • Dotted boxes are blocks of managed files on same node as parent Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 11. Deployment Node Node Node jobs tasks Client JobTracker TaskTracker DataNode Node NameNode Not uncommon to Node be same node Secondary • TaskTracker and DataNode collocated Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 12. Why Hadoop? Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 13. Why It’s Adopted Big Data PaaS Big Data Platform as a Service Hard Problem Wide Virtualization Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 14. How It’s Used Ad hoc Queries Querying / Sampling Processing Integration Integration / Processing Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 15. Cascading and ...? Big Data PaaS Ad hoc Pig/Hive ? Queries Processing Cascading Cascading Integration • Designed for Processing / Integration • OK for ad-hoc queries (API, not Syntax) • Should you use Hadoop for small queries? Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 16. ShareThis.com Processing Big Data Integration • Primarily integration and processing • Running on AWS Ad hoc PaaS Queries • Ad-hoc queries (mining) performed on Aster Data nCluster, not Hadoop Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 17. ShareThis in AWS bixo web content logprocessor ... crawler indexer Amazon Web Services SQS AsterData loggers loggers loggers Hadoop loggers Hadoop Hadoop Cascading Cascading Cascading Katta Hyper S3 Table Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 18. Staged Pipeline every Y hrs on bixo completion every X hrs logprocessor bixo indexer crawler ... ... • Each cluster • is “on demand” • is “right sized” to load • is tuned to boot at optimum frequency Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 19. ShareThis: Best Practices • Authoritative/Original data kept in S3 in its native format • Derived data pipelined into relevant systems, when possible • Clusters tuned to load for best utilization • Loose coupling keeps change cases isolated • GC’ing Hadoop not a bad idea Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 20. Thinking In MapReduce Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 21. It’s really Map-Group-Reduce • Map and Reduce are user defined functions • Map translates input Keys and Values to new Keys and Values [K1,V1] Map [K2,V2] • System sorts the keys, and groups each unique Key with all its Values [K2,V2] Group [K2,{V2,V2,....}] • Reduce translates the Values of each unique Key to new Keys and Values [K2,{V2,V2,....}] Reduce [K3,V3] Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 22. Word Count • Read a line of text, output every word [0, "when in the course of human events"] Map ["when",1] ["in",1] ["the",1] [...,1] • Group all the values with each unique word ["when",1] ["when",1] ["when",1] ["when",1] Group ["when",{1,1,1,1,1}] ["when",1] • Add up the values associated with each unique word ["when",{1,1,1,1,1}] Reduce ["when",5] Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 23. Know That... • 1 Job = 1 Map impl + 1 Reduce impl • Map: • is required in a Hadoop Job • may emit 0 or more Key Value pairs [K2,V2] • Reduce: • is optional in a Hadoop Job • sees Keys in sort order • but Keys will be randomly distributed across Reducers • the collection of Values per Key are unordered [K2,{V2..}] • may emit 0 or more Key Value pairs [K3,V3] Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 24. Runtime Distribution [K1,V1] Map [K2,V2] Group [K2,{V2,...}] Reduce [K3,V3] Mapper Task Mapper Reducer Shuffle Task Task Mapper Reducer Shuffle Task Task Mapper Reducer Shuffle Task Task Mapper Task Mappers must complete before Reducers can begin split1 split2 split3 split4 ... part-00000 part-00001 part-000N file directory block1 block2 blockN Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 25. Common Patterns • Filtering or “Grepping” • Distributed Tasks • Parsing, Conversion • Simple Total Sorting • Counting, Summing • Chained Jobs • Binning, Collating Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 26. Advanced Patterns • Group By • CoGrouping / Joining • Distinct • Distributed Total Sort • Secondary Sort Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 27. For Example: Secondary Sort Mapper Reducer [K1,V1] [K2,{V2,V2,....}] [K1,V1] -> <A1,B1,C1,D1> [K2,V2] -> <A2,B2,{<C2,D2>,...}> Map Reduce <A2,B2><C2> -> K2, <D2> -> V2 <A3,B3> -> K3, <C3,D3> -> V3 [K2,V2] [K3,V3] • The K2 key becomes a composite Key • Key: [grouping, secondary], Value: [remaining values] • Shuffle phase • Custom Partitioner, must only partition on grouping Fields • Standard ‘output key’ Comparator, must compare on all Fields • Custom ‘value grouping’ Comparator, must compare grouping Fields Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 28. Very Advanced • Classification • Dimension Reduction • Clustering • Evolutionary • Regression Algorithms Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 29. Thinking in MapReduce [ k, [v] ]2 [ k, [v] ]4 Map Reduce Map Reduce [ k, v ]1 [ k, v ]3 [ k, v ]3 [ k, v ]5 File File File [ k, v ] = key and value pair [ k, [v] ] = key and associated values collection • It’s not just about Keys and Values • And not just about Map and Reduce Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 30. Real World Apps [37/75] map+reduce [54/75] map+reduce [41/75] map+reduce [43/75] map+reduce [42/75] map+reduce [45/75] map+reduce [44/75] map+reduce [39/75] map+reduce [36/75] map+reduce [46/75] map+reduce [40/75] map+reduce [50/75] map+reduce [38/75] map+reduce [49/75] map+reduce [51/75] map+reduce [47/75] map+reduce [52/75] map+reduce [53/75] map+reduce [48/75] map+reduce [23/75] map+reduce [25/75] map+reduce [24/75] map+reduce [27/75] map+reduce [26/75] map+reduce [21/75] map+reduce [19/75] map+reduce [28/75] map+reduce [22/75] map+reduce [32/75] map+reduce [20/75] map+reduce [31/75] map+reduce [33/75] map+reduce [29/75] map+reduce [34/75] map+reduce [35/75] map+reduce [30/75] map+reduce [7/75] map+reduce [2/75] map+reduce [8/75] map+reduce [10/75] map+reduce [9/75] map+reduce [5/75] map+reduce [3/75] map+reduce [11/75] map+reduce [6/75] map+reduce [13/75] map+reduce [4/75] map+reduce [16/75] map+reduce [14/75] map+reduce [15/75] map+reduce [17/75] map+reduce [18/75] map+reduce [12/75] map+reduce [60/75] map [62/75] map [61/75] map [58/75] map [55/75] map [56/75] map+reduce [57/75] map [71/75] map [72/75] map [59/75] map [64/75] map+reduce [63/75] map+reduce [65/75] map+reduce [68/75] map+reduce [67/75] map+reduce [70/75] map+reduce [69/75] map+reduce [73/75] map+reduce [66/75] map+reduce [74/75] map+reduce [75/75] map+reduce [1/75] map+reduce 1 app, 75 jobs green = map + reduce purple = map blue = join/merge orange = map split Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 31. Not Thinking in MapReduce [ f1,f2,.. ]2 [ f1,f2,.. ]3 [ f1,f2,.. ]4 Pipe Pipe Pipe Pipe [ f1,f2,.. ]5 [ f1,f2,.. ]1 Source Sink [ f1, f2,... ] = tuples with field names • Its easier with Fields and Tuples • And Source/Sinks and Pipes Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 32. Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']'] Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']'] ['', ''] ['', ''] Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']'] ['', ''] ['', ''] ['', ''] Each('')[RegexSplitter[decl:'', ''][args:1]] Each('')[RegexSplitter[decl:'', '', '', ''][args:1]] ['', ''] ['', ''] ['', ''] Each('')[RegexSplitter[decl:'', '', ''][args:1]] ['', ''] ['', ''] ['', ''] ['', ''] ['', ''] ['', ''] ['', ''] CoGroup('')[by:a:['']b:['']] ['', ''] a[''],b[''] ['', ''] ['', '', '', ''] ['', ''] Each('')[Identity[decl:'', '', '']] ['', ''] ['', '', ''] ['', ''] ['', '', ''] ['', '', ''] ['', ''] ['', '', ''] ['', ''] CoGroup('')[by:a:['']b:['']] ['', '', ''] ['', '', ''] ['', ''] ['', '', ''] a[''],b[''] ['', ''] ['', '', ''] ['', '', '', '', ''] Each('')[Identity[decl:'', '', '', '']] ['', '', '', ''] ['', '', '', ''] ['', '', '', ''] ['', '', '', ''] ['', '', '', ''] ['', '', '', ''] CoGroup('')[by:a:['']b:['']] CoGroup('')[by:a:['']b:['']] a[''],b[''] a[''],b[''] ['', '', '', '', '', '', ''] ['', '', '', '', '', ''] Each('')[Identity[decl:'', '', '', '']] ['', '', '', '', '', '', ''] ['', '', '', ''] ['', '', '', '', '', '', ''] ['', '', '', ''] CoGroup('')[by:a:['']b:['']] a[''],b[''] ['', '', '', ''] ['', '', '', '', '', '', '', '', ''] ['', '', '', ''] Each('')[Identity[decl:'', '']] CoGroup('')[by:a:['']b:['']] ['', ''] a[''],b[''] ['', ''] ['', '', '', '', '', '', ''] Each('')[Identity[decl:'', '', '']] ['', ''] ['', '', ''] ['', ''] ['', '', ''] GroupBy('')[by:['']] a[''] ['', '', ''] ['', ''] ['', '', ''] Every('')[Agregator[decl:'']] CoGroup('')[by:a:['']b:['']] 1 app, 12 jobs a[''],b[''] ['', '', '', '', ''] ['', ''] ['', ''] Each('')[Identity[decl:'', '', '']] ['', '', ''] Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']'] ['', '', ''] ['', '', ''] ['', '', ''] GroupBy('')[by:['']] green = source or sink a[''] ['', '', ''] Every('')[Agregator[decl:'', '']] blue = pipe+operation ['', '', ''] ['', '', ''] Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']'] Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 33. Cascading Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 34. Cascading • A simpler alternative API to MapReduce • Fields and Tuples • Standard ‘relational’ operations • Simplifies development of reusable • data operations, • logical process definitions • integration end-points, and • Can be used by any JVM based language Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 35. Some Sample Code http://github.com/cwensel/cascading.samples • hadoop • logparser • loganalysis bit.ly links on next pages Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 36. Raw MapReduce: Parsing a Log File > read one key/value at a time from inputPath > using some regular expression... > parse value into regular expression groups > save results as a TAB delimited row to the outputPath http://bit.ly/RegexMap http://bit.ly/RegexMain Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 37. http://bit.ly/RegexMap File - /Users/cwensel/sandbox/calku/cascading.samples/hadoop/src/java/hadoop/RegexParserMap.java public class RegexParserMap extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { private Pattern pattern; private Matcher matcher; @Override public void configure( JobConf job ) { pattern = Pattern.compile( job.get( "logparser.regex" ) ); matcher = pattern.matcher( "" ); // lets re-use the matcher } @Override public void map( LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter ) { matcher.reset( value.toString() ); if( !matcher.find() ) throw new RuntimeException( "could not match pattern: [" + pattern + "] with value: [" + value + "]" ); StringBuffer buffer = new StringBuffer(); for( int i = 0; i < matcher.groupCount(); i++ ) { if( i != 0 ) buffer.append( "t" ); buffer.append( matcher.group( i + 1 ) ); // skip group 0 } // pass null so a TAB is not prepended, not all OutputFormats accept null output.collect( null, new Text( buffer.toString() ) ); } } Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 38. { public static void main( String[] args ) throws IOException { // create Hadoop path instances Path inputPath = new Path( args[ 0 ] ); http://bit.ly/RegexMain Path outputPath = new Path( args[ 1 ] ); // get the FileSystem instances for the input path FileSystem outputFS = outputPath.getFileSystem( new JobConf() ); // if output path exists, delete recursively if( outputFS.exists( outputPath ) ) outputFS.delete( outputPath, true ); // initialize Hadoop job configuration JobConf jobConf = new JobConf(); jobConf.setJobName( "logparser" ); // set the current job jar jobConf.setJarByClass( Main.class ); // set the input path and input format TextInputFormat.setInputPaths( jobConf, inputPath ); jobConf.setInputFormat( TextInputFormat.class ); // set the output path and output format TextOutputFormat.setOutputPath( jobConf, outputPath ); jobConf.setOutputFormat( TextOutputFormat.class ); jobConf.setOutputKeyClass( Text.class ); jobConf.setOutputValueClass( Text.class ); // must set to zero since we have no redcue function jobConf.setNumReduceTasks( 0 ); // configure our parsing map classs jobConf.setMapperClass( RegexParserMap.class ); String apacheRegex = "^([^ ]*) +[^ ]* +[^ ]* + [([^]]*) ] + " ([^ ]*) ([^ ]*) [^ ]* " ([^ ]*) ([^ ]*).*$" ; jobConf.set( "logparser.regex" , apacheRegex ); // create Hadoop client, must pass in this JobConf for some reason JobClient jobClient = new JobClient( jobConf ); // submit job RunningJob runningJob = jobClient.submitJob( jobConf ); // block until job completes runningJob.waitForCompletion(); } Copyright Concurrent, Inc. 2009. All rights reserved } Thursday, May 28, 2009
  • 39. Cascading: Parsing a Log File > read one “line” at a time from “source” > using some regular expression... > parse “line” into “ip, time, method, event, status, size” > save as a TAB delimited row to the “sink” http://bit.ly/LPMain Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 40. public static void main( String[] args ) http://bit.ly/LPMain { String inputPath = args[ 0 ]; String outputPath = args[ 1 ]; // define what the input file looks like, "offset" is bytes from beginning TextLine scheme = new TextLine( new Fields( "offset", "line" ) ); // create SOURCE tap to read a resource from the local file system, if input is not an URL Tap logTap = inputPath.matches( "^[^:]+://.*" ) ? new Hfs( scheme, inputPath ) : new Lfs( scheme, inputPath ); // create an assembly to parse an Apache log file and store on an HDFS cluster // declare the field names we will parse out of the log file Fields apacheFields = new Fields( "ip" , "time" , "method", "event" , "status" , "size" ); // define the regular expression to parse the log file with String apacheRegex = "^([^ ]*) +[^ ]* +[^ ]* + [([^]]*) ] + " ([^ ]*) ([^ ]*) [^ ]* " ([^ ]*) ([^ ]*).*$" ; // declare the groups from the above regex we want to keep. each regex group will be given // a field name from 'apacheFields', above, respectively int[] allGroups = {1, 2, 3, 4, 5, 6}; // create the parser RegexParser parser = new RegexParser( apacheFields, apacheRegex, allGroups ); // create the import pipe element, with the name 'import', and with the input argument named "line" // replace the incoming tuple with the parser results // "line" -> parser -> "ts" Pipe importPipe = new Each( "import" , new Fields( "line" ), parser, Fields.RESULTS ); // create a SINK tap to write to the default filesystem // by default, TextLine writes all fields out Tap remoteLogTap = new Hfs( new TextLine(), outputPath, SinkMode.REPLACE ); // set the current job jar Properties properties = new Properties(); FlowConnector.setApplicationJarClass( properties, Main.class ); // connect the assembly to the SOURCE and SINK taps Flow parsedLogFlow = new FlowConnector( properties ).connect( logTap, remoteLogTap, importPipe ); // optionally print out the parsedLogFlow to a DOT file for import into a graphics package // parsedLogFlow.writeDOT( "logparser.dot" ); // start execution of the flow (either locally or on the cluster parsedLogFlow.start(); // block until the flow completes parsedLogFlow.complete(); Copyright Concurrent, Inc. 2009. All rights reserved } Thursday, May 28, 2009 }
  • 41. What Changes When... • We want to sort by “status” field? • We want to filter out some “urls”? • We want to sum the “size”? • Do all the above? • We are thinking in ‘fields’ not Key/Values Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 42. Cascading: Arrival rate > ..parse data as before.. > read parsed logs > calculate “urls” per second > calculate “urls” per minute > save “sec interval” & “count” as a TAB’d row to “sink1” > save “min interval” & “count” as a TAB’d row to “sink2” http://bit.ly/LAMain Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 43. http://bit.ly/LAMain Arrival Rate Flow Hfs['SequenceFile[['ip', 'time', 'method', 'event', 'status', 'size']]']['output/logs/']'] ['ip', 'time', 'method', 'event', 'status', 'size'] ['ip', 'time', 'method', 'event', 'status', 'size'] Each('arrival rate')[DateParser[decl:'ts'][args:1]] ['ts'] ['ts'] ['ts'] ['ts'] GroupBy('tsCount')[by:['ts']] Each('arrival rate')[ExpressionFunction[decl:'tm'][args:1]] ['tm'] tsCount['ts'] ['tm'] ['ts'] GroupBy('tmCount')[by:['tm']] Every('tsCount')[Count[decl:'count']] tmCount['tm'] ['ts', 'count'] ['tm'] ['ts'] Every('tmCount')[Count[decl:'count']] Hfs['TextLine[['offset', 'line']->[ALL]]']['output/arrivalrate/sec']'] ['tm', 'count'] ['tm'] Hfs['TextLine[['offset', 'line']->[ALL]]']['output/arrivalrate/min']'] Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 44. // set the current job jar Properties properties = new Properties (); http://bit.ly/LAMain FlowConnector .setApplicationJarClass ( properties , Main .class ); FlowConnector flowConnector = new FlowConnector ( properties ); CascadeConnector cascadeConnector = new CascadeConnector (); String inputPath = args [ 0 ]; String logsPath = args [ 1 ] + "/logs/"; String arrivalRatePath = args [ 1 ] + "/arrivalrate/" ; String arrivalRateSecPath = arrivalRatePath + "sec"; String arrivalRateMinPath = arrivalRatePath + "min" ; // create an assembly to import an Apache log file and store on DFS // declares: "time", "method", "event", "status", "size" Fields apacheFields = new Fields ( "ip" , "time" , "method" , "event" , "status", "size" ); String apacheRegex = "^([^ ]*) +[^ ]* +[^ ]* +[([^]]*)] +"([^ ]*) ([^ ]*) [^ ]* " ([^ ]*) ([^ ]*).*$"; int[] apacheGroups = {1, 2, 3, 4, 5, 6}; RegexParser parser = new RegexParser ( apacheFields , apacheRegex , apacheGroups ); Pipe importPipe = new Each ( "import" , new Fields ( "line" ), parser ); // create tap to read a resource from the local file system, if not an url for an external resource // Lfs allows for relative paths Tap logTap = inputPath .matches ( "^[^:]+://.*" ) ? new Hfs ( new TextLine (), inputPath ) : new Lfs ( new TextLine (), inputPath ); // create a tap to read/write from the default filesystem Tap parsedLogTap = new Hfs ( apacheFields , logsPath ); // connect the assembly to source and sink taps Flow importLogFlow = flowConnector .connect ( logTap , parsedLogTap , importPipe ); // create an assembly to parse out the time field into a timestamp // then count the number of requests per second and per minute // apply a text parser to create a timestamp with 'second' granularity // declares field "ts" DateParser dateParser = new DateParser ( new Fields ( "ts" ), "dd/MMM/yyyy:HH:mm:ss Z"); Pipe tsPipe = new Each ( "arrival rate" , new Fields ( "time" ), dateParser , Fields .RESULTS ); // name the per second assembly and split on tsPipe Pipe tsCountPipe = new Pipe ( "tsCount", tsPipe ); tsCountPipe = new GroupBy ( tsCountPipe , new Fields ( "ts" ) ); tsCountPipe = new Every ( tsCountPipe , Fields .GROUP , new Count () ); // apply expression to create a timestamp with 'minute' granularity // declares field "tm" Pipe tmPipe = new Each ( tsPipe , new ExpressionFunction ( new Fields ( "tm" ), "ts - (ts % (60 * 1000))", long.class ) ); // name the per minute assembly and split on tmPipe Pipe tmCountPipe = new Pipe ( "tmCount", tmPipe ); tmCountPipe = new GroupBy ( tmCountPipe , new Fields ( "tm" ) ); tmCountPipe = new Every ( tmCountPipe , Fields .GROUP , new Count () ); // create taps to write the results the default filesystem, using the given fields Tap tsSinkTap = new Hfs ( new TextLine (), arrivalRateSecPath ); Tap tmSinkTap = new Hfs ( new TextLine (), arrivalRateMinPath ); // a convenience method for binding taps and pipes, order is significant Map <String , Tap > sinks = Cascades .tapsMap ( Pipe .pipes ( tsCountPipe , tmCountPipe ), Tap .taps ( tsSinkTap , tmSinkTap ) ); // connect the assembly to the source and sink taps Flow arrivalRateFlow = flowConnector .connect ( parsedLogTap , sinks , tsCountPipe , tmCountPipe ); // optionally print out the arrivalRateFlow to a graph file for import into a graphics package //arrivalRateFlow.writeDOT( "arrivalrate.dot" ); // connect the flows by their dependencies, order is not significant Cascade cascade = cascadeConnector .connect ( importLogFlow , arrivalRateFlow ); // execute the cascade, which in turn executes each flow in dependency order cascade .complete (); Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 45. Cascading Model Pipe Operation Sub Each GroupBy CoGroup Every Function Filter Aggregator Buffer Assertion Assembly Joiner ?? ?? ?? ?? ?? ?? InnerJoin OuterJoin LeftJoin RightJoin MixedJoin ?? Tap Hdfs JDBC HBase ?? Each Every TupleEntry Value Group Function Filter Assertion Aggregator Buffer Assertion Fields Tuple Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 46. Assembling Applications Tap func Tap Each Group Every aggr [source] filter [sink] Tap [trap] • Pipes can be arranged into arbitrary assemblies buffer Tap func Tap Group aggr [source] filter [sink] Tap [trap] • Some limitations imposed by Operations Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 47. Assembly -> Flow Pipe Assembly func Group CoGroup aggr Group GroupBy aggr func func func Flow Source func Group CoGroup aggr Group GroupBy aggr func Sink Source func func • Assemblies are planned into Flows Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 48. Flow Planner Map Reduce Map Reduce Source func CoGroup aggr temp GroupBy aggr func Sink Source func func • The Planner handles how operations are partitioned into MapReduce jobs Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 49. Assembly -> Cluster Client Cluster Assembly Flow Job Job Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 50. Topological Scheduler Cascade Flow Flow Flow Flow • Cascade will manage many Flows • Scheduled in dependency order • Will run only ‘stale’ Flows (pluggable behavior) • Custom MapReduce jobs can participate in Cascade Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 51. Heterogeneous Inputs Map Text func Group Seq func func • A single mapper can read data from multiple InputFormats Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 52. Dynamic Outputs Reduce ./2008/december/part-0000x ./2008/november/part-0000x Group aggr func Sink ./2008/october/part-0000x ... • Sink to output folders named from Tuple values • Works Map side (must take care on num files) Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 53. Stream Assertions Flow Pipe Assembly assert assert Source func Group aggr Group aggr func Sink Source func func assert • Unit and Regression tests for Flows • Planner removes ‘strict’, ‘validating’, or all assertions Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 54. Failure Traps Flow Trap Pipe Assembly assert assert Trap Source func Group aggr Group aggr func Sink Source func func Trap assert • Catch data causing Operations or Assertions to fail • Allows processes to continue without data loss Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 55. Event Notification • Flow Listeners for common events • start • complete • failed • stopped Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 56. Best Practices: Developing • Organize Flows as med/small work-loads • Improves reusability and work-sharing • Small/Medium-grained Operations • Improves reusability • Coarse-grained Operations • Better performance vs flexibility • SubAssemblies encapsulate responsibilities • Separate “rules” from “integration” • Solve problem first, Optimize second • Can replace Flows with raw MapReduce Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 57. Best Practices: Testing • Unit tests • Operations tested independent of Hadoop or Map/Reduce • Regression / Integration Testing • Use built-in testing cluster for small data • Use EC2 for large regression data sets • Inline “strict” Assertions into Assemblies • Planner can remove them at run-time • SubAssemblies tested independently of other logic Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 58. Best Practices: Data Processing • Account for good/bad data • Split good/bad data by rules • Capture rule reasons with “bad” data • Inline “validating” Assertions • For staging or untrusted data-sources • Again, can be “planned” out for performance at run-time • Capture really bad data through “traps” • Depending on Fidelity of app • Use s3n:// before s3:// • S3:// stores data in proprietary format, never as default FS • Store compressed, in chunks (1 file per Mapper) Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 59. Cascading • Open-Source/GPLv3 • User contributed modules • 1.0 since January • Elastic MapReduce • Growing community Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 60. 1.1 • Flow level counters • Fields.SWAP and Fields.REPLACE • “Safe” Operations • Hadoop “globbing” support • Detailed job statistics Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 61. Upcoming • ACM Cloud PDS - June 6th • http://bit.ly/U9x2q • ScaleCamp - June 9th • http://scalecamp.eventbrite.com/ • Hadoop Summit - June 10th • http://hadoopsummit09.eventbrite.com/ Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009
  • 62. Resources • Cascading • http://cascading.org • Enterprise tools and support • http://concurrentinc.com • Professional Services and Training • http://scaleunlimited.com Copyright Concurrent, Inc. 2009. All rights reserved Thursday, May 28, 2009