SlideShare una empresa de Scribd logo
1 de 68
Descargar para leer sin conexión
Enterprise Data Workflows
               with Cascading


                Paco Nathan


                                                                                  HashJoin   Regex
                                                                                    Left     token
                                                                                                     GroupBy    R

                Concurrent, Inc.
                                                                     Stop Word                        token




                                                         Copyright @2012, Concurrent, Inc.

Monday, 17 December 12                                                                                                      1
Unstructured Data
        Enterprise Scale

         1. Cascading API: a few facts & quotes
         2. Example #1: distributed file copy
         3. Example #2: word count
         4. Pattern Language: workflow abstraction
         5. Compare: Scalding, Cascalog, Hive, Pig

Monday, 17 December 12                               2
Intro to Cascading



                                                                              HashJoin   Regex
                                                                                Left     token
                                                                                                 GroupBy    R
                                                                 Stop Word                        token



                         Cascading API:
                         a few facts & quotes

Monday, 17 December 12                                                                                                  3
Enterprise apps, pre-Hadoop
                 analyst                                         Warehouse                           ops
                                               data                                        data
                                               sets                                       sources
                              insights                                                               data

                                         Analytics                                  Apps


                         dashboards                   analysis

Monday, 17 December 12                                                                                        4
Enterprise apps, pre-Hadoop
               the devil you know:

                 ‣ “scale up” as needed – larger proprietary hardware
                 ‣ data warehouse: e.g., Oracle,Teradata, etc. – expensive
                 ‣ analytics: e.g., SAS, Microstrategy, etc. – expensive
                 ‣ highly trained staff in specific roles – lots of “silos”

               however, to be competitive now, the data rates must scale
               by orders of magnitude...

               ( alternatively, can we get hired onto the SAS sales team? )

Monday, 17 December 12                                                        5
Enterprise apps, with Hadoop
               Apache Hadoop offers an attractive migration path:

                 ‣ open source software – less expensive
                 ‣ commodity hardware – less expensive
                 ‣ fault tolerance for large-scale parallel workloads
                 ‣ great use cases: Yahoo!, Facebook, Twitter, Amazon, Apple, etc.
                 ‣ offload workflows from licensed platforms, based on “scale-out”

Monday, 17 December 12                                                               6
Enterprise apps, with Hadoop

                         queries,               Java   job tracker
                         models                 apps   name node
                                                                     Hadoop Cluster
               analyst              developer


Monday, 17 December 12                                                                7
Enterprise apps, with Hadoop
               anything odd about that diagram?                          queries,
                                                                                                       job tracker
                                                                                                       name node
                                                                                                                     Hadoop Cluster
                                                               analyst              developer

                 ‣ demands expert Hadoop developers             ops

                 ‣ experts are hard to find, expensive
                 ‣ even harder to train from among existing staff
                 ‣ early adopter abstractions are not suitable for Enterprise IT
                 ‣ importantly: Hadoop is almost never used in isolation

Monday, 17 December 12                                                                                                                8
Cascading API: purpose
                ‣ simplify data processing development and deployment

                ‣ improve application developer productivity

                ‣ enable data processing application manageability

Monday, 17 December 12                                                  9
Cascading API: a few facts
                Java open source project (ASL 2) using Git, Gradle, Maven, JUnit, etc.

                in production (~5 yrs) at hundreds of enterprise Hadoop deployments:
                Finance, Health Care, Transportation, other verticals

                studies published about large use cases: Twitter, Etsy, eBay, Airbnb, Square,
                Climate Corp, FlightCaster, Williams-Sonoma, Trulia, TeleNav

                partnerships and distribution with SpringSource, Amazon AWS,
                Microsoft Azure, Hortonworks, MapR, EMC

                several open source projects built atop, managed by Twitter, Etsy, eBay, etc.,
                which provide substantial Machine Learning libraries

                DSLs available in Scala, Clojure, Python (Jython), Ruby (JRuby), Groovy

                data “taps” integrate popular data frameworks via JDBC, Memcached, HBase,
                plus serialization in Apache Thrift, Avro, Kyro, etc.

                entire app compiles into a single JAR: fully connected for compiler optimization,
                exception handling, debug, config, scheduling, notifications, provenance, etc.

Monday, 17 December 12                                                                              10
Cascading API: a few quotes
              “Cascading gives Java developers the ability to build Big Data applications
               on Hadoop using their existing skillset … Management can really go out
               and build a team around folks that are already very experienced with Java.
               Switching over to this is really a very short exercise.”
                  CIO, Thor Olavsrud, 2012-06-06

              “Masks the complexity of MapReduce, simplifies the programming, and
               speeds you on your journey toward actionable analytics … A vast
               improvement over native MapReduce functions or Pig UDFs.”
                  2012 BOSSIE Awards, James Borck, 2012-09-18

              “Company’s promise to application developers is an opportunity to build
               and test applications on their desktops in the language of choice with
               familiar constructs and reusable components”
                  Dr. Dobb’s, Adrian Bridgwater, 2012-06-08

Monday, 17 December 12                                                                      11
Enterprise concerns
              “Notes from the Mystery Machine Bus”
               by Steve Yegge, Google
                          “conservative”                             “liberal”
                            (mostly) Enterprise                   (mostly) Start-Up

                             risk management                    customer experiments

                                 assurance                            flexibility

                           well-defined schema                   schema follows code
                           explicit configuration                     convention

                          type-checking compiler                 interpreted scripts

                            wants no surprises                  wants no impediments

                          Java, Scala, Clojure, etc.            PHP, Ruby, Python, etc.

                   Cascading, Scalding, Cascalog, etc.   Hive, Pig, Hadoop Streaming, etc.

Monday, 17 December 12                                                                       12
Enterprise adoption

                         As Enterprise apps move into
                         Hadoop and related BigData
                         frameworks, risk profiles shift
                         toward more conservative
                         programming practices

                         Cascading provides a popular
                         API – formally speaking, as a
                         pattern language – for defining
                         and managing Enterprise data

Monday, 17 December 12                                    13
Migration of batch toolsets

                                       Enterprise   Migration    Start-Ups
                     define pipelines      J2EE       Cascading      Pig

                         query data       SQL         Lingual       Hive

                   predictive models      SAS         Pattern      Mahout

Monday, 17 December 12                                                       14
               Cascading API benefits:

                 ‣ addresses staffing bottlenecks due to Hadoop adoption
                 ‣ reduces costs, while servicing risk concerns and “conservatism”
                 ‣ manages complexity as the data continues to scale massively
                 ‣ provides a pattern language for system integration
                 ‣ leverages a workflow abstraction for Enterprise apps
                 ‣ utilizes existing practices for JVM-based clusters

Monday, 17 December 12                                                               15
Intro to Cascading



                                                                               HashJoin   Regex
                                                                                 Left     token
                                                                                                  GroupBy    R
                                                                  Stop Word                        token



                         Code Example #1:
                         distributed file copy

Monday, 17 December 12                                                                                                   16
1: distributed file copy
                                public class
                                  public static void
                                  main( String[] args )
                                    String inPath = args[ 0 ];
                                    String outPath = args[ 1 ];
                                    Properties props = new Properties();
                                    AppProps.setApplicationJarClass( props, Main.class );
                                    HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

                                    // create the source tap
                                    Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath );

                     M              // create the sink tap
                                    Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath );
                                    // specify a pipe to connect the taps
                                    Pipe copyPipe = new Pipe( "copy" );

                                    // connect the taps, pipes, etc., into a flow
                                    FlowDef flowDef = FlowDef.flowDef().setName( "copy" )
                                     .addSource( copyPipe, inTap )
                                     .addTailSink( copyPipe, outTap );

                                    // run the flow
                                    flowConnector.connect( flowDef ).complete();

          1 mapper                  }
          0 reducers
         10 lines code

Monday, 17 December 12                                                                                      17
1: distributed file copy
                ‣ a source tap – input data
                 ‣ a sink tap – output data
                 ‣ a pipe connecting a source to a sink
                 ‣ simplest possible Cascading app

               not shown:
                ‣ what kind of taps? and what size of input data set?
                 ‣ could be: JDBC, HBase, Cassandra, XML, flat files, etc.
                 ‣ what kind of topology? and what size of cluster?
                 ‣ could be: Hadoop, in-memory, etc.

               as system architects, we leverage pattern

Monday, 17 December 12                                                       18
principle: same JAR, any scale
                                                                     MegaCorp Enterprise IT:
                                                                     Pb’s data
                                                                     1000+ node private cluster
                                                                     EVP calls you when app fails
                                                                     runtime: days+

                                                      Production Cluster:
                                                      Tb’s data
                                                      EMR w/ 50 HPC Instances
                                                      Ops monitors results
                                                      runtime: hours – days

                                  Staging Cluster:
                                  Gb’s data
                                  EMR + 4 Spot Instances
                                  CI shows red or green lights
                                  runtime: minutes – hours

               Your Laptop:
               Mb’s data
               Hadoop standalone mode
               passes unit tests, or not
               runtime: seconds – minutes

Monday, 17 December 12                                                                              19
principle: fail the same way twice
               troubleshooting at scale:

                 ‣ physical plan for a query provides a deterministic strategy
                 ‣ avoid non-deterministic behavior – expensive when troubleshooting
                 ‣ otherwise, edge cases become nightmares on large clusters
                 ‣ again, addresses “conservative” need for predictability
                 ‣ a core value which is unique to Cascading

Monday, 17 December 12                                                                 20
principle: plan ahead
               flow planner per topology:

                 ‣ leverage the flow graph (DAG)
                 ‣ catch as many errors as possible before an app gets submitted
                 ‣ potential problems caught at compile time or at flow planner stage
                 ‣ …long before large, expensive resources start getting consumed
                 ‣ …or worse, before the wrong results get propagated downstream

Monday, 17 December 12                                                                  21
Intro to Cascading



                                                                            HashJoin   Regex
                                                                              Left     token
                                                                                               GroupBy    R
                                                               Stop Word                        token



                         Code Example #2:
                         word count

Monday, 17 December 12                                                                                                22
2: word count
               defined: count how often each word appears in a collection of text documents

               a simple program provides a great test case for parallel processing,
               since it illustrates:
                 ‣ requires a minimal amount of code
                 ‣ demonstrates use of both symbolic and numeric values
                 ‣ shows a dependency graph of tuples as an abstraction
                 ‣ is not many steps away from useful search indexing
                 ‣ serves as a “Hello World” for Hadoop apps

               any distributed computing framework which runs Word Count
               efficiently in parallel at scale,
               can handle much larger, more interesting compute problems

Monday, 17 December 12                                                                        23
2: word count


                    M                token             Count

                                       R                        Word

          1 mapper
          1 reducer
         18 lines code              

Monday, 17 December 12                                                  24
2: word count                                                     Document

                                                                                                  token    Count

             String docPath = args[ 0 ];                                                            R              Word

             String wcPath = args[ 1 ];
             Properties properties = new Properties();
             AppProps.setApplicationJarClass( properties, Main.class );
             HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

             // create source and sink taps
             Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
             Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

             // specify a regex to split "document" text lines into token stream
             Fields token = new Fields( "token" );
             Fields text = new Fields( "text" );
             RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );

             // only returns "token"
             Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

             // determine the word counts
             Pipe wcPipe = new Pipe( "wc", docPipe );
             wcPipe = new GroupBy( wcPipe, token );
             wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

             // connect the taps, pipes, etc., into a flow
             FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
              .addSource( docPipe, docTap )
              .addTailSink( wcPipe, wcTap );

             // write a DOT file and run the flow
             Flow wcFlow = flowConnector.connect( flowDef );
             wcFlow.writeDOT( "dot/" );

Monday, 17 December 12                                                                                                     25
2: word count

                            Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

                                                    [{2}:'doc_id', 'text']
                                                    [{2}:'doc_id', 'text']






                                                    [{2}:'token', 'count']

                         Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

          1 mapper                                  [{2}:'token', 'count']

          1 reducer                                 [{2}:'token', 'count']

         18 lines code                                     [tail]

Monday, 17 December 12                                                                                    26
2: word count
               deltas between Example #1 and Example #2:

                 ‣ defines source tap as a collection of text documents
                 ‣ defines sink tap to produce word count tuples (desired end result)
                 ‣ uses named fields, applying structure to unstructured data
                 ‣ adds semantics to the workflow, specifying business logic
                 ‣ inserts operations into the pipe: Tokenize, GroupBy, Count
                 ‣ shows function and aggregation applied to data tuples in parallel


                                                        M                token    Count

                                                                           R              Word

Monday, 17 December 12                                                                            27
Intro to Cascading



                                                                             HashJoin   Regex
                                                                               Left     token
                                                                                                GroupBy    R
                                                                Stop Word                        token



                         Pattern Language:
                         the workflow abstraction

Monday, 17 December 12                                                                                                 28
enterprise data workflows
               Tuples, Pipelines, Taps, Operations, Joins, Assertions, Traps, etc.
               …in other words, “plumbing” as a pattern language
               for handling Big Data in Enterprise IT




                                                       HashJoin   Regex
                                                         Left     token
                                                                           GroupBy    R
                                          Stop Word                         token



Monday, 17 December 12                                                                            29
pattern language
              defined: a structured method for solving large, complex
              design problems, where the syntax of the language
              promotes the use of best practices
              “plumbing” metaphor of pipes and operators in
              Cascading helps indicate: algorithms to be used at
              particular points, appropriate architectural trade-offs,
              frameworks which must be integrated, etc.
              design patterns: originated in consensus negotiation
              for architecture, later used in software engineering


Monday, 17 December 12                                                   30
data workflows: team
                ‣ Business Stakeholder POV:
                   business process management for workflow orchestration (think BPM/BPEL)

                ‣ Systems Integrator POV:
                   system integration of heterogenous data sources and compute platforms

                ‣ Data Scientist POV:
                   a directed, acyclic graph (DAG) on which we can apply Amdahl's Law, etc.

                ‣ Data Architect POV:
                   a physical plan for large-scale data flow management

                ‣ Software Architect POV:
                   a pattern language, similar to plumbing or circuit design

                ‣ App Developer POV:                                                      M

                   API bindings for Java, Scala, Clojure, Jython, JRuby, etc.                             Stop Word



                ‣ Systems Engineer POV:                                                                                                                  Word

                   a JAR file, has passed CI, available in a Maven repo

Monday, 17 December 12                                                                                                                                           31
data workflows: layers
                   business   domain expertise, business trade-offs,
                              operating parameters, market position, etc.

                      API     Java, Scala, Clojure, Jython, JRuby, Groovy, etc.
                              …envision whatever runs in a JVM

                 optimize /
                              major changes in technology now



                   physical                              Stop Word




                              Apache Hadoop, in-memory local mode
                              …envision GPUs, streaming, etc.

                              Splunk, New Relic, Typesafe, Nagios, etc.

Monday, 17 December 12                                                                                                        32
data workflows: example
                                web                         Memcached          web
                                logs                          cluster          API

                                            Cascading app
                                 source                         sink
                                   tap                          tap
                              source         System           trap
                                tap                            tap

                         customer                                    Support
                          profile                                    review

                                          Hadoop cluster

Monday, 17 December 12                                                                     33
data workflows: SQL vs. JVM
                     abstraction                  SQL
                            parser             SQL parser

                          optimizer            logical plan,
                                         optimized based on stats
                           planner             physical plan

                          machine             query history,
                           data                 table stats
                           topology            b-trees, etc.

                         visualization             ERD

                           schema             table schema

                           catalog          relational catalog

Monday, 17 December 12                                              34
data workflows: SQL vs. JVM
                     abstraction                  SQL                        JVM
                            parser             SQL parser           SQL-92 compliant parser
                                                                         (in progress)
                          optimizer            logical plan,              logical plan,
                                         optimized based on stats   optimized based on stats
                           planner             physical plan             API “plumbing”

                          machine             query history,               app history,
                           data                 table stats                 tuple stats
                           topology            b-trees, etc.        heterogenous, distributed:
                                                                     Hadoop, in-memory, etc.
                         visualization             ERD                    flow diagram

                           schema             table schema                tuple schema

                           catalog          relational catalog            tap usage DB

Monday, 17 December 12                                                                           35
Cascading taxonomy

                            scheduler     app

                  Maven                               flow
                                                             step                                tap

                         owner                                                           trap
                                                                 kind mapper | reducer    tap

                                                   topology hadoop | local

Monday, 17 December 12                                                                                   36
MapReduce architecture
               ‣ name node / data node
               ‣ job tracker / task tracker
               ‣ submit queue
               ‣ task slots
               ‣ HDFS
               ‣ distributed cache



Monday, 17 December 12                                    37
               If you were leading a team responsible for Enterprise apps:

                 ‣ which of the previous two slides seems easier to understand?
                 ‣ which is simpler to use for training and managing a team?
                 ‣ which costs the most in the long run?

Monday, 17 December 12                                                            38
Intro to Cascading



                                                                             HashJoin   Regex
                                                                               Left     token
                                                                                                GroupBy    R
                                                                Stop Word                        token



                         Compare & Contrast:
                         other approaches

Monday, 17 December 12                                                                                                 39
wc: pseudocode                             Document

                                                                           token    Count

                                                                             R              Word

       void map (String doc_id, String text):
         for each word w in segment(text):
           emit(w, "1");

       void reduce (String word, Iterator partial_counts):
         int count = 0;

           for each pc in partial_counts:
             count += Int(pc);

           emit(word, String(count));

Monday, 17 December 12                                                                              40
Scalding / Scala                          Document

                                                                          token    Count

                                                                            R              Word

       // Sujit Pal

       package com.mycompany.impatient

       import com.twitter.scalding._

       class Part2(args : Args) extends Job(args) {
         val input = Tsv(args("input"), ('docId, 'text))
         val output = Tsv(args("output"))
           flatMap('text -> 'word) {
              text : String => text.split("""s+""")
           groupBy('word) { group => group.size }.

Monday, 17 December 12                                                                             41
Scalding / Scala                                             Document

                                                                                             token    Count

                                                                                               R              Word

        ‣ code is compact, easy to understand

          ‣ functional programming is great for expressing
             complex workflows in MapReduce, etc.
          ‣ very large-scale, complex problems can be handled
             in just a few lines of code
          ‣ many large-scale apps in production deployments

          ‣ significant investments by Twitter, Etsy, eBay, etc.,
             in this open source project
          ‣ extensive libraries are available for linear algebra,
             machine learning – e.g., “Matrix API”

Monday, 17 December 12                                                                                                42
Cascalog / Clojure                            Document

                                                                              token    Count

                                                                                R              Word

       ; Paul Lam

       (ns impatient.core
         (:use [cascalog.api]
               [cascalog.more-taps :only (hfs-delimited)])
         (:require [clojure.string :as s]
                   [cascalog.ops :as c])

       (defmapcatop split [line]
         "reads in a line of string and splits it by regex"
         (s/split line #"[[](),.)s]+"))

       (defn -main [in out & args]
         (?<- (hfs-delimited out)
              [?word ?count]
              ((hfs-delimited in :skip-header? true) _ ?line)
              (split ?line :> ?word)
              (c/count ?count)))

Monday, 17 December 12                                                                                 43
Cascalog / Clojure                                            Document

                                                                                              token    Count

                                                                                                R              Word

        ‣ code is compact, easy to understand

          ‣ functional programming is great for expressing
             complex workflows in MapReduce, etc.
          ‣ significant investments by Twitter, Climate Corp, etc.,
             in this open source project
          ‣ can run queries from the Clojure REPL

          ‣ compelling for very large-scale use cases where code
             correctness can be verified before deployment

Monday, 17 December 12                                                                                                 44
Apache Hive                                 Document

                                                                            token    Count

                                                                              R              Word

       -- Steve Severance

       CREATE TABLE input (line STRING);

       LOAD DATA LOCAL INPATH 'input.tsv'

        word, COUNT(*)
       FROM input
        LATERAL VIEW explode(split(text, ' ')) lTable AS word
       GROUP BY word

Monday, 17 December 12                                                                               45
Apache Hive                                                         Document

                                                                                                    token    Count

                                                                                                      R              Word

        ‣ most popular abstraction atop Apache Hadoop

          ‣ SQL-like language is syntactically familiar to most analysts

          ‣ simple to load large-scale unstructured data and run ad-hoc queries

        ‣ not a relational engine, many surprises at scale

          ‣ difficult to represent complex workflows, ML algorithms, etc.

          ‣ one poorly-trained analyst can bottleneck an entire cluster

          ‣ app-level integration requires other coding, outside of script language

          ‣ logical planner mixed with physical planner; cannot collect app stats

          ‣ non-deterministic exec: number of mappers+reducers changes unexpectedly

          ‣ business logic must cross multiple language boundaries: difficult to
             troubleshoot, optimize, audit, handle exceptions, set notifications, etc.

Monday, 17 December 12                                                                                                       46
Apache Pig                                  Document

                                                                            token    Count

                                                                              R              Word

       -- kudos to Dmitriy Ryaboy

       docPipe = LOAD '$docPath' USING PigStorage('t', 'tagsource')
         AS (doc_id, text);
       docPipe = FILTER docPipe BY doc_id != 'doc_id';

       -- specify regex to split "document" text lines into token stream
       tokenPipe = FOREACH docPipe
         GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token;
       tokenPipe = FILTER tokenPipe BY token MATCHES 'w.*';

       -- determine the word counts
       tokenGroups = GROUP tokenPipe BY token;
       wcPipe = FOREACH tokenGroups
         GENERATE group AS token, COUNT(tokenPipe) AS count;

       -- output
       STORE wcPipe INTO '$wcPath' USING PigStorage('t', 'tagsource');
       EXPLAIN -out dot/ -dot wcPipe;

Monday, 17 December 12                                                                               47
Apache Pig                                                         Document

                                                                                                   token    Count

                                                                                                     R              Word

        ‣ easy to learn data manipulation language (DML)

          ‣ interactive prompt (Grunt) makes it simple to prototype apps

          ‣ extensibility through UDFs

        ‣ not a full programming language; must extend via UDFs outside of language

          ‣ app-level integration requires other coding, outside of script language

          ‣ simple problems are simple to do; hard problems become quite complex

          ‣ difficult to parameterize scripts externally; must rewrite to change taps!

          ‣ logical planner mixed with physical planner; cannot collect app stats

          ‣ non-deterministic exec: number of mappers+reducers changes unexpectedly

          ‣ business logic must cross multiple language boundaries: difficult to
             troubleshoot, optimize, audit, handle exceptions, set notifications, etc.

Monday, 17 December 12                                                                                                      48
Intro to Cascading



                                                                                HashJoin   Regex
                                                                                  Left     token
                                                                                                   GroupBy    R
                                                                   Stop Word                        token



                         Code Example #N:
                         city of palo alto, etc.

Monday, 17 December 12                                                                                                    49
extend: wc + scrub + stop words




                                                 HashJoin   Regex
                                                   Left     token
                                                                    GroupBy    R
                                    Stop Word                        token


          1 mapper                                                                 Word

          1 reducer                                                                Count

         28+10 lines code

Monday, 17 December 12                                                                     50
extend: a simple search engine

                                                                                Unique             Insert   SumBy

                                                                                doc_id               1      doc_id

                                                                          M       R           M               R            RHS



                                               HashJoin   Regex                 Unique            CountBy

                                                 Left     token                  token             token                                                    ExprFunc
                                Stop Word                                                                                                                     tf-idf
                                                                          M       R           M      R               M                              R


                                                                                                                            CountBy                 Sort
                                                                                                                             token                  count

                                                                          M       R       M
                                                                                                                                 R    M             R                  Count

         10 mappers
          8 reducers
         68+14 lines code

Monday, 17 December 12                                                                                                                                                                  51
City of Palo Alto open data
                                                             Regex           Regex

                                                              filter         parser        species

                                                                                                                   Left     Geohash
            GIS exprot                                                                                 Tree
                                                                                                     Metadata                                M
                                                                                                                   RHS                            RHS
                         Regex     Checkpoint

                                                             Regex           Regex

                         parser       tsv                     filter                                                                                             Tree       Filter         GroupBy        Checkpoint
                                                                             parser                                                              CoGroup
                                                                                                                                                               Distance   tree_dist       tree_name         shade

                                                                                                                                                           R                          M               R                M    RHS
                                                                                      HashJoin        Estimate     Road
                                                                                        Left           Albedo               Geohash                                                                                        CoGroup
                                                                       Metadata                                                                                                              GPS
                         Failure                                                        RHS                                                  M                                               logs
                          Traps                                                                                                                                                                                                      R




                                                              filter                                                                                                                                                                     reco


           ‣ GIS export for parks, roads, trees (unstructured / open data)
           ‣ log files of personalized/frequented locations in Palo Alto via iPhone GPS tracks
           ‣ curated metadata, used to enrich the dataset
           ‣ could extend via mash-up with many available public data APIs

         Enterprise-scale app: road albedo + tree species metadata + geospatial indexing
         “Find a shady spot on a summer day to walk near downtown and take a call…”

Monday, 17 December 12                                                                                                                                                                                                                          52
CoPA: log events

Monday, 17 December 12    53
CoPA: results                                          0.12
                                                                          Estimated Tree Height (meters)



                                                              0.06                                                      200




                                                                     0   10        20            30        40   50

            ‣   addr: 115 HAWTHORNE AVE
            ‣   lat/lng: 37.446, -122.168
            ‣   geohash: 9q9jh0
            ‣   tree: 413 site 2
            ‣   species: Liquidambar styraciflua
            ‣   avg height 23 m
            ‣   road albedo: 0.12
            ‣   distance: 10 m
            ‣   a short walk from my train stop ✔

Monday, 17 December 12                                                                                                        54
Intro to Cascading



                                                                              HashJoin   Regex
                                                                                Left     token
                                                                                                 GroupBy    R
                                                                 Stop Word                        token



                         predictive modeling

Monday, 17 December 12                                                                                                  55
PMML model

Monday, 17 December 12   56
                  1. use customer order history as the training data set
                  2. train a risk classifier for orders, using Random Forest
                  3. export model from R to PMML
                  4. build a Cascading app to execute the PMML model
                         4.1. generate a pipeline from PMML description
                         4.2. planner builds the flow for a topology (Hadoop)
                         4.3. compile app to a JAR file
                  5. deploy the app at scale to calculate scores

Monday, 17 December 12                                                         57
                risk classifier                                                 risk classifier
                dimension: customer 360                                        dimension: per-order
                  Cascading apps

                                              training             analyst's                    customer
                           data prep                                laptop
                                             data sets                                        transactions

                           predict                                                            score new
                          model costs                                                           orders
                             detect                                                            anomaly
                           fraudsters                                                          detection

                            segment                                                             velocity
                           customers                                                            metrics

                           Hadoop                                  Customer                    IMDG
                                                     batch                     real-time
                                                 workloads                     workloads


                                                    chargebacks,   partner
                                        DW              etc.        data

Monday, 17 December 12                                                                                       58
                         “orders” data set...
                         train/test in R...
                         exported as PMML

Monday, 17 December 12                          59
R modeling
       ## train a RandomForest model

       f <- as.formula("as.factor(label) ~ .")
       fit <- randomForest(f, data_train, ntree=50)

       ## test the model on the holdout test set


       predicted <- predict(fit, data)
       data$predicted <- predicted
       confuse <- table(pred = predicted, true = data[,1])

       ## export predicted labels to TSV

       write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),
       quote=FALSE, sep="t", row.names=FALSE)

       ## export RF model to PMML

       saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))

Monday, 17 December 12                                                        60
R output
       var0        0.6591701
       var1       33.8625179
       var2        8.0290020

               OOB estimate of   error rate: 13.83%
       Confusion matrix:
          0 1 class.error
       0 28 5    0.1515152
       1 8 53    0.1311475

       [1] "./data/sample.rf.xml"

Monday, 17 December 12                                61
                         Cascading app
                         takes PMML as
                         a parameter...

Monday, 17 December 12                    62
PMML model
       <?xml version="1.0"?>
       <PMML version="4.0" xmlns=""
        <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">
         <Extension name="user" value="ceteri" extender="Rattle/PMML"/>
         <Application name="Rattle/PMML" version="1.2.30"/>
         <Timestamp>2012-10-22 19:39:28</Timestamp>
        <DataDictionary numberOfFields="4">
         <DataField name="label" optype="categorical" dataType="string">
          <Value value="0"/>
          <Value value="1"/>
         <DataField name="var0" optype="continuous" dataType="double"/>
         <DataField name="var1" optype="continuous" dataType="double"/>
         <DataField name="var2" optype="continuous" dataType="double"/>
        <MiningModel modelName="randomForest_Model" functionName="classification">
          <MiningField name="label" usageType="predicted"/>
          <MiningField name="var0" usageType="active"/>
          <MiningField name="var1" usageType="active"/>
          <MiningField name="var2" usageType="active"/>
         <Segmentation multipleModelMethod="majorityVote">
          <Segment id="1">
           <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest"
             <MiningField name="label" usageType="predicted"/>
             <MiningField name="var0" usageType="active"/>
             <MiningField name="var1" usageType="active"/>
             <MiningField name="var2" usageType="active"/>

Monday, 17 December 12                                                                                            63
Cascading app
       public class Main {
         public static void main( String[] args ) {
           String pmmlPath = args[ 0 ];
           String ordersPath = args[ 1 ];
           String classifyPath = args[ 2 ];
           String trapPath = args[ 3 ];

             Properties properties = new Properties();
             AppProps.setApplicationJarClass( properties, Main.class );
             HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

             // create source and sink taps
             Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );
             Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
             Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );

             // define a "Classifier" model from PMML to evaluate the orders
             Classifier classifier = new Classifier( pmmlPath );
             Pipe classifyPipe = new Each( new Pipe( "classify" ), classifier.getFields(),
               new ClassifierFunction( new Fields( "score" ), classifier ), Fields.ALL );

             // connect the taps, pipes, etc., into a flow
             FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
              .addSource( classifyPipe, ordersTap )
              .addTrap( classifyPipe, trapTap )
              .addSink( classifyPipe, classifyTap );

             // write a DOT file and run the flow
             Flow classifyFlow = flowConnector.connect( flowDef );
             classifyFlow.writeDOT( "dot/" );

Monday, 17 December 12                                                                       64
                         app deployed on
                         a cluster to score
                         customers at scale...

Monday, 17 December 12                           65
deploy to cloud
       elastic-mapreduce --create --name "RF" 
         --jar s3n:// 
         --arg s3n:// 
         --arg s3n:// 
         --arg s3n:// 
         --arg s3n://

Monday, 17 December 12                                            66
       bash-3.2$ head output/classify/part-00000
       label" var0" var1" var2" order_id" predicted"score
       1" 0" 1" 0" 6f8e1014" 1" 1
       0" 0" 0" 1" 6f8ea22e" 0" 0
       1" 0" 1" 0" 6f8ea435" 1" 1
       0" 0" 0" 1" 6f8ea5e1" 0" 0
       1" 0" 1" 0" 6f8ea785" 1" 1
       1" 0" 1" 0" 6f8ea91e" 1" 1
       0" 1" 0" 0" 6f8eaaba" 0" 0
       1" 0" 1" 0" 6f8eac54" 1" 1
       0" 1" 1" 0" 6f8eade3" 1" 1

Monday, 17 December 12                                      67

                blog, code/wiki/gists, JARs, community, DevOps products:

                                                        Copyright @2012, Concurrent, Inc.

Monday, 17 December 12                                                                      68

Más contenido relacionado

La actualidad más candente

Isis Papyrus Ti Billing Energy E
Isis Papyrus Ti Billing Energy EIsis Papyrus Ti Billing Energy E
Isis Papyrus Ti Billing Energy E
Friso de Jong
Introduction to Business Intelligence in Microsoft SQL Server 2008 R2
Introduction to Business Intelligence in Microsoft SQL Server 2008 R2Introduction to Business Intelligence in Microsoft SQL Server 2008 R2
Introduction to Business Intelligence in Microsoft SQL Server 2008 R2
Quang Nguyễn Bá
adrian coyler open tour keynote
adrian coyler open tour keynoteadrian coyler open tour keynote
adrian coyler open tour keynote
Asadullah Khan
Samuel Bayeta
Samuel BayetaSamuel Bayeta
Samuel Bayeta
Sam B

La actualidad más candente (18)

Isis Papyrus Ti Billing Energy E
Isis Papyrus Ti Billing Energy EIsis Papyrus Ti Billing Energy E
Isis Papyrus Ti Billing Energy E
Introduction to Business Intelligence in Microsoft SQL Server 2008 R2
Introduction to Business Intelligence in Microsoft SQL Server 2008 R2Introduction to Business Intelligence in Microsoft SQL Server 2008 R2
Introduction to Business Intelligence in Microsoft SQL Server 2008 R2
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
BI Dashboards with SQL Server
BI Dashboards with SQL ServerBI Dashboards with SQL Server
BI Dashboards with SQL Server
adrian coyler open tour keynote
adrian coyler open tour keynoteadrian coyler open tour keynote
adrian coyler open tour keynote
Implementing a QbD program to make Process Validation a Lifestyle
Implementing a QbD program to make Process Validation a LifestyleImplementing a QbD program to make Process Validation a Lifestyle
Implementing a QbD program to make Process Validation a Lifestyle
SSRS integration with share point
SSRS integration with share pointSSRS integration with share point
SSRS integration with share point
Denny Lee\'s Data Camp v1.0 talk on SSRS Best Practices for IT
Denny Lee\'s Data Camp v1.0 talk on SSRS Best Practices for ITDenny Lee\'s Data Camp v1.0 talk on SSRS Best Practices for IT
Denny Lee\'s Data Camp v1.0 talk on SSRS Best Practices for IT
101 ab 1600-1630
101 ab 1600-1630101 ab 1600-1630
101 ab 1600-1630
GlassFish Mobility Platform - Hans Hrasna
GlassFish Mobility Platform - Hans HrasnaGlassFish Mobility Platform - Hans Hrasna
GlassFish Mobility Platform - Hans Hrasna
Sap and alfresco integrations with ctac connector 19 april2011
Sap and alfresco integrations with ctac connector 19 april2011Sap and alfresco integrations with ctac connector 19 april2011
Sap and alfresco integrations with ctac connector 19 april2011
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and Beyond
Os Lonergan
Os LonerganOs Lonergan
Os Lonergan
SAP Alfresco Integration For The Public Sector With Ctac
SAP Alfresco Integration For The Public Sector With CtacSAP Alfresco Integration For The Public Sector With Ctac
SAP Alfresco Integration For The Public Sector With Ctac
Samuel Bayeta
Samuel BayetaSamuel Bayeta
Samuel Bayeta
SharePoint 2010: ECM-ready?
SharePoint 2010: ECM-ready?SharePoint 2010: ECM-ready?
SharePoint 2010: ECM-ready?

Similar a Enterprise Data Workflows with Cascading

Similar a Enterprise Data Workflows with Cascading (10)

Cascading for the Impatient
Cascading for the ImpatientCascading for the Impatient
Cascading for the Impatient
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big DataFunctional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
Advanced analytics with sap hana and r
Advanced analytics with sap hana and rAdvanced analytics with sap hana and r
Advanced analytics with sap hana and r

Más de Paco Nathan

Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Paco Nathan
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Paco Nathan
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
Paco Nathan
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
Paco Nathan
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan

Más de Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
Computable Content
Computable ContentComputable Content
Computable Content
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused

Enterprise Data Workflows with Cascading

  • 1. Enterprise Data Workflows with Cascading Document Collection Paco Nathan Scrub Tokenize token M HashJoin Regex Left token GroupBy R Concurrent, Inc. Stop Word token List RHS Count Word Count @pacoid Copyright @2012, Concurrent, Inc. Monday, 17 December 12 1
  • 2. Unstructured Data meets Enterprise Scale 1. Cascading API: a few facts & quotes 2. Example #1: distributed file copy 3. Example #2: word count 4. Pattern Language: workflow abstraction 5. Compare: Scalding, Cascalog, Hive, Pig Monday, 17 December 12 2
  • 3. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Cascading API: a few facts & quotes Monday, 17 December 12 3
  • 4. Enterprise apps, pre-Hadoop SQL queries Data analyst Warehouse ops ETL data data sets sources insights data sources Analytics Apps modeling Tools developer priorities ad-hoc dashboards analysis queries domain Monday, 17 December 12 4
  • 5. Enterprise apps, pre-Hadoop the devil you know: ‣ “scale up” as needed – larger proprietary hardware ‣ data warehouse: e.g., Oracle,Teradata, etc. – expensive ‣ analytics: e.g., SAS, Microstrategy, etc. – expensive ‣ highly trained staff in specific roles – lots of “silos” however, to be competitive now, the data rates must scale by orders of magnitude... ( alternatively, can we get hired onto the SAS sales team? ) Monday, 17 December 12 5
  • 6. Enterprise apps, with Hadoop Apache Hadoop offers an attractive migration path: ‣ open source software – less expensive ‣ commodity hardware – less expensive ‣ fault tolerance for large-scale parallel workloads ‣ great use cases: Yahoo!, Facebook, Twitter, Amazon, Apple, etc. ‣ offload workflows from licensed platforms, based on “scale-out” Monday, 17 December 12 6
  • 7. Enterprise apps, with Hadoop queries, Java job tracker models apps name node Hadoop Cluster analyst developer ETL needs ops Monday, 17 December 12 7
  • 8. Enterprise apps, with Hadoop anything odd about that diagram? queries, models Java apps job tracker name node Hadoop Cluster analyst developer ETL needs ‣ demands expert Hadoop developers ops ‣ experts are hard to find, expensive ‣ even harder to train from among existing staff ‣ early adopter abstractions are not suitable for Enterprise IT ‣ importantly: Hadoop is almost never used in isolation Monday, 17 December 12 8
  • 9. Cascading API: purpose ‣ simplify data processing development and deployment ‣ improve application developer productivity ‣ enable data processing application manageability Monday, 17 December 12 9
  • 10. Cascading API: a few facts Java open source project (ASL 2) using Git, Gradle, Maven, JUnit, etc. in production (~5 yrs) at hundreds of enterprise Hadoop deployments: Finance, Health Care, Transportation, other verticals studies published about large use cases: Twitter, Etsy, eBay, Airbnb, Square, Climate Corp, FlightCaster, Williams-Sonoma, Trulia, TeleNav partnerships and distribution with SpringSource, Amazon AWS, Microsoft Azure, Hortonworks, MapR, EMC several open source projects built atop, managed by Twitter, Etsy, eBay, etc., which provide substantial Machine Learning libraries DSLs available in Scala, Clojure, Python (Jython), Ruby (JRuby), Groovy data “taps” integrate popular data frameworks via JDBC, Memcached, HBase, plus serialization in Apache Thrift, Avro, Kyro, etc. entire app compiles into a single JAR: fully connected for compiler optimization, exception handling, debug, config, scheduling, notifications, provenance, etc. Monday, 17 December 12 10
  • 11. Cascading API: a few quotes “Cascading gives Java developers the ability to build Big Data applications on Hadoop using their existing skillset … Management can really go out and build a team around folks that are already very experienced with Java. Switching over to this is really a very short exercise.” CIO, Thor Olavsrud, 2012-06-06 “Masks the complexity of MapReduce, simplifies the programming, and speeds you on your journey toward actionable analytics … A vast improvement over native MapReduce functions or Pig UDFs.” 2012 BOSSIE Awards, James Borck, 2012-09-18 “Company’s promise to application developers is an opportunity to build and test applications on their desktops in the language of choice with familiar constructs and reusable components” Dr. Dobb’s, Adrian Bridgwater, 2012-06-08 Monday, 17 December 12 11
  • 12. Enterprise concerns “Notes from the Mystery Machine Bus” by Steve Yegge, Google “conservative” “liberal” (mostly) Enterprise (mostly) Start-Up risk management customer experiments assurance flexibility well-defined schema schema follows code explicit configuration convention type-checking compiler interpreted scripts wants no surprises wants no impediments Java, Scala, Clojure, etc. PHP, Ruby, Python, etc. Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc. Monday, 17 December 12 12
  • 13. Enterprise adoption As Enterprise apps move into Hadoop and related BigData frameworks, risk profiles shift toward more conservative programming practices Cascading provides a popular API – formally speaking, as a pattern language – for defining and managing Enterprise data workflows Monday, 17 December 12 13
  • 14. Migration of batch toolsets Enterprise Migration Start-Ups define pipelines J2EE Cascading Pig query data SQL Lingual Hive predictive models SAS Pattern Mahout Monday, 17 December 12 14
  • 15. Summary Cascading API benefits: ‣ addresses staffing bottlenecks due to Hadoop adoption ‣ reduces costs, while servicing risk concerns and “conservatism” ‣ manages complexity as the data continues to scale massively ‣ provides a pattern language for system integration ‣ leverages a workflow abstraction for Enterprise apps ‣ utilizes existing practices for JVM-based clusters Monday, 17 December 12 15
  • 16. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Code Example #1: distributed file copy Monday, 17 December 12 16
  • 17. 1: distributed file copy public class   Main   {   public static void   main( String[] args )     {     String inPath = args[ 0 ];     String outPath = args[ 1 ]; Source     Properties props = new Properties();     AppProps.setApplicationJarClass( props, Main.class );     HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );     // create the source tap     Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath ); M     // create the sink tap     Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath ); Sink     // specify a pipe to connect the taps     Pipe copyPipe = new Pipe( "copy" );     // connect the taps, pipes, etc., into a flow     FlowDef flowDef = FlowDef.flowDef().setName( "copy" )      .addSource( copyPipe, inTap )      .addTailSink( copyPipe, outTap );     // run the flow     flowConnector.connect( flowDef ).complete(); 1 mapper     }   } 0 reducers 10 lines code Monday, 17 December 12 17
  • 18. 1: distributed file copy shown: ‣ a source tap – input data ‣ a sink tap – output data ‣ a pipe connecting a source to a sink ‣ simplest possible Cascading app not shown: ‣ what kind of taps? and what size of input data set? ‣ could be: JDBC, HBase, Cassandra, XML, flat files, etc. ‣ what kind of topology? and what size of cluster? ‣ could be: Hadoop, in-memory, etc. as system architects, we leverage pattern Monday, 17 December 12 18
  • 19. principle: same JAR, any scale MegaCorp Enterprise IT: Pb’s data 1000+ node private cluster EVP calls you when app fails runtime: days+ Production Cluster: Tb’s data EMR w/ 50 HPC Instances Ops monitors results runtime: hours – days Staging Cluster: Gb’s data EMR + 4 Spot Instances CI shows red or green lights runtime: minutes – hours Your Laptop: Mb’s data Hadoop standalone mode passes unit tests, or not runtime: seconds – minutes Monday, 17 December 12 19
  • 20. principle: fail the same way twice troubleshooting at scale: ‣ physical plan for a query provides a deterministic strategy ‣ avoid non-deterministic behavior – expensive when troubleshooting ‣ otherwise, edge cases become nightmares on large clusters ‣ again, addresses “conservative” need for predictability ‣ a core value which is unique to Cascading Monday, 17 December 12 20
  • 21. principle: plan ahead flow planner per topology: ‣ leverage the flow graph (DAG) ‣ catch as many errors as possible before an app gets submitted ‣ potential problems caught at compile time or at flow planner stage ‣ …long before large, expensive resources start getting consumed ‣ …or worse, before the wrong results get propagated downstream Monday, 17 December 12 21
  • 22. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Code Example #2: word count Monday, 17 December 12 22
  • 23. 2: word count defined: count how often each word appears in a collection of text documents a simple program provides a great test case for parallel processing, since it illustrates: ‣ requires a minimal amount of code ‣ demonstrates use of both symbolic and numeric values ‣ shows a dependency graph of tuples as an abstraction ‣ is not many steps away from useful search indexing ‣ serves as a “Hello World” for Hadoop apps any distributed computing framework which runs Word Count efficiently in parallel at scale, can handle much larger, more interesting compute problems Monday, 17 December 12 23
  • 24. 2: word count Document Collection Tokenize GroupBy M token Count R Word Count 1 mapper 1 reducer 18 lines code Monday, 17 December 12 24
  • 25. 2: word count Document Collection M Tokenize GroupBy token Count String docPath = args[ 0 ]; R Word Count String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/" ); wcFlow.complete(); Monday, 17 December 12 25
  • 26. 2: word count [head] Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']'] [{2}:'doc_id', 'text'] [{2}:'doc_id', 'text'] map Each('token')[RegexSplitGenerator[decl:'token'][args:1]] [{1}:'token'] [{1}:'token'] GroupBy('wc')[by:['token']] wc[{1}:'token'] [{1}:'token'] reduce Every('wc')[Count[decl:'count']] [{2}:'token', 'count'] [{1}:'token'] Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']'] 1 mapper [{2}:'token', 'count'] 1 reducer [{2}:'token', 'count'] 18 lines code [tail] Monday, 17 December 12 26
  • 27. 2: word count deltas between Example #1 and Example #2: ‣ defines source tap as a collection of text documents ‣ defines sink tap to produce word count tuples (desired end result) ‣ uses named fields, applying structure to unstructured data ‣ adds semantics to the workflow, specifying business logic ‣ inserts operations into the pipe: Tokenize, GroupBy, Count ‣ shows function and aggregation applied to data tuples in parallel Document Collection Source Tokenize GroupBy M token Count M Sink R Word Count Monday, 17 December 12 27
  • 28. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Pattern Language: the workflow abstraction Monday, 17 December 12 28
  • 29. enterprise data workflows Tuples, Pipelines, Taps, Operations, Joins, Assertions, Traps, etc. …in other words, “plumbing” as a pattern language for handling Big Data in Enterprise IT Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Monday, 17 December 12 29
  • 30. pattern language defined: a structured method for solving large, complex design problems, where the syntax of the language promotes the use of best practices “plumbing” metaphor of pipes and operators in Cascading helps indicate: algorithms to be used at particular points, appropriate architectural trade-offs, frameworks which must be integrated, etc. design patterns: originated in consensus negotiation for architecture, later used in software engineering Monday, 17 December 12 30
  • 31. data workflows: team ‣ Business Stakeholder POV: business process management for workflow orchestration (think BPM/BPEL) ‣ Systems Integrator POV: system integration of heterogenous data sources and compute platforms ‣ Data Scientist POV: a directed, acyclic graph (DAG) on which we can apply Amdahl's Law, etc. ‣ Data Architect POV: a physical plan for large-scale data flow management ‣ Software Architect POV: a pattern language, similar to plumbing or circuit design Document Collection ‣ App Developer POV: M Tokenize Scrub token API bindings for Java, Scala, Clojure, Jython, JRuby, etc. Stop Word List HashJoin Left RHS Regex token GroupBy token R Count ‣ Systems Engineer POV: Word Count a JAR file, has passed CI, available in a Maven repo Monday, 17 December 12 31
  • 32. data workflows: layers business domain expertise, business trade-offs, process operating parameters, market position, etc. API Java, Scala, Clojure, Jython, JRuby, Groovy, etc. language …envision whatever runs in a JVM optimize / schedule major changes in technology now Document Collection Scrub Tokenize token M physical Stop Word HashJoin Left Regex token GroupBy token R plan List RHS Count “assembler” Word Count code topology Apache Hadoop, in-memory local mode …envision GPUs, streaming, etc. machine data Splunk, New Relic, Typesafe, Nagios, etc. Monday, 17 December 12 32
  • 33. data workflows: example web web Memcached web logsweb logs cluster API logs Cascading app source sink tap tap Customers Recommender source System trap tap tap customer Support Customer profile review Profile DBs DBs Hadoop cluster Monday, 17 December 12 33
  • 34. data workflows: SQL vs. JVM abstraction SQL parser SQL parser optimizer logical plan, optimized based on stats planner physical plan machine query history, data table stats topology b-trees, etc. visualization ERD schema table schema catalog relational catalog Monday, 17 December 12 34
  • 35. data workflows: SQL vs. JVM abstraction SQL JVM parser SQL parser SQL-92 compliant parser (in progress) optimizer logical plan, logical plan, optimized based on stats optimized based on stats planner physical plan API “plumbing” machine query history, app history, data table stats tuple stats topology b-trees, etc. heterogenous, distributed: Hadoop, in-memory, etc. visualization ERD flow diagram schema table schema tuple schema catalog relational catalog tap usage DB Monday, 17 December 12 35
  • 36. Cascading taxonomy Cascading scheduler app app instance source tap Maven flow repo sink step tap slice owner trap kind mapper | reducer tap topology hadoop | local Monday, 17 December 12 36
  • 37. MapReduce architecture ‣ name node / data node ‣ job tracker / task tracker ‣ submit queue ‣ task slots ‣ HDFS ‣ distributed cache Wikipedia Apache Monday, 17 December 12 37
  • 38. Summary If you were leading a team responsible for Enterprise apps: ‣ which of the previous two slides seems easier to understand? ‣ which is simpler to use for training and managing a team? ‣ which costs the most in the long run? Monday, 17 December 12 38
  • 39. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Compare & Contrast: other approaches Monday, 17 December 12 39
  • 40. wc: pseudocode Document Collection M Tokenize GroupBy token Count R Word Count void map (String doc_id, String text): for each word w in segment(text): emit(w, "1"); void reduce (String word, Iterator partial_counts): int count = 0; for each pc in partial_counts: count += Int(pc); emit(word, String(count)); Monday, 17 December 12 40
  • 41. Scalding / Scala Document Collection M Tokenize GroupBy token Count R Word Count // Sujit Pal // package com.mycompany.impatient import com.twitter.scalding._ class Part2(args : Args) extends Job(args) {   val input = Tsv(args("input"), ('docId, 'text))   val output = Tsv(args("output"))     flatMap('text -> 'word) { text : String => text.split("""s+""") }.     groupBy('word) { group => group.size }.     write(output) } Monday, 17 December 12 41
  • 42. Scalding / Scala Document Collection M Tokenize GroupBy token Count R Word Count notes: ‣ code is compact, easy to understand ‣ functional programming is great for expressing complex workflows in MapReduce, etc. ‣ very large-scale, complex problems can be handled in just a few lines of code ‣ many large-scale apps in production deployments ‣ significant investments by Twitter, Etsy, eBay, etc., in this open source project ‣ extensive libraries are available for linear algebra, machine learning – e.g., “Matrix API” Monday, 17 December 12 42
  • 43. Cascalog / Clojure Document Collection M Tokenize GroupBy token Count R Word Count ; Paul Lam ; (ns impatient.core   (:use [cascalog.api]         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count))) Monday, 17 December 12 43
  • 44. Cascalog / Clojure Document Collection M Tokenize GroupBy token Count R Word Count notes: ‣ code is compact, easy to understand ‣ functional programming is great for expressing complex workflows in MapReduce, etc. ‣ significant investments by Twitter, Climate Corp, etc., in this open source project ‣ can run queries from the Clojure REPL ‣ compelling for very large-scale use cases where code correctness can be verified before deployment Monday, 17 December 12 44
  • 45. Apache Hive Document Collection M Tokenize GroupBy token Count R Word Count -- Steve Severance -- CREATE TABLE input (line STRING); LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTO TABLE input; SELECT  word, COUNT(*) FROM input  LATERAL VIEW explode(split(text, ' ')) lTable AS word GROUP BY word ; Monday, 17 December 12 45
  • 46. Apache Hive Document Collection M Tokenize GroupBy token Count R Word Count pro: ‣ most popular abstraction atop Apache Hadoop ‣ SQL-like language is syntactically familiar to most analysts ‣ simple to load large-scale unstructured data and run ad-hoc queries con: ‣ not a relational engine, many surprises at scale ‣ difficult to represent complex workflows, ML algorithms, etc. ‣ one poorly-trained analyst can bottleneck an entire cluster ‣ app-level integration requires other coding, outside of script language ‣ logical planner mixed with physical planner; cannot collect app stats ‣ non-deterministic exec: number of mappers+reducers changes unexpectedly ‣ business logic must cross multiple language boundaries: difficult to troubleshoot, optimize, audit, handle exceptions, set notifications, etc. Monday, 17 December 12 46
  • 47. Apache Pig Document Collection M Tokenize GroupBy token Count R Word Count -- kudos to Dmitriy Ryaboy docPipe = LOAD '$docPath' USING PigStorage('t', 'tagsource') AS (doc_id, text); docPipe = FILTER docPipe BY doc_id != 'doc_id'; -- specify regex to split "document" text lines into token stream tokenPipe = FOREACH docPipe GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token; tokenPipe = FILTER tokenPipe BY token MATCHES 'w.*'; -- determine the word counts tokenGroups = GROUP tokenPipe BY token; wcPipe = FOREACH tokenGroups GENERATE group AS token, COUNT(tokenPipe) AS count; -- output STORE wcPipe INTO '$wcPath' USING PigStorage('t', 'tagsource'); EXPLAIN -out dot/ -dot wcPipe; Monday, 17 December 12 47
  • 48. Apache Pig Document Collection M Tokenize GroupBy token Count R Word Count pro: ‣ easy to learn data manipulation language (DML) ‣ interactive prompt (Grunt) makes it simple to prototype apps ‣ extensibility through UDFs con: ‣ not a full programming language; must extend via UDFs outside of language ‣ app-level integration requires other coding, outside of script language ‣ simple problems are simple to do; hard problems become quite complex ‣ difficult to parameterize scripts externally; must rewrite to change taps! ‣ logical planner mixed with physical planner; cannot collect app stats ‣ non-deterministic exec: number of mappers+reducers changes unexpectedly ‣ business logic must cross multiple language boundaries: difficult to troubleshoot, optimize, audit, handle exceptions, set notifications, etc. Monday, 17 December 12 48
  • 49. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Code Example #N: city of palo alto, etc. Monday, 17 December 12 49
  • 50. extend: wc + scrub + stop words Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count 1 mapper Word 1 reducer Count 28+10 lines code Monday, 17 December 12 50
  • 51. extend: a simple search engine Unique Insert SumBy D doc_id 1 doc_id Document Collection M R M R RHS Scrub Tokenize token HashJoin M RHS token HashJoin Regex Unique CountBy DF Left token token token ExprFunc CoGroup Stop Word tf-idf List RHS M R M R M R TF-IDF M CountBy TF doc_id, token CountBy Sort token count M R M Word R M R Count 10 mappers 8 reducers 68+14 lines code Monday, 17 December 12 51
  • 52. City of Palo Alto open data Regex Regex tree Scrub filter parser species M HashJoin Left Geohash CoPA GIS exprot Tree Metadata M RHS RHS tree Regex Checkpoint road Regex Regex tsv parser tsv filter Tree Filter GroupBy Checkpoint parser CoGroup Distance tree_dist tree_name shade M R M R M RHS M HashJoin Estimate Road Left Albedo Geohash CoGroup Segments Road Metadata GPS Failure RHS M logs Traps R road Geohash M Regex park filter reco M park ‣ GIS export for parks, roads, trees (unstructured / open data) ‣ log files of personalized/frequented locations in Palo Alto via iPhone GPS tracks ‣ curated metadata, used to enrich the dataset ‣ could extend via mash-up with many available public data APIs Enterprise-scale app: road albedo + tree species metadata + geospatial indexing “Find a shady spot on a summer day to walk near downtown and take a call…” Monday, 17 December 12 52
  • 53. CoPA: log events Monday, 17 December 12 53
  • 54. CoPA: results 0.12 Estimated Tree Height (meters) 0.10 0.08 count 0 density 100 0.06 200 300 0.04 0.02 0.00 0 10 20 30 40 50 avg_height ‣ addr: 115 HAWTHORNE AVE ‣ lat/lng: 37.446, -122.168 ‣ geohash: 9q9jh0 ‣ tree: 413 site 2 ‣ species: Liquidambar styraciflua ‣ avg height 23 m ‣ road albedo: 0.12 ‣ distance: 10 m ‣ a short walk from my train stop ✔ Monday, 17 December 12 54
  • 55. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count PMML: predictive modeling Monday, 17 December 12 55
  • 56. PMML model Monday, 17 December 12 56
  • 57. cascading.pattern example: 1. use customer order history as the training data set 2. train a risk classifier for orders, using Random Forest 3. export model from R to PMML 4. build a Cascading app to execute the PMML model 4.1. generate a pipeline from PMML description 4.2. planner builds the flow for a topology (Hadoop) 4.3. compile app to a JAR file 5. deploy the app at scale to calculate scores Monday, 17 December 12 57
  • 58. cascading.pattern risk classifier risk classifier dimension: customer 360 dimension: per-order Cascading apps training analyst's customer data prep laptop data sets transactions predict score new model costs orders PMML model detect anomaly fraudsters detection segment velocity customers metrics Hadoop Customer IMDG DB batch real-time workloads workloads ETL chargebacks, partner DW etc. data Monday, 17 December 12 58
  • 59. 1: “orders” data set... train/test in R... exported as PMML Monday, 17 December 12 59
  • 60. R modeling ## train a RandomForest model f <- as.formula("as.factor(label) ~ .") fit <- randomForest(f, data_train, ntree=50) ## test the model on the holdout test set print(fit$importance) print(fit) predicted <- predict(fit, data) data$predicted <- predicted confuse <- table(pred = predicted, true = data[,1]) print(confuse) ## export predicted labels to TSV write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE) ## export RF model to PMML saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/")) Monday, 17 December 12 60
  • 61. R output MeanDecreaseGini var0 0.6591701 var1 33.8625179 var2 8.0290020 OOB estimate of error rate: 13.83% Confusion matrix: 0 1 class.error 0 28 5 0.1515152 1 8 53 0.1311475 [1] "./data/sample.rf.xml" Monday, 17 December 12 61
  • 62. 2: Cascading app takes PMML as a parameter... Monday, 17 December 12 62
  • 63. PMML model <?xml version="1.0"?> <PMML version="4.0" xmlns=""  xmlns:xsi=""  xsi:schemaLocation="">  <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">   <Extension name="user" value="ceteri" extender="Rattle/PMML"/>   <Application name="Rattle/PMML" version="1.2.30"/>   <Timestamp>2012-10-22 19:39:28</Timestamp>  </Header>  <DataDictionary numberOfFields="4">   <DataField name="label" optype="categorical" dataType="string">    <Value value="0"/>    <Value value="1"/>   </DataField>   <DataField name="var0" optype="continuous" dataType="double"/>   <DataField name="var1" optype="continuous" dataType="double"/>   <DataField name="var2" optype="continuous" dataType="double"/>  </DataDictionary>  <MiningModel modelName="randomForest_Model" functionName="classification">   <MiningSchema>    <MiningField name="label" usageType="predicted"/>    <MiningField name="var0" usageType="active"/>    <MiningField name="var1" usageType="active"/>    <MiningField name="var2" usageType="active"/>   </MiningSchema>   <Segmentation multipleModelMethod="majorityVote">    <Segment id="1">     <True/>     <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">      <MiningSchema>       <MiningField name="label" usageType="predicted"/>       <MiningField name="var0" usageType="active"/>       <MiningField name="var1" usageType="active"/>       <MiningField name="var2" usageType="active"/>      </MiningSchema> ... Monday, 17 December 12 63
  • 64. Cascading app public class Main { public static void main( String[] args ) {   String pmmlPath = args[ 0 ];   String ordersPath = args[ 1 ];   String classifyPath = args[ 2 ];   String trapPath = args[ 3 ];   Properties properties = new Properties();   AppProps.setApplicationJarClass( properties, Main.class );   HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );   // create source and sink taps   Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );   Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );   Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );   // define a "Classifier" model from PMML to evaluate the orders   Classifier classifier = new Classifier( pmmlPath );   Pipe classifyPipe = new Each( new Pipe( "classify" ), classifier.getFields(), new ClassifierFunction( new Fields( "score" ), classifier ), Fields.ALL );   // connect the taps, pipes, etc., into a flow   FlowDef flowDef = FlowDef.flowDef().setName( "classify" )    .addSource( classifyPipe, ordersTap )    .addTrap( classifyPipe, trapTap )    .addSink( classifyPipe, classifyTap );   // write a DOT file and run the flow   Flow classifyFlow = flowConnector.connect( flowDef );   classifyFlow.writeDOT( "dot/" );   classifyFlow.complete(); } } Monday, 17 December 12 64
  • 65. 3: app deployed on a cluster to score customers at scale... Monday, 17 December 12 65
  • 66. deploy to cloud elastic-mapreduce --create --name "RF" --jar s3n:// --arg s3n:// --arg s3n:// --arg s3n:// --arg s3n:// Monday, 17 December 12 66
  • 67. results bash-3.2$ head output/classify/part-00000 label" var0" var1" var2" order_id" predicted"score 1" 0" 1" 0" 6f8e1014" 1" 1 0" 0" 0" 1" 6f8ea22e" 0" 0 1" 0" 1" 0" 6f8ea435" 1" 1 0" 0" 0" 1" 6f8ea5e1" 0" 0 1" 0" 1" 0" 6f8ea785" 1" 1 1" 0" 1" 0" 6f8ea91e" 1" 1 0" 1" 0" 0" 6f8eaaba" 0" 0 1" 0" 1" 0" 6f8eac54" 1" 1 0" 1" 1" 0" 6f8eade3" 1" 1 Monday, 17 December 12 67
  • 68. drill-down blog, code/wiki/gists, JARs, community, DevOps products: @pacoid Copyright @2012, Concurrent, Inc. Monday, 17 December 12 68