SlideShare una empresa de Scribd logo
1 de 76
Intro to Cascading


Paco Nathan
                            Document
                            Collection



                                                           Scrub
                                           Tokenize
                                                           token




Concurrent, Inc.
                                    M



                                                                   HashJoin   Regex
                                                                     Left     token
                                                                                      GroupBy    R
                                                      Stop Word                        token
                                                         List
                                                                     RHS




pnathan@concurrentinc.com                                                                Count




@pacoid
                                                                                                     Word
                                                                                                     Count




                                         Copyright @2012, Concurrent, Inc.
Enterprise Apps
 for Big Data
with Cascading

  1. intro: Cascading API
  2. backstory: Big Data origins
  3. context: Hadoop cliff notes
  4. theory: Data Science teams
  5. tutorial: for the impatient
  6. code: sample apps
Intro to Cascading
            Document
            Collection



                                         Scrub
                         Tokenize
                                         token

                    M



                                                 HashJoin   Regex
                                                   Left     token
                                                                    GroupBy    R
                                    Stop Word                        token
                                       List
                                                   RHS




                                                                       Count




                                                                                   Word
                                                                                   Count




1. intro:
Cascading API
Cascading API: purpose
  ‣ simplify data processing development and deployment

  ‣ improve application developer productivity

  ‣ enable data processing application manageability
Cascading API: a few facts
  Java open source project (ASL 2) using Git, Gradle, Maven, JUnit, etc.

  in production (~5 yrs) at hundreds of enterprise Hadoop deployments:
  Finance, Health Care, Transportation, other verticals

  studies published about large use cases: Twitter, Etsy, Airbnb, Square,
  Climate Corporation, FlightCaster, Williams-Sonoma

  partnerships and distribution with SpringSource, Amazon AWS,
  Microsoft Azure, Hortonworks, MapR, EMC

  several open source projects built atop, contribs by Twitter, Etsy, etc.,
  which provide substantial Machine Learning libraries

  DSLs available in Scala, Clojure, Python (Jython), Ruby (JRuby), Groovy

  data “taps” integrate popular data frameworks via JDBC, Memcached, HBase,
  plus serialization in Apache Thrift, Avro, Kyro, etc.

  entire app compiles into a single JAR: fully connected for compiler optimization, exception
  handling, debugging, config, scheduling, etc.
Cascading API: a few quotes
 “Cascading gives Java developers the ability to build Big Data applications
  on Hadoop using their existing skillset … Management can really go out
  and build a team around folks that are already very experienced with Java.
  Switching over to this is really a very short exercise.”
   CIO, Thor Olavsrud, 2012-06-06
   cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading


 “Masks the complexity of MapReduce, simplifies the programming, and
  speeds you on your journey toward actionable analytics … A vast
  improvement over native MapReduce functions or Pig UDFs.”
   2012 BOSSIE Awards, James Borck, 2012-09-18
   infoworld.com/slideshow/65089


 “Company’s promise to application developers is an opportunity to build
  and test applications on their desktops in the language of choice with
  familiar constructs and reusable components”
   Dr. Dobb’s, Adrian Bridgwater, 2012-06-08
   drdobbs.com/jvm/where-does-big-data-go-to-get-data-inten/240001759
data+code “political spectrum”
 “Notes from the Mystery Machine Bus”
 by Steve Yegge, Google
 goo.gl/SeRZa
          “conservative”                            “liberal”
            (mostly) Enterprise                   (mostly) Start-Up

             risk management                    customer experiments

                 assurance                            flexibility

           well-defined schema                   schema follows code

           explicit configuration                     convention

          type-checking compiler                 interpreted scripts

            wants no surprises                  wants no impediments

          Java, Scala, Clojure, etc.            PHP, Ruby, Python, etc.

   Cascading, Scalding, Cascalog, etc.   Hive, Pig, Hadoop Streaming, etc.
Cascading API: adoption

    As Enterprise apps move into
    Hadoop and related BigData
    frameworks, risk profiles shift
    toward more conservative
    programming practices

    Cascading provides a popular API
    for defining and managing
    Enterprise data workflows
enterprise data workflows
 Tuples, Pipelines, Endpoints, Operations, Joins, Assertions, Traps, etc.
 …in other words, “plumbing”

  Document
  Collection



                               Scrub
               Tokenize
                               token

          M



                                       HashJoin   Regex
                                         Left     token
                                                           GroupBy    R
                          Stop Word                         token
                             List
                                         RHS




                                                              Count




                                                                          Word
                                                                          Count
data workflows: team
  ‣ Business Stakeholder POV:
    business process management for workflow orchestration (think BPM/BPEL)

  ‣ Systems Integrator POV:
    system integration of heterogenous data sources and compute platforms

  ‣ Data Scientist POV:
    a directed, acyclic graph (DAG) on which we can apply Amdahl's Law, etc.

  ‣ Data Architect POV:
    a physical plan for large-scale data flow management

  ‣ Software Architect POV:
    a pattern language, similar to plumbing or circuit design
                                                                      Document
                                                                      Collection



                                                                                                   Scrub
                                                                                   Tokenize
                                                                                                   token

                                                                              M




  ‣ App Developer POV:                                                                        Stop Word
                                                                                                 List
                                                                                                           HashJoin
                                                                                                             Left


                                                                                                             RHS
                                                                                                                      Regex
                                                                                                                      token
                                                                                                                              GroupBy
                                                                                                                               token
                                                                                                                                         R




    API bindings for Java, Scala, Clojure, Jython, JRuby, etc.                                                                   Count




                                                                                                                                             Word
                                                                                                                                             Count




  ‣ Systems Engineer POV:
    a JAR file, has passed CI, available in a Maven repo
data workflows: layers
    business    domain expertise, business trade-offs,
    process     operating parameters, market position, etc.

       API      Java, Scala, Clojure, Jython, JRuby, Groovy, etc.
    language    …envision whatever runs in a JVM

   optimize /   major changes in technology now
    schedule

                     Document
                     Collection



                                                  Scrub
                                  Tokenize
                                                  token




    physical
                             M



                                                          HashJoin   Regex
                                                            Left     token




     plan
                                                                             GroupBy    R
                                             Stop Word                        token
                                                List
                                                            RHS




                                                                                Count




                                                                                            Word
                                                                                            Count




                Apache Hadoop, in-memory local mode




                                                                                                    “assembler”
   compute




                                                                                                     code
   substrate    …envision GPUs, streaming, etc.

    machine     Splunk, Nagios, Collectd, New Relic, etc.
     data
data workflows: SQL
         Relational
           SQL parser


           logical plan,
     optimized based on stats

           physical plan


          query history,
            table stats

           b-trees, etc.


               ERD


          table schema


             catalog
data workflows: SQL vs. JVM
         Relational             Cascading + Driven
           SQL parser             SQL-92 compliant parser
                                       (in progress)

           logical plan,              TODO: logical plan,
     optimized based on stats      optimized based on stats

           physical plan               API “plumbing”


          query history,                 app history,
            table stats                  tuple stats

           b-trees, etc.        distributed compute substrate:
                                   Hadoop, in-memory, etc.

               ERD                      flow diagram


          table schema                  tuple schema


             catalog                 endpoint usage DB
Intro to Cascading
             Document
             Collection



                                          Scrub
                          Tokenize
                                          token

                     M



                                                  HashJoin   Regex
                                                    Left     token
                                                                     GroupBy    R
                                     Stop Word                        token
                                        List
                                                    RHS




                                                                        Count




                                                                                    Word
                                                                                    Count




2. backstory:
Big Data origins
inflection point
 huge Internet successes after 1997 holiday season…          1997
 AMZN, EBAY, Inktomi (YHOO Search), then GOOG
                                                             1998
 consider this metric:
   annual revenue per customer / amount of data stored
 which dropped 100x within a few years after 1997
                                                             2004
 storage and processing costs plummeted, now we must
 work much smarter to extract ROI from Big Data…
 our methods must adapt

 “conventional wisdom” of RDBMS and BI tools became
 less viable; however, business cadre was still focused on
 pivot tables and pie charts… which tends toward inertia!

 MapReduce and the Hadoop open source stack grew
 directly out of that contention… however, that effort
 only solves parts of the puzzle
                                                              +
inflection point: consequences
 Geoffrey Moore (Mohr Davidow Ventures, author of Crossing The Chasm)
 Hadoop Summit, 2012:

 “All of Fortune 500 is now on notice over the next 10-year period.”
 Amazon and Google as exemplars of massive disruption in retail, advertising,
 etc.
 data as the major force displacing Global 1000 over the next decade, mostly
 through apps — verticals, leveraging domain expertise


 Michael Stonebraker (INGRES, PostgreSQL,Vertica,VoltDB, etc.)
 XLDB, 2012:

 “Complex analytics workloads are now displacing SQL as the basis
  for Enterprise apps.”
primary sources
 Amazon
 “Early Amazon: Splitting the website” – Greg Linden
 glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

 eBay
 “The eBay Architecture” – Randy Shoup, Dan Pritchett
 addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
 addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf

 Inktomi (YHOO Search)
 “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
 youtube.com/watch?v=E91oEn1bnXM

 Google
 “The Birth of Google” – John Battelle
 wired.com/wired/archive/13.08/battelle.html
 “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
 youtube.com/watch?v=qsan-GQaeyk
 perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
the world before…

BI, SQL, and highly
optimized code
data innovation: circa 1996
                            Stakeholder                   Customers

     Excel pivot tables
   PowerPoint slide decks        strategy



         BI
                                Product
       Analysts


                               requirements



       SQL Query                              optimized
                              Engineering       code         Web App
        result sets



                                                             transactions




                                                             RDBMS
the world after…

machine learning,
leveraging log files
data innovation: circa 2001
    Stakeholder                    Product                   Customers




      dashboards                                                  UX
                                  Engineering

                    models                        servlets

                                  recommenders
    Algorithmic                          +                   Web Apps
     Modeling                        classifiers


                                                             Middleware
                    aggregation
                                                   event
     SQL Query                                    history
      result sets                                               customer
                                                              transactions
                                     Logs



        DW                             ETL                    RDBMS
the world ahead…

what our customers
are doing now
data innovation: circa 2013
                                                                                             Customers
                                        Data Apps
                          business
  Domain                  process       Workflow                                                                          Prod
  Expert
                            dashboard                                                        Web Apps,
                             metrics
                                         History                     services                 Mobile,
                  data                                                                         etc.                s/w
                science                                                                                            dev
   Data
                                         Planner
 Scientist
                                                                                       social
                          discovery                  optimized                      interactions
                              +                       capacity                                     transactions,          Eng
                                         endpoints
                          modeling                                                                    content

  App Dev
                                                Data Access Patterns


                                         Hadoop,                   Log                        In-Memory
                                           etc.                   Events                       Data Grid
    Ops                          DW                                                                                       Ops
                                                                            batch      "real time"


                                                                 Cluster Scheduler
  introduced                                                                                                             existing
   capability                                                                                                             SDLC

                                                                                                   RDBMS
                                                                                                    RDBMS
a key difference…
statistical thinking


       Process               Variation                 Data              Tools



  employing a mode of thought which includes both logical and analytical reasoning:
  evaluating the whole of a problem, as well as its component parts; attempting
  to assess the effects of changing one or more variables

  this approach attempts to understand not just problems and solutions,
  but also the processes involved and their variances

  particularly valuable in Big Data work when combined with hands-on experience in
  physics – roughly 50% of my peers come from physics or physical engineering…

  programmers typically don’t think this way…
  however, both systems engineers and data scientists must!
reference

  by Leo Breiman
  Statistical Modeling:
  The Two Cultures
  Statistical Science, 2001
  bit.ly/eUTh9L

  also check out RStudio:
  rstudio.org/
  rpubs.com/
Intro to Cascading
             Document
             Collection



                                          Scrub
                          Tokenize
                                          token

                     M



                                                  HashJoin   Regex
                                                    Left     token
                                                                     GroupBy    R
                                     Stop Word                        token
                                        List
                                                    RHS




                                                                        Count




                                                                                    Word
                                                                                    Count




3. context:
Hadoop cliff notes
MapReduce architecture
 ‣ name node + data nodes
 ‣ job tracker + task trackers
 ‣ submit queue
 ‣ task slots
 ‣ HDFS
 ‣ distributed cache

                                 Wikipedia




                  Apache
MapReduce: how it works

   map(k1, v1) → list(k2, v2)
   reduce(k2, list(v2)) → list(k3, v3)

 the property of data independence among tasks allows for parallel processing …
 maybe, if the stars are all aligned :)

 MapReduce is mostly about fault tolerance, and how to leverage “commodity
 hardware” to replace “big iron” solutions… where “big iron”
 might apply to Oracle + NetApp. or perhaps an IBM zSeries mainframe…
 or something else that’s expensive, undoubtably.

 bonus for math geeks: see any concerns about O(n) complexity, given
 Amdahl’s Law plus the functional definitions listed above?

 keep in mind that each phase cannot conclude and progress to the next
 phase until after each of its tasks has successfully completed.
a brief history…
 circa 1979 – Stanford, MIT, CMU, etc.
  set/list operations in LISP, Prolog, etc., for parallel processing
  www-formal.stanford.edu/jmc/history/lisp/lisp.htm

 circa 2004 – Google
  MapReduce: Simplified Data Processing on Large Clusters
  Jeffrey Dean and Sanjay Ghemawat
  labs.google.com/papers/mapreduce.html

 circa 2006 – Apache
  Hadoop, originating from the Nutch Project
  Doug Cutting
  research.yahoo.com/files/cutting.pdf

 circa 2008 – Yahoo
  web scale search indexing
  Hadoop Summit, HUG, etc.
  developer.yahoo.com/hadoop/

 circa 2009 – Amazon AWS
  Elastic MapReduce
  Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc.
  aws.amazon.com/elasticmapreduce/
CAP theorem
 purpose: theoretical limits for data access patterns
 essence:
    ‣ consistency
    ‣ availability
    ‣ partition tolerance




 best case scenario: you may pick two … or spend billions
 struggling to obtain all three at scale (GOOG)
 translated: cost of doing business

   www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf

   julianbrowne.com/article/viewer/brewers-cap-theorem
data access patterns
 because the world is not made of data warehouses…

 a handful of common data access patterns are prevalent

 learn to recognize these for any given problem

 typically expressed in terms of trade-offs:

      ‣ speed & volume (latency and throughput)

      ‣ reads & writes (access and storage)

      ‣ consistency / availability / partition tolerance
access → frameworks → forfeits
  financial transactions               general ledger in RDBMS            CAx
  ad-hoc queries                      RDS (hosted MySQL)                 CAx
  reporting, dashboards               like Pentaho                       CAx
  log rotation/persistence            like Riak                          xxP
  search indexes                      like Lucene/Solr                   xAP
  static content, archives            S3 (durable storage)               xAP
  customer facts                      like Redis, Membase                xAP
  distributed counters, locks, sets   like Redis                         x A P*
  data objects CRUD                   key/value – like, NoSQL on MySQL   CxP
  authoritative metadata              like Zookeeper                     CxP
  data prep, modeling at scale        like Hadoop/Cascading + R          CxP
  graph analysis                      like Hadoop + Redis + Gephi        CxP
  data marts                          like Hadoop/HBase                  CxP
access → frameworks → forfeits
  financial transactions               general ledger in RDBMS            CAx
  ad-hoc queries                      RDS (hosted MySQL)                 CAx
  reporting, dashboards               like Pentaho                       CAx
  log rotation/persistence            like Riak                          xxP
  search indexes                      like Lucene/Solr                   xAP
  static content, archives            S3 (durable storage)               xAP
  customer facts                      like Redis, Membase                xAP
  distributed counters, locks, sets   like Redis                         x A P*
  data objects CRUD                   key/value – like, NoSQL on MySQL   CxP
  authoritative metadata              like Zookeeper                     CxP
  data prep, modeling at scale        like Hadoop/Cascading + R          CxP
  graph analysis                      like Hadoop + Redis + Gephi        CxP
  data marts                          like Hadoop/HBase                  CxP
parallel computation
 parallelism allows for horizontal scale-out, which create
 business “levers” in cost/performance at scale

 NB: MapReduce provides a compute framework which
 is part-parallel and part-serial… which tends to
 complicate app development

 most hard problems in industry have portions which do not
 allow data independence, or which require iteration

 current efforts in massively parallel algorithms research may
 help to parallelize problems and reduce iteration – estimates
 are 3-5 years out for industry use

 GPUs and other hardware architecture advancements
 will likely make Hadoop unrecognizable 3-5 years out
reference

  by Tom White
  Hadoop:The Definitive Guide
  O’Reilly, 2009
  amazon.com/dp/1449311520

  see also:
  Cluster Computing and MapReduce Lectures
  code.google.com/edu/submissions/mapreduce-minilecture/listing.html
Intro to Cascading
            Document
            Collection



                                         Scrub
                         Tokenize
                                         token

                    M



                                                 HashJoin   Regex
                                                   Left     token
                                                                    GroupBy    R
                                    Stop Word                        token
                                       List
                                                   RHS




                                                                       Count




                                                                                   Word
                                                                                   Count




4. theory:
Data Science teams
core values

  Data Science teams develop actionable insights,
  building confidence for decisions

  that work may influence a few decisions worth
  billions (e.g., M&A) or billions of small decisions (e.g.,
  AdWords)

  probably somewhere in-between…
  solving for pattern, at scale.

  an interdisciplinary pursuit which
  requires teams, not sole players
most valuable skills
 approximately 80% of the costs for data-related projects
 get spent on data preparation – mostly on cleaning up
 data quality issues: ETL, log file analysis, etc.

 unfortunately, data-related budgets for many companies tend
 to go into frameworks which can only be used after clean up

 most valuable skills:
   ‣ learn to use programmable tools that prepare data

   ‣ learn to generate compelling data visualizations

   ‣ learn to estimate the confidence for reported results

   ‣ learn to automate work, making analysis repeatable

 the rest of the skills – modeling,
                                                               D3
 algorithms, etc. – those are secondary
the science in data science?
                                                         edoMpUsserD:IUN
                                     tcudorP ylppA lenaP yrotnevnI tneilC
                                  tcudorP evomeR lenaP yrotnevnI tneilC




  in a nutshell, what we do…
                                                         edoMmooRyM:IUN
                                                     edoMmooRcilbuP:IUN
                                                                  ydduB ddA
                                                               nigoL etisbeW
                                                                           vd
                                                          edoMsdneirF:IUN
                                                              edoMtahC:IUN
                                                          egasseM a evaeL
                                             G1 :gniniamer ecaps sserddA
                                                      dekcilCeliforPyM:IUN
                                                       edoMstiderCyuB:IUN
                                                           tohspanS a ekaT
                                                       egapemoH nwO tisiV
                                                               elbbuB a epyT
                                                                taeS egnahC
                                                          wodniW D3 nepO
                                                                  dneirF ddA
                                 revO tcudorP pilF lenaP yrotnevnI tneilC
                                                                   lenaP tidE




  ‣ estimate probability
                                                                    woN tahC
                                                                     teP yalP
                                                                    teP deeF
                             2 petS egaP traC esahcruP edaM remotsuC
                                          M215 :gniniamer ecaps sserddA
                                                              gnihtolC no tuP
                                                           bew :metI na yuB
                                                             edoMeivoM:IUN
                                    ytinummoc ,tneilc :detratS weiV eivoM
                                                             teP weN etaerC
                                        detrats etius tset :tseTytivitcennoC
                                                   emag pazyeh dehcnuaL
                                                    eciov mooRcilbuP tahC
                                                          egasseM yadhtriB
                                                          edoMlairotuT:IUN
                                                    ybbol semag dehcnuaL
                                                        noitartsigeR euqinU




  ‣ calculate analytic variance




                                                                                edoMpUsserD:IUN
                                                                                tcudorP ylppA lenaP yrotnevnI tneilC
                                                                                tcudorP evomeR lenaP yrotnevnI tneilC
                                                                                edoMmooRyM:IUN
                                                                                edoMmooRcilbuP:IUN
                                                                                ydduB ddA
                                                                                nigoL etisbeW
                                                                                vd
                                                                                edoMsdneirF:IUN
                                                                                edoMtahC:IUN
                                                                                egasseM a evaeL
                                                                                G1 :gniniamer ecaps sserddA
                                                                                dekcilCeliforPyM:IUN
                                                                                edoMstiderCyuB:IUN
                                                                                tohspanS a ekaT
                                                                                egapemoH nwO tisiV
                                                                                elbbuB a epyT
                                                                                taeS egnahC

                                                                                dneirF ddA
                                                                                revO tcudorP pilF lenaP yrotnevnI tneilC
                                                                                lenaP tidE
                                                                                woN tahC
                                                                                teP yalP
                                                                                teP deeF
                                                                                2 petS egaP traC esahcruP edaM remotsuC
                                                                                M215 :gniniamer ecaps sserddA
                                                                                gnihtolC no tuP
                                                                                bew :metI na yuB
                                                                                edoMeivoM:IUN
                                                                                ytinummoc ,tneilc :detratS weiV eivoM
                                                                                teP weN etaerC
                                                                                detrats etius tset :tseTytivitcennoC
                                                                                emag pazyeh dehcnuaL
                                                                                eciov mooRcilbuP tahC
                                                                                egasseM yadhtriB
                                                                                edoMlairotuT:IUN
                                                                                ybbol semag dehcnuaL
                                                                                noitartsigeR euqinU
                                                                                wodniW D3 nepO
  ‣ manipulate order complexity

  ‣ make use of learning theory

  +   collab with DevOps, Stakeholders

  +   reduce our work to cron entries
team process = needs

                  help people ask the
    discovery     right questions


                  allow automation to place
     modeling     informed bets


                  deliver products at
    integration   scale to customers


                  build smarts into
       apps       product features            Gephi



                  keep infrastructure
     systems      running, cost-effective
team composition = roles

       Domain
       Expert
                               business process,
                               stakeholder
                       data
                     science
        Data                   data prep, discovery,
      Scientist                modeling, etc.            Document
                                                         Collection



                                                                                      Scrub
                                                                      Tokenize
                                                                                      token

                                                                 M



                                                                                              HashJoin   Regex
                                                                                                Left     token
                                                                                                                 GroupBy    R
                                                                                 Stop Word                        token
                                                                                    List
                                                                                                RHS




       App Dev
                               software engineering,                                                                Count




                               automation                                                                                       Word
                                                                                                                                Count




         Ops                   systems engineering, access



       introduced
        capability
matrix = needs × roles

                                            nn
         o
         overy
           very      elliing
                      e ng            ratiio
                                      rat o      apps
                                                 apps      tem
                                                            tem
                                                               ss
   diisc
   d sc           mod
                  mod           nteg
                               ii nteg                  sys
                                                        sys

                                                                    stakeholder



                                                                     scientist



                                                                    developer



                                                                       ops
matrix: example team

                                             nn
          o
          overy
            very      elliing
                       e ng            ratiio
                                       rat o      apps
                                                  apps      tem
                                                             tem
                                                                ss
    diisc
    d sc           mod
                   mod           nteg
                                ii nteg                  sys
                                                         sys

                                                                     stakeholder



                                                                      scientist



                                                                     developer



                                                                        ops


 summary: this team seems heavy on systems, may need more overlap
 between modeling and integration, particularly among team leads
typical hand-offs

            integrity                      availability              discovery            communications


                                                                                                       people
         vendor
           data
         sources
                                                          Query
                                  data                     Query
                                                          Hosts
                                                             query              BI &            dashboards
                               warehouse                   Hosts
                                                             hosts            reporting
       production
         cluster                                                                              presentations

                                                                                           decision support

                              classifiers
                                                      predictive       analyze,
  customer                                             analytics       visualize                business
  interactions            recommenders                                                          stakeholders

                        internal API, crons, etc.
                                                              modeling


                                                      engineers,
    automation                                        analysts
use case: marketing funnel
  •   must optimize a very large ad spend
  •   different vendors report different metrics




                                                                Wikipedia
  •   seasonal variation distorts performance
  •   some campaigns are much smaller than others
  •   hard to predict ROI for incremental spend

  approach:
  • log aggregation, followed with cohort analysis
  • bayesian point estimates compare different-sized ad tests
  • customer lifetime value quantifies ROI of new leads
  • time series analysis normalizes for seasonal variation
  • geolocation adjusts for regional cost/benefit
  • linear programming models estimate elasticity of demand
use case: ecommerce fraud
  • sparse data means lots of missing values




                                                             stat.berkeley.edu
  • “needle in a haystack” lack of training cases
  • answers are available in large-scale batch, results
      are needed in real-time event processing
  •   not just one pattern to detect – many, ever-changing

  approach:
  • random forest (RF) classifiers predict likely fraud
  • subsampled data to re-balance training sets
  • impute missing values based on density functions
  • train on massive log files, run on in-memory grid
  • adjust metrics to minimize customer support costs
  • detect novelty – report anomalies via notifications
use case: customer segmentation
  • many millions of customers, hard to determine
      which features resonate




                                                                Mathworks
  •   multi-modal distributions get obscured by the
      practice of calculating an “average”
  •   not much is known about individual customers

  approach:
  • connected components for sessionization, determining
      uniques from logs
  •   estimates for age, gender, income, geo, etc.
  •   clustering algorithms to group into market segments
  •   social graph infers “unknown” relationships
  • covariance/heat maps visualizes segments vs. feature sets
use case: monetizing content
  • need to suggest relevant content which would




                                                               Digital Humanities
      otherwise get buried in the back catalog
  •   big disconnect between inventory and limited
      performance ad market
  •   enormous amounts of text, hard to categorize

  approach:
  • text analytics glean key phrases from documents
  • hierarchical clustering of char frequencies detects lang
  • latent dirichlet allocation (LDA) reduces dimension to
      topic models
  •   recommenders suggest similar topics to customers
  • collaborative filters connect known users with less known
reference

  by DJ Patil

  Data Jujitsu
  O’Reilly, 2012
  amazon.com/dp/B008HMN5BE

  Building Data Science Teams
  O’Reilly, 2011
  amazon.com/dp/B005O4U3ZE
Intro to Cascading
             Document
             Collection



                                          Scrub
                          Tokenize
                                          token

                     M



                                                  HashJoin   Regex
                                                    Left     token
                                                                     GroupBy    R
                                     Stop Word                        token
                                        List
                                                    RHS




                                                                        Count




                                                                                    Word
                                                                                    Count




5. tutorial:
for the impatient
“Cascading for the Impatient”
  cascading.org/category/impatient/
  ‣ a series of introductory tutorials and code samples

  ‣ 1:1 code comparisons in Scalding, Cascalog, Pig, Hive

     Document
     Collection



                                  Scrub
                  Tokenize
                                  token

             M



                                          HashJoin   Regex
                                            Left     token
                                                             GroupBy    R
                             Stop Word                        token
                                List
                                            RHS




                                                                Count




                                                                            Word
                                                                            Count
1: copy
                       public class
                         Main
                         {
                         public static void
                         main( String[] args )
                           {
                           String inPath = args[ 0 ];
                           String outPath = args[ 1 ];
 Source
                           Properties props = new Properties();
                           AppProps.setApplicationJarClass( props, Main.class );
                           HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

                           // create the source tap
                           Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath );
          M
                           // create the sink tap
                           Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath );
                Sink
                           // specify a pipe to connect the taps
                           Pipe copyPipe = new Pipe( "copy" );

                           // connect the taps, pipes, etc., into a flow
                           FlowDef flowDef = FlowDef.flowDef().setName( "copy" )
                            .addSource( copyPipe, inTap )
                            .addTailSink( copyPipe, outTap );

                           // run the flow
                           flowConnector.connect( flowDef ).complete();
                           }
 1 mapper                }

 0 reducers
10 lines code
wait!



  ten lines of code
  for a file copy…
  seems like a lot.
same JAR, any scale…
                                                            MegaCorp Enterprise IT:
                                                            Pb’s data
                                                            1000+ node private cluster
                                                            EVP calls you when app fails
                                                            runtime: days+

                                           Production Cluster:
                                           Tb’s data
                                           EMR w/ 50 HPC Instances
                                           Ops monitors results
                                           runtime: hours – days

                     Staging Cluster:
                     Gb’s data
                     EMR + 4 Spot Instances
                     CI shows red or green lights
                     runtime: minutes – hours

 Your Laptop:
 Mb’s data
 Hadoop standalone mode
 passes unit tests, or not
 runtime: seconds – minutes
2: word count



Document
Collection




                Tokenize
                           GroupBy
        M                   token             Count




                              R                         Word
                                                        Count




 1 mapper
 1 reducer
18 lines code                        gist.github.com/3900702
Cascading / Java
                                                                             Document
String docPath = args[ 0 ];                                                  Collection


String wcPath = args[ 1 ];                                                                Tokenize
                                                                                                     GroupBy
                                                                                     M                token    Count


Properties properties = new Properties();
                                                                                                        R              Word
AppProps.setApplicationJarClass( properties, Main.class );                                                             Count



HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
 .addSource( docPipe, docTap )
 .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Scalding / Scala
                                               Document
                                               Collection




                                                            Tokenize
                                                                       GroupBy
                                                       M                token    Count




// Sujit Pal                                                              R              Word
                                                                                         Count




// sujitpal.blogspot.com/2012/08/scalding-for-impatient.html

package com.mycompany.impatient

import com.twitter.scalding._

class Part2(args : Args) extends Job(args) {
  val input = Tsv(args("input"), ('docId, 'text))
  val output = Tsv(args("output"))
  input.read.
    flatMap('text -> 'word) {
       text : String => text.split("""s+""")
    }.
    groupBy('word) { group => group.size }.
    write(output)
}
Cascalog / Clojure
                                                      Document
                                                      Collection




                                                                   Tokenize
                                                                              GroupBy
                                                              M                token    Count




; Paul Lam                                                                       R              Word
                                                                                                Count




; github.com/Quantisan/Impatient

(ns impatient.core
  (:use [cascalog.api]
        [cascalog.more-taps :only (hfs-delimited)])
  (:require [clojure.string :as s]
            [cascalog.ops :as c])
  (:gen-class))

(defmapcatop split [line]
  "reads in a line of string and splits it by regex"
  (s/split line #"[[](),.)s]+"))

(defn -main [in out & args]
  (?<- (hfs-delimited out)
       [?word ?count]
       ((hfs-delimited in :skip-header? true) _ ?line)
       (split ?line :> ?word)
       (c/count ?count)))
Hive
                                                   Document
                                                   Collection




                                                                Tokenize
                                                                           GroupBy
                                                           M                token    Count




-- Steve Severance                                                            R              Word
                                                                                             Count




-- stackoverflow.com/questions/10039949/word-count-program-in-hive

CREATE TABLE input (line STRING);

LOAD DATA LOCAL INPATH 'input.tsv'
OVERWRITE INTO TABLE input;

SELECT
 word, COUNT(*)
FROM input
 LATERAL VIEW explode(split(text, ' ')) lTable AS word
GROUP BY word
;
Pig
                                                   Document
                                                   Collection




                                                                Tokenize
                                                                           GroupBy
                                                           M                token    Count




-- kudos to Dmitriy Ryaboy                                                    R              Word
                                                                                             Count




docPipe = LOAD '$docPath' USING PigStorage('t', 'tagsource')
  AS (doc_id, text);
docPipe = FILTER docPipe BY doc_id != 'doc_id';

-- specify regex to split "document" text lines into token stream
tokenPipe = FOREACH docPipe
  GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token;
tokenPipe = FILTER tokenPipe BY token MATCHES 'w.*';

-- determine the word counts
tokenGroups = GROUP tokenPipe BY token;
wcPipe = FOREACH tokenGroups
  GENERATE group AS token, COUNT(tokenPipe) AS count;

-- output
STORE wcPipe INTO '$wcPath' USING PigStorage('t', 'tagsource');
EXPLAIN -out dot/wc_pig.dot -dot wcPipe;
3: wc + scrub



Document
Collection



                        Scrub   GroupBy
             Tokenize
                        token    token
                                          Count
        M

                                   R              Word
                                                  Count




 1 mapper
 1 reducer
22+10 lines code
4: wc + scrub + stop words


  Document
  Collection



                               Scrub
               Tokenize
                               token

          M



                                       HashJoin   Regex
                                         Left     token
                                                          GroupBy    R
                          Stop Word                        token
                             List
                                         RHS




                                                             Count




                                                                         Word
                                                                         Count
 1 mapper
 1 reducer
28+10 lines code
5: tf-idf



                                                                        Unique                 Insert   SumBy




                                                                  D
                                                                        doc_id                   1      doc_id
Document
Collection

                                                                  M       R           M                   R      M     RHS

                               Scrub
             Tokenize
                               token
                                                                                                                     HashJoin
        M

                                                                                                                                            RHS




                                                          token
                                       HashJoin   Regex                 Unique                GroupBy




                                                                  DF
                                         Left     token                  token                 token                                                         ExprFunc
                                                                                                         Count                             CoGroup
                        Stop Word                                                                                                                              tf-idf
                           List
                                         RHS
                                                                  M       R           M          R               M                                   R
                                                                                                                                                                          TF-IDF




                                                                                                                 M

                                                                       GroupBy
                                                                  TF




                                                                        doc_id,
                                                                         token                 Count
                                                                                                                             GroupBy                 Count
                                                                                                                              token

                                                                  M       R       M       R
                                                                                                                                                                  Word
                                                                                                                                R      M      R                   Count




  11 mappers
   9 reducers
  65+10 lines code
6: tf-idf + tdd


                                                                                                Unique                 Insert   SumBy




                                                                                          D
                                                                                                doc_id                   1      doc_id
Document
Collection

                                                                                                                                               RHS
                                                                                          M       R           M                   R      M
                       Assert                          Scrub
                                Tokenize
                                                       token
                                                                                                                                             HashJoin              Checkpoint
        M
                                                                                                                                                                                  M

                                                                                                                                                                                       RHS




                                                                                  token
                                                               HashJoin   Regex                 Unique                GroupBy




                                                                                          DF
                                                                 Left     token                  token                 token     Count                                                               ExprFunc
                                                                                                                                                                                      CoGroup
                                                                                                                                                                                                       tf-idf
                                           Stop Word
                                              List               RHS

                                                                                          M       R           M          R               M                                                      R
                                                                                                                                                                                                                TF-IDF




                                                                                                                                         M
                                                                                               GroupBy
                                                                                          TF
                                                                                                doc_id,
             Failure                                                                             token                 Count
              Traps                                                                                                                                  GroupBy              Count
                                                                                                                                                      token

                                                                                          M       R       M       R
                                                                                                                                                                                             Word
                                                                                                                                                                                             Count
                                                                                                                                                        R      M    R




      12 mappers
       9 reducers
      76+14 lines code
deployed on AWS…



 elastic-mapreduce --create --name "TF-IDF" 
   --jar s3n://temp.cascading.org/impatient/part6.jar 
   --arg s3n://temp.cascading.org/impatient/rain.txt 
   --arg s3n://temp.cascading.org/impatient/out/wc 
   --arg s3n://temp.cascading.org/impatient/en.stop 
   --arg s3n://temp.cascading.org/impatient/out/tfidf 
   --arg s3n://temp.cascading.org/impatient/out/trap 
   --arg s3n://temp.cascading.org/impatient/out/check




 aws.amazon.com/elasticmapreduce/
results?
                                                                                                                                                                                                                                                   doc_id   tf-idf   token
                                                                                                                                                                                                                                                   doc02    0.9163   air
                                                                                                                                                                                                                                                   doc05    0.9163   australia
                                                                                                                                                                                                                                                   doc05    0.9163   broken
                                                                                                                                                                                                                                                   doc04    0.9163   california's
                                                                                                                                                                                                                                                   doc04    0.9163   cause
                                                                                                                                                                                                                                                   doc02    0.9163   cloudcover
                                                                                                                                                                                                                                                   doc04    0.9163   death
                                                                                                                                                                                                                                                   doc04    0.9163   deserts
doc_id text                                                                                                                                                                                                                                        doc03    0.9163   downwind
doc01    A rain shadow is a dry area on the lee back side of a mountainous area.                                                                                                                                                                    …
doc02    This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain                                                                                                                                             doc02    0.9163   sinking
and cloudcover.                                                                                                                                                                                                                                    doc04    0.9163   such
doc03    A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a                                                                                                                                                     doc04    0.9163   valley
mountain.                                                                                                                                                                                                                                          doc05    0.9163   women
doc04    This is known as the rain shadow effect and is the primary cause of leeward deserts of                                                                                                                                                    doc03    0.5108   land
mountain ranges, such as California's Death Valley.                                                                                                                                                                                                doc05    0.5108   land
doc05    Two Women. Secrets. A Broken Land. [DVD Australia]                                                                                                                                                                                        doc01    0.5108   lee
zoink    null                                                                                                                                                                                                                                      doc02    0.5108   lee
                                                                                                                                                                                                                                                   doc03    0.5108   leeward
                                                                                                                                                                                                                                                   doc04    0.5108   leeward
                                                                                                                                                                                                                                                   doc01    0.4463   area
                                                                                                                                                                                                                                                   doc02    0.2231   area
                                                                                                                                                                                                                                                   doc03    0.2231   area
                                                                                                                                                                                                                                                   doc01    0.2231   dry
                                                                                                                                                                                                                                                   doc02    0.2231   dry
                                                                                                                                                                                                                                                   doc03    0.2231   dry
                                                                                                                          Unique                 Insert   SumBy
                                                                                                                                                                                                                                                   doc02    0.2231   mountain
                                                                                                                                                                                                                                                   doc03    0.2231   mountain
                                                                                                                    D




                                                                                                                          doc_id                   1      doc_id
                          Document
                          Collection

                                                                                                                                                                         RHS



                                                                                                                                                                                                                                                   doc04    0.2231   mountain
                                                                                                                    M       R           M                   R      M
                                                 Assert                          Scrub
                                                          Tokenize
                                                                                 token
                                                                                                                                                                       HashJoin              Checkpoint
                                  M


                                                                                                                                                                                                                                                   doc01    0.0000   rain
                                                                                                                                                                                                            M

                                                                                                                                                                                                                 RHS
                                                                                                            token




                                                                                         HashJoin   Regex                 Unique                GroupBy
                                                                                                                    DF




                                                                                                                                                                                                                                                   doc02    0.0000   rain
                                                                                           Left     token                  token                 token     Count                                                               ExprFunc
                                                                                                                                                                                                                CoGroup
                                                                                                                                                                                                                                 tf-idf
                                                                     Stop Word
                                                                        List               RHS

                                                                                                                    M       R           M          R               M                                                      R
                                                                                                                                                                                                                                          TF-IDF

                                                                                                                                                                                                                                                   doc03    0.0000   rain
                                                                                                                         GroupBy
                                                                                                                                                                   M
                                                                                                                                                                                                                                                   doc04    0.0000   rain
                                                                                                                    TF




                                                                                                                          doc_id,
                                                                                                                           token                 Count


                                                                                                                                                                                                                                                   doc01    0.0000   shadow
                                       Failure
                                        Traps                                                                                                                                  GroupBy              Count
                                                                                                                                                                                token

                                                                                                                    M       R       M       R



                                                                                                                                                                                                                                                   doc02    0.0000   shadow
                                                                                                                                                                                                                       Word
                                                                                                                                                                                                                       Count
                                                                                                                                                                                  R      M    R




                                                                                                                                                                                                                                                   doc03    0.0000   shadow
                                                                                                                                                                                                                                                   doc04    0.0000   shadow
comparisons?


 compare similar code in Scalding (Scala) and Cascalog (Clojure):

 sujitpal.blogspot.com/2012/08/scalding-for-impatient.html
 based on: github.com/twitter/scalding/wiki


 github.com/Quantisan/Impatient
 based on: github.com/nathanmarz/cascalog/wiki
Intro to Cascading
              Document
              Collection



                                           Scrub
                           Tokenize
                                           token

                      M



                                                   HashJoin   Regex
                                                     Left     token
                                                                      GroupBy    R
                                      Stop Word                        token
                                         List
                                                     RHS




                                                                         Count




                                                                                     Word
                                                                                     Count




6. code:
sample apps
Social Recommender

                                               filter
                      Twitter                                        stop words
                                              tweets




                     calculate
                                                                        QA
                     similiarity


                                             threshold
                                             min, max

                    Neo4j

                                   LDA                               Redis




github.com/Cascading/SampleRecommender
 ‣   social recommender based on Twitter: suggest users who tweet about similar stocks
 ‣   instead of a cross-product (potential bottleneck) this runs in parallel on Hadoop
 ‣   uses a stop word list to remove common words, offensive phrases, etc.
 ‣   one tap measures token frequency: for QA, adjust stop words, improve filter, etc.
 ‣   adapted in Spring by Costin Leau
SocRec: architecture

         Twitter                                             filter                                          low-freq
        firehose                     source                                            stop words
                                                            tweets                                       batch updates
      ( uid, tweet, t )


                                  checkpoint:
                                  tokenized tweets




       calculate                                checkpoint:                                               analysis +
                                                                                           QA
       similiarity                              token frequency                                            curation


                                checkpoint:                             similarity
                                similar users                          thresholds



                                                           threshold
                                                           min, max
                                  sink
                                          sink                         sink
   Neo4j:
   social                                                                                Redis
   graph               LDA:
                       topic                                                            results
                                                                                     (uid: uidx, rank)
                     trending
SocRec: results

                           uid        recommend       weight

                  carbonfiberxrm     ClosingBellNews   0.1459

                  carbonfiberxrm     DJFunkyGrrL       0.0870

                  ClosingBellNews   DJFunkyGrrL       0.1491

                  CloudStocks       DJFunkyGrrL       0.1206

                  ElmoreNicole      DJFunkyGrrL       0.1798

                  EsNeey            alexiolo_         0.8603

                  ...
City of Palo Alto open data
                                                 Regex           Regex




                                          tree
                                                                               Scrub
                                                  filter         parser        species




                                          M
                                                                                                     HashJoin
                                                                                                       Left     Geohash
  CoPA
GIS exprot                                                                                 Tree
                                                                                         Metadata                                M
                                                                                                       RHS                            RHS
                                                                                                                          tree
             Regex     Checkpoint




                                          road
                                                 Regex           Regex

                                    tsv
             parser       tsv                     filter                                                                                             Tree       Filter         GroupBy        Checkpoint
                                                                 parser                                                              CoGroup
                                                                                                                                                   Distance   tree_dist       tree_name         shade
M

                                                                                                                                               R                          M               R                M    RHS
                                          M
                                                                          HashJoin        Estimate     Road
                                                                            Left           Albedo               Geohash                                                                                        CoGroup
                                                                                                     Segments
                                                            Road
                                                           Metadata                                                                                                              GPS
             Failure                                                        RHS                                                  M                                               logs
              Traps                                                                                                                                                                                                      R
                                                                                                                          road


                                                                                                                                                                                               Geohash


                                                                                                                                                                                                           M

                                                 Regex
                                          park




                                                  filter                                                                                                                                                                     reco




                                          M
                                                                park




github.com/Cascading/CoPA/wiki
        ‣    GIS export for parks, roads, trees (unstructured / open data)
        ‣    log files of personalized/frequented locations in Palo Alto via iPhone GPS tracks
        ‣    curated metadata, used to enrich the dataset
        ‣    could extend via mash-up with many available public data APIs

Enterprise-scale app: road albedo + tree species metadata + geospatial indexing
“Find a shady spot on a summer day to walk near downtown and take a call…”
CoPA: log events
CoPA: results                                      0.12
                                                               Estimated Tree Height (meters)




                                                   0.10




                                                   0.08
                                                                                                          count
                                                                                                             0




                                         density
                                                                                                             100
                                                   0.06                                                      200
                                                                                                             300



                                                   0.04




                                                   0.02




                                                   0.00


                                                          0   10        20            30        40   50
                                                                         avg_height




 ‣   addr: 115 HAWTHORNE AVE
 ‣   lat/lng: 37.446, -122.168
 ‣   geohash: 9q9jh0
 ‣   tree: 413 site 2
 ‣   species: Liquidambar styraciflua
 ‣   avg height 23 m
 ‣   road albedo: 0.12
 ‣   distance: 10 m
 ‣   a short walk from my train stop ✔
drill-down


  blog, code/wiki/gists, jars, list, DevOps products:
  cascading.org/
  github.org/Cascading/
  conjars.org/
  goo.gl/KQtUL
  concurrentinc.com/
                                      pnathan@concurrentinc.com
                                      @pacoid

Más contenido relacionado

La actualidad más candente

ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 
How Salesforce.com uses Hadoop
How Salesforce.com uses HadoopHow Salesforce.com uses Hadoop
How Salesforce.com uses HadoopNarayan Bharadwaj
 
a9TD6cbzTZotpJihekdc+w==.docx
a9TD6cbzTZotpJihekdc+w==.docxa9TD6cbzTZotpJihekdc+w==.docx
a9TD6cbzTZotpJihekdc+w==.docxVasimMemon4
 
Platforms for data science
Platforms for data sciencePlatforms for data science
Platforms for data scienceDeepak Singh
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Jonathan Seidman
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
 
Advanced analytics with sap hana and r
Advanced analytics with sap hana and rAdvanced analytics with sap hana and r
Advanced analytics with sap hana and rSAP Technology
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill Carol McDonald
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Jonathan Seidman
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04Ted Dunning
 
Data Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesData Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesAmazon Web Services
 

La actualidad más candente (20)

ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
How Salesforce.com uses Hadoop
How Salesforce.com uses HadoopHow Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
 
Galaxy of bits
Galaxy of bitsGalaxy of bits
Galaxy of bits
 
The Other Way of Doing Big Data
The Other Way of Doing Big DataThe Other Way of Doing Big Data
The Other Way of Doing Big Data
 
a9TD6cbzTZotpJihekdc+w==.docx
a9TD6cbzTZotpJihekdc+w==.docxa9TD6cbzTZotpJihekdc+w==.docx
a9TD6cbzTZotpJihekdc+w==.docx
 
Platforms for data science
Platforms for data sciencePlatforms for data science
Platforms for data science
 
PRAFUL_HADOOP
PRAFUL_HADOOPPRAFUL_HADOOP
PRAFUL_HADOOP
 
PRAFUL_HADOOP
PRAFUL_HADOOPPRAFUL_HADOOP
PRAFUL_HADOOP
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
 
Treasure Data: Big Data Analytics on Heroku
Treasure Data: Big Data Analytics on HerokuTreasure Data: Big Data Analytics on Heroku
Treasure Data: Big Data Analytics on Heroku
 
Resume
ResumeResume
Resume
 
Advanced analytics with sap hana and r
Advanced analytics with sap hana and rAdvanced analytics with sap hana and r
Advanced analytics with sap hana and r
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Data-driven Innovation - Wood
Data-driven Innovation - WoodData-driven Innovation - Wood
Data-driven Innovation - Wood
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04
 
Data Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesData Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web Services
 

Destacado

THINGS YOU CAN'T RECOVER
THINGS YOU CAN'T RECOVERTHINGS YOU CAN'T RECOVER
THINGS YOU CAN'T RECOVERYassir Khairi
 
REEDIFICAR EL TEMPLO DEL ESPÍRITU SANTO
REEDIFICAR EL TEMPLO DEL ESPÍRITU SANTOREEDIFICAR EL TEMPLO DEL ESPÍRITU SANTO
REEDIFICAR EL TEMPLO DEL ESPÍRITU SANTOCarlos Sialer Horna
 
Seven Wonders Of The World
Seven Wonders Of The World Seven Wonders Of The World
Seven Wonders Of The World Marlis
 
Bases delconcurso
Bases delconcursoBases delconcurso
Bases delconcursodtene68
 
Hoja informativa CGT SEAT160920
Hoja informativa CGT SEAT160920Hoja informativa CGT SEAT160920
Hoja informativa CGT SEAT160920Cgt Seat Martorell
 
Capítulo 9 libro zacatollan una hist... copia
Capítulo 9 libro zacatollan una hist...   copiaCapítulo 9 libro zacatollan una hist...   copia
Capítulo 9 libro zacatollan una hist... copiaCesar Adame
 
Manual primeros viernes
Manual primeros viernesManual primeros viernes
Manual primeros viernesLuz Arrillaga
 
Indicaciones para la pau 2014 2015
Indicaciones para la pau 2014 2015Indicaciones para la pau 2014 2015
Indicaciones para la pau 2014 2015Francisco Sanchez
 
1 1 presentacion
1 1  presentacion1 1  presentacion
1 1 presentacionRubby Gloom
 
Historia de la salud
Historia de la saludHistoria de la salud
Historia de la saludvalenypaom
 
Proyecto empren
Proyecto emprenProyecto empren
Proyecto emprenluisfegar
 

Destacado (20)

THINGS YOU CAN'T RECOVER
THINGS YOU CAN'T RECOVERTHINGS YOU CAN'T RECOVER
THINGS YOU CAN'T RECOVER
 
Presentación ponencias flacso 2012
Presentación ponencias flacso 2012Presentación ponencias flacso 2012
Presentación ponencias flacso 2012
 
La Oración
La OraciónLa Oración
La Oración
 
REEDIFICAR EL TEMPLO DEL ESPÍRITU SANTO
REEDIFICAR EL TEMPLO DEL ESPÍRITU SANTOREEDIFICAR EL TEMPLO DEL ESPÍRITU SANTO
REEDIFICAR EL TEMPLO DEL ESPÍRITU SANTO
 
Seven Wonders Of The World
Seven Wonders Of The World Seven Wonders Of The World
Seven Wonders Of The World
 
Google earth
Google earthGoogle earth
Google earth
 
Bases delconcurso
Bases delconcursoBases delconcurso
Bases delconcurso
 
Hoja informativa CGT SEAT160920
Hoja informativa CGT SEAT160920Hoja informativa CGT SEAT160920
Hoja informativa CGT SEAT160920
 
Capítulo 9 libro zacatollan una hist... copia
Capítulo 9 libro zacatollan una hist...   copiaCapítulo 9 libro zacatollan una hist...   copia
Capítulo 9 libro zacatollan una hist... copia
 
Manual primeros viernes
Manual primeros viernesManual primeros viernes
Manual primeros viernes
 
Indicaciones para la pau 2014 2015
Indicaciones para la pau 2014 2015Indicaciones para la pau 2014 2015
Indicaciones para la pau 2014 2015
 
Wells cathedral2
Wells cathedral2Wells cathedral2
Wells cathedral2
 
Toeic report on business english
Toeic report on business englishToeic report on business english
Toeic report on business english
 
1 1 presentacion
1 1  presentacion1 1  presentacion
1 1 presentacion
 
INFORMÁTICA - HALLOWEEN
INFORMÁTICA - HALLOWEEN INFORMÁTICA - HALLOWEEN
INFORMÁTICA - HALLOWEEN
 
Emprendimiento
EmprendimientoEmprendimiento
Emprendimiento
 
Historia de la salud
Historia de la saludHistoria de la salud
Historia de la salud
 
Proyecto empren
Proyecto emprenProyecto empren
Proyecto empren
 
Taller de negociacion
Taller de negociacionTaller de negociacion
Taller de negociacion
 
Higiene
HigieneHigiene
Higiene
 

Similar a Cascading API: An Introduction to Data Workflows

A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...Paco Nathan
 
Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingPaco Nathan
 
Cascading for the Impatient
Cascading for the ImpatientCascading for the Impatient
Cascading for the ImpatientPaco Nathan
 
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsPaco Nathan
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Paco Nathan
 
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiPaco Nathan
 
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional ProgrammingCascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional ProgrammingPaco Nathan
 
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataPaco Nathan
 
Web standards, why care?
Web standards, why care?Web standards, why care?
Web standards, why care?Thomas Roessler
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big DataPaco Nathan
 
Keyword Services Platform (KSP) from Microsoft adCenter
Keyword Services Platform (KSP) from Microsoft adCenterKeyword Services Platform (KSP) from Microsoft adCenter
Keyword Services Platform (KSP) from Microsoft adCentergoodfriday
 
PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
PDX Hadoop: Enterprise Data Workflows with Cascading and MesosPDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
PDX Hadoop: Enterprise Data Workflows with Cascading and MesosPaco Nathan
 
Testing Rich Domain Models
Testing Rich Domain ModelsTesting Rich Domain Models
Testing Rich Domain ModelsChris Richardson
 
Top 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStackTop 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStackOpenStack Foundation
 
Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133OpenStack Foundation
 
FinCap Solutions Brochure
FinCap  Solutions BrochureFinCap  Solutions Brochure
FinCap Solutions BrochureCFPuser
 
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingBoulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingPaco Nathan
 

Similar a Cascading API: An Introduction to Data Workflows (20)

A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...
 
Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with Cascading
 
Cascading for the Impatient
Cascading for the ImpatientCascading for the Impatient
Cascading for the Impatient
 
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...
 
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai
 
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional ProgrammingCascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
 
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open Data
 
Web standards, why care?
Web standards, why care?Web standards, why care?
Web standards, why care?
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big Data
 
Keyword Services Platform (KSP) from Microsoft adCenter
Keyword Services Platform (KSP) from Microsoft adCenterKeyword Services Platform (KSP) from Microsoft adCenter
Keyword Services Platform (KSP) from Microsoft adCenter
 
Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
 
PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
PDX Hadoop: Enterprise Data Workflows with Cascading and MesosPDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
 
Testing Rich Domain Models
Testing Rich Domain ModelsTesting Rich Domain Models
Testing Rich Domain Models
 
Top 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStackTop 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStack
 
Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133
 
FinCap Solutions Brochure
FinCap  Solutions BrochureFinCap  Solutions Brochure
FinCap Solutions Brochure
 
Ruby at UW C4C
Ruby at UW C4CRuby at UW C4C
Ruby at UW C4C
 
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingBoulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
 

Más de Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 

Más de Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 

Último

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 

Último (20)

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 

Cascading API: An Introduction to Data Workflows

  • 1. Intro to Cascading Paco Nathan Document Collection Scrub Tokenize token Concurrent, Inc. M HashJoin Regex Left token GroupBy R Stop Word token List RHS pnathan@concurrentinc.com Count @pacoid Word Count Copyright @2012, Concurrent, Inc.
  • 2. Enterprise Apps for Big Data with Cascading 1. intro: Cascading API 2. backstory: Big Data origins 3. context: Hadoop cliff notes 4. theory: Data Science teams 5. tutorial: for the impatient 6. code: sample apps
  • 3. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. intro: Cascading API
  • 4. Cascading API: purpose ‣ simplify data processing development and deployment ‣ improve application developer productivity ‣ enable data processing application manageability
  • 5. Cascading API: a few facts Java open source project (ASL 2) using Git, Gradle, Maven, JUnit, etc. in production (~5 yrs) at hundreds of enterprise Hadoop deployments: Finance, Health Care, Transportation, other verticals studies published about large use cases: Twitter, Etsy, Airbnb, Square, Climate Corporation, FlightCaster, Williams-Sonoma partnerships and distribution with SpringSource, Amazon AWS, Microsoft Azure, Hortonworks, MapR, EMC several open source projects built atop, contribs by Twitter, Etsy, etc., which provide substantial Machine Learning libraries DSLs available in Scala, Clojure, Python (Jython), Ruby (JRuby), Groovy data “taps” integrate popular data frameworks via JDBC, Memcached, HBase, plus serialization in Apache Thrift, Avro, Kyro, etc. entire app compiles into a single JAR: fully connected for compiler optimization, exception handling, debugging, config, scheduling, etc.
  • 6. Cascading API: a few quotes “Cascading gives Java developers the ability to build Big Data applications on Hadoop using their existing skillset … Management can really go out and build a team around folks that are already very experienced with Java. Switching over to this is really a very short exercise.” CIO, Thor Olavsrud, 2012-06-06 cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading “Masks the complexity of MapReduce, simplifies the programming, and speeds you on your journey toward actionable analytics … A vast improvement over native MapReduce functions or Pig UDFs.” 2012 BOSSIE Awards, James Borck, 2012-09-18 infoworld.com/slideshow/65089 “Company’s promise to application developers is an opportunity to build and test applications on their desktops in the language of choice with familiar constructs and reusable components” Dr. Dobb’s, Adrian Bridgwater, 2012-06-08 drdobbs.com/jvm/where-does-big-data-go-to-get-data-inten/240001759
  • 7. data+code “political spectrum” “Notes from the Mystery Machine Bus” by Steve Yegge, Google goo.gl/SeRZa “conservative” “liberal” (mostly) Enterprise (mostly) Start-Up risk management customer experiments assurance flexibility well-defined schema schema follows code explicit configuration convention type-checking compiler interpreted scripts wants no surprises wants no impediments Java, Scala, Clojure, etc. PHP, Ruby, Python, etc. Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc.
  • 8. Cascading API: adoption As Enterprise apps move into Hadoop and related BigData frameworks, risk profiles shift toward more conservative programming practices Cascading provides a popular API for defining and managing Enterprise data workflows
  • 9. enterprise data workflows Tuples, Pipelines, Endpoints, Operations, Joins, Assertions, Traps, etc. …in other words, “plumbing” Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count
  • 10. data workflows: team ‣ Business Stakeholder POV: business process management for workflow orchestration (think BPM/BPEL) ‣ Systems Integrator POV: system integration of heterogenous data sources and compute platforms ‣ Data Scientist POV: a directed, acyclic graph (DAG) on which we can apply Amdahl's Law, etc. ‣ Data Architect POV: a physical plan for large-scale data flow management ‣ Software Architect POV: a pattern language, similar to plumbing or circuit design Document Collection Scrub Tokenize token M ‣ App Developer POV: Stop Word List HashJoin Left RHS Regex token GroupBy token R API bindings for Java, Scala, Clojure, Jython, JRuby, etc. Count Word Count ‣ Systems Engineer POV: a JAR file, has passed CI, available in a Maven repo
  • 11. data workflows: layers business domain expertise, business trade-offs, process operating parameters, market position, etc. API Java, Scala, Clojure, Jython, JRuby, Groovy, etc. language …envision whatever runs in a JVM optimize / major changes in technology now schedule Document Collection Scrub Tokenize token physical M HashJoin Regex Left token plan GroupBy R Stop Word token List RHS Count Word Count Apache Hadoop, in-memory local mode “assembler” compute code substrate …envision GPUs, streaming, etc. machine Splunk, Nagios, Collectd, New Relic, etc. data
  • 12. data workflows: SQL Relational SQL parser logical plan, optimized based on stats physical plan query history, table stats b-trees, etc. ERD table schema catalog
  • 13. data workflows: SQL vs. JVM Relational Cascading + Driven SQL parser SQL-92 compliant parser (in progress) logical plan, TODO: logical plan, optimized based on stats optimized based on stats physical plan API “plumbing” query history, app history, table stats tuple stats b-trees, etc. distributed compute substrate: Hadoop, in-memory, etc. ERD flow diagram table schema tuple schema catalog endpoint usage DB
  • 14. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. backstory: Big Data origins
  • 15. inflection point huge Internet successes after 1997 holiday season… 1997 AMZN, EBAY, Inktomi (YHOO Search), then GOOG 1998 consider this metric: annual revenue per customer / amount of data stored which dropped 100x within a few years after 1997 2004 storage and processing costs plummeted, now we must work much smarter to extract ROI from Big Data… our methods must adapt “conventional wisdom” of RDBMS and BI tools became less viable; however, business cadre was still focused on pivot tables and pie charts… which tends toward inertia! MapReduce and the Hadoop open source stack grew directly out of that contention… however, that effort only solves parts of the puzzle +
  • 16. inflection point: consequences Geoffrey Moore (Mohr Davidow Ventures, author of Crossing The Chasm) Hadoop Summit, 2012: “All of Fortune 500 is now on notice over the next 10-year period.” Amazon and Google as exemplars of massive disruption in retail, advertising, etc. data as the major force displacing Global 1000 over the next decade, mostly through apps — verticals, leveraging domain expertise Michael Stonebraker (INGRES, PostgreSQL,Vertica,VoltDB, etc.) XLDB, 2012: “Complex analytics workloads are now displacing SQL as the basis  for Enterprise apps.”
  • 17. primary sources Amazon “Early Amazon: Splitting the website” – Greg Linden glinden.blogspot.com/2006/02/early-amazon-splitting-website.html eBay “The eBay Architecture” – Randy Shoup, Dan Pritchett addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf Inktomi (YHOO Search) “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff) youtube.com/watch?v=E91oEn1bnXM Google “The Birth of Google” – John Battelle wired.com/wired/archive/13.08/battelle.html “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff) youtube.com/watch?v=qsan-GQaeyk perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
  • 18. the world before… BI, SQL, and highly optimized code
  • 19. data innovation: circa 1996 Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMS
  • 20. the world after… machine learning, leveraging log files
  • 21. data innovation: circa 2001 Stakeholder Product Customers dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMS
  • 22. the world ahead… what our customers are doing now
  • 23. data innovation: circa 2013 Customers Data Apps business Domain process Workflow Prod Expert dashboard Web Apps, metrics History services Mobile, data etc. s/w science dev Data Planner Scientist social discovery optimized interactions + capacity transactions, Eng endpoints modeling content App Dev Data Access Patterns Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch "real time" Cluster Scheduler introduced existing capability SDLC RDBMS RDBMS
  • 25. statistical thinking Process Variation Data Tools employing a mode of thought which includes both logical and analytical reasoning: evaluating the whole of a problem, as well as its component parts; attempting to assess the effects of changing one or more variables this approach attempts to understand not just problems and solutions, but also the processes involved and their variances particularly valuable in Big Data work when combined with hands-on experience in physics – roughly 50% of my peers come from physics or physical engineering… programmers typically don’t think this way… however, both systems engineers and data scientists must!
  • 26. reference by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001 bit.ly/eUTh9L also check out RStudio: rstudio.org/ rpubs.com/
  • 27. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 3. context: Hadoop cliff notes
  • 28. MapReduce architecture ‣ name node + data nodes ‣ job tracker + task trackers ‣ submit queue ‣ task slots ‣ HDFS ‣ distributed cache Wikipedia Apache
  • 29. MapReduce: how it works map(k1, v1) → list(k2, v2) reduce(k2, list(v2)) → list(k3, v3) the property of data independence among tasks allows for parallel processing … maybe, if the stars are all aligned :) MapReduce is mostly about fault tolerance, and how to leverage “commodity hardware” to replace “big iron” solutions… where “big iron” might apply to Oracle + NetApp. or perhaps an IBM zSeries mainframe… or something else that’s expensive, undoubtably. bonus for math geeks: see any concerns about O(n) complexity, given Amdahl’s Law plus the functional definitions listed above? keep in mind that each phase cannot conclude and progress to the next phase until after each of its tasks has successfully completed.
  • 30. a brief history… circa 1979 – Stanford, MIT, CMU, etc. set/list operations in LISP, Prolog, etc., for parallel processing www-formal.stanford.edu/jmc/history/lisp/lisp.htm circa 2004 – Google MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat labs.google.com/papers/mapreduce.html circa 2006 – Apache Hadoop, originating from the Nutch Project Doug Cutting research.yahoo.com/files/cutting.pdf circa 2008 – Yahoo web scale search indexing Hadoop Summit, HUG, etc. developer.yahoo.com/hadoop/ circa 2009 – Amazon AWS Elastic MapReduce Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc. aws.amazon.com/elasticmapreduce/
  • 31. CAP theorem purpose: theoretical limits for data access patterns essence: ‣ consistency ‣ availability ‣ partition tolerance best case scenario: you may pick two … or spend billions struggling to obtain all three at scale (GOOG) translated: cost of doing business www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf julianbrowne.com/article/viewer/brewers-cap-theorem
  • 32. data access patterns because the world is not made of data warehouses… a handful of common data access patterns are prevalent learn to recognize these for any given problem typically expressed in terms of trade-offs: ‣ speed & volume (latency and throughput) ‣ reads & writes (access and storage) ‣ consistency / availability / partition tolerance
  • 33. access → frameworks → forfeits financial transactions general ledger in RDBMS CAx ad-hoc queries RDS (hosted MySQL) CAx reporting, dashboards like Pentaho CAx log rotation/persistence like Riak xxP search indexes like Lucene/Solr xAP static content, archives S3 (durable storage) xAP customer facts like Redis, Membase xAP distributed counters, locks, sets like Redis x A P* data objects CRUD key/value – like, NoSQL on MySQL CxP authoritative metadata like Zookeeper CxP data prep, modeling at scale like Hadoop/Cascading + R CxP graph analysis like Hadoop + Redis + Gephi CxP data marts like Hadoop/HBase CxP
  • 34. access → frameworks → forfeits financial transactions general ledger in RDBMS CAx ad-hoc queries RDS (hosted MySQL) CAx reporting, dashboards like Pentaho CAx log rotation/persistence like Riak xxP search indexes like Lucene/Solr xAP static content, archives S3 (durable storage) xAP customer facts like Redis, Membase xAP distributed counters, locks, sets like Redis x A P* data objects CRUD key/value – like, NoSQL on MySQL CxP authoritative metadata like Zookeeper CxP data prep, modeling at scale like Hadoop/Cascading + R CxP graph analysis like Hadoop + Redis + Gephi CxP data marts like Hadoop/HBase CxP
  • 35. parallel computation parallelism allows for horizontal scale-out, which create business “levers” in cost/performance at scale NB: MapReduce provides a compute framework which is part-parallel and part-serial… which tends to complicate app development most hard problems in industry have portions which do not allow data independence, or which require iteration current efforts in massively parallel algorithms research may help to parallelize problems and reduce iteration – estimates are 3-5 years out for industry use GPUs and other hardware architecture advancements will likely make Hadoop unrecognizable 3-5 years out
  • 36. reference by Tom White Hadoop:The Definitive Guide O’Reilly, 2009 amazon.com/dp/1449311520 see also: Cluster Computing and MapReduce Lectures code.google.com/edu/submissions/mapreduce-minilecture/listing.html
  • 37. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 4. theory: Data Science teams
  • 38. core values Data Science teams develop actionable insights, building confidence for decisions that work may influence a few decisions worth billions (e.g., M&A) or billions of small decisions (e.g., AdWords) probably somewhere in-between… solving for pattern, at scale. an interdisciplinary pursuit which requires teams, not sole players
  • 39. most valuable skills approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cleaning up data quality issues: ETL, log file analysis, etc. unfortunately, data-related budgets for many companies tend to go into frameworks which can only be used after clean up most valuable skills: ‣ learn to use programmable tools that prepare data ‣ learn to generate compelling data visualizations ‣ learn to estimate the confidence for reported results ‣ learn to automate work, making analysis repeatable the rest of the skills – modeling, D3 algorithms, etc. – those are secondary
  • 40. the science in data science? edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC in a nutshell, what we do… edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC wodniW D3 nepO dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE ‣ estimate probability woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU ‣ calculate analytic variance edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU wodniW D3 nepO ‣ manipulate order complexity ‣ make use of learning theory + collab with DevOps, Stakeholders + reduce our work to cron entries
  • 41. team process = needs help people ask the discovery right questions allow automation to place modeling informed bets deliver products at integration scale to customers build smarts into apps product features Gephi keep infrastructure systems running, cost-effective
  • 42. team composition = roles Domain Expert business process, stakeholder data science Data data prep, discovery, Scientist modeling, etc. Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS App Dev software engineering, Count automation Word Count Ops systems engineering, access introduced capability
  • 43. matrix = needs × roles nn o overy very elliing e ng ratiio rat o apps apps tem tem ss diisc d sc mod mod nteg ii nteg sys sys stakeholder scientist developer ops
  • 44. matrix: example team nn o overy very elliing e ng ratiio rat o apps apps tem tem ss diisc d sc mod mod nteg ii nteg sys sys stakeholder scientist developer ops summary: this team seems heavy on systems, may need more overlap between modeling and integration, particularly among team leads
  • 45. typical hand-offs integrity availability discovery communications people vendor data sources Query data Query Hosts query BI & dashboards warehouse Hosts hosts reporting production cluster presentations decision support classifiers predictive analyze, customer analytics visualize business interactions recommenders stakeholders internal API, crons, etc. modeling engineers, automation analysts
  • 46. use case: marketing funnel • must optimize a very large ad spend • different vendors report different metrics Wikipedia • seasonal variation distorts performance • some campaigns are much smaller than others • hard to predict ROI for incremental spend approach: • log aggregation, followed with cohort analysis • bayesian point estimates compare different-sized ad tests • customer lifetime value quantifies ROI of new leads • time series analysis normalizes for seasonal variation • geolocation adjusts for regional cost/benefit • linear programming models estimate elasticity of demand
  • 47. use case: ecommerce fraud • sparse data means lots of missing values stat.berkeley.edu • “needle in a haystack” lack of training cases • answers are available in large-scale batch, results are needed in real-time event processing • not just one pattern to detect – many, ever-changing approach: • random forest (RF) classifiers predict likely fraud • subsampled data to re-balance training sets • impute missing values based on density functions • train on massive log files, run on in-memory grid • adjust metrics to minimize customer support costs • detect novelty – report anomalies via notifications
  • 48. use case: customer segmentation • many millions of customers, hard to determine which features resonate Mathworks • multi-modal distributions get obscured by the practice of calculating an “average” • not much is known about individual customers approach: • connected components for sessionization, determining uniques from logs • estimates for age, gender, income, geo, etc. • clustering algorithms to group into market segments • social graph infers “unknown” relationships • covariance/heat maps visualizes segments vs. feature sets
  • 49. use case: monetizing content • need to suggest relevant content which would Digital Humanities otherwise get buried in the back catalog • big disconnect between inventory and limited performance ad market • enormous amounts of text, hard to categorize approach: • text analytics glean key phrases from documents • hierarchical clustering of char frequencies detects lang • latent dirichlet allocation (LDA) reduces dimension to topic models • recommenders suggest similar topics to customers • collaborative filters connect known users with less known
  • 50. reference by DJ Patil Data Jujitsu O’Reilly, 2012 amazon.com/dp/B008HMN5BE Building Data Science Teams O’Reilly, 2011 amazon.com/dp/B005O4U3ZE
  • 51. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 5. tutorial: for the impatient
  • 52. “Cascading for the Impatient” cascading.org/category/impatient/ ‣ a series of introductory tutorials and code samples ‣ 1:1 code comparisons in Scalding, Cascalog, Pig, Hive Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count
  • 53. 1: copy public class   Main   {   public static void   main( String[] args )     {     String inPath = args[ 0 ];     String outPath = args[ 1 ]; Source     Properties props = new Properties();     AppProps.setApplicationJarClass( props, Main.class );     HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );     // create the source tap     Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath ); M     // create the sink tap     Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath ); Sink     // specify a pipe to connect the taps     Pipe copyPipe = new Pipe( "copy" );     // connect the taps, pipes, etc., into a flow     FlowDef flowDef = FlowDef.flowDef().setName( "copy" )      .addSource( copyPipe, inTap )      .addTailSink( copyPipe, outTap );     // run the flow     flowConnector.connect( flowDef ).complete();     } 1 mapper   } 0 reducers 10 lines code
  • 54. wait! ten lines of code for a file copy… seems like a lot.
  • 55. same JAR, any scale… MegaCorp Enterprise IT: Pb’s data 1000+ node private cluster EVP calls you when app fails runtime: days+ Production Cluster: Tb’s data EMR w/ 50 HPC Instances Ops monitors results runtime: hours – days Staging Cluster: Gb’s data EMR + 4 Spot Instances CI shows red or green lights runtime: minutes – hours Your Laptop: Mb’s data Hadoop standalone mode passes unit tests, or not runtime: seconds – minutes
  • 56. 2: word count Document Collection Tokenize GroupBy M token Count R Word Count 1 mapper 1 reducer 18 lines code gist.github.com/3900702
  • 57. Cascading / Java Document String docPath = args[ 0 ]; Collection String wcPath = args[ 1 ]; Tokenize GroupBy M token Count Properties properties = new Properties(); R Word AppProps.setApplicationJarClass( properties, Main.class ); Count HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete();
  • 58. Scalding / Scala Document Collection Tokenize GroupBy M token Count // Sujit Pal R Word Count // sujitpal.blogspot.com/2012/08/scalding-for-impatient.html package com.mycompany.impatient import com.twitter.scalding._ class Part2(args : Args) extends Job(args) {   val input = Tsv(args("input"), ('docId, 'text))   val output = Tsv(args("output"))   input.read.     flatMap('text -> 'word) { text : String => text.split("""s+""") }.     groupBy('word) { group => group.size }.     write(output) }
  • 59. Cascalog / Clojure Document Collection Tokenize GroupBy M token Count ; Paul Lam R Word Count ; github.com/Quantisan/Impatient (ns impatient.core   (:use [cascalog.api]         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count)))
  • 60. Hive Document Collection Tokenize GroupBy M token Count -- Steve Severance R Word Count -- stackoverflow.com/questions/10039949/word-count-program-in-hive CREATE TABLE input (line STRING); LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTO TABLE input; SELECT  word, COUNT(*) FROM input  LATERAL VIEW explode(split(text, ' ')) lTable AS word GROUP BY word ;
  • 61. Pig Document Collection Tokenize GroupBy M token Count -- kudos to Dmitriy Ryaboy R Word Count docPipe = LOAD '$docPath' USING PigStorage('t', 'tagsource') AS (doc_id, text); docPipe = FILTER docPipe BY doc_id != 'doc_id'; -- specify regex to split "document" text lines into token stream tokenPipe = FOREACH docPipe GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token; tokenPipe = FILTER tokenPipe BY token MATCHES 'w.*'; -- determine the word counts tokenGroups = GROUP tokenPipe BY token; wcPipe = FOREACH tokenGroups GENERATE group AS token, COUNT(tokenPipe) AS count; -- output STORE wcPipe INTO '$wcPath' USING PigStorage('t', 'tagsource'); EXPLAIN -out dot/wc_pig.dot -dot wcPipe;
  • 62. 3: wc + scrub Document Collection Scrub GroupBy Tokenize token token Count M R Word Count 1 mapper 1 reducer 22+10 lines code
  • 63. 4: wc + scrub + stop words Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1 mapper 1 reducer 28+10 lines code
  • 64. 5: tf-idf Unique Insert SumBy D doc_id 1 doc_id Document Collection M R M R M RHS Scrub Tokenize token HashJoin M RHS token HashJoin Regex Unique GroupBy DF Left token token token ExprFunc Count CoGroup Stop Word tf-idf List RHS M R M R M R TF-IDF M GroupBy TF doc_id, token Count GroupBy Count token M R M R Word R M R Count 11 mappers 9 reducers 65+10 lines code
  • 65. 6: tf-idf + tdd Unique Insert SumBy D doc_id 1 doc_id Document Collection RHS M R M R M Assert Scrub Tokenize token HashJoin Checkpoint M M RHS token HashJoin Regex Unique GroupBy DF Left token token token Count ExprFunc CoGroup tf-idf Stop Word List RHS M R M R M R TF-IDF M GroupBy TF doc_id, Failure token Count Traps GroupBy Count token M R M R Word Count R M R 12 mappers 9 reducers 76+14 lines code
  • 66. deployed on AWS… elastic-mapreduce --create --name "TF-IDF" --jar s3n://temp.cascading.org/impatient/part6.jar --arg s3n://temp.cascading.org/impatient/rain.txt --arg s3n://temp.cascading.org/impatient/out/wc --arg s3n://temp.cascading.org/impatient/en.stop --arg s3n://temp.cascading.org/impatient/out/tfidf --arg s3n://temp.cascading.org/impatient/out/trap --arg s3n://temp.cascading.org/impatient/out/check aws.amazon.com/elasticmapreduce/
  • 67. results? doc_id tf-idf token doc02 0.9163 air doc05 0.9163 australia doc05 0.9163 broken doc04 0.9163 california's doc04 0.9163 cause doc02 0.9163 cloudcover doc04 0.9163 death doc04 0.9163 deserts doc_id text doc03 0.9163 downwind doc01 A rain shadow is a dry area on the lee back side of a mountainous area. … doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain doc02 0.9163 sinking and cloudcover. doc04 0.9163 such doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a doc04 0.9163 valley mountain. doc05 0.9163 women doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of doc03 0.5108 land mountain ranges, such as California's Death Valley. doc05 0.5108 land doc05 Two Women. Secrets. A Broken Land. [DVD Australia] doc01 0.5108 lee zoink null doc02 0.5108 lee doc03 0.5108 leeward doc04 0.5108 leeward doc01 0.4463 area doc02 0.2231 area doc03 0.2231 area doc01 0.2231 dry doc02 0.2231 dry doc03 0.2231 dry Unique Insert SumBy doc02 0.2231 mountain doc03 0.2231 mountain D doc_id 1 doc_id Document Collection RHS doc04 0.2231 mountain M R M R M Assert Scrub Tokenize token HashJoin Checkpoint M doc01 0.0000 rain M RHS token HashJoin Regex Unique GroupBy DF doc02 0.0000 rain Left token token token Count ExprFunc CoGroup tf-idf Stop Word List RHS M R M R M R TF-IDF doc03 0.0000 rain GroupBy M doc04 0.0000 rain TF doc_id, token Count doc01 0.0000 shadow Failure Traps GroupBy Count token M R M R doc02 0.0000 shadow Word Count R M R doc03 0.0000 shadow doc04 0.0000 shadow
  • 68. comparisons? compare similar code in Scalding (Scala) and Cascalog (Clojure): sujitpal.blogspot.com/2012/08/scalding-for-impatient.html based on: github.com/twitter/scalding/wiki github.com/Quantisan/Impatient based on: github.com/nathanmarz/cascalog/wiki
  • 69. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 6. code: sample apps
  • 70. Social Recommender filter Twitter stop words tweets calculate QA similiarity threshold min, max Neo4j LDA Redis github.com/Cascading/SampleRecommender ‣ social recommender based on Twitter: suggest users who tweet about similar stocks ‣ instead of a cross-product (potential bottleneck) this runs in parallel on Hadoop ‣ uses a stop word list to remove common words, offensive phrases, etc. ‣ one tap measures token frequency: for QA, adjust stop words, improve filter, etc. ‣ adapted in Spring by Costin Leau
  • 71. SocRec: architecture Twitter filter low-freq firehose source stop words tweets batch updates ( uid, tweet, t ) checkpoint: tokenized tweets calculate checkpoint: analysis + QA similiarity token frequency curation checkpoint: similarity similar users thresholds threshold min, max sink sink sink Neo4j: social Redis graph LDA: topic results (uid: uidx, rank) trending
  • 72. SocRec: results uid recommend weight carbonfiberxrm ClosingBellNews 0.1459 carbonfiberxrm DJFunkyGrrL 0.0870 ClosingBellNews DJFunkyGrrL 0.1491 CloudStocks DJFunkyGrrL 0.1206 ElmoreNicole DJFunkyGrrL 0.1798 EsNeey alexiolo_ 0.8603 ...
  • 73. City of Palo Alto open data Regex Regex tree Scrub filter parser species M HashJoin Left Geohash CoPA GIS exprot Tree Metadata M RHS RHS tree Regex Checkpoint road Regex Regex tsv parser tsv filter Tree Filter GroupBy Checkpoint parser CoGroup Distance tree_dist tree_name shade M R M R M RHS M HashJoin Estimate Road Left Albedo Geohash CoGroup Segments Road Metadata GPS Failure RHS M logs Traps R road Geohash M Regex park filter reco M park github.com/Cascading/CoPA/wiki ‣ GIS export for parks, roads, trees (unstructured / open data) ‣ log files of personalized/frequented locations in Palo Alto via iPhone GPS tracks ‣ curated metadata, used to enrich the dataset ‣ could extend via mash-up with many available public data APIs Enterprise-scale app: road albedo + tree species metadata + geospatial indexing “Find a shady spot on a summer day to walk near downtown and take a call…”
  • 75. CoPA: results 0.12 Estimated Tree Height (meters) 0.10 0.08 count 0 density 100 0.06 200 300 0.04 0.02 0.00 0 10 20 30 40 50 avg_height ‣ addr: 115 HAWTHORNE AVE ‣ lat/lng: 37.446, -122.168 ‣ geohash: 9q9jh0 ‣ tree: 413 site 2 ‣ species: Liquidambar styraciflua ‣ avg height 23 m ‣ road albedo: 0.12 ‣ distance: 10 m ‣ a short walk from my train stop ✔
  • 76. drill-down blog, code/wiki/gists, jars, list, DevOps products: cascading.org/ github.org/Cascading/ conjars.org/ goo.gl/KQtUL concurrentinc.com/ pnathan@concurrentinc.com @pacoid

Notas del editor

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n