Cascading API: An Introduction to Data Workflows

Intro to Cascading

Paco Nathan
Document
Collection

Scrub
Tokenize
token

Concurrent, Inc.
M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

pnathan@concurrentinc.com Count

@pacoid
Word
Count

Copyright @2012, Concurrent, Inc.

Enterprise Apps
for Big Data
with Cascading

1. intro: Cascading API
2. backstory: Big Data origins
3. context: Hadoop cliff notes
4. theory: Data Science teams
5. tutorial: for the impatient
6. code: sample apps

Intro to Cascading
Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

1. intro:
Cascading API

Cascading API: purpose
‣ simplify data processing development and deployment

‣ improve application developer productivity

‣ enable data processing application manageability

Cascading API: a few facts
Java open source project (ASL 2) using Git, Gradle, Maven, JUnit, etc.

in production (~5 yrs) at hundreds of enterprise Hadoop deployments:
Finance, Health Care, Transportation, other verticals

studies published about large use cases: Twitter, Etsy, Airbnb, Square,
Climate Corporation, FlightCaster, Williams-Sonoma

partnerships and distribution with SpringSource, Amazon AWS,
Microsoft Azure, Hortonworks, MapR, EMC

several open source projects built atop, contribs by Twitter, Etsy, etc.,
which provide substantial Machine Learning libraries

DSLs available in Scala, Clojure, Python (Jython), Ruby (JRuby), Groovy

data “taps” integrate popular data frameworks via JDBC, Memcached, HBase,
plus serialization in Apache Thrift, Avro, Kyro, etc.

entire app compiles into a single JAR: fully connected for compiler optimization, exception
handling, debugging, conﬁg, scheduling, etc.

Cascading API: a few quotes
“Cascading gives Java developers the ability to build Big Data applications
on Hadoop using their existing skillset … Management can really go out
and build a team around folks that are already very experienced with Java.
Switching over to this is really a very short exercise.”
CIO, Thor Olavsrud, 2012-06-06
cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading

“Masks the complexity of MapReduce, simpliﬁes the programming, and
speeds you on your journey toward actionable analytics … A vast
improvement over native MapReduce functions or Pig UDFs.”
2012 BOSSIE Awards, James Borck, 2012-09-18
infoworld.com/slideshow/65089

“Company’s promise to application developers is an opportunity to build
and test applications on their desktops in the language of choice with
familiar constructs and reusable components”
Dr. Dobb’s, Adrian Bridgwater, 2012-06-08
drdobbs.com/jvm/where-does-big-data-go-to-get-data-inten/240001759

data+code “political spectrum”
“Notes from the Mystery Machine Bus”
by Steve Yegge, Google
goo.gl/SeRZa
“conservative” “liberal”
(mostly) Enterprise (mostly) Start-Up

risk management customer experiments

assurance flexibility

well-defined schema schema follows code

explicit configuration convention

type-checking compiler interpreted scripts

wants no surprises wants no impediments

Java, Scala, Clojure, etc. PHP, Ruby, Python, etc.

Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc.

Cascading API: adoption

As Enterprise apps move into
Hadoop and related BigData
frameworks, risk profiles shift
toward more conservative
programming practices

Cascading provides a popular API
for defining and managing
Enterprise data workflows

enterprise data workflows
Tuples, Pipelines, Endpoints, Operations, Joins, Assertions, Traps, etc.
…in other words, “plumbing”

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

data workflows: team
‣ Business Stakeholder POV:
business process management for workflow orchestration (think BPM/BPEL)

‣ Systems Integrator POV:
system integration of heterogenous data sources and compute platforms

‣ Data Scientist POV:
a directed, acyclic graph (DAG) on which we can apply Amdahl's Law, etc.

‣ Data Architect POV:
a physical plan for large-scale data flow management

‣ Software Architect POV:
a pattern language, similar to plumbing or circuit design
Document
Collection

Scrub
Tokenize
token

M

‣ App Developer POV: Stop Word
List
HashJoin
Left

RHS
Regex
token
GroupBy
token
R

API bindings for Java, Scala, Clojure, Jython, JRuby, etc. Count

Word
Count

‣ Systems Engineer POV:
a JAR file, has passed CI, available in a Maven repo

data workflows: layers
business domain expertise, business trade-offs,
process operating parameters, market position, etc.

API Java, Scala, Clojure, Jython, JRuby, Groovy, etc.
language …envision whatever runs in a JVM

optimize / major changes in technology now
schedule

Document
Collection

Scrub
Tokenize
token

physical
M

HashJoin Regex
Left token

plan
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

Apache Hadoop, in-memory local mode

“assembler”
compute

code
substrate …envision GPUs, streaming, etc.

machine Splunk, Nagios, Collectd, New Relic, etc.
data

data workflows: SQL
Relational
SQL parser

logical plan,
optimized based on stats

physical plan

query history,
table stats

b-trees, etc.

ERD

table schema

catalog

data workflows: SQL vs. JVM
Relational Cascading + Driven
SQL parser SQL-92 compliant parser
(in progress)

logical plan, TODO: logical plan,
optimized based on stats optimized based on stats

physical plan API “plumbing”

query history, app history,
table stats tuple stats

b-trees, etc. distributed compute substrate:
Hadoop, in-memory, etc.

ERD ﬂow diagram

table schema tuple schema

catalog endpoint usage DB

Intro to Cascading
Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

2. backstory:
Big Data origins

inﬂection point
huge Internet successes after 1997 holiday season… 1997
AMZN, EBAY, Inktomi (YHOO Search), then GOOG
1998
consider this metric:
annual revenue per customer / amount of data stored
which dropped 100x within a few years after 1997
2004
storage and processing costs plummeted, now we must
work much smarter to extract ROI from Big Data…
our methods must adapt

“conventional wisdom” of RDBMS and BI tools became
less viable; however, business cadre was still focused on
pivot tables and pie charts… which tends toward inertia!

MapReduce and the Hadoop open source stack grew
directly out of that contention… however, that effort
only solves parts of the puzzle
＋

inﬂection point: consequences
Geoffrey Moore (Mohr Davidow Ventures, author of Crossing The Chasm)
Hadoop Summit, 2012:

“All of Fortune 500 is now on notice over the next 10-year period.”
Amazon and Google as exemplars of massive disruption in retail, advertising,
etc.
data as the major force displacing Global 1000 over the next decade, mostly
through apps — verticals, leveraging domain expertise

Michael Stonebraker (INGRES, PostgreSQL,Vertica,VoltDB, etc.)
XLDB, 2012:

“Complex analytics workloads are now displacing SQL as the basis
for Enterprise apps.”

primary sources
Amazon
“Early Amazon: Splitting the website” – Greg Linden
glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

eBay
“The eBay Architecture” – Randy Shoup, Dan Pritchett
addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf

Inktomi (YHOO Search)
“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
youtube.com/watch?v=E91oEn1bnXM

Google
“The Birth of Google” – John Battelle
wired.com/wired/archive/13.08/battelle.html
“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
youtube.com/watch?v=qsan-GQaeyk
perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx

the world before…

BI, SQL, and highly
optimized code

data innovation: circa 1996
Stakeholder Customers

Excel pivot tables
PowerPoint slide decks strategy

BI
Product
Analysts

requirements

SQL Query optimized
Engineering code Web App
result sets

transactions

RDBMS

the world after…

machine learning,
leveraging log files

Stakeholder Product Customers

dashboards UX
Engineering

models servlets

recommenders
Algorithmic + Web Apps
Modeling classiﬁers

Middleware
aggregation
event
SQL Query history
result sets customer
transactions
Logs

DW ETL RDBMS

the world ahead…

what our customers
are doing now

Customers
Data Apps
business
Domain process Workﬂow Prod
Expert
dashboard Web Apps,
metrics
History services Mobile,
data etc. s/w
science dev
Data
Planner
Scientist
social
discovery optimized interactions
+ capacity transactions, Eng
endpoints
modeling content

App Dev
Data Access Patterns

Hadoop, Log In-Memory
etc. Events Data Grid
Ops DW Ops
batch "real time"

Cluster Scheduler
introduced existing
capability SDLC

RDBMS
RDBMS

statistical thinking

Process Variation Data Tools

employing a mode of thought which includes both logical and analytical reasoning:
evaluating the whole of a problem, as well as its component parts; attempting
to assess the effects of changing one or more variables

this approach attempts to understand not just problems and solutions,
but also the processes involved and their variances

particularly valuable in Big Data work when combined with hands-on experience in
physics – roughly 50% of my peers come from physics or physical engineering…

programmers typically don’t think this way…
however, both systems engineers and data scientists must!

reference

by Leo Breiman
Statistical Modeling:
The Two Cultures
Statistical Science, 2001
bit.ly/eUTh9L

also check out RStudio:
rstudio.org/
rpubs.com/

Intro to Cascading
Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

3. context:
Hadoop cliff notes

MapReduce architecture
‣ name node + data nodes
‣ job tracker + task trackers
‣ submit queue
‣ task slots
‣ HDFS
‣ distributed cache

Wikipedia

Apache

MapReduce: how it works

map(k1, v1) → list(k2, v2)
reduce(k2, list(v2)) → list(k3, v3)

the property of data independence among tasks allows for parallel processing …
maybe, if the stars are all aligned :)

MapReduce is mostly about fault tolerance, and how to leverage “commodity
hardware” to replace “big iron” solutions… where “big iron”
might apply to Oracle + NetApp. or perhaps an IBM zSeries mainframe…
or something else that’s expensive, undoubtably.

bonus for math geeks: see any concerns about O(n) complexity, given
Amdahl’s Law plus the functional deﬁnitions listed above?

keep in mind that each phase cannot conclude and progress to the next
phase until after each of its tasks has successfully completed.

a brief history…
circa 1979 – Stanford, MIT, CMU, etc.
set/list operations in LISP, Prolog, etc., for parallel processing
www-formal.stanford.edu/jmc/history/lisp/lisp.htm

circa 2004 – Google
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
labs.google.com/papers/mapreduce.html

circa 2006 – Apache
Hadoop, originating from the Nutch Project
Doug Cutting
research.yahoo.com/files/cutting.pdf

circa 2008 – Yahoo
web scale search indexing
Hadoop Summit, HUG, etc.
developer.yahoo.com/hadoop/

circa 2009 – Amazon AWS
Elastic MapReduce
Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc.
aws.amazon.com/elasticmapreduce/

CAP theorem
purpose: theoretical limits for data access patterns
essence:
‣ consistency
‣ availability
‣ partition tolerance

best case scenario: you may pick two … or spend billions
struggling to obtain all three at scale (GOOG)
translated: cost of doing business

www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf

julianbrowne.com/article/viewer/brewers-cap-theorem

data access patterns
because the world is not made of data warehouses…

a handful of common data access patterns are prevalent

learn to recognize these for any given problem

typically expressed in terms of trade-offs:

‣ speed & volume (latency and throughput)

‣ reads & writes (access and storage)

‣ consistency / availability / partition tolerance

access → frameworks → forfeits
ﬁnancial transactions general ledger in RDBMS CAx
ad-hoc queries RDS (hosted MySQL) CAx
reporting, dashboards like Pentaho CAx
log rotation/persistence like Riak xxP
search indexes like Lucene/Solr xAP
static content, archives S3 (durable storage) xAP
customer facts like Redis, Membase xAP
distributed counters, locks, sets like Redis x A P*
data objects CRUD key/value – like, NoSQL on MySQL CxP
authoritative metadata like Zookeeper CxP
data prep, modeling at scale like Hadoop/Cascading + R CxP
graph analysis like Hadoop + Redis + Gephi CxP
data marts like Hadoop/HBase CxP

parallel computation
parallelism allows for horizontal scale-out, which create
business “levers” in cost/performance at scale

NB: MapReduce provides a compute framework which
is part-parallel and part-serial… which tends to
complicate app development

most hard problems in industry have portions which do not
allow data independence, or which require iteration

current efforts in massively parallel algorithms research may
help to parallelize problems and reduce iteration – estimates
are 3-5 years out for industry use

GPUs and other hardware architecture advancements
will likely make Hadoop unrecognizable 3-5 years out

reference

by Tom White
Hadoop:The Deﬁnitive Guide
O’Reilly, 2009
amazon.com/dp/1449311520

see also:
Cluster Computing and MapReduce Lectures
code.google.com/edu/submissions/mapreduce-minilecture/listing.html

Intro to Cascading
Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

4. theory:
Data Science teams

core values

Data Science teams develop actionable insights,
building conﬁdence for decisions

that work may inﬂuence a few decisions worth
billions (e.g., M&A) or billions of small decisions (e.g.,
AdWords)

probably somewhere in-between…
solving for pattern, at scale.

an interdisciplinary pursuit which
requires teams, not sole players

most valuable skills
approximately 80% of the costs for data-related projects
get spent on data preparation – mostly on cleaning up
data quality issues: ETL, log ﬁle analysis, etc.

unfortunately, data-related budgets for many companies tend
to go into frameworks which can only be used after clean up

most valuable skills:
‣ learn to use programmable tools that prepare data

‣ learn to generate compelling data visualizations

‣ learn to estimate the conﬁdence for reported results

‣ learn to automate work, making analysis repeatable

the rest of the skills – modeling,
D3
algorithms, etc. – those are secondary

the science in data science?
edoMpUsserD:IUN
tcudorP ylppA lenaP yrotnevnI tneilC
tcudorP evomeR lenaP yrotnevnI tneilC

in a nutshell, what we do…
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
ydduB ddA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA
dekcilCeliforPyM:IUN
edoMstiderCyuB:IUN
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
taeS egnahC
wodniW D3 nepO
dneirF ddA
revO tcudorP pilF lenaP yrotnevnI tneilC
lenaP tidE

‣ estimate probability
woN tahC
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP
bew :metI na yuB
edoMeivoM:IUN
ytinummoc ,tneilc :detratS weiV eivoM
teP weN etaerC
detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC
egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL
noitartsigeR euqinU

‣ calculate analytic variance

edoMpUsserD:IUN
tcudorP ylppA lenaP yrotnevnI tneilC
tcudorP evomeR lenaP yrotnevnI tneilC
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
ydduB ddA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA
dekcilCeliforPyM:IUN
edoMstiderCyuB:IUN
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
taeS egnahC

dneirF ddA
revO tcudorP pilF lenaP yrotnevnI tneilC
lenaP tidE
woN tahC
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP
bew :metI na yuB
edoMeivoM:IUN
ytinummoc ,tneilc :detratS weiV eivoM
teP weN etaerC
detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC
egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL
noitartsigeR euqinU
wodniW D3 nepO
‣ manipulate order complexity

‣ make use of learning theory

+ collab with DevOps, Stakeholders

+ reduce our work to cron entries

team process = needs

help people ask the
discovery right questions

allow automation to place
modeling informed bets

deliver products at
integration scale to customers

build smarts into
apps product features Gephi

keep infrastructure
systems running, cost-effective

team composition = roles

Domain
Expert
business process,
stakeholder
data
science
Data data prep, discovery,
Scientist modeling, etc. Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

App Dev
software engineering, Count

automation Word
Count

Ops systems engineering, access

introduced
capability

matrix = needs × roles

nn
o
overy
very elliing
e ng ratiio
rat o apps
apps tem
tem
ss
diisc
d sc mod
mod nteg
ii nteg sys
sys

stakeholder

scientist

developer

ops

matrix: example team

nn
o
overy
very elliing
e ng ratiio
rat o apps
apps tem
tem
ss
diisc
d sc mod
mod nteg
ii nteg sys
sys

stakeholder

scientist

developer

ops

summary: this team seems heavy on systems, may need more overlap
between modeling and integration, particularly among team leads

typical hand-offs

integrity availability discovery communications

people
vendor
data
sources
Query
data Query
Hosts
query BI & dashboards
warehouse Hosts
hosts reporting
production
cluster presentations

decision support

classiﬁers
predictive analyze,
customer analytics visualize business
interactions recommenders stakeholders

internal API, crons, etc.
modeling

engineers,
automation analysts

use case: marketing funnel
• must optimize a very large ad spend
• different vendors report different metrics

Wikipedia
• seasonal variation distorts performance
• some campaigns are much smaller than others
• hard to predict ROI for incremental spend

approach:
• log aggregation, followed with cohort analysis
• bayesian point estimates compare different-sized ad tests
• customer lifetime value quantiﬁes ROI of new leads
• time series analysis normalizes for seasonal variation
• geolocation adjusts for regional cost/beneﬁt
• linear programming models estimate elasticity of demand

use case: ecommerce fraud
• sparse data means lots of missing values

stat.berkeley.edu
• “needle in a haystack” lack of training cases
• answers are available in large-scale batch, results
are needed in real-time event processing
• not just one pattern to detect – many, ever-changing

approach:
• random forest (RF) classifiers predict likely fraud
• subsampled data to re-balance training sets
• impute missing values based on density functions
• train on massive log files, run on in-memory grid
• adjust metrics to minimize customer support costs
• detect novelty – report anomalies via notifications

use case: customer segmentation
• many millions of customers, hard to determine
which features resonate

Mathworks
• multi-modal distributions get obscured by the
practice of calculating an “average”
• not much is known about individual customers

approach:
• connected components for sessionization, determining
uniques from logs
• estimates for age, gender, income, geo, etc.
• clustering algorithms to group into market segments
• social graph infers “unknown” relationships
• covariance/heat maps visualizes segments vs. feature sets

use case: monetizing content
• need to suggest relevant content which would

Digital Humanities
otherwise get buried in the back catalog
• big disconnect between inventory and limited
performance ad market
• enormous amounts of text, hard to categorize

approach:
• text analytics glean key phrases from documents
• hierarchical clustering of char frequencies detects lang
• latent dirichlet allocation (LDA) reduces dimension to
topic models
• recommenders suggest similar topics to customers
• collaborative ﬁlters connect known users with less known

reference

by DJ Patil

Data Jujitsu
O’Reilly, 2012
amazon.com/dp/B008HMN5BE

Building Data Science Teams
O’Reilly, 2011
amazon.com/dp/B005O4U3ZE

Intro to Cascading
Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

5. tutorial:
for the impatient

“Cascading for the Impatient”
cascading.org/category/impatient/
‣ a series of introductory tutorials and code samples

‣ 1:1 code comparisons in Scalding, Cascalog, Pig, Hive

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

1: copy
public class
  Main
  {
  public static void
  main( String[] args )
    {
    String inPath = args[ 0 ];
    String outPath = args[ 1 ];
Source
    Properties props = new Properties();
    AppProps.setApplicationJarClass( props, Main.class );
    HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

    // create the source tap
    Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath );
M
    // create the sink tap
    Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath );
Sink
    // specify a pipe to connect the taps
    Pipe copyPipe = new Pipe( "copy" );

    // connect the taps, pipes, etc., into a flow
    FlowDef flowDef = FlowDef.flowDef().setName( "copy" )
     .addSource( copyPipe, inTap )
     .addTailSink( copyPipe, outTap );

    // run the flow
    flowConnector.connect( flowDef ).complete();
    }
1 mapper   }

0 reducers
10 lines code

wait!

ten lines of code
for a ﬁle copy…
seems like a lot.

same JAR, any scale…
MegaCorp Enterprise IT:
Pb’s data
1000+ node private cluster
EVP calls you when app fails
runtime: days+

Production Cluster:
Tb’s data
EMR w/ 50 HPC Instances
Ops monitors results
runtime: hours – days

Staging Cluster:
Gb’s data
EMR + 4 Spot Instances
CI shows red or green lights
runtime: minutes – hours

Your Laptop:
Mb’s data
Hadoop standalone mode
passes unit tests, or not
runtime: seconds – minutes

2: word count

Document
Collection

Tokenize
GroupBy
M token Count

R Word
Count

1 mapper
1 reducer
18 lines code gist.github.com/3900702

Cascading / Java
Document
String docPath = args[ 0 ]; Collection

String wcPath = args[ 1 ]; Tokenize
GroupBy
M token Count

Properties properties = new Properties();
R Word
AppProps.setApplicationJarClass( properties, Main.class ); Count

HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();

Scalding / Scala
Document
Collection

Tokenize
GroupBy
M token Count

// Sujit Pal R Word
Count

// sujitpal.blogspot.com/2012/08/scalding-for-impatient.html

package com.mycompany.impatient

import com.twitter.scalding._

class Part2(args : Args) extends Job(args) {
  val input = Tsv(args("input"), ('docId, 'text))
  val output = Tsv(args("output"))
  input.read.
    flatMap('text -> 'word) {
text : String => text.split("""s+""")
}.
    groupBy('word) { group => group.size }.
    write(output)
}

Cascalog / Clojure
Document
Collection

Tokenize
GroupBy
M token Count

; Paul Lam R Word
Count

; github.com/Quantisan/Impatient

(ns impatient.core
  (:use [cascalog.api]
        [cascalog.more-taps :only (hfs-delimited)])
  (:require [clojure.string :as s]
            [cascalog.ops :as c])
  (:gen-class))

(defmapcatop split [line]
  "reads in a line of string and splits it by regex"
  (s/split line #"[[](),.)s]+"))

(defn -main [in out & args]
  (?<- (hfs-delimited out)
       [?word ?count]
       ((hfs-delimited in :skip-header? true) _ ?line)
       (split ?line :> ?word)
       (c/count ?count)))

Hive
Document
Collection

Tokenize
GroupBy
M token Count

-- Steve Severance R Word
Count

-- stackoverflow.com/questions/10039949/word-count-program-in-hive

CREATE TABLE input (line STRING);

LOAD DATA LOCAL INPATH 'input.tsv'
OVERWRITE INTO TABLE input;

SELECT
word, COUNT(*)
FROM input
LATERAL VIEW explode(split(text, ' ')) lTable AS word
GROUP BY word
;

Pig
Document
Collection

Tokenize
GroupBy
M token Count

-- kudos to Dmitriy Ryaboy R Word
Count

docPipe = LOAD '$docPath' USING PigStorage('t', 'tagsource')
AS (doc_id, text);
docPipe = FILTER docPipe BY doc_id != 'doc_id';

-- specify regex to split "document" text lines into token stream
tokenPipe = FOREACH docPipe
GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token;
tokenPipe = FILTER tokenPipe BY token MATCHES 'w.*';

-- determine the word counts
tokenGroups = GROUP tokenPipe BY token;
wcPipe = FOREACH tokenGroups
GENERATE group AS token, COUNT(tokenPipe) AS count;

-- output
STORE wcPipe INTO '$wcPath' USING PigStorage('t', 'tagsource');
EXPLAIN -out dot/wc_pig.dot -dot wcPipe;

3: wc + scrub

Document
Collection

Scrub GroupBy
Tokenize
token token
Count
M

R Word
Count

1 mapper
1 reducer
22+10 lines code

4: wc + scrub + stop words

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count
1 mapper
1 reducer
28+10 lines code

5: tf-idf

Unique Insert SumBy

D
doc_id 1 doc_id
Document
Collection

M R M R M RHS

Scrub
Tokenize
token
HashJoin
M

RHS

token
HashJoin Regex Unique GroupBy

DF
Left token token token ExprFunc
Count CoGroup
Stop Word tf-idf
List
RHS
M R M R M R
TF-IDF

M

GroupBy
TF

doc_id,
token Count
GroupBy Count
token

M R M R
Word
R M R Count

11 mappers
9 reducers
65+10 lines code

6: tf-idf + tdd

Unique Insert SumBy

D
doc_id 1 doc_id
Document
Collection

RHS
M R M R M
Assert Scrub
Tokenize
token
HashJoin Checkpoint
M
M

RHS

token

DF
Left token token token Count ExprFunc
CoGroup
tf-idf
Stop Word
List RHS

M R M R M R
TF-IDF

M
GroupBy
TF
doc_id,
Failure token Count
Traps GroupBy Count
token

M R M R
Word
Count
R M R

12 mappers
9 reducers
76+14 lines code

deployed on AWS…

elastic-mapreduce --create --name "TF-IDF"
--jar s3n://temp.cascading.org/impatient/part6.jar
--arg s3n://temp.cascading.org/impatient/rain.txt
--arg s3n://temp.cascading.org/impatient/out/wc
--arg s3n://temp.cascading.org/impatient/en.stop
--arg s3n://temp.cascading.org/impatient/out/tfidf
--arg s3n://temp.cascading.org/impatient/out/trap
--arg s3n://temp.cascading.org/impatient/out/check

aws.amazon.com/elasticmapreduce/

results?
doc_id tf-idf token
doc02 0.9163 air
doc05 0.9163 australia
doc05 0.9163 broken
doc04 0.9163 california's
doc04 0.9163 cause
doc02 0.9163 cloudcover
doc04 0.9163 death
doc04 0.9163 deserts
doc_id text doc03 0.9163 downwind
doc01 A rain shadow is a dry area on the lee back side of a mountainous area. …
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain doc02 0.9163 sinking
and cloudcover. doc04 0.9163 such
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a doc04 0.9163 valley
mountain. doc05 0.9163 women
doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of doc03 0.5108 land
mountain ranges, such as California's Death Valley. doc05 0.5108 land
doc05 Two Women. Secrets. A Broken Land. [DVD Australia] doc01 0.5108 lee
zoink null doc02 0.5108 lee
doc03 0.5108 leeward
doc04 0.5108 leeward
doc01 0.4463 area
doc02 0.2231 area
doc03 0.2231 area
doc01 0.2231 dry
doc02 0.2231 dry
doc03 0.2231 dry
Unique Insert SumBy
doc02 0.2231 mountain
D

doc_id 1 doc_id
Document
Collection

RHS

M R M R M
Assert Scrub
Tokenize
token
HashJoin Checkpoint
M

doc01 0.0000 rain
M

RHS
token

DF

doc02 0.0000 rain
Left token token token Count ExprFunc
CoGroup
tf-idf
Stop Word
List RHS

M R M R M R
TF-IDF

doc03 0.0000 rain
GroupBy
M
doc04 0.0000 rain
TF

doc_id,
token Count

doc01 0.0000 shadow
Failure
Traps GroupBy Count
token

M R M R

doc02 0.0000 shadow
Word
Count
R M R

doc03 0.0000 shadow
doc04 0.0000 shadow

comparisons?

compare similar code in Scalding (Scala) and Cascalog (Clojure):

sujitpal.blogspot.com/2012/08/scalding-for-impatient.html
based on: github.com/twitter/scalding/wiki

github.com/Quantisan/Impatient
based on: github.com/nathanmarz/cascalog/wiki

Intro to Cascading
Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

6. code:
sample apps

Social Recommender

ﬁlter
Twitter stop words
tweets

calculate
QA
similiarity

threshold
min, max

Neo4j

LDA Redis

github.com/Cascading/SampleRecommender
‣ social recommender based on Twitter: suggest users who tweet about similar stocks
‣ instead of a cross-product (potential bottleneck) this runs in parallel on Hadoop
‣ uses a stop word list to remove common words, offensive phrases, etc.
‣ one tap measures token frequency: for QA, adjust stop words, improve ﬁlter, etc.
‣ adapted in Spring by Costin Leau

SocRec: architecture

Twitter ﬁlter low-freq
ﬁrehose source stop words
tweets batch updates
( uid, tweet, t )

checkpoint:
tokenized tweets

calculate checkpoint: analysis +
QA
similiarity token frequency curation

checkpoint: similarity
similar users thresholds

threshold
min, max
sink
sink sink
Neo4j:
social Redis
graph LDA:
topic results
(uid: uidx, rank)
trending

SocRec: results

uid recommend weight

carbonﬁberxrm ClosingBellNews 0.1459

carbonﬁberxrm DJFunkyGrrL 0.0870

ClosingBellNews DJFunkyGrrL 0.1491

CloudStocks DJFunkyGrrL 0.1206

ElmoreNicole DJFunkyGrrL 0.1798

EsNeey alexiolo_ 0.8603

...

City of Palo Alto open data
Regex Regex

tree
Scrub
filter parser species

M
HashJoin
Left Geohash
CoPA
GIS exprot Tree
Metadata M
RHS RHS
tree
Regex Checkpoint

road
Regex Regex

tsv
parser tsv filter Tree Filter GroupBy Checkpoint
parser CoGroup
Distance tree_dist tree_name shade
M

R M R M RHS
M
HashJoin Estimate Road
Left Albedo Geohash CoGroup
Segments
Road
Metadata GPS
Failure RHS M logs
Traps R
road

Geohash

M

Regex
park

filter reco

M
park

github.com/Cascading/CoPA/wiki
‣ GIS export for parks, roads, trees (unstructured / open data)
‣ log ﬁles of personalized/frequented locations in Palo Alto via iPhone GPS tracks
‣ curated metadata, used to enrich the dataset
‣ could extend via mash-up with many available public data APIs

Enterprise-scale app: road albedo + tree species metadata + geospatial indexing
“Find a shady spot on a summer day to walk near downtown and take a call…”

CoPA: results 0.12
Estimated Tree Height (meters)

0.10

0.08
count
0

density
100
0.06 200
300

0.04

0.02

0.00

0 10 20 30 40 50
avg_height

‣ addr: 115 HAWTHORNE AVE
‣ lat/lng: 37.446, -122.168
‣ geohash: 9q9jh0
‣ tree: 413 site 2
‣ species: Liquidambar styraciﬂua
‣ avg height 23 m
‣ road albedo: 0.12
‣ distance: 10 m
‣ a short walk from my train stop ✔

drill-down

blog, code/wiki/gists, jars, list, DevOps products:
cascading.org/
github.org/Cascading/
conjars.org/
goo.gl/KQtUL
concurrentinc.com/
pnathan@concurrentinc.com
@pacoid

Cascading API: An Introduction to Data Workflows

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Cascading API: An Introduction to Data Workflows

Similar a Cascading API: An Introduction to Data Workflows (20)

Más de Paco Nathan

Más de Paco Nathan (20)

Último

Último (20)

Cascading API: An Introduction to Data Workflows

Notas del editor