SlideShare una empresa de Scribd logo
1 de 67
Descargar para leer sin conexión
Paco Nathan
Concurrent, Inc.
San Francisco, CA
@pacoid
“Pattern – an open source project
for migrating predictive models
from SAS, etc., onto Hadoop”
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
Cascading: background
The Workflow Abstraction
PMML: Predictive Model Markup
Pattern: PMML in Cascading
PMML for Customer Experiments
Ensemble Models with Pattern
Workflow Design Pattern
Cascading – origins
API author Chris Wensel worked as a system architect
at an Enterprise firm well-known for many popular data
products.
Wensel was following the Nutch open source project –
where Hadoop started.
Observation: would be difficult to find Java developers
to write complex Enterprise apps in MapReduce –
potential blocker for leveraging new open source
technology.
Cascading – functional programming
Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
To ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:
leverages JVM and Java-based tools without any
need to create new languages
allows programmers who have J2EE expertise
to leverage the economics of Hadoop clusters
•
•
Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Cascading – definitions
a pattern language for Enterprise Data
Workflows
simple to build, easy to test, robust in
production
design principles ⟹ ensure best practices at
scale
•
•
•
Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Cascading – usage
Java API, DSLs in Scala,
Clojure,
Jython, JRuby, Groovy, ANSI
SQL
ASL 2 license, GitHub src,
http://conjars.org
5+ yrs production use,
multiple Enterprise verticals
•
•
•
Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Cascading – integrations
partners: Microsoft Azure, Hortonworks,
Amazon AWS, MapR, EMC, SpringSource,
Cloudera
taps: Memcached, Cassandra, MongoDB,
HBase, JDBC, Parquet, etc.
serialization: Avro, Thrift, Kryo,
JSON, etc.
topologies: Apache Hadoop,
tuple spaces, local mode
•
•
•
•
Cascading – deployments
case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, Factual, etc.
use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.
•
•
Cascading – deployments
case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, Factual, etc.
use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.
•
•
workflow abstraction addresses:
• staffing bottleneck;
• system integration;
• operational complexity;
• test-driven development
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
Cascading: background
The Workflow Abstraction
PMML: Predictive Model Markup
Pattern: PMML in Cascading
PMML for Customer Experiments
Ensemble Models with Pattern
Workflow Design Pattern
Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Enterprise Data Workflows
Let’s consider a “strawman” architecture
for an example app… at the front end
LOB use cases drive demand for apps
Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Enterprise Data Workflows
Same example… in the back office
Organizations have substantial investments
in people, infrastructure, process
Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Enterprise Data
Workflows
Same example… the heavy lifting!
“Main Street” firms are migrating
workflows to Hadoop, for cost
savings and scale-out
Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Cascading workflows – taps
taps integrate other data frameworks, as tuple streams
these are “plumbing” endpoints in the pattern language
sources (inputs), sinks (outputs), traps (exceptions)
text delimited, JDBC, Memcached,
HBase, Cassandra, MongoDB, etc.
data serialization: Avro, Thrift,
Kryo, JSON, etc.
extend a new kind of tap in just
a few lines of Java
schema and provenance get
derived from analysis of the taps
•
•
•
•
•
•
Cascading workflows – taps
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ []
(),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
source and sink taps
for TSV data in HDFS
Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Cascading workflows – topologies
topologies execute workflows on clusters
flow planner is like a compiler for queries
Hadoop (MapReduce jobs)
local mode (dev/test or special config)
in-memory data grids (real-time)
flow planner can be extended
to support other topologies
blend flows in different topologies
into the same app – for example,
batch (Hadoop) + transactions (IMDG)
•
•
-
-
-
•
Cascading workflows – topologies
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ []
(),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
flow planner for
Apache Hadoop
topology
Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Cascading workflows – test-driven development
assert patterns (regex) on the tuple streams
adjust assert levels, like log4j levels
trap edge cases as “data exceptions”
TDD at scale:
start from raw inputs in the flow graph
define stream assertions for each stage
of transforms
verify exceptions, code to remove them
when impl is complete, app has full
test coverage
redirect traps in production
to Ops, QA, Support, Audit, etc.
•
•
•
•
1.
2.
3.
4.
Workflow Abstraction – pattern language
Cascading uses a “plumbing” metaphor in the Java API,
to define workflows out of familiar elements: Pipes, Taps,
Tuple Flows, Filters, Joins, Traps, etc.
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
Data is represented as flows of tuples. Operations within
the flows bring functional programming aspects into Java
In formal terms, this provides a pattern language
Pattern Language
structured method for solving large, complex design
problems, where the syntax of the language ensures
the use of best practices – i.e., conveying expertise
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
A Pattern Language
Christopher Alexander, et al.
amazon.com/dp/0195019199
Workflow Abstraction – literate programming
Cascading workflows generate their own visual
documentation: flow diagrams
in formal terms, flow diagrams leverage a methodology
called literate programming
provides intuitive, visual representations for apps –
great for cross-team collaboration
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
Literate
Programming
by Don Knuth
Literate Programming
Univ of Chicago Press, 1992
literateprogramming.com/
“Instead of imagining that our main task is
to instruct a computer what to do, let us
concentrate rather on explaining to human
beings what we want a computer to do.”
Workflow Abstraction – business process
following the essence of literate programming, Cascading
workflows provide statements of business process
this recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
this is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
by virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale
Business
Process
by Edgar Codd
“A relational model of data for large shared data banks”
Communications of the ACM, 1970
dl.acm.org/citation.cfm?id=362685
rather than arguing between SQL vs. NoSQL…
structured vs. unstructured data frameworks…
this approach focuses on what apps do:
the process of structuring data
Cascading – functional
programming
Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
have invested in open source projects atop Cascading
– used for their large-scale production deployments
new case studies for Cascading apps are mostly
based on domain-specific languages (DSLs) in JVM
languages which emphasize functional programming:
Cascalog in Clojure (2010)
Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
•
•
Why Adopting the Declarative Programming Practices Will Improve Your Return from
Technology
Dan Woods, 2013-04-17 Forbes
forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-
practices-will-improve-your-return-from-technology/
Functional Programming for Big Data
WordCount with token scrubbing…
Apache Hive: 52 lines HQL + 8 lines Python (UDF)
compared to
Scalding: 18 lines Scala/Cascading
functional programming languages help reduce
software engineering costs at scale, over time
Two Avenues to the App Layer…
scale ➞
complexity➞
Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments – using J2EE,
ANSI SQL, SAS, etc. – to migrate
workflows onto Apache Hadoop while
leveraging existing staff
Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
Cascading: background
The Workflow Abstraction
PMML: Predictive Model Markup
Pattern: PMML in Cascading
PMML for Customer Experiments
Ensemble Models with Pattern
Workflow Design Pattern
established XML standard for predictive model markup
organized by Data Mining Group (DMG), since 1997
http://dmg.org/
members: IBM, SAS, Visa, NASA, Equifax, Microstrategy,
Microsoft, etc.
PMML concepts for metadata, ensembles, etc., translate
directly into Cascading tuple flows
“PMML is the leading standard for statistical and data mining models and
supported by over 20 vendors and organizations. With PMML, it is easy
to develop a model on one system using one application and deploy the
model on another system using another application.”
•
•
•
•
PMML – standard
wikipedia.org/wiki/Predictive_Model_Markup_Language
Association Rules: AssociationModel element
Cluster Models: ClusteringModel element
Decision Trees: TreeModel element
Naïve Bayes Classifiers: NaiveBayesModel element
Neural Networks: NeuralNetwork element
Regression: RegressionModel and GeneralRegressionModel elements
Rulesets: RuleSetModel element
Sequences: SequenceModel element
Support Vector Machines: SupportVectorMachineModel element
Text Models: TextModel element
Time Series: TimeSeriesModel element
•
•
•
•
•
•
•
•
•
•
•
PMML – model coverage
ibm.com/developerworks/industry/library/ind-PMML2/
PMML – vendor coverage
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
Cascading: background
The Workflow Abstraction
PMML: Predictive Model Markup
Pattern: PMML in Cascading
PMML for Customer Experiments
Ensemble Models with Pattern
Workflow Design Pattern
Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Pattern – model scoring
migrate workloads: SAS,Teradata, etc.,
exporting predictive models as PMML
great open source tools – R, Weka,
KNIME, Matlab, RapidMiner, etc.
integrate with other libraries –
Matrix API, etc.
leverage PMML as another kind
of DSL
•
•
•
•
cascading.org/pattern
## train a RandomForest model
f <- as.formula("as.factor(label) ~ .")
fit <- randomForest(f, data_train, ntree=50)
## test the model on the holdout test set
print(fit$importance)
print(fit)
predicted <- predict(fit, data)
data$predicted <- predicted
confuse <- table(pred = predicted, true = data[,1])
print(confuse)
## export predicted labels to TSV
write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),
quote=FALSE, sep="t", row.names=FALSE)
## export RF model to PMML
saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))
Pattern – create a model in R
<?xml version="1.0"?>
<PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.dmg.org/PMML-4_0
http://www.dmg.org/v4-0/pmml-4-0.xsd">
<Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">
<Extension name="user" value="ceteri" extender="Rattle/PMML"/>
<Application name="Rattle/PMML" version="1.2.30"/>
<Timestamp>2012-10-22 19:39:28</Timestamp>
</Header>
<DataDictionary numberOfFields="4">
<DataField name="label" optype="categorical" dataType="string">
<Value value="0"/>
<Value value="1"/>
</DataField>
<DataField name="var0" optype="continuous" dataType="double"/>
<DataField name="var1" optype="continuous" dataType="double"/>
<DataField name="var2" optype="continuous" dataType="double"/>
</DataDictionary>
<MiningModel modelName="randomForest_Model" functionName="classification">
<MiningSchema>
<MiningField name="label" usageType="predicted"/>
<MiningField name="var0" usageType="active"/>
<MiningField name="var1" usageType="active"/>
<MiningField name="var2" usageType="active"/>
</MiningSchema>
<Segmentation multipleModelMethod="majorityVote">
<Segment id="1">
<True/>
<TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">
<MiningSchema>
<MiningField name="label" usageType="predicted"/>
<MiningField name="var0" usageType="active"/>
<MiningField name="var1" usageType="active"/>
<MiningField name="var2" usageType="active"/>
</MiningSchema>
...
Pattern – capture model parameters as PMML
public static void main( String[] args ) throws RuntimeException {
String inputPath = args[ 0 ];
String classifyPath = args[ 1 ];
// set up the config properties
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap inputTap = new Hfs( new TextDelimited( true, "t" ), inputPath );
Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
// handle command line options
OptionParser optParser = new OptionParser();
optParser.accepts( "pmml" ).withRequiredArg();
OptionSet options = optParser.parse( args );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );
if( options.hasArgument( "pmml" ) ) {
String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 );
PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlPath ) )
.retainOnlyActiveIncomingFields()
.setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the model
flowDef.addAssemblyPlanner( pmmlPlanner );
}
// write a DOT file and run the flow
Flow classifyFlow = flowConnector.connect( flowDef );
classifyFlow.writeDOT( "dot/classify.dot" );
classifyFlow.complete();
}
Pattern – score a model, within an app
Customer
Orders
Classify
Scored
Orders
GroupBy
token
Count
PMML
Model
M R
Failure
Traps
Assert
Confusion
Matrix
Pattern – score a model, using pre-defined Cascading app
cascading.org/pattern
## run an RF classifier at scale
hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap 
--pmml data/sample.rf.xml
## run an RF classifier at scale, assert regression test, measure confusion matrix
hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap 
--pmml data/sample.rf.xml --assert --measure out/measure
## run a predictive model at scale, measure RMSE
hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap 
--pmml data/iris.lm_p.xml --rmse out/measure
Pattern – score a model, using pre-defined Cascading app
Roadmap – existing algorithms for scoring
Random Forest
Decision Trees
Linear Regression
GLM
Logistic Regression
K-Means Clustering
Hierarchical Clustering
Multinomial
Support Vector Machines (prepared for release)
also, model chaining and general support for ensembles
•
•
•
•
•
•
•
•
•
cascading.org/pattern
Roadmap – next priorities for scoring
Time Series (ARIMA forecast)
Association Rules (basket analysis)
Naïve Bayes
Neural Networks
algorithms extended based on customer use cases –
contact groups.google.com/forum/?fromgroups#!forum/pattern-user
•
•
•
•
cascading.org/pattern
Roadmap – top priorities for creating models at scale
Random Forest
Logistic Regression
K-Means Clustering
Association Rules
…plus all models which can be trained via sparse matrix factorization
(TQSR => PCA, SVD least squares, etc.)
a wealth of recent research indicates many opportunities
to parallelize popular algorithms for training models at scale
on Apache Hadoop…
•
•
•
•
cascading.org/pattern
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
Cascading: background
The Workflow Abstraction
PMML: Predictive Model Markup
Pattern: PMML in Cascading
PMML for Customer Experiments
Ensemble Models with Pattern
Workflow Design Pattern
Experiments – comparing models
much customer interest in leveraging Cascading and
Apache Hadoop to run customer experiments at scale
run multiple variants, then measure relative “lift”
Concurrent runtime – tag and track models
the following example compares two models trained
with different machine learning algorithms
this is exaggerated, one has an important variable
intentionally omitted to help illustrate the experiment
•
•
•
## train a Random Forest model
## example: http://mkseo.pe.kr/stats/?p=220
f <- as.formula("as.factor(label) ~ var0 + var1 + var2")
fit <- randomForest(f, data=data, proximity=TRUE, ntree=25)
print(fit)
saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/"))
Experiments – Random Forest model
OOB estimate of error rate: 14%
Confusion matrix:
0 1 class.error
0 69 16 0.1882353
1 12 103 0.1043478
## train a Logistic Regression model (special case of GLM)
## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r
f <- as.formula("as.factor(label) ~ var0 + var2")
fit <- glm(f, family=binomial, data=data)
print(summary(fit))
saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/"))
Experiments – Logistic Regression model
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.8524 0.3803 4.871 1.11e-06 ***
var0 -1.3755 0.4355 -3.159 0.00159 **
var2 -3.7742 0.5794 -6.514 7.30e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
NB: this model has “var1” intentionally omitted
Experiments – comparing results
use a confusion matrix to compare results for the classifiers
Logistic Regression has a lower “false negative” rate (5% vs. 11%)
however it has a much higher “false positive” rate (52% vs. 14%)
assign a cost model to select a winner –
for example, in an ecommerce anti-fraud classifier:
FN ∼ chargeback risk
FP ∼ customer support costs
•
•
•
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
Cascading: background
The Workflow Abstraction
PMML: Predictive Model Markup
Pattern: PMML in Cascading
PMML for Customer Experiments
Ensemble Models with Pattern
Workflow Design Pattern
Two Cultures
“A new research community using these tools sprang up. Their goal
was predictive accuracy. The community consisted of young computer
scientists, physicists and engineers plus a few aging statisticians.
They began using the new tools in working on complex prediction
problems where it was obvious that data models were not applicable:
speech recognition, image recognition, nonlinear time series prediction,
handwriting recognition, prediction in financial markets.”
Statistical Modeling: The Two Cultures
Leo Breiman, 2001
bit.ly/eUTh9L
in other words, seeing the forest for the trees…
this paper chronicled a sea change from data modeling practices
(silos, manual process) to the rising use of algorithmic modeling
(machine data for automation/optimization)
Why Do Ensembles Matter?
The World…
per Data Modeling
The World…
Algorithmic Modeling
“The trick to being a scientist is to be open to using
a wide variety of tools.” – Breiman
circa 2001: Random Forest, bootstrap aggregation, etc.,
yield dramatic increases in predictive power over earlier
modeling such as Logistic Regression
major learnings from the Netflix Prize: the power of
ensembles, model chaining, etc.
the problems at hand have become simply too big and too
complex for ONE distribution, ONE model, ONE team…
Ensemble Models
Breiman: “a multiplicity of data models”
BellKor team: 100+ individual models in 2007 Progress Prize
while the process of combining models adds complexity
(making it more difficult to anticipate or explain predictions)
accuracy may increase substantially
Ensemble Learning: Better Predictions Through Diversity
Todd Holloway
ETech (2008)
abeautifulwww.com/EnsembleLearningETech.pdf
The Story of the Netflix Prize: An Ensemblers Tale
Lester Mackey
National Academies Seminar, Washington, DC (2011)
stanford.edu/~lmackey/papers/
KDD 2013 PMML Workshop
Pattern: PMML for Cascading and Hadoop
Paco Nathan, Girish Kathalagiri
Chicago, 2013-08-11 (accepted)
19th ACM SIGKDD
Conference on Knowledge Discovery
and Data Mining
kdd13pmml.wordpress.com
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
Cascading: background
The Workflow Abstraction
PMML: Predictive Model Markup
Pattern: PMML in Cascading
PMML for Customer Experiments
Ensemble Models with Pattern
Workflow Design Pattern
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
ANSI SQL for ETL
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive models
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive modelsANSI SQL for ETL most of the licensing costs…
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
most of the project costs…
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
a compiler sees it all…
cascading.org
a compiler sees it all…
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef()
.setName( "etl" )
.addSource( "example.employee", emplTap )
.addSource( "example.sales", salesTap )
.addSink( "results", resultsTap );
SQLPlanner sqlPlanner = new SQLPlanner()
.setSql( sqlStatement );
flowDef.addAssemblyPlanner( sqlPlanner );
cascading.org
a compiler sees it all…
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef()
.setName( "classifier" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );
PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlModel ) )
.retainOnlyActiveIncomingFields();
flowDef.addAssemblyPlanner( pmmlPlanner );
cascading.org
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
visual collaboration for the business logic is a great way
to improve how teams work together
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
multiple departments, working in their respective
frameworks, integrate results into a combined app,
which runs at scale on a cluster… business process
combined in a common space (DAG) for flow
planners, compiler, optimization, troubleshooting,
exception handling, notifications, security audit,
performance monitoring, etc.
cascading.org
Enterprise Data Workflows
with Cascading
O’Reilly, 2013
amazon.com/dp/1449358721
references…
newsletter updates:
liber118.com/pxn/
Many thanks to others who have contributed code,
ideas, suggestions, etc., to Pattern:
Chris Wensel @ Concurrent
Girish Kathalagiri @ AgilOne
Vijay Srinivas Agneeswaran @ Impetus
Chris Severs @ eBay
Ofer Mendelevitch @ Hortonworks
Sergey Boldyrev @ Nokia
Quinton Anderson @ IZAZI Solutions
Chris Gutierrez @ Airbnb
Villu Ruusmann @ JPMML project
•
•
•
•
•
•
•
•
•
acknowledgements…
blog, developer community, code/wiki/gists, maven repo,
commercial products, etc.:
cascading.org
zest.to/group11
github.com/Cascading
conjars.org
goo.gl/KQtUL
concurrentinc.com
drill-down…

Más contenido relacionado

Destacado

sas-capital-requirements-for-market-risk-108541
sas-capital-requirements-for-market-risk-108541sas-capital-requirements-for-market-risk-108541
sas-capital-requirements-for-market-risk-108541
Robert Markham
 
Build a Better Service-Level Agreement
Build a Better Service-Level AgreementBuild a Better Service-Level Agreement
Build a Better Service-Level Agreement
Act-On Software
 

Destacado (20)

20160512 predictive and adaptive approach
20160512   predictive and adaptive approach20160512   predictive and adaptive approach
20160512 predictive and adaptive approach
 
Bio variance j_scheiber_bioit_repurposingworkshop2013_draft
Bio variance j_scheiber_bioit_repurposingworkshop2013_draftBio variance j_scheiber_bioit_repurposingworkshop2013_draft
Bio variance j_scheiber_bioit_repurposingworkshop2013_draft
 
Predictive Modeling in Underwriting
Predictive Modeling in UnderwritingPredictive Modeling in Underwriting
Predictive Modeling in Underwriting
 
Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
 
Data Management as a Strategic Initiative for Government
Data Management as a Strategic Initiative for GovernmentData Management as a Strategic Initiative for Government
Data Management as a Strategic Initiative for Government
 
TATA Teleservices - SAS Forum India: Enhancing Marketing Performance to drive...
TATA Teleservices - SAS Forum India: Enhancing Marketing Performance to drive...TATA Teleservices - SAS Forum India: Enhancing Marketing Performance to drive...
TATA Teleservices - SAS Forum India: Enhancing Marketing Performance to drive...
 
QlikTalk: QlikView in Legal
QlikTalk: QlikView in LegalQlikTalk: QlikView in Legal
QlikTalk: QlikView in Legal
 
Business Discovery and QlikView 11
Business Discovery and QlikView 11Business Discovery and QlikView 11
Business Discovery and QlikView 11
 
Customer Management - A Practioners Perspective
Customer Management - A Practioners PerspectiveCustomer Management - A Practioners Perspective
Customer Management - A Practioners Perspective
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
 
Business analytics !!
Business analytics !!Business analytics !!
Business analytics !!
 
sas-capital-requirements-for-market-risk-108541
sas-capital-requirements-for-market-risk-108541sas-capital-requirements-for-market-risk-108541
sas-capital-requirements-for-market-risk-108541
 
Risk Management
Risk ManagementRisk Management
Risk Management
 
SAS Forum India: Building for Success: The Foundation for Achievable Master D...
SAS Forum India: Building for Success: The Foundation for Achievable Master D...SAS Forum India: Building for Success: The Foundation for Achievable Master D...
SAS Forum India: Building for Success: The Foundation for Achievable Master D...
 
Build a Better Service-Level Agreement
Build a Better Service-Level AgreementBuild a Better Service-Level Agreement
Build a Better Service-Level Agreement
 
SAS Operations Risk Solutions
SAS Operations Risk SolutionsSAS Operations Risk Solutions
SAS Operations Risk Solutions
 
Streamlining ITSM Operations Between ServiceNow and SAP Solution Manager
Streamlining ITSM Operations Between ServiceNow and SAP Solution ManagerStreamlining ITSM Operations Between ServiceNow and SAP Solution Manager
Streamlining ITSM Operations Between ServiceNow and SAP Solution Manager
 
Analytics for High Performance Organizations
Analytics for High Performance OrganizationsAnalytics for High Performance Organizations
Analytics for High Performance Organizations
 
Anticipate & Manage Change with Business Analytics
Anticipate & Manage Change with Business AnalyticsAnticipate & Manage Change with Business Analytics
Anticipate & Manage Change with Business Analytics
 
Protect Your Revenue Streams: Big Data & Analytics in Tax
Protect Your Revenue Streams: Big Data & Analytics in TaxProtect Your Revenue Streams: Big Data & Analytics in Tax
Protect Your Revenue Streams: Big Data & Analytics in Tax
 

Similar a Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop

Similar a Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop (20)

The Cascading (big) data application framework
The Cascading (big) data application frameworkThe Cascading (big) data application framework
The Cascading (big) data application framework
 
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
 
PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
PDX Hadoop: Enterprise Data Workflows with Cascading and MesosPDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
 
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open DataOSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
 
Using Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open DataUsing Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open Data
 
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To Cascading
 
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingBoulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big Data
 
Windows Phone app development overview
Windows Phone app development overviewWindows Phone app development overview
Windows Phone app development overview
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
 
Reducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop ApplicationsReducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop Applications
 
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
 
Scalding Big (Ad)ta
Scalding Big (Ad)taScalding Big (Ad)ta
Scalding Big (Ad)ta
 
Towards sql for streams
Towards sql for streamsTowards sql for streams
Towards sql for streams
 
Developing real-time data pipelines with Spring and Kafka
Developing real-time data pipelines with Spring and KafkaDeveloping real-time data pipelines with Spring and Kafka
Developing real-time data pipelines with Spring and Kafka
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
Apache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
Apache Kafka vs. Cloud-native iPaaS Integration Platform MiddlewareApache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
Apache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
 

Más de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 

Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop

  • 1. Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid “Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop”
  • 2. Failure Traps bonus allocation employee PMML classifier quarterly sales Join Count leads Cascading: background The Workflow Abstraction PMML: Predictive Model Markup Pattern: PMML in Cascading PMML for Customer Experiments Ensemble Models with Pattern Workflow Design Pattern
  • 3. Cascading – origins API author Chris Wensel worked as a system architect at an Enterprise firm well-known for many popular data products. Wensel was following the Nutch open source project – where Hadoop started. Observation: would be difficult to find Java developers to write complex Enterprise apps in MapReduce – potential blocker for leveraging new open source technology.
  • 4. Cascading – functional programming Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature. To ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows: leverages JVM and Java-based tools without any need to create new languages allows programmers who have J2EE expertise to leverage the economics of Hadoop clusters • •
  • 5. Hadoop Cluster source tap source tap sink tap trap tap customer profile DBsCustomer Prefs logs logs Logs Data Workflow Cache Customers Support Web App Reporting Analytics Cubes sink tap Modeling PMML Cascading – definitions a pattern language for Enterprise Data Workflows simple to build, easy to test, robust in production design principles ⟹ ensure best practices at scale • • •
  • 6. Hadoop Cluster source tap source tap sink tap trap tap customer profile DBsCustomer Prefs logs logs Logs Data Workflow Cache Customers Support Web App Reporting Analytics Cubes sink tap Modeling PMML Cascading – usage Java API, DSLs in Scala, Clojure, Jython, JRuby, Groovy, ANSI SQL ASL 2 license, GitHub src, http://conjars.org 5+ yrs production use, multiple Enterprise verticals • • •
  • 7. Hadoop Cluster source tap source tap sink tap trap tap customer profile DBsCustomer Prefs logs logs Logs Data Workflow Cache Customers Support Web App Reporting Analytics Cubes sink tap Modeling PMML Cascading – integrations partners: Microsoft Azure, Hortonworks, Amazon AWS, MapR, EMC, SpringSource, Cloudera taps: Memcached, Cassandra, MongoDB, HBase, JDBC, Parquet, etc. serialization: Avro, Thrift, Kryo, JSON, etc. topologies: Apache Hadoop, tuple spaces, local mode • • • •
  • 8. Cascading – deployments case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc. use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc. • •
  • 9. Cascading – deployments case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc. use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc. • • workflow abstraction addresses: • staffing bottleneck; • system integration; • operational complexity; • test-driven development
  • 10. Failure Traps bonus allocation employee PMML classifier quarterly sales Join Count leads Cascading: background The Workflow Abstraction PMML: Predictive Model Markup Pattern: PMML in Cascading PMML for Customer Experiments Ensemble Models with Pattern Workflow Design Pattern
  • 11. Hadoop Cluster source tap source tap sink tap trap tap customer profile DBsCustomer Prefs logs logs Logs Data Workflow Cache Customers Support Web App Reporting Analytics Cubes sink tap Modeling PMML Enterprise Data Workflows Let’s consider a “strawman” architecture for an example app… at the front end LOB use cases drive demand for apps
  • 12. Hadoop Cluster source tap source tap sink tap trap tap customer profile DBsCustomer Prefs logs logs Logs Data Workflow Cache Customers Support Web App Reporting Analytics Cubes sink tap Modeling PMML Enterprise Data Workflows Same example… in the back office Organizations have substantial investments in people, infrastructure, process
  • 13. Hadoop Cluster source tap source tap sink tap trap tap customer profile DBsCustomer Prefs logs logs Logs Data Workflow Cache Customers Support Web App Reporting Analytics Cubes sink tap Modeling PMML Enterprise Data Workflows Same example… the heavy lifting! “Main Street” firms are migrating workflows to Hadoop, for cost savings and scale-out
  • 14. Hadoop Cluster source tap source tap sink tap trap tap customer profile DBsCustomer Prefs logs logs Logs Data Workflow Cache Customers Support Web App Reporting Analytics Cubes sink tap Modeling PMML Cascading workflows – taps taps integrate other data frameworks, as tuple streams these are “plumbing” endpoints in the pattern language sources (inputs), sinks (outputs), traps (exceptions) text delimited, JDBC, Memcached, HBase, Cassandra, MongoDB, etc. data serialization: Avro, Thrift, Kryo, JSON, etc. extend a new kind of tap in just a few lines of Java schema and provenance get derived from analysis of the taps • • • • • •
  • 15. Cascading workflows – taps String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [] (),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); source and sink taps for TSV data in HDFS
  • 16. Hadoop Cluster source tap source tap sink tap trap tap customer profile DBsCustomer Prefs logs logs Logs Data Workflow Cache Customers Support Web App Reporting Analytics Cubes sink tap Modeling PMML Cascading workflows – topologies topologies execute workflows on clusters flow planner is like a compiler for queries Hadoop (MapReduce jobs) local mode (dev/test or special config) in-memory data grids (real-time) flow planner can be extended to support other topologies blend flows in different topologies into the same app – for example, batch (Hadoop) + transactions (IMDG) • • - - - •
  • 17. Cascading workflows – topologies String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [] (),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); flow planner for Apache Hadoop topology
  • 18. Hadoop Cluster source tap source tap sink tap trap tap customer profile DBsCustomer Prefs logs logs Logs Data Workflow Cache Customers Support Web App Reporting Analytics Cubes sink tap Modeling PMML Cascading workflows – test-driven development assert patterns (regex) on the tuple streams adjust assert levels, like log4j levels trap edge cases as “data exceptions” TDD at scale: start from raw inputs in the flow graph define stream assertions for each stage of transforms verify exceptions, code to remove them when impl is complete, app has full test coverage redirect traps in production to Ops, QA, Support, Audit, etc. • • • • 1. 2. 3. 4.
  • 19. Workflow Abstraction – pattern language Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Scrub token Document Collection Tokenize Word Count GroupBy token Count Stop Word List Regex token HashJoin Left RHS M R Data is represented as flows of tuples. Operations within the flows bring functional programming aspects into Java In formal terms, this provides a pattern language
  • 20. Pattern Language structured method for solving large, complex design problems, where the syntax of the language ensures the use of best practices – i.e., conveying expertise Failure Traps bonus allocation employee PMML classifier quarterly sales Join Count leads A Pattern Language Christopher Alexander, et al. amazon.com/dp/0195019199
  • 21. Workflow Abstraction – literate programming Cascading workflows generate their own visual documentation: flow diagrams in formal terms, flow diagrams leverage a methodology called literate programming provides intuitive, visual representations for apps – great for cross-team collaboration Scrub token Document Collection Tokenize Word Count GroupBy token Count Stop Word List Regex token HashJoin Left RHS M R
  • 22. Literate Programming by Don Knuth Literate Programming Univ of Chicago Press, 1992 literateprogramming.com/ “Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”
  • 23. Workflow Abstraction – business process following the essence of literate programming, Cascading workflows provide statements of business process this recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data) Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.) this is especially apparent in large-scale Cascalog apps: “Specify what you require, not how to achieve it.” by virtue of the pattern language, the flow planner then determines how to translate business process into efficient, parallel jobs at scale
  • 24. Business Process by Edgar Codd “A relational model of data for large shared data banks” Communications of the ACM, 1970 dl.acm.org/citation.cfm?id=362685 rather than arguing between SQL vs. NoSQL… structured vs. unstructured data frameworks… this approach focuses on what apps do: the process of structuring data
  • 25. Cascading – functional programming Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading – used for their large-scale production deployments new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wiki • • Why Adopting the Declarative Programming Practices Will Improve Your Return from Technology Dan Woods, 2013-04-17 Forbes forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming- practices-will-improve-your-return-from-technology/
  • 26. Functional Programming for Big Data WordCount with token scrubbing… Apache Hive: 52 lines HQL + 8 lines Python (UDF) compared to Scalding: 18 lines Scala/Cascading functional programming languages help reduce software engineering costs at scale, over time
  • 27. Two Avenues to the App Layer… scale ➞ complexity➞ Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Start-ups: crave complexity and scale to become viable… new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding
  • 28. Failure Traps bonus allocation employee PMML classifier quarterly sales Join Count leads Cascading: background The Workflow Abstraction PMML: Predictive Model Markup Pattern: PMML in Cascading PMML for Customer Experiments Ensemble Models with Pattern Workflow Design Pattern
  • 29. established XML standard for predictive model markup organized by Data Mining Group (DMG), since 1997 http://dmg.org/ members: IBM, SAS, Visa, NASA, Equifax, Microstrategy, Microsoft, etc. PMML concepts for metadata, ensembles, etc., translate directly into Cascading tuple flows “PMML is the leading standard for statistical and data mining models and supported by over 20 vendors and organizations. With PMML, it is easy to develop a model on one system using one application and deploy the model on another system using another application.” • • • • PMML – standard wikipedia.org/wiki/Predictive_Model_Markup_Language
  • 30. Association Rules: AssociationModel element Cluster Models: ClusteringModel element Decision Trees: TreeModel element Naïve Bayes Classifiers: NaiveBayesModel element Neural Networks: NeuralNetwork element Regression: RegressionModel and GeneralRegressionModel elements Rulesets: RuleSetModel element Sequences: SequenceModel element Support Vector Machines: SupportVectorMachineModel element Text Models: TextModel element Time Series: TimeSeriesModel element • • • • • • • • • • • PMML – model coverage ibm.com/developerworks/industry/library/ind-PMML2/
  • 31. PMML – vendor coverage
  • 32. Failure Traps bonus allocation employee PMML classifier quarterly sales Join Count leads Cascading: background The Workflow Abstraction PMML: Predictive Model Markup Pattern: PMML in Cascading PMML for Customer Experiments Ensemble Models with Pattern Workflow Design Pattern
  • 33. Hadoop Cluster source tap source tap sink tap trap tap customer profile DBsCustomer Prefs logs logs Logs Data Workflow Cache Customers Support Web App Reporting Analytics Cubes sink tap Modeling PMML Pattern – model scoring migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML great open source tools – R, Weka, KNIME, Matlab, RapidMiner, etc. integrate with other libraries – Matrix API, etc. leverage PMML as another kind of DSL • • • • cascading.org/pattern
  • 34. ## train a RandomForest model f <- as.formula("as.factor(label) ~ .") fit <- randomForest(f, data_train, ntree=50) ## test the model on the holdout test set print(fit$importance) print(fit) predicted <- predict(fit, data) data$predicted <- predicted confuse <- table(pred = predicted, true = data[,1]) print(confuse) ## export predicted labels to TSV write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE) ## export RF model to PMML saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/")) Pattern – create a model in R
  • 35. <?xml version="1.0"?> <PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_0 http://www.dmg.org/v4-0/pmml-4-0.xsd"> <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model"> <Extension name="user" value="ceteri" extender="Rattle/PMML"/> <Application name="Rattle/PMML" version="1.2.30"/> <Timestamp>2012-10-22 19:39:28</Timestamp> </Header> <DataDictionary numberOfFields="4"> <DataField name="label" optype="categorical" dataType="string"> <Value value="0"/> <Value value="1"/> </DataField> <DataField name="var0" optype="continuous" dataType="double"/> <DataField name="var1" optype="continuous" dataType="double"/> <DataField name="var2" optype="continuous" dataType="double"/> </DataDictionary> <MiningModel modelName="randomForest_Model" functionName="classification"> <MiningSchema> <MiningField name="label" usageType="predicted"/> <MiningField name="var0" usageType="active"/> <MiningField name="var1" usageType="active"/> <MiningField name="var2" usageType="active"/> </MiningSchema> <Segmentation multipleModelMethod="majorityVote"> <Segment id="1"> <True/> <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit"> <MiningSchema> <MiningField name="label" usageType="predicted"/> <MiningField name="var0" usageType="active"/> <MiningField name="var1" usageType="active"/> <MiningField name="var2" usageType="active"/> </MiningSchema> ... Pattern – capture model parameters as PMML
  • 36. public static void main( String[] args ) throws RuntimeException { String inputPath = args[ 0 ]; String classifyPath = args[ 1 ]; // set up the config properties Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap inputTap = new Hfs( new TextDelimited( true, "t" ), inputPath ); Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath ); // handle command line options OptionParser optParser = new OptionParser(); optParser.accepts( "pmml" ).withRequiredArg(); OptionSet options = optParser.parse( args ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "classify" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap ); if( options.hasArgument( "pmml" ) ) { String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlPath ) ) .retainOnlyActiveIncomingFields() .setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the model flowDef.addAssemblyPlanner( pmmlPlanner ); } // write a DOT file and run the flow Flow classifyFlow = flowConnector.connect( flowDef ); classifyFlow.writeDOT( "dot/classify.dot" ); classifyFlow.complete(); } Pattern – score a model, within an app
  • 38. ## run an RF classifier at scale hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap --pmml data/sample.rf.xml ## run an RF classifier at scale, assert regression test, measure confusion matrix hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap --pmml data/sample.rf.xml --assert --measure out/measure ## run a predictive model at scale, measure RMSE hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap --pmml data/iris.lm_p.xml --rmse out/measure Pattern – score a model, using pre-defined Cascading app
  • 39. Roadmap – existing algorithms for scoring Random Forest Decision Trees Linear Regression GLM Logistic Regression K-Means Clustering Hierarchical Clustering Multinomial Support Vector Machines (prepared for release) also, model chaining and general support for ensembles • • • • • • • • • cascading.org/pattern
  • 40. Roadmap – next priorities for scoring Time Series (ARIMA forecast) Association Rules (basket analysis) Naïve Bayes Neural Networks algorithms extended based on customer use cases – contact groups.google.com/forum/?fromgroups#!forum/pattern-user • • • • cascading.org/pattern
  • 41. Roadmap – top priorities for creating models at scale Random Forest Logistic Regression K-Means Clustering Association Rules …plus all models which can be trained via sparse matrix factorization (TQSR => PCA, SVD least squares, etc.) a wealth of recent research indicates many opportunities to parallelize popular algorithms for training models at scale on Apache Hadoop… • • • • cascading.org/pattern
  • 42. Failure Traps bonus allocation employee PMML classifier quarterly sales Join Count leads Cascading: background The Workflow Abstraction PMML: Predictive Model Markup Pattern: PMML in Cascading PMML for Customer Experiments Ensemble Models with Pattern Workflow Design Pattern
  • 43. Experiments – comparing models much customer interest in leveraging Cascading and Apache Hadoop to run customer experiments at scale run multiple variants, then measure relative “lift” Concurrent runtime – tag and track models the following example compares two models trained with different machine learning algorithms this is exaggerated, one has an important variable intentionally omitted to help illustrate the experiment • • •
  • 44. ## train a Random Forest model ## example: http://mkseo.pe.kr/stats/?p=220 f <- as.formula("as.factor(label) ~ var0 + var1 + var2") fit <- randomForest(f, data=data, proximity=TRUE, ntree=25) print(fit) saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/")) Experiments – Random Forest model OOB estimate of error rate: 14% Confusion matrix: 0 1 class.error 0 69 16 0.1882353 1 12 103 0.1043478
  • 45. ## train a Logistic Regression model (special case of GLM) ## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r f <- as.formula("as.factor(label) ~ var0 + var2") fit <- glm(f, family=binomial, data=data) print(summary(fit)) saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/")) Experiments – Logistic Regression model Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.8524 0.3803 4.871 1.11e-06 *** var0 -1.3755 0.4355 -3.159 0.00159 ** var2 -3.7742 0.5794 -6.514 7.30e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 NB: this model has “var1” intentionally omitted
  • 46. Experiments – comparing results use a confusion matrix to compare results for the classifiers Logistic Regression has a lower “false negative” rate (5% vs. 11%) however it has a much higher “false positive” rate (52% vs. 14%) assign a cost model to select a winner – for example, in an ecommerce anti-fraud classifier: FN ∼ chargeback risk FP ∼ customer support costs • • •
  • 47. Failure Traps bonus allocation employee PMML classifier quarterly sales Join Count leads Cascading: background The Workflow Abstraction PMML: Predictive Model Markup Pattern: PMML in Cascading PMML for Customer Experiments Ensemble Models with Pattern Workflow Design Pattern
  • 48. Two Cultures “A new research community using these tools sprang up. Their goal was predictive accuracy. The community consisted of young computer scientists, physicists and engineers plus a few aging statisticians. They began using the new tools in working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.” Statistical Modeling: The Two Cultures Leo Breiman, 2001 bit.ly/eUTh9L in other words, seeing the forest for the trees… this paper chronicled a sea change from data modeling practices (silos, manual process) to the rising use of algorithmic modeling (machine data for automation/optimization)
  • 49. Why Do Ensembles Matter? The World… per Data Modeling The World…
  • 50. Algorithmic Modeling “The trick to being a scientist is to be open to using a wide variety of tools.” – Breiman circa 2001: Random Forest, bootstrap aggregation, etc., yield dramatic increases in predictive power over earlier modeling such as Logistic Regression major learnings from the Netflix Prize: the power of ensembles, model chaining, etc. the problems at hand have become simply too big and too complex for ONE distribution, ONE model, ONE team…
  • 51. Ensemble Models Breiman: “a multiplicity of data models” BellKor team: 100+ individual models in 2007 Progress Prize while the process of combining models adds complexity (making it more difficult to anticipate or explain predictions) accuracy may increase substantially Ensemble Learning: Better Predictions Through Diversity Todd Holloway ETech (2008) abeautifulwww.com/EnsembleLearningETech.pdf The Story of the Netflix Prize: An Ensemblers Tale Lester Mackey National Academies Seminar, Washington, DC (2011) stanford.edu/~lmackey/papers/
  • 52. KDD 2013 PMML Workshop Pattern: PMML for Cascading and Hadoop Paco Nathan, Girish Kathalagiri Chicago, 2013-08-11 (accepted) 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining kdd13pmml.wordpress.com
  • 53. Failure Traps bonus allocation employee PMML classifier quarterly sales Join Count leads Cascading: background The Workflow Abstraction PMML: Predictive Model Markup Pattern: PMML in Cascading PMML for Customer Experiments Ensemble Models with Pattern Workflow Design Pattern
  • 54. Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses
  • 55. Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses ANSI SQL for ETL
  • 56. Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end usesJ2EE for business logic
  • 57. Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses SAS for predictive models
  • 58. Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses SAS for predictive modelsANSI SQL for ETL most of the licensing costs…
  • 59. Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end usesJ2EE for business logic most of the project costs…
  • 60. ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source a compiler sees it all… cascading.org
  • 61. a compiler sees it all… ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source FlowDef flowDef = FlowDef.flowDef() .setName( "etl" ) .addSource( "example.employee", emplTap ) .addSource( "example.sales", salesTap ) .addSink( "results", resultsTap ); SQLPlanner sqlPlanner = new SQLPlanner() .setSql( sqlStatement ); flowDef.addAssemblyPlanner( sqlPlanner ); cascading.org
  • 62. a compiler sees it all… ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source FlowDef flowDef = FlowDef.flowDef() .setName( "classifier" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlModel ) ) .retainOnlyActiveIncomingFields(); flowDef.addAssemblyPlanner( pmmlPlanner );
  • 63. cascading.org ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source visual collaboration for the business logic is a great way to improve how teams work together Failure Traps bonus allocation employee PMML classifier quarterly sales Join Count leads
  • 64. ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source Failure Traps bonus allocation employee PMML classifier quarterly sales Join Count leads multiple departments, working in their respective frameworks, integrate results into a combined app, which runs at scale on a cluster… business process combined in a common space (DAG) for flow planners, compiler, optimization, troubleshooting, exception handling, notifications, security audit, performance monitoring, etc. cascading.org
  • 65. Enterprise Data Workflows with Cascading O’Reilly, 2013 amazon.com/dp/1449358721 references… newsletter updates: liber118.com/pxn/
  • 66. Many thanks to others who have contributed code, ideas, suggestions, etc., to Pattern: Chris Wensel @ Concurrent Girish Kathalagiri @ AgilOne Vijay Srinivas Agneeswaran @ Impetus Chris Severs @ eBay Ofer Mendelevitch @ Hortonworks Sergey Boldyrev @ Nokia Quinton Anderson @ IZAZI Solutions Chris Gutierrez @ Airbnb Villu Ruusmann @ JPMML project • • • • • • • • • acknowledgements…
  • 67. blog, developer community, code/wiki/gists, maven repo, commercial products, etc.: cascading.org zest.to/group11 github.com/Cascading conjars.org goo.gl/KQtUL concurrentinc.com drill-down…