ACM: Hands-On Workshop for Predictive Modeling and Enterprise Data Workflows with PMML and Cascading
2013-10-12
http://www.sfbayacm.org/event/hands-workshop-predictive-modeling-and-enterprise-data-workflows-pmml-and-cascading
OpenShift Commons Paris - Choose Your Own Observability Adventure
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
1. ACM Big Data Mining Camp,
2013-10-12:
Cascading, Pattern, and PMML
Paco Nathan @pacoid
Chief Scientist, Mesosphere
2. Cascading, Pattern, and PMML:
1. PMML and R (30 min lab)
2. Cascading Overview (15 min)
3. Model Scoring (30 min lab)
4. < break/ >
5. Ensembles, Experiments, etc. (15 min)
6. Industry Practices (20 min)
7. Q & A
ACM, 2013-10-12
3. PMML – an industry standard
•
•
established XML standard for predictive model markup
•
members: IBM, SAS, Visa, NASA, Equifax, Microstrategy,
Microsoft, etc.
•
PMML concepts for metadata, ensembles, etc., translate
directly into Cascading tuple flows
organized by Data Mining Group (DMG), since 1997
http://dmg.org/
“PMML is the leading standard for statistical and data mining models and
supported by over 20 vendors and organizations.With PMML, it is easy
to develop a model on one system using one application and deploy the
model on another system using another application.”
wikipedia.org/wiki/Predictive_Model_Markup_Language
5. PMML – model coverage
•
•
•
•
•
•
•
•
•
•
•
Association Rules: AssociationModel element
Cluster Models: ClusteringModel element
Decision Trees: TreeModel element
Naïve Bayes Classifiers: NaiveBayesModel element
Neural Networks: NeuralNetwork element
Regression: RegressionModel and GeneralRegressionModel elements
Rulesets: RuleSetModel element
Sequences: SequenceModel element
Support Vector Machines: SupportVectorMachineModel element
Text Models: TextModel element
Time Series: TimeSeriesModel element
ibm.com/developerworks/industry/library/ind-PMML2/
6. PMML – create a model in R
## train a RandomForest model
f <- as.formula("as.factor(label) ~ .")
fit <- randomForest(f, data_train, ntree=50)
## test the model on the holdout test set
print(fit$importance)
print(fit)
predicted <- predict(fit, data)
data$predicted <- predicted
confuse <- table(pred = predicted, true = data[,1])
print(confuse)
## export predicted labels to TSV
write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),
quote=FALSE, sep="t", row.names=FALSE)
## export RF model to PMML
saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))
8. PMML – further study
PMML in Action
Alex Guazzelli, Wen-Ching Lin, Tridivesh Jena
amazon.com/dp/1470003244
See also excellent resources at:
zementis.com/pmml.htm
9. Lab: RStudio and PMML in R
set up RStudio…
rstudio.com/ide/
use the Iris data to build predictive models…
•
github.com/Cascading/pattern
pattern-examples/examples/r/rattle_pmml.R
•
•
•
test/train hold-outs
evaluating predictive power
export as PMML
10. Model: data prep based on “Iris”
library(pmml)
library(randomForest)
library(nnet)
library(XML)
library(kernlab)
## split data into test and train sets
data(iris)
iris_full <- iris
colnames(iris_full) <c("sepal_length", "sepal_width", "petal_length", "petal_width", "species")
idx <- sample(150, 100)
iris_train <- iris_full[idx,]
iris_test <- iris_full[-idx,]
17. Cascading, Pattern, and PMML:
1. PMML and R (30 min lab)
2. Cascading Overview (15 min)
3. Model Scoring (30 min lab)
4. < break/ >
5. Ensembles, Experiments, etc. (15 min)
6. Industry Practices (20 min)
7. Q & A
ACM, 2013-10-12
18. Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
sources
data
prep
predictive
model
end
uses
19. Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ANSI SQL for ETL
ETL
data
sources
data
prep
predictive
model
end
uses
20. Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
sources
data
prep
Java, Pig for business logic
predictive
model
end
uses
21. Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
SAS for predictive models
ETL
data
sources
data
prep
predictive
model
end
uses
22. Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ANSI SQL for ETL
most of the licensing costs…
ETL
data
sources
data
prep
SAS for predictive models
predictive
model
end
uses
23. Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
most of the project costs…
ETL
data
sources
data
prep
Java, Pig for business logic
predictive
model
end
uses
24. Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
Lingual:
DW → ANSI SQL
cascading.org
ETL
data
sources
business logic in Java,
Clojure, Scala, etc.
data
prep
a compiler sees it all…
one connected DAG:
• optimization
Pattern:
SAS, R, etc. → PMML
predictive
model
end
uses
• troubleshooting
source taps for
Cassandra, JDBC,
Splunk, etc.
• exception handling
• notifications
sink taps for
Memcached, HBase,
MongoDB, etc.
25. Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
Lingual:
DW → ANSI SQL
cascading.org
ETL
data
sources
source taps for
Cassandra, JDBC,
Splunk, etc.
business logic in Java,
Clojure, Scala, etc.
Pattern:
SAS, R, etc. → PMML
FlowDef flowDef = FlowDef.flowDef()
.setName( "etl" )
predictive
data
.addSource( "example.employee", emplTap )
model
prep
.addSource( "example.sales", salesTap )
.addSink( "results", resultsTap );
SQLPlanner sqlPlanner = new SQLPlanner()
end
.setSql( sqlStatement );
a compiler sees it all…
uses
flowDef.addAssemblyPlanner( sqlPlanner );
sink taps for
Memcached, HBase,
MongoDB, etc.
26. Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
Lingual:
DW → ANSI SQL
business logic in Java,
Clojure, Scala, etc.
Pattern:
SAS, R, etc. → PMML
FlowDef flowDef = FlowDef.flowDef()
.setName( "classifier" )
predictive
.addSource( "input", inputTap ) data
ETL
model
.addSink( "classify", classifyTap prep
);
PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlModel ) )
data
.retainOnlyActiveIncomingFields(); sees it all…
a compiler
sources
flowDef.addAssemblyPlanner( pmmlPlanner );
source taps for
Cassandra, JDBC,
Splunk, etc.
end
uses
sink taps for
Memcached, HBase,
MongoDB, etc.
27. Cascading – functional programming
Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
to ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:
•
•
leverages JVM and Java-based tools without any
need to create new languages
allows programmers who have Java expertise
to leverage the economics of Hadoop clusters
Edgar Codd alluded to this (DSLs for structuring data)
in his original paper about relational model
28. Cascading – functional programming
•
•
Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
have invested in open source projects atop Cascading –
used for their large-scale production deployments
new case studies for Cascading apps are mostly based on
domain-specific languages (DSLs) in JVM languages which
emphasize functional programming:
Cascalog in Clojure (2010)
Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
Why Adopting the Declarative Programming Practices Will Improve Your Return from Technology
Dan Woods, 2013-04-17 Forbes
forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programmingpractices-will-improve-your-return-from-technology/
29. Functional Programming for Big Data
WordCount with token scrubbing…
Apache Hive: 52 lines HQL + 8 lines Python (UDF)
compared to
Scalding: 18 lines Scala/Cascading
functional programming languages help reduce
software engineering costs at scale, over time
30. Cascading – deployments
•
case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, Factual, etc.
•
use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.
31. Workflow Abstraction – pattern language
Cascading uses a “plumbing” metaphor in Java
to define workflows out of familiar elements:
Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc.
Document
Collection
Tokenize
Scrub
token
M
HashJoin
Left
Stop Word
List
Regex
token
GroupBy
token
R
RHS
Count
Word
Count
data is represented as flows of tuples
operations in the flows bring functional
programming aspects into Java
A Pattern Language
Christopher Alexander, et al.
amazon.com/dp/0195019199
32. Workflow Abstraction – literate programming
Cascading workflows generate their own visual
documentation: flow diagrams
Document
Collection
Tokenize
Scrub
token
M
HashJoin
Left
Stop Word
List
Regex
token
GroupBy
token
R
RHS
Count
Word
Count
in formal terms, flow diagrams leverage a methodology
called literate programming
provides intuitive, visual representations for apps –
great for cross-team collaboration
Literate Programming
Don Knuth
literateprogramming.com
33. Workflow Abstraction – business process
following the essence of literate programming, Cascading
workflows provide statements of business process
this recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
this is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
by virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale
34. The Ubiquitous Word Count
Definition:
count how often each word appears
in a collection of text documents
this simple program provides an excellent test case
for parallel processing:
• requires a minimal amount of code
• demonstrates use of both symbolic and numeric values
• shows a dependency graph of tuples as an abstraction
• is not many steps away from useful search indexing
• serves as a “Hello World” for Hadoop apps
void map (String doc_id, String text):
for each word w in segment(text):
emit(w, "1");
void reduce (String word, Iterator group):
int count = 0;
for each pc in group:
count += Int(pc);
emit(word, String(count));
a distributed computing framework that runs Word Count
efficiently in parallel at scale can handle much larger
and more interesting compute problems
35. WordCount – conceptual flow diagram
Document
Collection
Tokenize
M
GroupBy
token
R
1 map
1 reduce
18 lines code
Count
Word
Count
cascading.org/category/impatient
gist.github.com/3900702
36. WordCount – Cascading app in Java
Document
Collection
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Tokenize
M
GroupBy
token
R
Count
Word
Count
38. WordCount – Cascalog / Clojure
Document
Collection
(ns impatient.core
(:use [cascalog.api]
[cascalog.more-taps :only (hfs-delimited)])
(:require [clojure.string :as s]
[cascalog.ops :as c])
(:gen-class))
(defmapcatop split [line]
"reads in a line of string and splits it by regex"
(s/split line #"[[](),.)s]+"))
(defn -main [in out & args]
(?<- (hfs-delimited out)
[?word ?count]
((hfs-delimited in :skip-header? true) _ ?line)
(split ?line :> ?word)
(c/count ?count)))
; Paul Lam
; github.com/Quantisan/Impatient
Tokenize
M
GroupBy
token
R
Count
Word
Count
39. WordCount – Cascalog / Clojure
Document
Collection
github.com/nathanmarz/cascalog/wiki
Tokenize
M
GroupBy
token
R
• implements Datalog in Clojure, with predicates backed
by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –
approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development
(TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs –
Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn
Count
Word
Count
41. WordCount – Scalding / Scala
Document
Collection
github.com/twitter/scalding/wiki
Tokenize
M
GroupBy
token
R
• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram
and function calls
• extensive libraries are available for linear algebra, abstract
algebra, machine learning – e.g., Matrix API, Algebird, etc.
• significant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog
Count
Word
Count
42. WordCount – Apache Hive
Document
Collection
CREATE TABLE text_docs (line STRING);
LOAD DATA LOCAL INPATH 'data/rain.txt'
OVERWRITE INTO TABLE text_docs
;
SELECT
word, COUNT(*)
FROM
(SELECT
split(line, 't')[1] AS text
FROM text_docs
) t
LATERAL VIEW explode(split(text, '[ ,.()]')) lTable AS
word
GROUP BY word
;
Tokenize
M
GroupBy
token
R
Count
Word
Count
43. WordCount – Apache Hive
Document
Collection
hive.apache.org
pro:
‣ most popular abstraction atop Apache Hadoop
‣ SQL-like language is syntactically familiar to most analysts
‣ simple to load large-scale unstructured data and run ad-hoc queries
con:
‣ not a relational engine, many surprises at scale
‣ difficult to represent complex workflows, ML algorithms, etc.
‣ one poorly-trained analyst can bottleneck an entire cluster
‣ app-level integration requires other coding, outside of script language
‣ logical planner mixed with physical planner; cannot collect app stats
‣ non-deterministic exec: number of maps+reduces may change unexpectedly
‣ business logic must cross multiple language boundaries: difficult to
troubleshoot, optimize, audit, handle exceptions, set notifications, etc.
Tokenize
M
GroupBy
token
R
Count
Word
Count
44. WordCount – Apache Pig
Document
Collection
docPipe = LOAD '$docPath' USING PigStorage('t', 'tagsource')
AS (doc_id, text);
docPipe = FILTER docPipe BY doc_id != 'doc_id';
-- specify regex to split "document" text lines into token stream
tokenPipe = FOREACH docPipe
GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token;
tokenPipe = FILTER tokenPipe BY token MATCHES 'w.*';
-- determine the word counts
tokenGroups = GROUP tokenPipe BY token;
wcPipe = FOREACH tokenGroups
GENERATE group AS token, COUNT(tokenPipe) AS count;
-- output
STORE wcPipe INTO '$wcPath' USING PigStorage('t', 'tagsource');
EXPLAIN -out dot/wc_pig.dot -dot wcPipe;
Tokenize
M
GroupBy
token
R
Count
Word
Count
45. WordCount – Apache Pig
Document
Collection
pig.apache.org
pro:
‣ easy to learn data manipulation language (DML)
‣ interactive prompt (Grunt) makes it simple to prototype apps
‣ extensibility through UDFs
con:
‣ not a full programming language; must extend via UDFs outside of language
‣ app-level integration requires other coding, outside of script language
‣ simple problems are simple to do; hard problems become quite complex
‣ difficult to parameterize scripts externally; must rewrite to change taps!
‣ logical planner mixed with physical planner; cannot collect app stats
‣ non-deterministic exec: number of maps+reduces may changes unexpectedly
‣ business logic must cross multiple language boundaries: difficult to
troubleshoot, optimize, audit, handle exceptions, set notifications, etc.
Tokenize
M
GroupBy
token
R
Count
Word
Count
46. Two Avenues to the App Layer…
incumbents extend current practices and
infrastructure investments – using JVM,
ANSI SQL, SAS, etc. – to migrate
workflows onto Apache Hadoop while
leveraging existing staff
Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding
complexity ➞
Enterprise: must contend with
complexity at scale everyday…
scale ➞
47. Cascading, Pattern, and PMML:
1. PMML and R (30 min lab)
2. Cascading Overview (15 min)
3. Model Scoring (30 min lab)
4. < break/ >
5. Ensembles, Experiments, etc. (15 min)
6. Industry Practices (20 min)
7. Q & A
ACM, 2013-10-12
48. Pattern – model scoring
•
migrate workloads: SAS,Teradata, etc.,
exporting predictive models as PMML
•
great open source tools – R, Weka,
KNIME, Matlab, RapidMiner, etc.
•
integrate with other libraries –
Matrix API, etc.
Customers
•
Web
App
logs
logs
Logs
Support
leverage PMML as another kind
of DSL
trap
tap
Modeling
PMML
source
tap
source
tap
Analytics
Cubes
Reporting
sink
tap
Data
Workflow
sink
tap
cascading.org/pattern
Cache
customer
Customer
profile DBs
Prefs
Hadoop
Cluster
49. Pattern – score a model, using pre-defined Cascading app
Customer
Orders
Classify
Scored
Orders
Assert
GroupBy
token
M
R
PMML
Model
Count
Failure
Traps
cascading.org/pattern
Confusion
Matrix
50. Pattern – score a model, within an app
public static void main( String[] args ) throws RuntimeException {
String inputPath = args[ 0 ];
String classifyPath = args[ 1 ];
// set up the config properties
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap inputTap = new Hfs( new TextDelimited( true, "t" ), inputPath );
Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
// handle command line options
OptionParser optParser = new OptionParser();
optParser.accepts( "pmml" ).withRequiredArg();
OptionSet options = optParser.parse( args );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );
if( options.hasArgument( "pmml" ) ) {
String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 );
PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlPath ) )
.retainOnlyActiveIncomingFields()
.setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the model
flowDef.addAssemblyPlanner( pmmlPlanner );
}
// write a DOT file and run the flow
Flow classifyFlow = flowConnector.connect( flowDef );
classifyFlow.writeDOT( "dot/classify.dot" );
classifyFlow.complete();
}
51. Approach 1: Vagrant Cluster for Cascading and Hadoop
set up Vagrant (use v1.3.3 only!) and VirtualBox to run Cascading…
PS: we can share USB thumb drives to speed up box downloads!
github.com/Cascading/vagrant-cascading-hadoop-cluster
NB: when running Gradle builds, you must run as “root”…
then when running Hadoop, you must run as “mapred”
and use HDFS commands.
52. Approach 2: Laptop Setup for Java, Hadoop, Gradle, Cascading
set up a build environment locally and run Apache Hadoop
in “standalone” mode… works fine for Linux or MacOSX;
however, please no “cdh”, “hdp”, “homebrew”, or “cygwin”
liber118.com/pxn/course/itds/install.html
download as a ZIP file, or use Git to clone the repo…
github.com/Cascading/Impatient
NB: when running Hadoop, you will run in local mode –
no HDFS
53. Approach 3: Login to a pre-configured EC2 Node
assuming you are familiar with using SSH on Linux or MacOSX,
or using Putty on Windows…
we will give instructions during the workshop
NB: when running Hadoop, you will run in local mode –
no HDFS
54. Cascading, Pattern, and PMML:
1. PMML and R (30 min lab)
2. Cascading Overview (15 min)
3. Model Scoring (30 min lab)
4. < break/ >
5. Ensembles, Experiments, etc. (15 min)
6. Industry Practices (20 min)
7. Q & A
ACM, 2013-10-12
55. Cascading, Pattern, and PMML:
1. PMML and R (30 min lab)
2. Cascading Overview (15 min)
3. Model Scoring (30 min lab)
4. < break/ >
5. Ensembles, Experiments, etc. (15 min)
6. Industry Practices (20 min)
7. Q & A
ACM, 2013-10-12
56. Experiments – comparing models
•
much customer interest in leveraging Cascading and
Apache Hadoop to run customer experiments at scale
•
•
run multiple variants, then measure relative “lift”
Concurrent runtime – tag and track models
the following example compares two models trained
with different machine learning algorithms
this is exaggerated, one has an important variable
intentionally omitted to help illustrate the experiment
57. Experiments – Random Forest model
## train a Random Forest model
## example: http://mkseo.pe.kr/stats/?p=220
f <- as.formula("as.factor(label) ~ var0 + var1 + var2")
fit <- randomForest(f, data=data, proximity=TRUE, ntree=25)
print(fit)
saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/"))
OOB estimate of
Confusion matrix:
0
1 class.error
0 69 16
0.1882353
1 12 103
0.1043478
error rate: 14%
58. Experiments – Logistic Regression model
## train a Logistic Regression model (special case of GLM)
## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r
f <- as.formula("as.factor(label) ~ var0 + var2")
fit <- glm(f, family=binomial, data=data)
print(summary(fit))
saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/"))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
1.8524
0.3803
4.871 1.11e-06 ***
var0
-1.3755
0.4355 -3.159 0.00159 **
var2
-3.7742
0.5794 -6.514 7.30e-11 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
NB: this model has “var1” intentionally omitted
59. Experiments – comparing results
•
•
a confusion matrix to compare results for the classifiers
use
•
assign a cost model to select a winner –
for example, in an ecommerce anti-fraud classifier:
Logistic Regression has a lower “false negative” rate (5% vs. 11%)
however it has a much higher “false positive” rate (52% vs. 14%)
FN ∼ chargeback risk
FP ∼ customer support costs
60. Why Do Ensembles Matter?
The World…
The World…
per Data Modeling
61. Two Cultures
“A new research community using these tools sprang up.Their goal
was predictive accuracy.The community consisted of young computer
scientists, physicists and engineers plus a few aging statisticians.
They began using the new tools in working on complex prediction
problems where it was obvious that data models were not applicable:
speech recognition, image recognition, nonlinear time series prediction,
handwriting recognition, prediction in financial markets.”
Statistical Modeling: The Two Cultures
Leo Breiman, 2001
bit.ly/eUTh9L
chronicled a sea change from data modeling (silos, manual
process) to the rising use of algorithmic modeling (machine
data for automation/optimization) which led in turn to the
practice of leveraging inter-disciplinary teams
62. Ensemble Models
Breiman: “a multiplicity of data models”
BellKor team: 100+ individual models in 2007 Progress Prize
while the process of combining models adds complexity
(making it more difficult to anticipate or explain predictions)
accuracy may increase substantially
Ensemble Learning: Better Predictions Through Diversity
Todd Holloway
ETech (2008)
abeautifulwww.com/EnsembleLearningETech.pdf
The Story of the Netflix Prize: An Ensemblers Tale
Lester Mackey
National Academies Seminar, Washington, DC (2011)
stanford.edu/~lmackey/papers/
63. KDD 2013 PMML Workshop
Pattern: PMML for Cascading and Hadoop
Paco Nathan, Girish Kathalagiri
Chicago (2013-08-11)
19th ACM SIGKDD
Conference on Knowledge Discovery
and Data Mining
kdd13pmml.wordpress.com
64. Pattern: Example App
•
example integration of PMML and Cascading, using a sample app
based on the crime dataset from the City of Chicago Open Data
•
sample app implements a predictive model for expected crime
rates based on location, hour of day, and month
•
modeling performed in R, using the pmml package
•
multiple models are captured as PMML, then integrated via
Pattern to implement the entire workflow as a single app
•
PMML provides a vector for migrating workloads off of SAS,
SPSS, etc., onto Hadoop clusters for more cost-effective scaling
65. Pattern: Example App
City of Chicago Open Data portal
cityofchicago.org/city/en/narr/foia/CityData.html
Pattern open source project
github.com/Cascading/pattern
Observed benefits include greatly reduced development costs
and less licensing issues at scale, while leveraging the scalability
of Apache Hadoop clusters, existing intellectual property in
predictive models, and the core competencies of analytics staff.
Analysts can train predictive models in popular analytics
frameworks, such as SAS, Microstrategy, R, Weka, SQL Server,
etc., then run those models at scale on Apache Hadoop with
little or no coding required.
68. Cascading, Pattern, and PMML:
1. PMML and R (30 min lab)
2. Cascading Overview (15 min)
3. Model Scoring (30 min lab)
4. < break/ >
5. Ensembles, Experiments, etc. (15 min)
6. Industry Practices (20 min)
7. Q & A
ACM, 2013-10-12
69. Statistical Thinking
Process
Variation
Data
Tools
employing a mode of thought which includes both logical and analytical reasoning:
evaluating the whole of a problem, as well as its component parts; attempting
to assess the effects of changing one or more variables
this approach attempts to understand not just problems and solutions,
but also the processes involved and their variances
particularly valuable in Big Data work when combined with hands-on experience in
physics – roughly 50% of my peers come from physics or physical engineering…
programmers typically don’t think this way…
however, both systems engineers and data scientists must
70. What is needed most?
approximately 80% of the costs for data-related projects
gets spent on data preparation – mostly on cleaning up
data quality issues: ETL, log files, etc., generally by socializing
the problem
unfortunately, data-related budgets tend to go into
frameworks which can only be used after clean up
most valuable skills:
‣ learn to use programmable tools that prepare data
‣ learn to understand the audience and their priorities
‣ learn to generate compelling data visualizations
‣ learn to estimate the confidence for reported results
‣ learn to automate work, making analysis repeatable
d3js.org
71. Team Process = Needs
discovery
help people ask the
right questions
modeling
analysts
allow automation to
place informed bets
integration
apps
systems
deliver data products
at scale to LOB end uses
inter-disciplinary
leadership
build smarts into
product features
keep infrastructure
running, cost-effective
engineers
72. Team Composition = Roles
business process,
stakeholder
Domain
Expert
data
science
Data
Scientist
App Dev
Ops
introduced
capability
data prep, discovery,
modeling, etc.
software engineering,
automation
systems engineering,
availability
leverage non-traditional
pairing among roles, to
complement skills and
tear down silos
73. Team Composition = Needs × Roles
very
very
sco
iisco
d
d
ng
lliing
ode
ode
m
m
n
n
atiio
at o
tegr
tegr
n
iin
pps
pps
a
a
s
s
tem
tem
sys
sys
business process,
stakeholder
Domain
Expert
data
science
Data
Scientist
App Dev
Ops
data prep, discovery,
modeling, etc.
software engineering,
automation
systems engineering,
availability
74. Alternatively, Data Roles × Skill Sets
Analyzing the Analyzers
Harlan Harris, Sean Murphy,
Marck Vaisman
O’Reilly, 2013
amazon.com/dp/B00DBHTE56
Harlan Harris, et al.
datacommunitydc.org/blog/wp-content/uploads/
2012/08/SkillsSelfIDMosaic-edit-500px.png
75. Cluster Computing’s Dirty Little Secret
many of us make a good living by leveraging high ROI
apps based on clusters, and so execs agree to build
out more data centers…
clusters for Hadoop/HBase, for Storm, for MySQL,
for Memcached, for Cassandra, for Nginx, etc.
this becomes expensive!
a single class of workloads on a given cluster is simpler
to manage, but terrible for utilization… various notions
of “cloud” help…
Cloudera, Hortonworks, probably EMC soon: sell a notion
of “Hadoop as OS”
All your workloads are belong to us
Google Data Center, Fox News
~2002
76. Beyond Hadoop
Hadoop – an open source solution for fault-tolerant parallel
processing of batch jobs at scale, based on commodity
hardware… however, other priorities have emerged for the
analytics lifecycle:
•
•
•
•
•
•
apps require integration beyond Hadoop
multiple topologies, mixed workloads, multi-tenancy
higher utilization
lower latency
highly-available, long running services
more than “Just JVM” – e.g., Python growth
keep in mind the priority for multi-disciplinary efforts,
to break down even more silos – well beyond the
de facto “priesthood” of data engineering
77. Beyond Hadoop
Google has been doing data center computing for years,
to address the complexities of large-scale data workflows:
•
•
leveraging the modern kernel: isolation in lieu of VMs
•
•
•
•
mixed workloads, multi-tenancy
“most (>80%) jobs are batch jobs, but the majority
of resources (55–80%) are allocated to service jobs”
relatively high utilization rates
JVM? not so much…
reality: scheduling batch is simple;
scheduling services is hard/expensive
78. “Return of the Borg”
Return of the Borg: How Twitter Rebuilt Google’s
Secret Weapon
Cade Metz
wired.com/wiredenterprise/
2013/03/google-borg-twitter-mesos
The Datacenter as a Computer: An Introduction
to the Design of Warehouse-Scale Machines
Luiz André Barroso, Urs Hölzle
research.google.com/pubs/
pub35290.html
2011 GAFS Omega
John Wilkes, et al.
youtu.be/0ZFMlO98Jkc
79. “Return of the Borg”
Omega: flexible, scalable schedulers for large compute clusters
Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, John Wilkes
eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf
80. Mesos – definitions
a common substrate for cluster computing
http://mesos.apache.org/
heterogenous assets in your data center or cloud
made available as a homogenous set of resources
•
•
•
•
•
•
•
•
•
top-level Apache project
scalability to 10,000s of nodes
obviates the need for virtual machines
isolation (pluggable) for CPU, RAM, I/O, FS, etc.
fault-tolerant replicated master using ZooKeeper
multi-resource scheduling (memory and CPU aware)
APIs in C++, Java, Python
web UI for inspecting cluster state
available for Linux, OpenSolaris, Mac OSX
81. Mesos – architecture
given the use of Mesos as a Data Center OS kernel…
•
Chronos provides complex scheduling capabilities,
much like a distributed Unix “cron”
•
Marathon provides highly-available long-running
services, much like a distributed Unix “init.d”
•
next time you need to build a distributed app,
consider using these as building blocks
a major lesson learned from Spark:
•
leveraging these kinds of building blocks,
one can rebuild Hadoop 100x faster,
in much less code
84. Case Study: Twitter (bare metal / on premise)
“Mesos is the cornerstone of our elastic compute infrastructure –
it’s how we build all our new services and is critical for Twitter’s
continued success at scale. It's one of the primary keys to our
data center efficiency."
Chris Fry, SVP Engineering
blog.twitter.com/2013/mesos-graduates-from-apache-incubation
•
key services run in production: analytics, typeahead, ads
•
Twitter engineers rely on Mesos to build all new services
•
instead of thinking about static machines, engineers think
about resources like CPU, memory and disk
•
allows services to scale and leverage a shared pool of
servers across data centers efficiently
•
reduces the time between prototyping and launching
85. Case Study: Airbnb (fungible cloud infrastructure)
“We think we might be pushing data science in the field of travel
more so than anyone has ever done before… a smaller number
of engineers can have higher impact through automation on
Mesos."
Mike Curtis,VP Engineering
gigaom.com/2013/07/29/airbnb-is-engineering-itself-into-a-data-driven...
•
improves resource management and efficiency
•
helps advance engineering strategy of building small teams
that can move fast
•
key to letting engineers make the most of AWS-based
infrastructure beyond just Hadoop
•
allowed company to migrate off Elastic MapReduce
•
enables use of Hadoop along with Chronos, Spark, Storm, etc.
86. Arguments for Data Center Computing
rather than running several specialized clusters, each
at relatively low utilization rates, instead run many
mixed workloads
obvious benefits are realized in terms of:
• scalability, elasticity, fault tolerance, performance, utilization
•
•
reduced equipment capex, Ops overhead, etc.
reduced licensing, eliminating need for VMs or
potential vendor lockin
subtle benefits – arguably, more important for Enterprise IT:
• reduced time for engineers to rampup new services at scale
•
reduced latency between batch and services, enabling new
highROI use cases
•
enables Dev/Test apps to run safely on a Production cluster
87. Media Coverage
Mesosphere Adds Docker Support To Its Mesos-Based Operating System For The Data Center
Frederic Lardinois
TechCrunch (2013-09-26)
techcrunch.com/2013/09/26/mesosphere...
Play Framework Grid Deployment with Mesos
James Ward, Flo Leibert, et al.
Typesafe blog (2013-09-19)
typesafe.com/blog/play-framework-grid...
Mesosphere Launches Marathon Framework
Adrian Bridgwater
Dr. Dobbs (2013-09-18)
drdobbs.com/open-source/mesosphere...
New open source tech Marathon wants to make your data center run like Google’s
Derrick Harris
GigaOM (2013-09-04)
gigaom.com/2013/09/04/new-open-source...
Running batch and long-running, highly available service jobs on the same cluster
Ben Lorica
O’Reilly (2013-09-01)
strata.oreilly.com/2013/09/running-batch...
89. Cascading, Pattern, and PMML:
1. PMML and R (30 min lab)
2. Cascading Overview (15 min)
3. Model Scoring (30 min lab)
4. < break/ >
5. Ensembles, Experiments, etc. (15 min)
6. Industry Practices (20 min)
7. Q & A
ACM, 2013-10-12
90. Enterprise Data Workflows with Cascading
O’Reilly, 2013
shop.oreilly.com/product/
0636920028536.do
monthly newsletter for updates, events,
conference summaries, etc.:
liber118.com/pxn/