OSCON 2014: Data Workflows for Machine Learning

Paco Nathan
Paco NathanEvil Mad Scientist
Data Workflows for 

Machine Learning
!
Paco Nathan

@pacoid
http://liber118.com/pxn/
Why is this talk here?	

Machine Learning in production apps is less and less about
algorithms (even though that work is quite fun and vital)	

!
Performing real work is more about:	

! socializing a problem within an organization	

! feature engineering (“Beyond Product Managers”)	

! tournaments in CI/CD environments	

! operationalizing high-ROI apps at scale	

! etc.	

!
So I’ll just crawl out on a limb and state that leveraging great
frameworks to build data workflows is more important than 

chasing after diminishing returns on highly nuanced algorithms.	

!
Because Interwebs!
Data Workflows for Machine Learning	

Middleware has been evolving for Big Data, and there are some
great examples — we’ll review several. Process has been evolving
too, right along with the use cases.	

!
Popular frameworks typically provide some Machine Learning
capabilities within their core components, or at least among their
major use cases.	

!
Let’s consider features from Enterprise DataWorkflows as a basis 

for what’s needed in Data Workflows for Machine Learning.	

!
Their requirements for scale, robustness, cost trade-offs,
interdisciplinary teams, etc., serve as guides in general.
Caveat Auditor	

I won’t claim to be expert with each of the frameworks and
environments described in this talk. Expert with a few of them
perhaps, but more to the point: embroiled in many use cases.	

!
This talk attempts to define a “scorecard” for evaluating
important ML data workflow features: what’s needed for 

use cases, compare and contrast of what’s available, plus 

some indication of which frameworks are likely to be best 

for a given scenario.	

!
Seriously, this is a work in progress.
Outline	

• Definition: Machine Learning	

• Definition: Data Workflows	

• A whole bunch o’ examples across several platforms	

• Nine points to discuss, leading up to a scorecard	

• Because Notebooks	

• Questions, comments, flying tomatoes…
Data Workflows for 

Machine Learning:


Frame the question…
!
“A Basis for What’s Needed”


“Machine learning algorithms can figure out how to perform

important tasks by generalizing from examples.This is often

feasible and cost-effective where manual programming is not. 

As more data becomes available, more ambitious problems 

can be tackled. As a result, machine learning is widely used 

in computer science and other fields.”	

Pedro Domingos, U Washington

A Few Useful Things to Know about Machine Learning
Definition: Machine Learning
• overfitting (variance) 	

• underfitting (bias)	

• “perfect classifier” (no free lunch)
[learning as generalization] =

[representation] + [evaluation] + [optimization]
Definition: Machine Learning … Conceptually

!
!
1. real-world data 	

2. representation – as graphs, 

time series, geo, etc. 	

3. convert to sparse matrix for production work 

leveraging abstract algebra + func programming 	

4. cost-effective parallel processing for ML apps 

at scale, live use cases
discovery
discovery
modeling
modeling
integration
integration
appsapps
systems
systems
business process,	

stakeholder
data prep, discovery,
modeling, etc.
software engineering,
automation
systems engineering,
availability
data
science
Data
Scientist
App Dev
Ops
Domain
Expert
Definition: Machine Learning … Interdisciplinary Teams, Needs × Roles
Definition: Machine Learning … Process, Feature Engineering
evaluationrepresentation optimization use cases
feature engineering
data
sources
data
sources
data prep
pipeline
train
test
hold-outs
learners
classifiers
classifiers
classifiers
data
sources
scoring
Definition: Machine Learning … Process, Tournaments
evaluationrepresentation optimization use cases
feature engineering tournaments
data
sources
data
sources
data prep
pipeline
train
test
hold-outs
learners
classifiers
classifiers
classifiers
data
sources
scoring
quantify and measure:
• benefit?
• risk?
• operational costs?
• obtain other data?
• improve metadata?
• refine representation?
• improve optimization?
• iterate with stakeholders?
• can models (inference) inform
first principles approaches?
monoid (an alien life form, courtesy of abstract algebra):	

! binary associative operator	

! closure within a set of objects	

! unique identity element	

what are those good for?	

! composable functions in workflows	

! compiler “hints” on steroids, to parallelize	

! reassemble results minimizing bottlenecks at scale	

! reusable components, like Lego blocks	

! think: Docker for business logic	

check out http://justenoughmath.com/	

kudos:Avi Bryant, Oscar Boykin, Sam Ritchie, Jimmy Lin, et al.
Definition: Machine Learning … Invasion of the Monoids
Definition: Machine Learning
evaluationrepresentation optimization use cases
feature engineering tournaments
data
sources
data
sources
data prep
pipeline
train
test
hold-outs
learners
classifiers
classifiers
classifiers
data
sources
scoring
quantify and measure:
• benefit?
• risk?
• operational costs?
• obtain other data?
• improve metadata?
• refine representation?
• improve optimization?
• iterate with stakeholders?
• can models (inference) inform
first principles approaches?
discovery
discovery
modeling
modeling
integration
integration
appsapps
systems
systems
data
science
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
Definition
evaluationrepresentation optimization use cases
feature engineering tournaments
data
sources
data
sources
data prep
pipeline
train
test
hold-outs
learners
classifiers
classifiers
classifiers
data
sources
scoring
quantify and measure:
• benefit?
• risk?
• operational costs?
• obtain other data?
• improve metadata?
• refine representation?
• improve optimization?
• iterate with stakeholders?
• can models (inference) inform
first principles approaches?
discovery
discovery
modeling
modeling
integration
integration
appsapps
systems
systems
data
science
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
Can we fit the required 

process into generalized 

workflow definitions?
Definition: Data Workflows	

Middleware, effectively, is evolving for Big Data and Machine Learning…

The following design pattern — a DAG — shows up in many places, 

via many frameworks:
ETL
data
prep
predictive
model
data
sources
end
uses
Definition: Data Workflows	

definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
ANSI SQL for ETL
Definition: Data Workflows	

definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
usesJava, Pig, etc., for business logic
Definition: Data Workflows	

definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive models

Definition: Data Workflows
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive models
ANSI SQL for ETL most of the licensing costs…
Definition: Data Workflows
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
usesJava, Pig, etc.,
most of the project costs…
Definition: Data Workflows
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
usesJava, Pig, etc.,
most of the project costs…
Something emerges 

to fill these needs, 

for instance…
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Definition: Data Workflows	

For example, Cascading and related projects implement the following
components, based on 100% open source: cascading.org
a compiler sees it all…

one connected DAG:
• troubleshooting
• exception handling
• notifications
• some optimizations
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Definition: Data Workflows
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
!
!
FlowDef flowDef = FlowDef.flowDef()!
.setName( "etl" )!
.addSource( "example.employee", emplTap )!
.addSource( "example.sales", salesTap )!
.addSink( "results", resultsTap );!
 !
SQLPlanner sqlPlanner = new SQLPlanner()!
.setSql( sqlStatement );!
 !
flowDef.addAssemblyPlanner( sqlPlanner );!
!
!
cascading.org
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Definition: Data Workflows
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
!
!
FlowDef flowDef = FlowDef.flowDef()!
.setName( "classifier" )!
.addSource( "input", inputTap )!
.addSink( "classify", classifyTap );!
 !
PMMLPlanner pmmlPlanner = new PMMLPlanner()!
.setPMMLInput( new File( pmmlModel ) )!
.retainOnlyActiveIncomingFields();!
 !
flowDef.addAssemblyPlanner( pmmlPlanner );!
!
!
Enterprise DataWorkflows with Cascading	

O’Reilly, 2013	

shop.oreilly.com/product/
0636920028536.do
ETL
data
prep
predictive
model
data
sources
end
uses
Enterprise DataWorkflows with Cascading	

O’Reilly, 2013	

shop.oreilly.com/product/
0636920028536.do
ETL
data
prep
predictive
model
data
sources
end
uses
This begs an update…
Enterprise DataWorkflows with Cascading	

O’Reilly, 2013	

shop.oreilly.com/product/
0636920028536.do
ETL
data
prep
predictive
model
data
sources
end
uses
Because…
OSCON 2014: Data Workflows for Machine Learning
Data Workflows for 

Machine Learning:


Examples…
!
“Neque per exemplum”


Example: KNIME	

“a user-friendly graphical workbench for the entire analysis

process: data access, data transformation, initial investigation,

powerful predictive analytics, visualisation and reporting.”	

• large number of integrations (over 1000 modules)	

• ranked #1 in customer satisfaction among 

open source analytics frameworks	

• visual editing of reusable modules	

• leverage prior work in R, Perl, etc.	

• Eclipse integration	

• easily extended for new integrations
knime.org
Example: Python stack	

Python has much to offer – ranging across an 

organization, not just for the analytics staff	

• ipython.org
• pandas.pydata.org
• scikit-learn.org
• numpy.org
• scipy.org
• code.google.com/p/augustus
• continuum.io
• nltk.org
• matplotlib.org
Example: Julia	

“a high-level, high-performance dynamic programming language

for technical computing, with syntax that is familiar to users of

other technical computing environments”	

• significantly faster than most alternatives	

• built to leverage parallelism, cloud computing	

• still relatively new — one to watch!
importall Base!
!
type BubbleSort <: Sort.Algorithm end!
!
function sort!(v::AbstractVector, lo::Int, hi::Int, ::BubbleSort, o::Sort.Ordering)!
    while true!
        clean = true!
        for i = lo:hi-1!
            if Sort.lt(o, v[i+1], v[i])!
                v[i+1], v[i] = v[i], v[i+1]!
                clean = false!
            end!
        end!
        clean && break!
    end!
    return v!
end!
julialang.org
Example: Summingbird	

“a library that lets you write streaming MapReduce programs

that look like native Scala or Java collection transformations 

and execute them on a number of well-known distributed

MapReduce platforms like Storm and Scalding.”	

• switch between Storm, Scalding (Hadoop)	

• Spark support is in progress	

• leverage Algebird, Storehaus, Matrix API, etc.
github.com/twitter/summingbird
def wordCount[P <: Platform[P]]!
(source: Producer[P, String], store: P#Store[String, Long]) =!
source.flatMap { sentence => !
toWords(sentence).map(_ -> 1L)!
}.sumByKey(store)
Example: Scalding	

import com.twitter.scalding._!
 !
class WordCount(args : Args) extends Job(args) {!
Tsv(args("doc"),!
('doc_id, 'text),!
skipHeader = true)!
.read!
.flatMap('text -> 'token) {!
text : String => text.split("[ [](),.]")!
}!
.groupBy('token) { _.size('count) }!
.write(Tsv(args("wc"), writeHeader = true))!
}
github.com/twitter/scalding
Example: Scalding	

• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading	

• code is compact, easy to understand	

• nearly 1:1 between elements of conceptual flow diagram 

and function calls	

• extensive libraries are available for linear algebra, abstract
algebra, machine learning – e.g., Matrix API, Algebird, etc.	

• significant investments by Twitter, Etsy, eBay, etc.	

• less learning curve than Cascalog	

• build scripts… those take a while to run :
github.com/twitter/scalding
Example: Cascalog	

(ns impatient.core!
  (:use [cascalog.api]!
        [cascalog.more-taps :only (hfs-delimited)])!
  (:require [clojure.string :as s]!
            [cascalog.ops :as c])!
  (:gen-class))!
!
(defmapcatop split [line]!
  "reads in a line of string and splits it by regex"!
  (s/split line #"[[](),.)s]+"))!
!
(defn -main [in out & args]!
  (?<- (hfs-delimited out)!
       [?word ?count]!
       ((hfs-delimited in :skip-header? true) _ ?line)!
       (split ?line :> ?word)!
       (c/count ?count)))!
!
; Paul Lam!
; github.com/Quantisan/Impatient
cascalog.org
Example: Cascalog	

• implements Datalog in Clojure, with predicates backed 

by Cascading – for a highly declarative language	

• run ad-hoc queries from the Clojure REPL –

approx. 10:1 code reduction compared with SQL	

• composable subqueries, used for test-driven
development (TDD) practices at scale	

• Leiningen build: simple, no surprises, in Clojure itself	

• more new deployments than other Cascading DSLs – 

Climate Corp is largest use case: 90% Clojure/Cascalog	

• has a learning curve, limited number of Clojure
developers	

• aggregators are the magic, and those take effort to learn
cascalog.org
Example: Titan	

distributed graph database, 

by Matthias Broecheler, Marko Rodriguez, et al.	

• scalable graph database optimized for storing and
querying graphs	

• supports hundreds of billions of vertices and edges	

• transactional database that can support thousands of
concurrent users executing complex graph traversals	

• supports search through Lucene, ElasticSearch	

• can be backed by HBase, Cassandra, BerkeleyDB	

• TinkerPop native graph stack with Gremlin, etc.
// who is hercules' paternal grandfather?!
g.V('name','hercules').out('father').out('father').name
thinkaurelius.github.io/titan
Example: MBrace	

“a .NET based software stack that enables easy large-scale

distributed computation and big data analysis.”	

• declarative and concise distributed algorithms in 

F# asynchronous workflows	

• scalable for apps in private datacenters or public 

clouds, e.g.,Windows Azure or Amazon EC2	

• tooling for interactive, REPL-style deployment,
monitoring, management, debugging inVisual Studio	

• leverages monads (similar to Summingbird)	

• MapReduce as a library, but many patterns beyond
m-brace.net
let rec mapReduce (map: 'T -> Cloud<'R>)!
(reduce: 'R -> 'R -> Cloud<'R>)!
(identity: 'R)!
(input: 'T list) =!
cloud {!
match input with!
| [] -> return identity!
| [value] -> return! map value!
| _ ->!
let left, right = List.split input!
let! r1, r2 =!
(mapReduce map reduce identity left)!
<||>!
(mapReduce map reduce identity right)!
return! reduce r1 r2!
}
Example: Apache Spark	

in-memory cluster computing, 

by Matei Zaharia, et al.	

• intended to make data analytics fast to write and to run	

• easy to use, at a higher level of abstraction	

• load data into memory and query it repeatedly, much
more quickly than with Hadoop	

• APIs in Scala, Java and Python, shells in Scala and Python	

• integrations: Spark Streaming, MLlib, GraphX, Spark SQL,
Tachyon, etc.	

// word count!
val file = spark.textFile(“hdfs://…”)!
val counts = file.flatMap(line => line.split(” “))!
.map(word => (word, 1))!
.reduceByKey(_ + _)!
counts.saveAsTextFile(“hdfs://…”)
spark-project.org
The State of Spark, and
WhereWe're Going Next	

Matei Zaharia
Spark Summit (2013)	

youtu.be/nU6vO2EJAb4
Example: Spark SQL	

“blurs the lines between RDDs and relational tables”	

intermix SQL commands to query external data,
along with complex analytics, in a single app:	

• allows SQL extensions based on MLlib	

• Shark is being migrated to Spark SQL	

Spark SQL: Manipulating Structured Data Using Spark

Michael Armbrust, Reynold Xin (2014-03-24)

databricks.com/blog/2014/03/26/Spark-
SQL-manipulating-structured-data-using-
Spark.html

people.apache.org/~pwendell/catalyst-
docs/sql-programming-guide.html	

!
// data can be extracted from existing sources
// e.g., Apache Hive
val trainingDataTable = sql("""
SELECT e.action
u.age,
u.latitude,
u.logitude
FROM Users u
JOIN Events e
ON u.userId = e.userId""")
!
// since `sql` returns an RDD, the results above
// can be used in MLlib
val trainingData = trainingDataTable.map { row =>
val features = Array[Double](row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
!
val model =
new LogisticRegressionWithSGD().run(trainingData)
spark-project.org
MapReduce
General Batch Processing
Pregel Giraph
Dremel Drill Tez
Impala GraphLab
Storm S4
Specialized Systems:
iterative, interactive, streaming, graph, etc.
Example: Apache Spark spark-project.org
2002
2002
MapReduce @ Google
2004
MapReduce paper
2006
Hadoop @Yahoo!
2004 2006 2008 2010 2012 2014
2014
Apache Spark top-level
2010
Spark paper
2008
Hadoop Summit
action value
RDD
RDD
RDD
transformations RDD
How about a generalized engine for distributed,
applicative systems – apps sharing code across
multiple use cases: batch, iterative, streaming, etc.
Data Workflows for
Machine Learning:


Update the definitions…
!
“Nine Points to Discuss”

Data Workflows	

Formally speaking, workflows include people (cross-dept) 

and define oversight for exceptional data	

…otherwise, data flow or pipeline would be more apt	

…examples include Cascading traps, with exceptional tuples 

routed to a review organization, e.g., Customer Support
includes people, defines
oversight for exceptional data
Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
1
Data Workflows	

Footnote –	

see also an excellent article related to this point 

and the next:	

!
Therbligs for data science: 

A nuts and bolts framework for accelerating data work	

Abe Gong @Jawbone	

blog.abegong.com/2014/03/therbligs-for-data-science.html	

strataconf.com/strata2014/public/schedule/detail/32291	

brighttalk.com/webcast/9059/105119	

!
!
!
includes people, defines
oversight for exceptional data
1+
Data Workflows	

Workflows impose a separation of concerns, allowing 

for multiple abstraction layers in the tech stack	

…specify what is required, not how to accomplish it	

…articulating the business logic	

… Cascading leverages pattern language	

…related notions from Knuth, literate programming	

…not unlike BPM/BPEL for Big Data	

…examples: IPython, R Markdown, etc.	

separation of concerns, allows
for literate programming
2
Data Workflows	

Footnote –	

while discussing separation of concerns and a general
design pattern, let’s consider data workflows in a
broader context of probabilistic programming, since
business data is heterogeneous and structured… 

the data for every domain is heterogeneous. 	

Paraphrasing Ginsberg:	

I’ve seen the best minds of my generation destroyed by
madness, dragging themselves through quagmires of large
LCD screens filled with Intellij debuggers and database
command lines, yearning to fit real-world data into their
preferred “deterministic” tools…
separation of concerns, allows
for literate programming
2+
Probabilistic Programming: 

Why,What, How,When

Beau Cronin

Strata SC (2014)

speakerdeck.com/beaucronin/
probabilistic-programming-
strata-santa-clara-2014
Why Probabilistic Programming 

Matters

Rob Zinkov

Convex Optimized (2012-06-27)

zinkov.com/posts/
2012-06-27-why-prob-
programming-matters/
Data Workflows	

Multiple abstraction layers in the tech stack are 

needed, but still emerging	

…feedback loops based on machine data	

…optimizers feed on abstraction	

…metadata accounting:	

• track data lineage	

• propagate schema	

• model / feature usage	

• ensemble performance	

…app history accounting:	

• util stats, mixing workloads	

• heavy-hitters, bottlenecks	

• throughput, latency
multiple abstraction layers 

for metadata, feedback, 

and optimization
Cluster
Scheduler
Planners/
Optimizers
Clusters
DSLs
Machine
Data
• app history
• util stats
• bottlenecksMixed
Topologies
Business
Process
Reusable
Components
Portable
Models
• data lineage
• schema propagation
• feature selection
• tournaments
3
Data Workflows
Multiple
needed, but still emerging	

…feedback loops based on
…optimizers feed on abstraction
…metadata accounting:	

• track data lineage
• propagate
• model / feature usage	

• ensemble performance	

…app history accounting:	

• util stats, mixing workloads	

• heavy-hitters, bottlenecks	

• throughput, latency
multiple abstraction layers 

for metadata, feedback, 

and optimization
Cluster
Scheduler
Planners/
Optimizers
Clusters
DSLs
Machine
Data
• app history
• util stats
• bottlenecksMixed
Topologies
Business
Process
Reusable
Components
Portable
Models
• data lineage
• schema propagation
• feature selection
• tournaments
3
Um, will compilers 

ever look like this??
Data Workflows	

Workflows must provide for test, which is no simple 

matter	

…testing is required on three abstraction layers:	

• model portability/evaluation, ensembles, tournaments	

• TDD in reusable components and business process	

• continuous integration / continuous deployment	

…examples: Cascalog for TDD, PMML for model eval	

…still so much more to improve	

…keep in mind that workflows involve people, too	

testing: model evaluation,
TDD, app deployment
4
Cluster
Scheduler
Planners/
Optimizers
Clusters
DSLs
Machine
Data
• app history
• util stats
• bottlenecksMixed
Topologies
Business
Process
Reusable
Components
Portable
Models
• data lineage
• schema propagation
• feature selection
• tournaments
Data Workflows	

Workflows enable system integration	

…future-proof integrations and scale-out	

…build components and apps, not jobs and command lines	

…allow compiler optimizations across the DAG,

i.e., cross-dept contributions	

…minimize operationalization costs:	

• troubleshooting, debugging at scale	

• exception handling	

• notifications, instrumentation	

…examples: KNIME, Cascading, etc.	

!
future-proof system
integration, scale-out, ops
Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
5
Data Workflows	

Visualizing workflows, what a great idea.	

…the practical basis for:	

• collaboration	

• rapid prototyping	

• component reuse	

…examples: KNIME wins best in category	

…Cascading generates flow diagrams, 

which are a nice start	

visualizing allows people 

to collaborate through code
6
Data Workflows	

Abstract algebra, containerizing workflow metadata
…monoids, semigroups, etc., allow for reusable components 

with well-defined properties for running in parallel at scale	

…let business process be agnostic about underlying topologies, 

with analogy to Linux containers (Docker, etc.)	

…compose functions, take advantage of sparsity (monoids)	

…because “data is fully functional” – FP for s/w eng benefits	

…aggregators are almost always “magic”, now we can solve 

for common use cases	

…read Monoidify! Monoids as a Design Principle for Efficient
MapReduce Algorithms by Jimmy Lin	

…Cascading introduced some notions of this circa 2007, 

but didn’t make it explicit	

…examples: Algebird in Summingbird, Simmer, Spark, etc.
abstract algebra and
functional programming
containerize business process
7
Data Workflows	

Footnote –	

see also two excellent talks related to this point:	

!
Algebra for Analytics

speakerdeck.com/johnynek/algebra-for-analytics

Oscar Boykin, Strata SC (2014)	

Add ALL theThings:

Abstract Algebra Meets Analytics

infoq.com/presentations/abstract-algebra-analytics

Avi Bryant, Strange Loop (2013)
abstract algebra and
functional programming
containerize business process
7+
Data Workflows	

Workflow needs vary in time, and need to blend time	

…something something batch, something something low-latency	

…scheduling batch isn’t hard; scheduling low-latency is 

computationally brutal (see the Omega paper)	

…because people like saying “Lambda Architecture”, 

it gives them goosebumps, or something	

…because real-time means so many different things	

…because batch windows are so 1964	

…examples: Summingbird, Oryx
blend results from different
time scales: batch plus low-
latency
8
Big Data

Nathan Marz, James Warren

manning.com/marz
Data Workflows	

Workflows may define contexts in which model selection 

possibly becomes a compiler problem	

…for example, see Boyd, Parikh, et al.	

…ultimately optimizing for loss function + regularization term	

…perhaps not ready for prime-time immediately	

…examples: MLbase
optimize learners in context,
to make model selection
potentially a compiler problem
9
f(x): loss function
g(z): regularization term
Data Workflows	

For one of the best papers about what workflows really

truly require, see Out of the Tar Pit by Moseley and Marks
bonus
Nine Points for Data Workflows — a wish list	

1. includes people, defines oversight for exceptional data
2. separation of concerns, allows for literate programming
3. multiple abstraction layers for metadata, feedback, and
optimization
4. testing: model evaluation, TDD, app deployment
5. future-proof system integration, scale-out, ops
6. visualizing workflows allows people to collaborate
through code
7. abstract algebra and functional programming
containerize business process
8. blend results from different time scales: 

batch plus low-latency
9. optimize learners in context, to make model 

selection potentially a compiler problem
Cluster
Scheduler
Planners/
Optimizers
Clusters
DSLs
Machine
Data
• app history
• util stats
• bottlenecksMixed
Topologies
Business
Process
Reusable
Components
Portable
Models
• data lineage
• schema propagation
• feature selection
• tournaments
Data Workflows for
Machine Learning:


Summary…
!
“feedback, discussion, outlook”


?
OSCON 2014: Data Workflows for Machine Learning
Nine Points for Data Workflows — a scorecard	

Spark Oryx Summing!
bird
Cascalog Cascading
!
KNIME Py Data R
Markdown
MBrace
includes people,
exceptional data
separation of
concerns
multiple 

abstraction layers
testing in depth
future-proof 

system
integration
visualize to collab
can haz monoids
blends batch + 

“real-time”
optimize learners 

in context
can haz PMML ? ✔ ✔ ✔ ✔ ✔
Haskell Curry

haskell.org
Alonso Church

wikipedia.org
The General Case:	

Theory, Eight Decades Ago:	

Haskell Curry, known for seminal
work on combinatory logic (1927)	

Alonzo Church, known for lambda
calculus (1936) and much more! 	

!
Both sought formal answers to the
question, “What can be computed?”
The General Case:	

Praxis, Four Decades Ago:	

Leveraging lambda calculus, combinators,
etc., to increase the parallelism of apps as
applicative systems
John Backus

acm.org
David Turner

wikipedia.org
“Can Programming Be Liberated from the von Neumann

Style? A Functional Style and Its Algebra of Programs”

ACMTuring Award (1977)

stanford.edu/class/cs242/readings/backus.pdf
“A new implementation technique for applicative languages”

Turner, D.A. (1979)

Softw: Pract. Exper., 9: 31–49. doi: 10.1002/spe.4380090105
Notebooks, cloud-based and otherwise…	

For now, the best practices appear to be heading toward 

a world of notebooks…	

• IPython => Jupyter	

• can be containerized to run in a cloud	

• great for team collaboration	

• Don Knuth sez: try literate programming	

• great for dashboards, as end-use data products	

• also good as endpoints for service-oriented architectures

(which are what ML use cases should be thinking about)
Ali Ghodsi, Databricks Cloud	

youtu.be/lO7LhVZrNwA?t=23m27s
Notebooks, cloud-based and otherwise…
follow-ups
monthly newsletter for updates, 

events, conf summaries, etc.:
liber118.com/pxn/
Enterprise Data Workflows with Cascading
O’Reilly, 2013
shop.oreilly.com/product/0636920028536.do
Just Enough Math
O’Reilly, 2014
oreilly.com/go/enough_math/

preview: youtu.be/TQ58cWgdCpA
Scala by the Bay

SF, Aug 8

scalabythebay.org
#MesosCon

Chicago, Aug 21

events.linuxfoundation.org/events/mesoscon
Cassandra Summit

SF, Sep 10

cvent.com/events/cassandra-summit-2014
Strata NYC + Hadoop World

NYC, Oct 15

strataconf.com/stratany2014
Strata EU

Barcelona, Nov 20

strataconf.com/strataeu2014
Data Day Texas

Austin, Jan 10

datadaytexas.com
calendar:
1 de 68

Recomendados

Data Workflows for Machine Learning - SF Bay Area ML por
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLPaco Nathan
8.9K vistas79 diapositivas
Square's Machine Learning Infrastructure and Applications - Rong Yan por
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanHakka Labs
4.1K vistas43 diapositivas
Data Workflows for Machine Learning - Seattle DAML por
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
31.6K vistas74 diapositivas
Azure Machine Learning por
Azure Machine LearningAzure Machine Learning
Azure Machine LearningMostafa Elzoghbi
3.7K vistas50 diapositivas
Azure Machine Learning and ML on Premises por
Azure Machine Learning and ML on PremisesAzure Machine Learning and ML on Premises
Azure Machine Learning and ML on PremisesIvo Andreev
3.9K vistas30 diapositivas
Target Leakage in Machine Learning por
Target Leakage in Machine LearningTarget Leakage in Machine Learning
Target Leakage in Machine LearningYuriy Guts
2.6K vistas45 diapositivas

Más contenido relacionado

La actualidad más candente

Machine Learning Goes Production por
Machine Learning Goes ProductionMachine Learning Goes Production
Machine Learning Goes ProductionMichał Łopuszyński
1.3K vistas28 diapositivas
From Labelling Open data images to building a private recommender system por
From Labelling Open data images to building a private recommender systemFrom Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemPierre Gutierrez
1.9K vistas47 diapositivas
Machine learning the high interest credit card of technical debt [PWL] por
Machine learning the high interest credit card of technical debt [PWL]Machine learning the high interest credit card of technical debt [PWL]
Machine learning the high interest credit card of technical debt [PWL]Jenia Gorokhovsky
993 vistas40 diapositivas
Data Science Reinvents Learning? por
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
6.1K vistas62 diapositivas
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF... por
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...Bill Liu
393 vistas29 diapositivas
End-to-End Machine Learning Project por
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning ProjectEng Teong Cheah
2.5K vistas46 diapositivas

La actualidad más candente(20)

From Labelling Open data images to building a private recommender system por Pierre Gutierrez
From Labelling Open data images to building a private recommender systemFrom Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender system
Pierre Gutierrez1.9K vistas
Machine learning the high interest credit card of technical debt [PWL] por Jenia Gorokhovsky
Machine learning the high interest credit card of technical debt [PWL]Machine learning the high interest credit card of technical debt [PWL]
Machine learning the high interest credit card of technical debt [PWL]
Jenia Gorokhovsky993 vistas
Data Science Reinvents Learning? por Paco Nathan
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
Paco Nathan6.1K vistas
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF... por Bill Liu
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
Bill Liu393 vistas
End-to-End Machine Learning Project por Eng Teong Cheah
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
Eng Teong Cheah2.5K vistas
DutchMLSchool. Logistic Regression, Deepnets, Time Series por BigML, Inc
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesDutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Logistic Regression, Deepnets, Time Series
BigML, Inc231 vistas
Interpretable Machine Learning por Sri Ambati
Interpretable Machine LearningInterpretable Machine Learning
Interpretable Machine Learning
Sri Ambati1.4K vistas
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc... por Sri Ambati
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...
Sri Ambati2.6K vistas
DutchMLSchool. Machine Learning End-to-End por BigML, Inc
DutchMLSchool. Machine Learning End-to-EndDutchMLSchool. Machine Learning End-to-End
DutchMLSchool. Machine Learning End-to-End
BigML, Inc217 vistas
Recommendations for Building Machine Learning Software por Justin Basilico
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
Justin Basilico6.2K vistas
Starting data science with kaggle.com por Nathaniel Shimoni
Starting data science with kaggle.comStarting data science with kaggle.com
Starting data science with kaggle.com
Nathaniel Shimoni2.4K vistas
Machine learning 101 sit hvr por fredverheul
Machine learning 101 sit hvrMachine learning 101 sit hvr
Machine learning 101 sit hvr
fredverheul939 vistas
Automated Machine Learning por safa cimenli
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
safa cimenli1.1K vistas
The Power of Auto ML and How Does it Work por Ivo Andreev
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
Ivo Andreev2.8K vistas
UXPA 2013 CARD: A Collaborative Tool for Rapid Task Analysis and Design por Len Conte
UXPA 2013 CARD: A Collaborative Tool for Rapid Task Analysis and DesignUXPA 2013 CARD: A Collaborative Tool for Rapid Task Analysis and Design
UXPA 2013 CARD: A Collaborative Tool for Rapid Task Analysis and Design
Len Conte2.3K vistas
Automatic machine learning (AutoML) 101 por QuantUniversity
Automatic machine learning (AutoML) 101Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101
QuantUniversity2K vistas
Setting up Machine Learning Projects - Full Stack Deep Learning por Sergey Karayev
Setting up Machine Learning Projects - Full Stack Deep LearningSetting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep Learning
Sergey Karayev104.2K vistas

Destacado

Data Science in Future Tense por
Data Science in Future TenseData Science in Future Tense
Data Science in Future TensePaco Nathan
7.8K vistas46 diapositivas
#MesosCon 2014: Spark on Mesos por
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on MesosPaco Nathan
5.4K vistas45 diapositivas
How Apache Spark fits into the Big Data landscape por
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
4.7K vistas77 diapositivas
Apache Spark and the Emerging Technology Landscape for Big Data por
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
3.3K vistas53 diapositivas
Big Data is changing abruptly, and where it is likely heading por
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
6.1K vistas60 diapositivas
Future of data science as a profession por
Future of data science as a professionFuture of data science as a profession
Future of data science as a professionJose Quesada
1.7K vistas46 diapositivas

Destacado(20)

Data Science in Future Tense por Paco Nathan
Data Science in Future TenseData Science in Future Tense
Data Science in Future Tense
Paco Nathan7.8K vistas
#MesosCon 2014: Spark on Mesos por Paco Nathan
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos
Paco Nathan5.4K vistas
How Apache Spark fits into the Big Data landscape por Paco Nathan
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan4.7K vistas
Apache Spark and the Emerging Technology Landscape for Big Data por Paco Nathan
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
Paco Nathan3.3K vistas
Big Data is changing abruptly, and where it is likely heading por Paco Nathan
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
Paco Nathan6.1K vistas
Future of data science as a profession por Jose Quesada
Future of data science as a professionFuture of data science as a profession
Future of data science as a profession
Jose Quesada1.7K vistas
Big data & data science challenges and opportunities por Jose Quesada
Big data & data science   challenges and opportunitiesBig data & data science   challenges and opportunities
Big data & data science challenges and opportunities
Jose Quesada1.2K vistas
Data Science in 2016: Moving Up por Paco Nathan
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
Paco Nathan10.5K vistas
GalvanizeU Seattle: Eleven Almost-Truisms About Data por Paco Nathan
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Paco Nathan9.3K vistas
How Apache Spark fits in the Big Data landscape por Paco Nathan
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
Paco Nathan6.9K vistas
Use of standards and related issues in predictive analytics por Paco Nathan
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
Paco Nathan2K vistas
Databricks Meetup @ Los Angeles Apache Spark User Group por Paco Nathan
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan2.5K vistas
How Apache Spark fits into the Big Data landscape por Paco Nathan
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan7.6K vistas
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More por Paco Nathan
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Paco Nathan3.7K vistas
Microservices, Containers, and Machine Learning por Paco Nathan
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
Paco Nathan8.1K vistas
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming por Paco Nathan
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan7.6K vistas
GraphX: Graph analytics for insights about developer communities por Paco Nathan
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan11.8K vistas
A New Year in Data Science: ML Unpaused por Paco Nathan
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
Paco Nathan20.9K vistas
QCon São Paulo: Real-Time Analytics with Spark Streaming por Paco Nathan
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan19.7K vistas

Similar a OSCON 2014: Data Workflows for Machine Learning

How to build your own Delve: combining machine learning, big data and SharePoint por
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointJoris Poelmans
6.1K vistas47 diapositivas
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02 por
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02BIWUG
1.1K vistas47 diapositivas
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis por
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTrivadis
449 vistas41 diapositivas
Data science presentation por
Data science presentationData science presentation
Data science presentationMSDEVMTL
38.2K vistas25 diapositivas
Architecting for Data Science por
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data ScienceJohann Schleier-Smith
1.3K vistas114 diapositivas
No more Three Tier - A path to a better code for Cloud and Azure por
No more Three Tier - A path to a better code for Cloud and AzureNo more Three Tier - A path to a better code for Cloud and Azure
No more Three Tier - A path to a better code for Cloud and AzureMarco Parenzan
419 vistas70 diapositivas

Similar a OSCON 2014: Data Workflows for Machine Learning(20)

How to build your own Delve: combining machine learning, big data and SharePoint por Joris Poelmans
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePoint
Joris Poelmans6.1K vistas
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02 por BIWUG
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
BIWUG1.1K vistas
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis por Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
Trivadis449 vistas
Data science presentation por MSDEVMTL
Data science presentationData science presentation
Data science presentation
MSDEVMTL38.2K vistas
No more Three Tier - A path to a better code for Cloud and Azure por Marco Parenzan
No more Three Tier - A path to a better code for Cloud and AzureNo more Three Tier - A path to a better code for Cloud and Azure
No more Three Tier - A path to a better code for Cloud and Azure
Marco Parenzan419 vistas
Big Data Meetup #7 por Paul Lo
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7
Paul Lo441 vistas
Webinar: Scaling MongoDB por MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
MongoDB9.1K vistas
Maintainable Machine Learning Products por Andrew Musselman
Maintainable Machine Learning ProductsMaintainable Machine Learning Products
Maintainable Machine Learning Products
Andrew Musselman975 vistas
Azure Machine Learning 101 por Renato Jovic
Azure Machine Learning 101Azure Machine Learning 101
Azure Machine Learning 101
Renato Jovic10.2K vistas
Keynote at-icpc-2020 por Ralf Laemmel
Keynote at-icpc-2020Keynote at-icpc-2020
Keynote at-icpc-2020
Ralf Laemmel239 vistas
SPSChicagoBurbs 2019 - What is CDM and CDS? por Nicolas Georgeault
SPSChicagoBurbs 2019 - What is CDM and CDS?SPSChicagoBurbs 2019 - What is CDM and CDS?
SPSChicagoBurbs 2019 - What is CDM and CDS?
Nicolas Georgeault361 vistas
Brochure data science learning path board-infinity (1) por NirupamNishant2
Brochure   data science learning path board-infinity (1)Brochure   data science learning path board-infinity (1)
Brochure data science learning path board-infinity (1)
NirupamNishant289 vistas
Machine learning model to production por Georg Heiler
Machine learning model to productionMachine learning model to production
Machine learning model to production
Georg Heiler4.4K vistas
The Challenges of Bringing Machine Learning to the Masses por Alice Zheng
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
Alice Zheng3.3K vistas
Best Practices for Building and Deploying Data Pipelines in Apache Spark por Databricks
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks1.9K vistas
EVAIN Artificial intelligence and semantic annotation: are you serious about it? por FIAT/IFTA
EVAIN Artificial intelligence and semantic annotation: are you serious about it?EVAIN Artificial intelligence and semantic annotation: are you serious about it?
EVAIN Artificial intelligence and semantic annotation: are you serious about it?
FIAT/IFTA145 vistas
Machine Learning Models in Production por DataWorks Summit
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
DataWorks Summit2.2K vistas
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf por Altinity Ltd
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
Altinity Ltd19 vistas

Más de Paco Nathan

Human in the loop: a design pattern for managing teams working with ML por
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
2.9K vistas62 diapositivas
Human-in-the-loop: a design pattern for managing teams that leverage ML por
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
2.7K vistas46 diapositivas
Human-in-a-loop: a design pattern for managing teams which leverage ML por
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
1.4K vistas50 diapositivas
Humans in a loop: Jupyter notebooks as a front-end for AI por
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
1.8K vistas40 diapositivas
Humans in the loop: AI in open source and industry por
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
1.5K vistas71 diapositivas
Computable Content por
Computable ContentComputable Content
Computable ContentPaco Nathan
674 vistas35 diapositivas

Más de Paco Nathan(14)

Human in the loop: a design pattern for managing teams working with ML por Paco Nathan
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
Paco Nathan2.9K vistas
Human-in-the-loop: a design pattern for managing teams that leverage ML por Paco Nathan
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Paco Nathan2.7K vistas
Human-in-a-loop: a design pattern for managing teams which leverage ML por Paco Nathan
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Paco Nathan1.4K vistas
Humans in a loop: Jupyter notebooks as a front-end for AI por Paco Nathan
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
Paco Nathan1.8K vistas
Humans in the loop: AI in open source and industry por Paco Nathan
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
Paco Nathan1.5K vistas
Computable Content por Paco Nathan
Computable ContentComputable Content
Computable Content
Paco Nathan674 vistas
Computable Content: Lessons Learned por Paco Nathan
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
Paco Nathan675 vistas
SF Python Meetup: TextRank in Python por Paco Nathan
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
Paco Nathan5.7K vistas
Jupyter for Education: Beyond Gutenberg and Erasmus por Paco Nathan
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
Paco Nathan10.5K vistas
Microservices, containers, and machine learning por Paco Nathan
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan18.5K vistas
Graph Analytics in Spark por Paco Nathan
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
Paco Nathan21.8K vistas
What's new with Apache Spark? por Paco Nathan
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
Paco Nathan6.9K vistas
Strata EU 2014: Spark Streaming Case Studies por Paco Nathan
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
Paco Nathan12.8K vistas
Brief Intro to Apache Spark @ Stanford ICME por Paco Nathan
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan1.8K vistas

Último

Attacking IoT Devices from a Web Perspective - Linux Day por
Attacking IoT Devices from a Web Perspective - Linux Day Attacking IoT Devices from a Web Perspective - Linux Day
Attacking IoT Devices from a Web Perspective - Linux Day Simone Onofri
15 vistas68 diapositivas
Top 10 Strategic Technologies in 2024: AI and Automation por
Top 10 Strategic Technologies in 2024: AI and AutomationTop 10 Strategic Technologies in 2024: AI and Automation
Top 10 Strategic Technologies in 2024: AI and AutomationAutomationEdge Technologies
18 vistas14 diapositivas
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 por
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院IttrainingIttraining
41 vistas8 diapositivas
virtual reality.pptx por
virtual reality.pptxvirtual reality.pptx
virtual reality.pptxG036GaikwadSnehal
11 vistas15 diapositivas
Five Things You SHOULD Know About Postman por
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About PostmanPostman
30 vistas43 diapositivas
Lilypad @ Labweek, Istanbul, 2023.pdf por
Lilypad @ Labweek, Istanbul, 2023.pdfLilypad @ Labweek, Istanbul, 2023.pdf
Lilypad @ Labweek, Istanbul, 2023.pdfAlly339821
9 vistas45 diapositivas

Último(20)

Attacking IoT Devices from a Web Perspective - Linux Day por Simone Onofri
Attacking IoT Devices from a Web Perspective - Linux Day Attacking IoT Devices from a Web Perspective - Linux Day
Attacking IoT Devices from a Web Perspective - Linux Day
Simone Onofri15 vistas
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 por IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
Five Things You SHOULD Know About Postman por Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman30 vistas
Lilypad @ Labweek, Istanbul, 2023.pdf por Ally339821
Lilypad @ Labweek, Istanbul, 2023.pdfLilypad @ Labweek, Istanbul, 2023.pdf
Lilypad @ Labweek, Istanbul, 2023.pdf
Ally3398219 vistas
Case Study Copenhagen Energy and Business Central.pdf por Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana16 vistas
Piloting & Scaling Successfully With Microsoft Viva por Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
Richard Harbridge12 vistas
6g - REPORT.pdf por Liveplex
6g - REPORT.pdf6g - REPORT.pdf
6g - REPORT.pdf
Liveplex10 vistas
Transcript: The Details of Description Techniques tips and tangents on altern... por BookNet Canada
Transcript: The Details of Description Techniques tips and tangents on altern...Transcript: The Details of Description Techniques tips and tangents on altern...
Transcript: The Details of Description Techniques tips and tangents on altern...
BookNet Canada135 vistas
Data-centric AI and the convergence of data and model engineering: opportunit... por Paolo Missier
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier39 vistas
DALI Basics Course 2023 por Ivory Egg
DALI Basics Course  2023DALI Basics Course  2023
DALI Basics Course 2023
Ivory Egg16 vistas
Web Dev - 1 PPT.pdf por gdsczhcet
Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdf
gdsczhcet60 vistas
Igniting Next Level Productivity with AI-Infused Data Integration Workflows por Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software257 vistas
Special_edition_innovator_2023.pdf por WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2217 vistas
Voice Logger - Telephony Integration Solution at Aegis por Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma31 vistas

OSCON 2014: Data Workflows for Machine Learning

  • 1. Data Workflows for 
 Machine Learning ! Paco Nathan
 @pacoid http://liber118.com/pxn/
  • 2. Why is this talk here? Machine Learning in production apps is less and less about algorithms (even though that work is quite fun and vital) ! Performing real work is more about: ! socializing a problem within an organization ! feature engineering (“Beyond Product Managers”) ! tournaments in CI/CD environments ! operationalizing high-ROI apps at scale ! etc. ! So I’ll just crawl out on a limb and state that leveraging great frameworks to build data workflows is more important than 
 chasing after diminishing returns on highly nuanced algorithms. ! Because Interwebs!
  • 3. Data Workflows for Machine Learning Middleware has been evolving for Big Data, and there are some great examples — we’ll review several. Process has been evolving too, right along with the use cases. ! Popular frameworks typically provide some Machine Learning capabilities within their core components, or at least among their major use cases. ! Let’s consider features from Enterprise DataWorkflows as a basis 
 for what’s needed in Data Workflows for Machine Learning. ! Their requirements for scale, robustness, cost trade-offs, interdisciplinary teams, etc., serve as guides in general.
  • 4. Caveat Auditor I won’t claim to be expert with each of the frameworks and environments described in this talk. Expert with a few of them perhaps, but more to the point: embroiled in many use cases. ! This talk attempts to define a “scorecard” for evaluating important ML data workflow features: what’s needed for 
 use cases, compare and contrast of what’s available, plus 
 some indication of which frameworks are likely to be best 
 for a given scenario. ! Seriously, this is a work in progress.
  • 5. Outline • Definition: Machine Learning • Definition: Data Workflows • A whole bunch o’ examples across several platforms • Nine points to discuss, leading up to a scorecard • Because Notebooks • Questions, comments, flying tomatoes…
  • 6. Data Workflows for 
 Machine Learning: 
 Frame the question… ! “A Basis for What’s Needed” 

  • 7. “Machine learning algorithms can figure out how to perform
 important tasks by generalizing from examples.This is often
 feasible and cost-effective where manual programming is not. 
 As more data becomes available, more ambitious problems 
 can be tackled. As a result, machine learning is widely used 
 in computer science and other fields.” Pedro Domingos, U Washington
 A Few Useful Things to Know about Machine Learning Definition: Machine Learning • overfitting (variance) • underfitting (bias) • “perfect classifier” (no free lunch) [learning as generalization] =
 [representation] + [evaluation] + [optimization]
  • 8. Definition: Machine Learning … Conceptually
 ! ! 1. real-world data 2. representation – as graphs, 
 time series, geo, etc. 3. convert to sparse matrix for production work 
 leveraging abstract algebra + func programming 4. cost-effective parallel processing for ML apps 
 at scale, live use cases
  • 9. discovery discovery modeling modeling integration integration appsapps systems systems business process, stakeholder data prep, discovery, modeling, etc. software engineering, automation systems engineering, availability data science Data Scientist App Dev Ops Domain Expert Definition: Machine Learning … Interdisciplinary Teams, Needs × Roles
  • 10. Definition: Machine Learning … Process, Feature Engineering evaluationrepresentation optimization use cases feature engineering data sources data sources data prep pipeline train test hold-outs learners classifiers classifiers classifiers data sources scoring
  • 11. Definition: Machine Learning … Process, Tournaments evaluationrepresentation optimization use cases feature engineering tournaments data sources data sources data prep pipeline train test hold-outs learners classifiers classifiers classifiers data sources scoring quantify and measure: • benefit? • risk? • operational costs? • obtain other data? • improve metadata? • refine representation? • improve optimization? • iterate with stakeholders? • can models (inference) inform first principles approaches?
  • 12. monoid (an alien life form, courtesy of abstract algebra): ! binary associative operator ! closure within a set of objects ! unique identity element what are those good for? ! composable functions in workflows ! compiler “hints” on steroids, to parallelize ! reassemble results minimizing bottlenecks at scale ! reusable components, like Lego blocks ! think: Docker for business logic check out http://justenoughmath.com/ kudos:Avi Bryant, Oscar Boykin, Sam Ritchie, Jimmy Lin, et al. Definition: Machine Learning … Invasion of the Monoids
  • 13. Definition: Machine Learning evaluationrepresentation optimization use cases feature engineering tournaments data sources data sources data prep pipeline train test hold-outs learners classifiers classifiers classifiers data sources scoring quantify and measure: • benefit? • risk? • operational costs? • obtain other data? • improve metadata? • refine representation? • improve optimization? • iterate with stakeholders? • can models (inference) inform first principles approaches? discovery discovery modeling modeling integration integration appsapps systems systems data science Data Scientist App Dev Ops Domain Expert introduced capability
  • 14. Definition evaluationrepresentation optimization use cases feature engineering tournaments data sources data sources data prep pipeline train test hold-outs learners classifiers classifiers classifiers data sources scoring quantify and measure: • benefit? • risk? • operational costs? • obtain other data? • improve metadata? • refine representation? • improve optimization? • iterate with stakeholders? • can models (inference) inform first principles approaches? discovery discovery modeling modeling integration integration appsapps systems systems data science Data Scientist App Dev Ops Domain Expert introduced capability Can we fit the required 
 process into generalized 
 workflow definitions?
  • 15. Definition: Data Workflows Middleware, effectively, is evolving for Big Data and Machine Learning…
 The following design pattern — a DAG — shows up in many places, 
 via many frameworks: ETL data prep predictive model data sources end uses
  • 16. Definition: Data Workflows definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses ANSI SQL for ETL
  • 17. Definition: Data Workflows definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end usesJava, Pig, etc., for business logic
  • 18. Definition: Data Workflows definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses SAS for predictive models

  • 19. Definition: Data Workflows definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses SAS for predictive models ANSI SQL for ETL most of the licensing costs…
  • 20. Definition: Data Workflows definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end usesJava, Pig, etc., most of the project costs…
  • 21. Definition: Data Workflows definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end usesJava, Pig, etc., most of the project costs… Something emerges 
 to fill these needs, 
 for instance…
  • 22. ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Definition: Data Workflows For example, Cascading and related projects implement the following components, based on 100% open source: cascading.org a compiler sees it all…
 one connected DAG: • troubleshooting • exception handling • notifications • some optimizations
  • 23. ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Definition: Data Workflows Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source ! ! FlowDef flowDef = FlowDef.flowDef()! .setName( "etl" )! .addSource( "example.employee", emplTap )! .addSource( "example.sales", salesTap )! .addSink( "results", resultsTap );!  ! SQLPlanner sqlPlanner = new SQLPlanner()! .setSql( sqlStatement );!  ! flowDef.addAssemblyPlanner( sqlPlanner );! ! ! cascading.org
  • 24. ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Definition: Data Workflows Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source ! ! FlowDef flowDef = FlowDef.flowDef()! .setName( "classifier" )! .addSource( "input", inputTap )! .addSink( "classify", classifyTap );!  ! PMMLPlanner pmmlPlanner = new PMMLPlanner()! .setPMMLInput( new File( pmmlModel ) )! .retainOnlyActiveIncomingFields();!  ! flowDef.addAssemblyPlanner( pmmlPlanner );! ! !
  • 25. Enterprise DataWorkflows with Cascading O’Reilly, 2013 shop.oreilly.com/product/ 0636920028536.do ETL data prep predictive model data sources end uses
  • 26. Enterprise DataWorkflows with Cascading O’Reilly, 2013 shop.oreilly.com/product/ 0636920028536.do ETL data prep predictive model data sources end uses This begs an update…
  • 27. Enterprise DataWorkflows with Cascading O’Reilly, 2013 shop.oreilly.com/product/ 0636920028536.do ETL data prep predictive model data sources end uses Because…
  • 29. Data Workflows for 
 Machine Learning: 
 Examples… ! “Neque per exemplum” 

  • 30. Example: KNIME “a user-friendly graphical workbench for the entire analysis
 process: data access, data transformation, initial investigation,
 powerful predictive analytics, visualisation and reporting.” • large number of integrations (over 1000 modules) • ranked #1 in customer satisfaction among 
 open source analytics frameworks • visual editing of reusable modules • leverage prior work in R, Perl, etc. • Eclipse integration • easily extended for new integrations knime.org
  • 31. Example: Python stack Python has much to offer – ranging across an 
 organization, not just for the analytics staff • ipython.org • pandas.pydata.org • scikit-learn.org • numpy.org • scipy.org • code.google.com/p/augustus • continuum.io • nltk.org • matplotlib.org
  • 32. Example: Julia “a high-level, high-performance dynamic programming language
 for technical computing, with syntax that is familiar to users of
 other technical computing environments” • significantly faster than most alternatives • built to leverage parallelism, cloud computing • still relatively new — one to watch! importall Base! ! type BubbleSort <: Sort.Algorithm end! ! function sort!(v::AbstractVector, lo::Int, hi::Int, ::BubbleSort, o::Sort.Ordering)!     while true!         clean = true!         for i = lo:hi-1!             if Sort.lt(o, v[i+1], v[i])!                 v[i+1], v[i] = v[i], v[i+1]!                 clean = false!             end!         end!         clean && break!     end!     return v! end! julialang.org
  • 33. Example: Summingbird “a library that lets you write streaming MapReduce programs
 that look like native Scala or Java collection transformations 
 and execute them on a number of well-known distributed
 MapReduce platforms like Storm and Scalding.” • switch between Storm, Scalding (Hadoop) • Spark support is in progress • leverage Algebird, Storehaus, Matrix API, etc. github.com/twitter/summingbird def wordCount[P <: Platform[P]]! (source: Producer[P, String], store: P#Store[String, Long]) =! source.flatMap { sentence => ! toWords(sentence).map(_ -> 1L)! }.sumByKey(store)
  • 34. Example: Scalding import com.twitter.scalding._!  ! class WordCount(args : Args) extends Job(args) {! Tsv(args("doc"),! ('doc_id, 'text),! skipHeader = true)! .read! .flatMap('text -> 'token) {! text : String => text.split("[ [](),.]")! }! .groupBy('token) { _.size('count) }! .write(Tsv(args("wc"), writeHeader = true))! } github.com/twitter/scalding
  • 35. Example: Scalding • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram 
 and function calls • extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc. • significant investments by Twitter, Etsy, eBay, etc. • less learning curve than Cascalog • build scripts… those take a while to run : github.com/twitter/scalding
  • 36. Example: Cascalog (ns impatient.core!   (:use [cascalog.api]!         [cascalog.more-taps :only (hfs-delimited)])!   (:require [clojure.string :as s]!             [cascalog.ops :as c])!   (:gen-class))! ! (defmapcatop split [line]!   "reads in a line of string and splits it by regex"!   (s/split line #"[[](),.)s]+"))! ! (defn -main [in out & args]!   (?<- (hfs-delimited out)!        [?word ?count]!        ((hfs-delimited in :skip-header? true) _ ?line)!        (split ?line :> ?word)!        (c/count ?count)))! ! ; Paul Lam! ; github.com/Quantisan/Impatient cascalog.org
  • 37. Example: Cascalog • implements Datalog in Clojure, with predicates backed 
 by Cascading – for a highly declarative language • run ad-hoc queries from the Clojure REPL –
 approx. 10:1 code reduction compared with SQL • composable subqueries, used for test-driven development (TDD) practices at scale • Leiningen build: simple, no surprises, in Clojure itself • more new deployments than other Cascading DSLs – 
 Climate Corp is largest use case: 90% Clojure/Cascalog • has a learning curve, limited number of Clojure developers • aggregators are the magic, and those take effort to learn cascalog.org
  • 38. Example: Titan distributed graph database, 
 by Matthias Broecheler, Marko Rodriguez, et al. • scalable graph database optimized for storing and querying graphs • supports hundreds of billions of vertices and edges • transactional database that can support thousands of concurrent users executing complex graph traversals • supports search through Lucene, ElasticSearch • can be backed by HBase, Cassandra, BerkeleyDB • TinkerPop native graph stack with Gremlin, etc. // who is hercules' paternal grandfather?! g.V('name','hercules').out('father').out('father').name thinkaurelius.github.io/titan
  • 39. Example: MBrace “a .NET based software stack that enables easy large-scale
 distributed computation and big data analysis.” • declarative and concise distributed algorithms in 
 F# asynchronous workflows • scalable for apps in private datacenters or public 
 clouds, e.g.,Windows Azure or Amazon EC2 • tooling for interactive, REPL-style deployment, monitoring, management, debugging inVisual Studio • leverages monads (similar to Summingbird) • MapReduce as a library, but many patterns beyond m-brace.net let rec mapReduce (map: 'T -> Cloud<'R>)! (reduce: 'R -> 'R -> Cloud<'R>)! (identity: 'R)! (input: 'T list) =! cloud {! match input with! | [] -> return identity! | [value] -> return! map value! | _ ->! let left, right = List.split input! let! r1, r2 =! (mapReduce map reduce identity left)! <||>! (mapReduce map reduce identity right)! return! reduce r1 r2! }
  • 40. Example: Apache Spark in-memory cluster computing, 
 by Matei Zaharia, et al. • intended to make data analytics fast to write and to run • easy to use, at a higher level of abstraction • load data into memory and query it repeatedly, much more quickly than with Hadoop • APIs in Scala, Java and Python, shells in Scala and Python • integrations: Spark Streaming, MLlib, GraphX, Spark SQL, Tachyon, etc. // word count! val file = spark.textFile(“hdfs://…”)! val counts = file.flatMap(line => line.split(” “))! .map(word => (word, 1))! .reduceByKey(_ + _)! counts.saveAsTextFile(“hdfs://…”) spark-project.org The State of Spark, and WhereWe're Going Next Matei Zaharia Spark Summit (2013) youtu.be/nU6vO2EJAb4
  • 41. Example: Spark SQL “blurs the lines between RDDs and relational tables” intermix SQL commands to query external data, along with complex analytics, in a single app: • allows SQL extensions based on MLlib • Shark is being migrated to Spark SQL Spark SQL: Manipulating Structured Data Using Spark
 Michael Armbrust, Reynold Xin (2014-03-24)
 databricks.com/blog/2014/03/26/Spark- SQL-manipulating-structured-data-using- Spark.html
 people.apache.org/~pwendell/catalyst- docs/sql-programming-guide.html ! // data can be extracted from existing sources // e.g., Apache Hive val trainingDataTable = sql(""" SELECT e.action u.age, u.latitude, u.logitude FROM Users u JOIN Events e ON u.userId = e.userId""") ! // since `sql` returns an RDD, the results above // can be used in MLlib val trainingData = trainingDataTable.map { row => val features = Array[Double](row(1), row(2), row(3)) LabeledPoint(row(0), features) } ! val model = new LogisticRegressionWithSGD().run(trainingData) spark-project.org
  • 42. MapReduce General Batch Processing Pregel Giraph Dremel Drill Tez Impala GraphLab Storm S4 Specialized Systems: iterative, interactive, streaming, graph, etc. Example: Apache Spark spark-project.org 2002 2002 MapReduce @ Google 2004 MapReduce paper 2006 Hadoop @Yahoo! 2004 2006 2008 2010 2012 2014 2014 Apache Spark top-level 2010 Spark paper 2008 Hadoop Summit action value RDD RDD RDD transformations RDD How about a generalized engine for distributed, applicative systems – apps sharing code across multiple use cases: batch, iterative, streaming, etc.
  • 43. Data Workflows for Machine Learning: 
 Update the definitions… ! “Nine Points to Discuss”

  • 44. Data Workflows Formally speaking, workflows include people (cross-dept) 
 and define oversight for exceptional data …otherwise, data flow or pipeline would be more apt …examples include Cascading traps, with exceptional tuples 
 routed to a review organization, e.g., Customer Support includes people, defines oversight for exceptional data Hadoop Cluster source tap source tap sink tap trap tap customer profile DBsCustomer Prefs logs logs Logs Data Workflow Cache Customers Support Web App Reporting Analytics Cubes sink tap Modeling PMML 1
  • 45. Data Workflows Footnote – see also an excellent article related to this point 
 and the next: ! Therbligs for data science: 
 A nuts and bolts framework for accelerating data work Abe Gong @Jawbone blog.abegong.com/2014/03/therbligs-for-data-science.html strataconf.com/strata2014/public/schedule/detail/32291 brighttalk.com/webcast/9059/105119 ! ! ! includes people, defines oversight for exceptional data 1+
  • 46. Data Workflows Workflows impose a separation of concerns, allowing 
 for multiple abstraction layers in the tech stack …specify what is required, not how to accomplish it …articulating the business logic … Cascading leverages pattern language …related notions from Knuth, literate programming …not unlike BPM/BPEL for Big Data …examples: IPython, R Markdown, etc. separation of concerns, allows for literate programming 2
  • 47. Data Workflows Footnote – while discussing separation of concerns and a general design pattern, let’s consider data workflows in a broader context of probabilistic programming, since business data is heterogeneous and structured… 
 the data for every domain is heterogeneous. Paraphrasing Ginsberg: I’ve seen the best minds of my generation destroyed by madness, dragging themselves through quagmires of large LCD screens filled with Intellij debuggers and database command lines, yearning to fit real-world data into their preferred “deterministic” tools… separation of concerns, allows for literate programming 2+ Probabilistic Programming: 
 Why,What, How,When
 Beau Cronin
 Strata SC (2014)
 speakerdeck.com/beaucronin/ probabilistic-programming- strata-santa-clara-2014 Why Probabilistic Programming 
 Matters
 Rob Zinkov
 Convex Optimized (2012-06-27)
 zinkov.com/posts/ 2012-06-27-why-prob- programming-matters/
  • 48. Data Workflows Multiple abstraction layers in the tech stack are 
 needed, but still emerging …feedback loops based on machine data …optimizers feed on abstraction …metadata accounting: • track data lineage • propagate schema • model / feature usage • ensemble performance …app history accounting: • util stats, mixing workloads • heavy-hitters, bottlenecks • throughput, latency multiple abstraction layers 
 for metadata, feedback, 
 and optimization Cluster Scheduler Planners/ Optimizers Clusters DSLs Machine Data • app history • util stats • bottlenecksMixed Topologies Business Process Reusable Components Portable Models • data lineage • schema propagation • feature selection • tournaments 3
  • 49. Data Workflows Multiple needed, but still emerging …feedback loops based on …optimizers feed on abstraction …metadata accounting: • track data lineage • propagate • model / feature usage • ensemble performance …app history accounting: • util stats, mixing workloads • heavy-hitters, bottlenecks • throughput, latency multiple abstraction layers 
 for metadata, feedback, 
 and optimization Cluster Scheduler Planners/ Optimizers Clusters DSLs Machine Data • app history • util stats • bottlenecksMixed Topologies Business Process Reusable Components Portable Models • data lineage • schema propagation • feature selection • tournaments 3 Um, will compilers 
 ever look like this??
  • 50. Data Workflows Workflows must provide for test, which is no simple 
 matter …testing is required on three abstraction layers: • model portability/evaluation, ensembles, tournaments • TDD in reusable components and business process • continuous integration / continuous deployment …examples: Cascalog for TDD, PMML for model eval …still so much more to improve …keep in mind that workflows involve people, too testing: model evaluation, TDD, app deployment 4 Cluster Scheduler Planners/ Optimizers Clusters DSLs Machine Data • app history • util stats • bottlenecksMixed Topologies Business Process Reusable Components Portable Models • data lineage • schema propagation • feature selection • tournaments
  • 51. Data Workflows Workflows enable system integration …future-proof integrations and scale-out …build components and apps, not jobs and command lines …allow compiler optimizations across the DAG,
 i.e., cross-dept contributions …minimize operationalization costs: • troubleshooting, debugging at scale • exception handling • notifications, instrumentation …examples: KNIME, Cascading, etc. ! future-proof system integration, scale-out, ops Hadoop Cluster source tap source tap sink tap trap tap customer profile DBsCustomer Prefs logs logs Logs Data Workflow Cache Customers Support Web App Reporting Analytics Cubes sink tap Modeling PMML 5
  • 52. Data Workflows Visualizing workflows, what a great idea. …the practical basis for: • collaboration • rapid prototyping • component reuse …examples: KNIME wins best in category …Cascading generates flow diagrams, 
 which are a nice start visualizing allows people 
 to collaborate through code 6
  • 53. Data Workflows Abstract algebra, containerizing workflow metadata …monoids, semigroups, etc., allow for reusable components 
 with well-defined properties for running in parallel at scale …let business process be agnostic about underlying topologies, 
 with analogy to Linux containers (Docker, etc.) …compose functions, take advantage of sparsity (monoids) …because “data is fully functional” – FP for s/w eng benefits …aggregators are almost always “magic”, now we can solve 
 for common use cases …read Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms by Jimmy Lin …Cascading introduced some notions of this circa 2007, 
 but didn’t make it explicit …examples: Algebird in Summingbird, Simmer, Spark, etc. abstract algebra and functional programming containerize business process 7
  • 54. Data Workflows Footnote – see also two excellent talks related to this point: ! Algebra for Analytics
 speakerdeck.com/johnynek/algebra-for-analytics
 Oscar Boykin, Strata SC (2014) Add ALL theThings:
 Abstract Algebra Meets Analytics
 infoq.com/presentations/abstract-algebra-analytics
 Avi Bryant, Strange Loop (2013) abstract algebra and functional programming containerize business process 7+
  • 55. Data Workflows Workflow needs vary in time, and need to blend time …something something batch, something something low-latency …scheduling batch isn’t hard; scheduling low-latency is 
 computationally brutal (see the Omega paper) …because people like saying “Lambda Architecture”, 
 it gives them goosebumps, or something …because real-time means so many different things …because batch windows are so 1964 …examples: Summingbird, Oryx blend results from different time scales: batch plus low- latency 8 Big Data
 Nathan Marz, James Warren
 manning.com/marz
  • 56. Data Workflows Workflows may define contexts in which model selection 
 possibly becomes a compiler problem …for example, see Boyd, Parikh, et al. …ultimately optimizing for loss function + regularization term …perhaps not ready for prime-time immediately …examples: MLbase optimize learners in context, to make model selection potentially a compiler problem 9 f(x): loss function g(z): regularization term
  • 57. Data Workflows For one of the best papers about what workflows really
 truly require, see Out of the Tar Pit by Moseley and Marks bonus
  • 58. Nine Points for Data Workflows — a wish list 1. includes people, defines oversight for exceptional data 2. separation of concerns, allows for literate programming 3. multiple abstraction layers for metadata, feedback, and optimization 4. testing: model evaluation, TDD, app deployment 5. future-proof system integration, scale-out, ops 6. visualizing workflows allows people to collaborate through code 7. abstract algebra and functional programming containerize business process 8. blend results from different time scales: 
 batch plus low-latency 9. optimize learners in context, to make model 
 selection potentially a compiler problem Cluster Scheduler Planners/ Optimizers Clusters DSLs Machine Data • app history • util stats • bottlenecksMixed Topologies Business Process Reusable Components Portable Models • data lineage • schema propagation • feature selection • tournaments
  • 59. Data Workflows for Machine Learning: 
 Summary… ! “feedback, discussion, outlook” 
 ?
  • 61. Nine Points for Data Workflows — a scorecard Spark Oryx Summing! bird Cascalog Cascading ! KNIME Py Data R Markdown MBrace includes people, exceptional data separation of concerns multiple 
 abstraction layers testing in depth future-proof 
 system integration visualize to collab can haz monoids blends batch + 
 “real-time” optimize learners 
 in context can haz PMML ? ✔ ✔ ✔ ✔ ✔
  • 62. Haskell Curry
 haskell.org Alonso Church
 wikipedia.org The General Case: Theory, Eight Decades Ago: Haskell Curry, known for seminal work on combinatory logic (1927) Alonzo Church, known for lambda calculus (1936) and much more! ! Both sought formal answers to the question, “What can be computed?”
  • 63. The General Case: Praxis, Four Decades Ago: Leveraging lambda calculus, combinators, etc., to increase the parallelism of apps as applicative systems John Backus
 acm.org David Turner
 wikipedia.org “Can Programming Be Liberated from the von Neumann
 Style? A Functional Style and Its Algebra of Programs”
 ACMTuring Award (1977)
 stanford.edu/class/cs242/readings/backus.pdf “A new implementation technique for applicative languages”
 Turner, D.A. (1979)
 Softw: Pract. Exper., 9: 31–49. doi: 10.1002/spe.4380090105
  • 64. Notebooks, cloud-based and otherwise… For now, the best practices appear to be heading toward 
 a world of notebooks… • IPython => Jupyter • can be containerized to run in a cloud • great for team collaboration • Don Knuth sez: try literate programming • great for dashboards, as end-use data products • also good as endpoints for service-oriented architectures
 (which are what ML use cases should be thinking about)
  • 65. Ali Ghodsi, Databricks Cloud youtu.be/lO7LhVZrNwA?t=23m27s Notebooks, cloud-based and otherwise…
  • 67. monthly newsletter for updates, 
 events, conf summaries, etc.: liber118.com/pxn/ Enterprise Data Workflows with Cascading O’Reilly, 2013 shop.oreilly.com/product/0636920028536.do Just Enough Math O’Reilly, 2014 oreilly.com/go/enough_math/
 preview: youtu.be/TQ58cWgdCpA
  • 68. Scala by the Bay
 SF, Aug 8
 scalabythebay.org #MesosCon
 Chicago, Aug 21
 events.linuxfoundation.org/events/mesoscon Cassandra Summit
 SF, Sep 10
 cvent.com/events/cassandra-summit-2014 Strata NYC + Hadoop World
 NYC, Oct 15
 strataconf.com/stratany2014 Strata EU
 Barcelona, Nov 20
 strataconf.com/strataeu2014 Data Day Texas
 Austin, Jan 10
 datadaytexas.com calendar: