Spark DataFrames and ML Pipelines

Spark DataFrames
and ML Pipelines
Joseph K. Bradley
May 1, 2015
MLconf Seattle

Who am I?
Joseph K. Bradley
Ph.D. in ML from CMU, postdoc at Berkeley
Apache Spark committer
Software Engineer @ Databricks Inc.
2

Databricks Inc.
3
Founded by the creators of Spark
& driving its development
Databricks Cloud: the best place to run Spark
Guess what…we’re hiring!
databricks.com/company/careers

4
Concise
APIs
in
Python,
Java,
Scala

…
and
R
in
Spark
1.4!

500+
enterprises
using
or
planning

to
use
Spark
in
producCon
(blog)

Spark

SparkSQL
Streaming
MLlib
GraphX

Distributed
compuCng
engine

•  Built
for
speed,
ease
of
use,

and
sophisCcated
analyCcs

•  Apache
open
source

Beyond Hadoop
5
Early
adopters
(Data)
Engineers

MapReduce
&

funcConal
API

Data
ScienCsts

&
StaCsCcians

Spark for Data Science
DataFrames
Intuitive manipulation of distributed structured data
6
Machine Learning Pipelines
Simple construction and tuning of ML workflows

Google Trends for “dataframe”
7

DataFrames
8
dept
age
name

Bio
48
H
Smith

CS
54
A
Turing

Bio
43
B
Jones

Chem
61
M
Kennedy

RDD
API

DataFrame
API

Data
grouped
into

named
columns

DataFrames
9
dept
age
name

Bio
48
H
Smith

CS
54
A
Turing

Bio
43
B
Jones

Chem
61
M
Kennedy

Data
grouped
into

named
columns

DSL
for
common
tasks

•  Project,
ﬁlter,
aggregate,
join,
…

•  Metadata

•  UDFs

Spark DataFrames
10
API inspired by R and Python Pandas
•  Python, Scala, Java (+ R in dev)
•  Pandas integration
Distributed DataFrame
Highly optimized

11
0 2 4 6 8 10
RDD Scala
RDD Python
Spark Scala DF
Spark Python DF
Runtime of aggregating 10 million int pairs (secs)
Spark DataFrames are fast
be.er

Uses
SparkSQL

Catalyst
op;mizer

Spark for Data Science
DataFrames
•  Structured data
•  Familiar API based on R & Python Pandas
•  Distributed, optimized implementation
13
Simple construction and tuning of ML workflows

About Spark MLlib
Started @ Berkeley
•  Spark 0.8
Now (Spark 1.3)
•  Contributions from 50+ orgs, 100+ individuals
•  Growing coverage of distributed algorithms
Spark

SparkSQL
Streaming
MLlib
GraphX

14

About Spark MLlib
Classification
•  Logistic regression
•  Naive Bayes
•  Streaming logistic regression
•  Linear SVMs
•  Decision trees
•  Random forests
•  Gradient-boosted trees
Regression
•  Ordinary least squares
•  Ridge regression
•  Lasso
•  Isotonic regression
•  Decision trees
•  Random forests
•  Gradient-boosted trees
•  Streaming linear methods
15
Statistics
•  Pearson correlation
•  Spearman correlation
•  Online summarization
•  Chi-squared test
•  Kernel density estimation
Linear algebra
•  Local dense & sparse vectors & matrices
•  Distributed matrices
•  Block-partitioned matrix
•  Row matrix
•  Indexed row matrix
•  Coordinate matrix
•  Matrix decompositions
Frequent itemsets
•  FP-growth
Model import/export
Clustering
•  Gaussian mixture models
•  K-Means
•  Streaming K-Means
•  Latent Dirichlet Allocation
•  Power Iteration Clustering
Recommendation
•  Alternating Least Squares
Feature extraction & selection
•  Word2Vec
•  Chi-Squared selection
•  Hashing term frequency
•  Inverse document frequency
•  Normalizer
•  Standard scaler
•  Tokenizer

ML Workflows are complex
16
Image
classiﬁcaCon

pipeline*

*
Evan
Sparks.
“ML
Pipelines.”

amplab.cs.berkeley.edu/ml-‐pipelines

à Specify
pipeline

à Inspect
&
debug

à Re-‐run
on
new
data

à Tune
parameters

Example: Text Classification
17
Goal: Given a text document, predict its topic.
Subject: Re: Lexan Polish?
Suggest McQuires #1 plastic
polish. It will help somewhat
but nothing will remove deep
scratches without making it
worse than it already is.
McQuires will do something...
1:
about
science

0:
not
about
science

Label
Features

Dataset:
“20
Newsgroups”

From
UCI
KDD
Archive

ML Workflow
18
Train
model

Evaluate

Load
data

Extract
features

Load Data
19
Train
model

Evaluate

Load
data

Extract
features

built-in external
{ JSON }
JDBC
and more …
Data sources for DataFrames

Load Data
20
Train
model

Evaluate

Load
data

Extract
features

label: Int
text: String
Current
data
schema

Extract Features
21
Train
model

Evaluate

Load
data

Extract
features

label: Int
text: String
Current
data
schema

Extract Features
22
Train
model

Evaluate

Load
data

label: Int
text: String
Current
data
schema

Tokenizer

Hashed
Term
Freq.

features: Vector
words: Seq[String]
Transformer

Train a Model
23
LogisAc
Regression

Evaluate

label: Int
text: String
Current
data
schema

Tokenizer

Hashed
Term
Freq.

features: Vector
words: Seq[String]
prediction: Int
Estimator
Load
data

Transformer

Evaluate the Model
24
LogisCc
Regression

Evaluate

label: Int
text: String
Current
data
schema

Tokenizer

Hashed
Term
Freq.

features: Vector
words: Seq[String]
prediction: Int
Load
data

Transformer
Evaluator
Estimator
By
default,
always

append
new
columns

à Can
go
back
&
inspect

intermediate
results

à Made
eﬃcient
by

DataFrame

opCmizaCons

ML Pipelines
25
LogisCc
Regression

Evaluate

Tokenizer

Hashed
Term
Freq.

Load
data

Pipeline
Test
data

LogisCc
Regression

Tokenizer

Hashed
Term
Freq.

Evaluate

Re-‐run
exactly

the
same
way

Parameter Tuning
26
LogisCc
Regression

Evaluate

Tokenizer

Hashed
Term
Freq.

lr.regParam
{0.01, 0.1, 0.5}
hashingTF.numFeatures
{100, 1000, 10000} Given:
•  Estimator
•  Parameter grid
•  Evaluator
Find best parameters
CrossValidator

Recap
DataFrames
•  Structured data
•  Familiar API based on R & Python Pandas
•  Distributed, optimized implementation
•  Integration with DataFrames
•  Familiar API based on scikit-learn
•  Simple parameter tuning
28
Composable
&
DAG
Pipelines

Schema
validaCon

User-‐deﬁned
Transformers

&
EsCmators

Looking Ahead
Collaborations with UC Berkeley & others
•  Auto-tuning models
29
DataFrames
•  Further optimization
•  API for R
ML Pipelines
•  More algorithms & pluggability
•  API for R

Thank you!
Spark
documentaCon

spark.apache.org

Pipelines
blog
post

databricks.com/blog/2015/01/07

DataFrames
blog
post

databricks.com/blog/2015/02/17

Databricks
Cloud
Plalorm

databricks.com/product

Spark
MOOCs
on
edX

Intro
to
Spark
&
ML
with
Spark

Spark
Packages

spark-‐packages.org

Spark DataFrames and ML Pipelines

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Spark DataFrames and ML Pipelines

Similar a Spark DataFrames and ML Pipelines (20)

Más de Databricks

Más de Databricks (20)

Último

Último (20)

Spark DataFrames and ML Pipelines