Presented at the MLConf in Seattle, this presentation offers a quick introduction to Apache Spark, followed by an overview of two novel features for data science
2. Who am I?
Joseph K. Bradley
Ph.D. in ML from CMU, postdoc at Berkeley
Apache Spark committer
Software Engineer @ Databricks Inc.
2
3. Databricks Inc.
3
Founded by the creators of Spark
& driving its development
Databricks Cloud: the best place to run Spark
Guess what…we’re hiring!
databricks.com/company/careers
4. 4
Concise
APIs
in
Python,
Java,
Scala
…
and
R
in
Spark
1.4!
500+
enterprises
using
or
planning
to
use
Spark
in
producCon
(blog)
Spark
SparkSQL
Streaming
MLlib
GraphX
Distributed
compuCng
engine
• Built
for
speed,
ease
of
use,
and
sophisCcated
analyCcs
• Apache
open
source
6. Spark for Data Science
DataFrames
Intuitive manipulation of distributed structured data
6
Machine Learning Pipelines
Simple construction and tuning of ML workflows
8. DataFrames
8
dept
age
name
Bio
48
H
Smith
CS
54
A
Turing
Bio
43
B
Jones
Chem
61
M
Kennedy
RDD
API
DataFrame
API
Data
grouped
into
named
columns
9. DataFrames
9
dept
age
name
Bio
48
H
Smith
CS
54
A
Turing
Bio
43
B
Jones
Chem
61
M
Kennedy
Data
grouped
into
named
columns
DSL
for
common
tasks
• Project,
filter,
aggregate,
join,
…
• Metadata
• UDFs
10. Spark DataFrames
10
API inspired by R and Python Pandas
• Python, Scala, Java (+ R in dev)
• Pandas integration
Distributed DataFrame
Highly optimized
11. 11
0 2 4 6 8 10
RDD Scala
RDD Python
Spark Scala DF
Spark Python DF
Runtime of aggregating 10 million int pairs (secs)
Spark DataFrames are fast
be.er
Uses
SparkSQL
Catalyst
op;mizer
13. Spark for Data Science
DataFrames
• Structured data
• Familiar API based on R & Python Pandas
• Distributed, optimized implementation
13
Machine Learning Pipelines
Simple construction and tuning of ML workflows
14. About Spark MLlib
Started @ Berkeley
• Spark 0.8
Now (Spark 1.3)
• Contributions from 50+ orgs, 100+ individuals
• Growing coverage of distributed algorithms
Spark
SparkSQL
Streaming
MLlib
GraphX
14
15. About Spark MLlib
Classification
• Logistic regression
• Naive Bayes
• Streaming logistic regression
• Linear SVMs
• Decision trees
• Random forests
• Gradient-boosted trees
Regression
• Ordinary least squares
• Ridge regression
• Lasso
• Isotonic regression
• Decision trees
• Random forests
• Gradient-boosted trees
• Streaming linear methods
15
Statistics
• Pearson correlation
• Spearman correlation
• Online summarization
• Chi-squared test
• Kernel density estimation
Linear algebra
• Local dense & sparse vectors & matrices
• Distributed matrices
• Block-partitioned matrix
• Row matrix
• Indexed row matrix
• Coordinate matrix
• Matrix decompositions
Frequent itemsets
• FP-growth
Model import/export
Clustering
• Gaussian mixture models
• K-Means
• Streaming K-Means
• Latent Dirichlet Allocation
• Power Iteration Clustering
Recommendation
• Alternating Least Squares
Feature extraction & selection
• Word2Vec
• Chi-Squared selection
• Hashing term frequency
• Inverse document frequency
• Normalizer
• Standard scaler
• Tokenizer
16. ML Workflows are complex
16
Image
classificaCon
pipeline*
*
Evan
Sparks.
“ML
Pipelines.”
amplab.cs.berkeley.edu/ml-‐pipelines
à Specify
pipeline
à Inspect
&
debug
à Re-‐run
on
new
data
à Tune
parameters
17. Example: Text Classification
17
Goal: Given a text document, predict its topic.
Subject: Re: Lexan Polish?
Suggest McQuires #1 plastic
polish. It will help somewhat
but nothing will remove deep
scratches without making it
worse than it already is.
McQuires will do something...
1:
about
science
0:
not
about
science
Label
Features
Dataset:
“20
Newsgroups”
From
UCI
KDD
Archive
22. Extract Features
22
Train
model
Evaluate
Load
data
label: Int
text: String
Current
data
schema
Tokenizer
Hashed
Term
Freq.
features: Vector
words: Seq[String]
Transformer
23. Train a Model
23
LogisAc
Regression
Evaluate
label: Int
text: String
Current
data
schema
Tokenizer
Hashed
Term
Freq.
features: Vector
words: Seq[String]
prediction: Int
Estimator
Load
data
Transformer
24. Evaluate the Model
24
LogisCc
Regression
Evaluate
label: Int
text: String
Current
data
schema
Tokenizer
Hashed
Term
Freq.
features: Vector
words: Seq[String]
prediction: Int
Load
data
Transformer
Evaluator
Estimator
By
default,
always
append
new
columns
à Can
go
back
&
inspect
intermediate
results
à Made
efficient
by
DataFrame
opCmizaCons
25. ML Pipelines
25
LogisCc
Regression
Evaluate
Tokenizer
Hashed
Term
Freq.
Load
data
Pipeline
Test
data
LogisCc
Regression
Tokenizer
Hashed
Term
Freq.
Evaluate
Re-‐run
exactly
the
same
way
28. Recap
DataFrames
• Structured data
• Familiar API based on R & Python Pandas
• Distributed, optimized implementation
Machine Learning Pipelines
• Integration with DataFrames
• Familiar API based on scikit-learn
• Simple parameter tuning
28
Composable
&
DAG
Pipelines
Schema
validaCon
User-‐defined
Transformers
&
EsCmators
29. Looking Ahead
Collaborations with UC Berkeley & others
• Auto-tuning models
29
DataFrames
• Further optimization
• API for R
ML Pipelines
• More algorithms & pluggability
• API for R
30. Thank you!
Spark
documentaCon
spark.apache.org
Pipelines
blog
post
databricks.com/blog/2015/01/07
DataFrames
blog
post
databricks.com/blog/2015/02/17
Databricks
Cloud
Plalorm
databricks.com/product
Spark
MOOCs
on
edX
Intro
to
Spark
&
ML
with
Spark
Spark
Packages
spark-‐packages.org