Spark ML Intro & Extending

Hi LA Apache Spark friends!
Thanks for coming early! Enjoy burritos and the
WWGuest WiFi network
(or you know in person networking works too)

Intro & Extending Spark ML
With your “friend” @holdenkarau
& friend Boo!
Hella-Legit

Who’s presenting:
● My name is Holden Karau
● Prefered pronouns are she/her
● I’m a Principal Software Engineer at IBM’s Spark Technology Center
● Apache Spark committer (as of January!) :)
● previously Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos

Who is Boo?
● Boo uses she/her pronouns (as I told the Texas house committee)
● Best doge
● Lot’s of experience barking at computers to make them go faster
● Author of “Learning to Bark” & “High Performance Barking”
● On twitter @BooProgrammer

Spark Technology
Center
6
IBM
Spark
Technology
Center
Founded in 2015.
Location:
Physical: 505 Howard St., San Francisco CA
Web: http://spark.tc Twitter: @apachespark_tc
Mission:
Contribute intellectual and technical capital to the Apache Spark
community.
Make the core technology enterprise- and cloud-ready.
Build data science skills to drive intelligence into business
applications — http://bigdatauniversity.com
Key statistics:
About 50 developers, co-located with 25 IBM designers.
Major contributions to Apache Spark http://jiras.spark.tc
Apache SystemML is now an Apache Incubator project.
Founding member of UC Berkeley AMPLab and RISE Lab
Member of R Consortium and Scala Center
Spark Technology
Center

What we are going to explore together!
● Who I think you all are
● Spark’s two different ML APIs
● Running through a simple example with one
● Model save/load
● Discussion of “serving” options
● Extending Spark ML
● Optional take home exercises
● Finish up with a raffle & cookies (sponsored by Qubole!)
○ Please fill out your ticket if you want to win!

The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML
bagel &
Grah X
MLLib
Community
Packages

The different pieces of Spark: 2.0+
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark
ML
bagel &
Graph X
MLLib
Community
Packages
Structured
Streaming

Who do I think you all are?
● Nice people*
● Some knowledge of Apache Spark core & maybe SQL
● Interested in using Spark for Machine Learning
● Familiar-ish with Scala or Java or Python
Amanda

If you're planning to following along:
● Spark 2+ (Spark 2.2 would be best!)
○ (built with Hive support if building from source)
● Since this is a regular talk, you won’t have time to the
exercises as we go -- but you can come back and finish
it after :)
Amanda

Some resources:
http://bit.ly/sparkDocs
http://bit.ly/sparkPyDocs OR http://bit.ly/sparkScalaDoc
http://bit.ly/sparkMLGuide
https://github.com/holdenk/spark-intro-ml-pipeline-workshop
http://www.slideshare.net/hkarau
Download census data
https://archive.ics.uci.edu/ml/datasets/Adult
Dwight Sipler

Getting some data for working with:
● census data:
https://archive.ics.uci.edu/ml/datasets/Adult
● Goal: predict income > 50k
● Also included in the github repo
● Download that now if you haven’t already
● We will add a header to the data
○ http://pastebin.ca/3318687
PROTill
Westermayer

So what are the two APIs?
● Traditional and Pipeline
○ Pipeline is the new shiny future which will fix all problems*
● Traditional API works on RDDs
○ Data preparation work is generally done in traditional Spark
transformations
● Pipeline API works on DataFrames
○ Often we want to apply some transformations to our data before
feeding to the machine learning algorithm
○ Makes it easy to chain these together
(*until replaced by a newer shinier future)
Steve Jurvetson

So what are DataFrames / Datasets?
● Spark SQL’s version of RDDs of the world
○ It’s for more than just SQL
● Restricted data types, schema information, compile time
untyped*
○ Datasets add the types back
● Slightly restricted operations (more relational style)
○ Still support many of the same functional programming magic
○ map & friends are here to stay, but at a cost
● Allow lots of fun extra optimizations
○ Tungsten, Apache Arrow, etc.
● Not Pandas or R DataFrames

What is DataFrame performance like?
Andrew Skudder

Spark ML pipelines
Tokenizer HashingTF String Indexer Naive Bayes
Tokenizer HashingTF
Streaming
String Indexer
Streaming
Naive Bayes
fit(df)
Estimator
Transformer
● Sci-Kit Learn Inspired
● Consist of Estimators and Transformers

So what does a pipeline stage look like?
Are either an:
● Estimator - has a method called “fit” which returns an transformer
● Transformer - no need to train can directly transform (e.g. HashingTF) (with
transform)
Both must provide:
● transformSchema* (used to validate input schema is reasonable) & copy
Often have:
● Parameters for configuration (think input columns, regularization, etc.)
Wendy Piersall

How are transformers made?
Estimator
data
class Estimator extends PipelineStage {
def fit(dataset: Dataset[_]): Transformer = {
// magic happens here
}
}
Transformer

Let’s start with loading some data
● We’ve got some CSV data, we could use textfile and
parse by hand
● spark-packages can save by providing the spark-csv
package by Hossein Falaki
○ If we were building a Java project we can include maven coordinates
○ For the Spark shell when launching add:
--packages com.databricks:spark-csv_2.10:1.3.0
Jess Johnson

Loading with sparkSQL & spark-csv
sqlContext.read returns a DataFrameReader
We can specify general properties & data specific options
● option(“key”, “value”)
○ spark-csv ones we will use are header & inferSchema
● format(“formatName”)
○ built in formats include parquet, jdbc, etc. today we will use csv (added
in Spark 2.0, prior to that as com.databricks.csv)
● load(“path”)
Jess Johnson

Loading with sparkSQL & spark-csv
df = sqlContext.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("resources/adult.data")

Lets explore training a Decision Tree
● Step 1: Data loading (done!)
● Step 2: Data prep (select features, etc.)
● Step 3: Train
● Step 4: Predict

Data prep / cleaning
● We need to predict a double (can be 0.0, 1.0, but type
must be double)
● We need to train with a vector of features
Imports:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.param import Param, Params
from pyspark.ml.feature import Bucketizer, VectorAssembler,
StringIndexer
from pyspark.ml import Pipeline
Huang
Yun
Chung

Data prep / cleaning continued
# Combines a list of double input features into a vector
assembler = VectorAssembler(inputCols=["age", "education-num"],
outputCol="feautres")
# String indexer converts a set of strings into doubles
indexer =
StringIndexer(inputCol="category")
.setOutputCol("category-index")
# Can be used to combine pipeline components together
pipeline = Pipeline().setStages([assembler, indexer])
Huang
Yun
Chung

So a bit more about that pipeline
● Each of our previous components has “fit” & “transform”
stage
● Constructing the pipeline this way makes it easier to
work with (only need to call one fit & one transform)
● Can re-use the fitted model on future data
model=pipeline.fit(df)
prepared = model.transform(df)
Andrey

What does the pipeline look like so far?
Input
Data Assembler Input Data
+ Vectors
StringIndexer
Input Data
+Cat ID
+ Vectors
While not an ML
learning algorithm
this still needs to be
“fit”
This is a transformer
- no fitting required.
Ray Bodden

Let's train a model on our prepared data:
# Specify model
dt = DecisionTreeClassifier(labelCol = "category-index",
featuresCol="features")
# Fit it
dt_model = dt.fit(prepared)
# Or as part of the pipeline
pipeline_and_model = Pipeline().setStages([assembler, indexer,
dt])
pipeline_model = pipeline_and_model.fit(df)
Edmund
Fitzgerald

Yay! You have an ML pipeline!
Photo by Jessica Fiess-Hill

And predict the results on the same data:
pipeline_model.transform(df).select("prediction",
"category-index").take(20)

Pipeline API has many models:
● org.apache.spark.ml.classification
○ BinaryLogisticRegressionClassification, DecissionTreeClassification,
GBTClassifier, etc.
● org.apache.spark.ml.regression
○ DecissionTreeRegression, GBTRegressor, IsotonicRegression,
LinearRegression, etc.
● org.apache.spark.ml.recommendation
○ ALS
● You can also check out spark-packages for some more
● But possible not your special AwesomeFooBazinatorML
PROcarterse Follow

& data prep stages...
● org.apache.spark.ml.feature
○ ~30 elements from VectorAssembler to Tokenizer, to PCA, etc.
● Often simpler to understand while getting started with
building our own stages
PROcarterse Follow

What is/why Sparkling ML
● A place for useful Spark ML pipeline stages to live
○ Including both feature transformers and estimators
● The why: Spark ML can’t keep up with every new algorithm
● Lots of cool ML on Spark tools exist, but many don’t play nice with Spark ML
or together
● https://github.com/holdenk/sparklingml

So now begins our adventure to add stages

So what does a pipeline stage look like?
Must provide:
● Scala: transformSchema (used to validate input schema is
reasonable) & copy
● Both: Either a “fit” (for estimator) or transform (for
transformer)
Often have:
● Params for configuration (so we can do meta-algorithms)
Wendy Piersall

Building a simple transformer:
class HardCodedWordCountStage(override val uid: String) extends Transformer {
def this() = this(Identifiable.randomUID("hardcodedwordcount"))
def copy(extra: ParamMap): HardCodedWordCountStage = {
defaultCopy(extra)
}
...
}
Not to be confused with the Transformers franchise from Hasbro and Tomy.

Verify the input schema is reasonable:
override def transformSchema(schema: StructType): StructType = {
// Check that the input type is a string
val idx = schema.fieldIndex("happy_pandas")
val field = schema.fields(idx)
if (field.dataType != StringType) {
throw new Exception(s"Input type ${field.dataType} did not match
input type StringType")
}
// Add the return field
schema.add(StructField("happy_panda_counts", IntegerType, false))
}

How is transformSchema used?
● When you call fit on a pipeline it calls transformSchema
on the pipeline stages in order
● This is used to verify that things should work
● Ideally allows pipelines to fail fast when misconfigured,
instead of at the final stage of a 48-hour process
● Doesn’t always work that way :p
● Not supported in Python (I’m sorry!)
Tricia Hall

Do the “work” (e.g. predict labels or w/e):
def transform(df: Dataset[_]): DataFrame = {
val wordcount = udf { in: String => in.split(" ").size }
df.select(col("*"),
wordcount(df.col("happy_pandas")).as("happy_panda_counts"))
}
vic15

Do the “work” (e.g. call numpy or tensorflow or w/e):
class StrLenPlus3Transformer(Model):
@keyword_only
def __init__(self):
super(StrLenPlusKTransformer, self).__init__()
def _transform(self, dataset):
func = lambda x : len(x) + 3
retType = IntegerType()
udf = UserDefinedFunction(func, retType)
return dataset.withColumn(
"magic", udf("input")
)

What about configuring our stage?
class ConfigurableWordCount(override val uid: String) extends
Transformer {
final val inputCol= new Param[String](this, "inputCol", "The input
column")
final val outputCol = new Param[String](this, "outputCol", "The
output column")
def setInputCol(value: String): this.type = set(inputCol, value)
def setOutputCol(value: String): this.type = set(outputCol, value)
Jason Wesley Upton

class StrLenPlusKTransformer(Model, HasInputCol, HasOutputCol):
# We need a parameter to configure k
k = Param(Params._dummy(),
"k", "amount to add to str len",
typeConverter=TypeConverters.toInt)
@keyword_only
def __init__(self, k=None, inputCol=None, outputCol=None):
super(StrLenPlusKTransformer, self).__init__()
kwargs = self._input_kwargs
self.setParams(**kwargs)
Jason Wesley Upton

@keyword_only
def setParams(self, k=None, inputCol=None, outputCol=None):
kwargs = self._input_kwargs
return self._set(**kwargs)
def setK(self, value):
return self._set(k=value)
def getK(self):
return self.getOrDefault(self.k)
Jason Wesley Upton

So why do we configure it that way?
● Allow meta algorithms to work on it
● Scala:
○ If you look inside of spark you’ll see
“sharedParams.scala” for common params (like input
column)
○ We can’t access those unless we pretend to be inside
of org.apache.spark - so we have to make our own
● Python: Just import pyspark.ml.param.shared
Tricia Hall

So how to make an estimator?
● Very similar, instead of directly providing transform
provide a `fit` which returns a “model” which implements
the estimator interface as shown above
● Also take a look at the algorithms in Spark itself (helpful
traits you can mixin to take care of many common things).
● Let’s look at a simple one now!
sneakerdog

A simple string indexer estimator
class SimpleIndexer(override val uid: String) extends
Estimator[SimpleIndexerModel] with SimpleIndexerParams {
….
override def fit(dataset: Dataset[_]): SimpleIndexerModel = {
import dataset.sparkSession.implicits._
val words = dataset.select(dataset($(inputCol)).as[String]).distinct
.collect()
new SimpleIndexerModel(uid, words)
}
}

Quick aside: What’ts that “$(inputCol)”?
● How you get access to a configuration parameter
● Inside stage only (external use getInputCol just like
Java™ :p)

And our friend the transformer is back:
class SimpleIndexerModel(
override val uid: String, words: Array[String]) extends
Model[SimpleIndexerModel] with SimpleIndexerParams {
...
private val labelToIndex: Map[String, Double] = words.zipWithIndex.
map{case (x, y) => (x, y.toDouble)}.toMap
override def transform(dataset: Dataset[_]): DataFrame = {
val indexer = udf { label: String => labelToIndex(label) }
dataset.select(col("*"),
indexer(dataset($(inputCol)).cast(StringType)).as($(outputCol)))
Still not to be confused with the Transformers franchise from Hasbro and Tomy.

Ok so how do you make the train function?
● Read some papers on the algorithm(s) you care about
● Most likely some iterative approach (pro-tip: RDDs >
Datasets for iterative)
○ Write down your cost function
○ Seth has some interesting work around pluggable
optimizers -- you can play with them
https://github.com/sethah/sparkopt or read about
parallel iterative algorithms.
● Closed form solution? Go have a party!

What else can you add to your models?
● Put in an ML pipeline
● Do hyper-parameter tuning
And if you have some coffee left over:
● Persistence*
○ MLWriter & MLReader give you the basics
○ You’ll have to do a lot of work yourself :(
● Serving*
*With enough coffee. Not guaranteed.

Ok so I put my new fancy thing on GitHub
● Yay thank you!
● Please publish to maven central
● Also consider contributing it to SparklingML
● Also consider listing on spark-packages + user@ list
○ Let me know ( holden@pigscanfly.ca ) :)
● Think of the Python users (and I guess the R users) too?

Custom Estimators/Transformers in the Wild
Classification/Regression
xgboost
Deep Learning!
MXNet
DL4J
etc.
Feature Transformation
FeatureHasher

More resources:
● High Performance Spark Example Repo has some
sample models
○ Of course buy several copies of the book - it is the gift of the season :p
● The models inside of Spark itself (internal APIs though)
● Sparkling ML - So much fun!
● Nick Pentreath’s FeatureHasher
● O’Reilly radar blog post
https://www.oreilly.com/learning/extend-structured-streami
ng-for-spark-ml
Captain Pancakes

Want to do Optional Exercise 1?
Go from the index to something useful
● We could manually look up the labels and then write a
select statement
● Or we could look at the features on the
StringIndexerModel and use IndexToString
● Our pipeline has an array of stages we can use for this

Solution:
from pyspark.ml.feature import IndexToString
labels = list(pipeline_model.stages[1].labels())
inverter = IndexToString(inputCol="prediction",
outputCol="prediction-label", labels=labels)
inverter.transform(pipeline_model.transform(df)).select("predict
ion-label", "category").take(20)
# Pre Spark 1.6 use SQL if/else or similar

Exercise 2: Add more features to your tree
● Finished quickly? Help others!
● Or tell me if adding these features helped or not…
○ We can download a reserve “test” dataset but how would we know if
we couldn’t do that?
cobra libre

Exercise 3: Train a new model type
● Your choice!
● If you want to do regression - change what we are
predicting

Exercise 4: Make a pipeline stage!
● Birthday*-saurus** will thank you!
*Birthday in the sense you tell the olive garden it’s your birthday -- not like a “real” birthday

Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark

High Performance Spark!
Available today!
The rest of you can buy it from that scrappy Seattle
bookstore :p
http://bit.ly/hkHighPerfSpark

And some upcoming talks:
● Data Day Seattle (SEA, October)
● Spark Summit EU (Dublin, October)
● Big Data Spain (November, Madrid)
● Bee Scala (November, Ljubljana)
● Strata Singapore (Singapore, December)
● ScalaX (London, December)
● Linux Conf AU (Melbourne, January)
● Know of interesting conferences/webinar things that
should be on my radar? Let me know!

k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
I have to give a talk on this in ~ 2
weeks (meeps!)
Will tweet results
“eventually” @holdenkarau
Any PySpark Users: Have some
simple UDFs you wish ran faster
you are willing to share?:
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)

Cross-validation
because saving a test set is effort
● Automagically* fit your model params
● Because thinking is effort
● org.apache.spark.ml.tuning has the tools
○ (not in Python yet so skipping for now)
Jonathan Kotta

Pipeline API has many models:
● org.apache.spark.ml.classification
○ BinaryLogisticRegressionClassification, DecissionTreeClassification,
GBTClassifier, etc.
● org.apache.spark.ml.regression
○ DecissionTreeRegression, GBTRegressor, IsotonicRegression,
LinearRegression, etc.
● org.apache.spark.ml.recommendation
○ ALS
PROcarterse Follow

So serving...
● Generally refers to using your model online
○ Generating recommendations...
● In batch mode you can “just” save & use the Spark bits
● Spark’s “native” formats (often parquet w/metadata)
○ Understood by Spark libraries and thats pretty much it
○ If you are serving in JVM can load these but need Spark
dependencies (albeit often not a Spark cluster)
● Some models support PMML export
○ https://github.com/jpmml/openscoring etc.
● We can also write our own export & serving by hand :(
Ambernectar 13

So what models are PMML exportable?
● Right now “old” style models
○ KMeans, LinearRegresion, RidgeRegression, Lasso, SVM, and Binary
LogisticRegression
○ However if we look in the code we can sometimes find converters to
the old style models and use this to export our “new” style model
● Waiting on
https://issues.apache.org/jira/browse/SPARK-11171 /
https://github.com/apache/spark/pull/9207 for pipeline
models
● Not getting in for 2.0 :(

How to PMML export*
toPMML
● returns a string or
● takes a path to local fs and saves results or
● takes a SparkContext & a distributed path and saves or
● takes a stream and writes result to stream
Oooor just wait for something better

Optional* exercise time
● Take a model you trained and save it to PMML
○ You will have to dig around in the Spark code to be able to do this
● Look at the file
● Load it into a serving system and try some predictions
● Note: PMML export currently only includes the model -
not any transformations beforehand
● Also: you might need to train a new model
● If you don’t get it don’t worry - hints to follow :)

Spark ML Intro & Extending

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Spark ML Intro & Extending

Similar a Spark ML Intro & Extending (20)

Más de Data Con LA

Más de Data Con LA (20)

Último

Último (20)

Spark ML Intro & Extending