This document provides instructions for extending Spark ML pipelines by building new pipeline stages. It discusses the key components needed to build estimators and transformers, including implementing transformSchema, fit/transform methods, and parameter configuration. Examples are given of building a simple string indexer estimator and transformer. The document also briefly mentions additional features like persistence and serving that could be added.
Gen AI in Business - Global Trends Report 2024.pdf
Spark ML Intro & Extending
1. Hi LA Apache Spark friends!
Thanks for coming early! Enjoy burritos and the
WWGuest WiFi network
(or you know in person networking works too)
2. Intro & Extending Spark ML
With your “friend” @holdenkarau
& friend Boo!
Hella-Legit
3. Who’s presenting:
● My name is Holden Karau
● Prefered pronouns are she/her
● I’m a Principal Software Engineer at IBM’s Spark Technology Center
● Apache Spark committer (as of January!) :)
● previously Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos
4. Who is Boo?
● Boo uses she/her pronouns (as I told the Texas house committee)
● Best doge
● Lot’s of experience barking at computers to make them go faster
● Author of “Learning to Bark” & “High Performance Barking”
● On twitter @BooProgrammer
5.
6. Spark Technology
Center
6
IBM
Spark
Technology
Center
Founded in 2015.
Location:
Physical: 505 Howard St., San Francisco CA
Web: http://spark.tc Twitter: @apachespark_tc
Mission:
Contribute intellectual and technical capital to the Apache Spark
community.
Make the core technology enterprise- and cloud-ready.
Build data science skills to drive intelligence into business
applications — http://bigdatauniversity.com
Key statistics:
About 50 developers, co-located with 25 IBM designers.
Major contributions to Apache Spark http://jiras.spark.tc
Apache SystemML is now an Apache Incubator project.
Founding member of UC Berkeley AMPLab and RISE Lab
Member of R Consortium and Scala Center
Spark Technology
Center
7. What we are going to explore together!
● Who I think you all are
● Spark’s two different ML APIs
● Running through a simple example with one
● Model save/load
● Discussion of “serving” options
● Extending Spark ML
● Optional take home exercises
● Finish up with a raffle & cookies (sponsored by Qubole!)
○ Please fill out your ticket if you want to win!
8. The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML
bagel &
Grah X
MLLib
Community
Packages
9. The different pieces of Spark: 2.0+
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark
ML
bagel &
Graph X
MLLib
Community
Packages
Structured
Streaming
10. Who do I think you all are?
● Nice people*
● Some knowledge of Apache Spark core & maybe SQL
● Interested in using Spark for Machine Learning
● Familiar-ish with Scala or Java or Python
Amanda
11. If you're planning to following along:
● Spark 2+ (Spark 2.2 would be best!)
○ (built with Hive support if building from source)
● Since this is a regular talk, you won’t have time to the
exercises as we go -- but you can come back and finish
it after :)
Amanda
12. Some resources:
http://bit.ly/sparkDocs
http://bit.ly/sparkPyDocs OR http://bit.ly/sparkScalaDoc
http://bit.ly/sparkMLGuide
https://github.com/holdenk/spark-intro-ml-pipeline-workshop
http://www.slideshare.net/hkarau
Download census data
https://archive.ics.uci.edu/ml/datasets/Adult
Dwight Sipler
13. Getting some data for working with:
● census data:
https://archive.ics.uci.edu/ml/datasets/Adult
● Goal: predict income > 50k
● Also included in the github repo
● Download that now if you haven’t already
● We will add a header to the data
○ http://pastebin.ca/3318687
PROTill
Westermayer
14. So what are the two APIs?
● Traditional and Pipeline
○ Pipeline is the new shiny future which will fix all problems*
● Traditional API works on RDDs
○ Data preparation work is generally done in traditional Spark
transformations
● Pipeline API works on DataFrames
○ Often we want to apply some transformations to our data before
feeding to the machine learning algorithm
○ Makes it easy to chain these together
(*until replaced by a newer shinier future)
Steve Jurvetson
15. So what are DataFrames / Datasets?
● Spark SQL’s version of RDDs of the world
○ It’s for more than just SQL
● Restricted data types, schema information, compile time
untyped*
○ Datasets add the types back
● Slightly restricted operations (more relational style)
○ Still support many of the same functional programming magic
○ map & friends are here to stay, but at a cost
● Allow lots of fun extra optimizations
○ Tungsten, Apache Arrow, etc.
● Not Pandas or R DataFrames
18. So what does a pipeline stage look like?
Are either an:
● Estimator - has a method called “fit” which returns an transformer
● Transformer - no need to train can directly transform (e.g. HashingTF) (with
transform)
Both must provide:
● transformSchema* (used to validate input schema is reasonable) & copy
Often have:
● Parameters for configuration (think input columns, regularization, etc.)
Wendy Piersall
19. How are transformers made?
Estimator
data
class Estimator extends PipelineStage {
def fit(dataset: Dataset[_]): Transformer = {
// magic happens here
}
}
Transformer
20. Let’s start with loading some data
● We’ve got some CSV data, we could use textfile and
parse by hand
● spark-packages can save by providing the spark-csv
package by Hossein Falaki
○ If we were building a Java project we can include maven coordinates
○ For the Spark shell when launching add:
--packages com.databricks:spark-csv_2.10:1.3.0
Jess Johnson
21. Loading with sparkSQL & spark-csv
sqlContext.read returns a DataFrameReader
We can specify general properties & data specific options
● option(“key”, “value”)
○ spark-csv ones we will use are header & inferSchema
● format(“formatName”)
○ built in formats include parquet, jdbc, etc. today we will use csv (added
in Spark 2.0, prior to that as com.databricks.csv)
● load(“path”)
Jess Johnson
23. Lets explore training a Decision Tree
● Step 1: Data loading (done!)
● Step 2: Data prep (select features, etc.)
● Step 3: Train
● Step 4: Predict
24. Data prep / cleaning
● We need to predict a double (can be 0.0, 1.0, but type
must be double)
● We need to train with a vector of features
Imports:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.param import Param, Params
from pyspark.ml.feature import Bucketizer, VectorAssembler,
StringIndexer
from pyspark.ml import Pipeline
Huang
Yun
Chung
25. Data prep / cleaning continued
# Combines a list of double input features into a vector
assembler = VectorAssembler(inputCols=["age", "education-num"],
outputCol="feautres")
# String indexer converts a set of strings into doubles
indexer =
StringIndexer(inputCol="category")
.setOutputCol("category-index")
# Can be used to combine pipeline components together
pipeline = Pipeline().setStages([assembler, indexer])
Huang
Yun
Chung
26. So a bit more about that pipeline
● Each of our previous components has “fit” & “transform”
stage
● Constructing the pipeline this way makes it easier to
work with (only need to call one fit & one transform)
● Can re-use the fitted model on future data
model=pipeline.fit(df)
prepared = model.transform(df)
Andrey
27. What does the pipeline look like so far?
Input
Data Assembler Input Data
+ Vectors
StringIndexer
Input Data
+Cat ID
+ Vectors
While not an ML
learning algorithm
this still needs to be
“fit”
This is a transformer
- no fitting required.
Ray Bodden
28. Let's train a model on our prepared data:
# Specify model
dt = DecisionTreeClassifier(labelCol = "category-index",
featuresCol="features")
# Fit it
dt_model = dt.fit(prepared)
# Or as part of the pipeline
pipeline_and_model = Pipeline().setStages([assembler, indexer,
dt])
pipeline_model = pipeline_and_model.fit(df)
Edmund
Fitzgerald
29. Yay! You have an ML pipeline!
Photo by Jessica Fiess-Hill
30. And predict the results on the same data:
pipeline_model.transform(df).select("prediction",
"category-index").take(20)
31. Pipeline API has many models:
● org.apache.spark.ml.classification
○ BinaryLogisticRegressionClassification, DecissionTreeClassification,
GBTClassifier, etc.
● org.apache.spark.ml.regression
○ DecissionTreeRegression, GBTRegressor, IsotonicRegression,
LinearRegression, etc.
● org.apache.spark.ml.recommendation
○ ALS
● You can also check out spark-packages for some more
● But possible not your special AwesomeFooBazinatorML
PROcarterse Follow
32. & data prep stages...
● org.apache.spark.ml.feature
○ ~30 elements from VectorAssembler to Tokenizer, to PCA, etc.
● Often simpler to understand while getting started with
building our own stages
PROcarterse Follow
33. What is/why Sparkling ML
● A place for useful Spark ML pipeline stages to live
○ Including both feature transformers and estimators
● The why: Spark ML can’t keep up with every new algorithm
● Lots of cool ML on Spark tools exist, but many don’t play nice with Spark ML
or together
● https://github.com/holdenk/sparklingml
35. So what does a pipeline stage look like?
Must provide:
● Scala: transformSchema (used to validate input schema is
reasonable) & copy
● Both: Either a “fit” (for estimator) or transform (for
transformer)
Often have:
● Params for configuration (so we can do meta-algorithms)
Wendy Piersall
36. Building a simple transformer:
class HardCodedWordCountStage(override val uid: String) extends Transformer {
def this() = this(Identifiable.randomUID("hardcodedwordcount"))
def copy(extra: ParamMap): HardCodedWordCountStage = {
defaultCopy(extra)
}
...
}
Not to be confused with the Transformers franchise from Hasbro and Tomy.
37. Verify the input schema is reasonable:
override def transformSchema(schema: StructType): StructType = {
// Check that the input type is a string
val idx = schema.fieldIndex("happy_pandas")
val field = schema.fields(idx)
if (field.dataType != StringType) {
throw new Exception(s"Input type ${field.dataType} did not match
input type StringType")
}
// Add the return field
schema.add(StructField("happy_panda_counts", IntegerType, false))
}
38. How is transformSchema used?
● When you call fit on a pipeline it calls transformSchema
on the pipeline stages in order
● This is used to verify that things should work
● Ideally allows pipelines to fail fast when misconfigured,
instead of at the final stage of a 48-hour process
● Doesn’t always work that way :p
● Not supported in Python (I’m sorry!)
Tricia Hall
39. Do the “work” (e.g. predict labels or w/e):
def transform(df: Dataset[_]): DataFrame = {
val wordcount = udf { in: String => in.split(" ").size }
df.select(col("*"),
wordcount(df.col("happy_pandas")).as("happy_panda_counts"))
}
vic15
40. Do the “work” (e.g. call numpy or tensorflow or w/e):
class StrLenPlus3Transformer(Model):
@keyword_only
def __init__(self):
super(StrLenPlusKTransformer, self).__init__()
def _transform(self, dataset):
func = lambda x : len(x) + 3
retType = IntegerType()
udf = UserDefinedFunction(func, retType)
return dataset.withColumn(
"magic", udf("input")
)
41. What about configuring our stage?
class ConfigurableWordCount(override val uid: String) extends
Transformer {
final val inputCol= new Param[String](this, "inputCol", "The input
column")
final val outputCol = new Param[String](this, "outputCol", "The
output column")
def setInputCol(value: String): this.type = set(inputCol, value)
def setOutputCol(value: String): this.type = set(outputCol, value)
Jason Wesley Upton
42. What about configuring our stage?
class StrLenPlusKTransformer(Model, HasInputCol, HasOutputCol):
# We need a parameter to configure k
k = Param(Params._dummy(),
"k", "amount to add to str len",
typeConverter=TypeConverters.toInt)
@keyword_only
def __init__(self, k=None, inputCol=None, outputCol=None):
super(StrLenPlusKTransformer, self).__init__()
kwargs = self._input_kwargs
self.setParams(**kwargs)
Jason Wesley Upton
43. What about configuring our stage?
@keyword_only
def setParams(self, k=None, inputCol=None, outputCol=None):
kwargs = self._input_kwargs
return self._set(**kwargs)
def setK(self, value):
return self._set(k=value)
def getK(self):
return self.getOrDefault(self.k)
Jason Wesley Upton
44. So why do we configure it that way?
● Allow meta algorithms to work on it
● Scala:
○ If you look inside of spark you’ll see
“sharedParams.scala” for common params (like input
column)
○ We can’t access those unless we pretend to be inside
of org.apache.spark - so we have to make our own
● Python: Just import pyspark.ml.param.shared
Tricia Hall
45. So how to make an estimator?
● Very similar, instead of directly providing transform
provide a `fit` which returns a “model” which implements
the estimator interface as shown above
● Also take a look at the algorithms in Spark itself (helpful
traits you can mixin to take care of many common things).
● Let’s look at a simple one now!
sneakerdog
46. A simple string indexer estimator
class SimpleIndexer(override val uid: String) extends
Estimator[SimpleIndexerModel] with SimpleIndexerParams {
….
override def fit(dataset: Dataset[_]): SimpleIndexerModel = {
import dataset.sparkSession.implicits._
val words = dataset.select(dataset($(inputCol)).as[String]).distinct
.collect()
new SimpleIndexerModel(uid, words)
}
}
47. Quick aside: What’ts that “$(inputCol)”?
● How you get access to a configuration parameter
● Inside stage only (external use getInputCol just like
Java™ :p)
48. And our friend the transformer is back:
class SimpleIndexerModel(
override val uid: String, words: Array[String]) extends
Model[SimpleIndexerModel] with SimpleIndexerParams {
...
private val labelToIndex: Map[String, Double] = words.zipWithIndex.
map{case (x, y) => (x, y.toDouble)}.toMap
override def transform(dataset: Dataset[_]): DataFrame = {
val indexer = udf { label: String => labelToIndex(label) }
dataset.select(col("*"),
indexer(dataset($(inputCol)).cast(StringType)).as($(outputCol)))
Still not to be confused with the Transformers franchise from Hasbro and Tomy.
49. Ok so how do you make the train function?
● Read some papers on the algorithm(s) you care about
● Most likely some iterative approach (pro-tip: RDDs >
Datasets for iterative)
○ Write down your cost function
○ Seth has some interesting work around pluggable
optimizers -- you can play with them
https://github.com/sethah/sparkopt or read about
parallel iterative algorithms.
● Closed form solution? Go have a party!
50. What else can you add to your models?
● Put in an ML pipeline
● Do hyper-parameter tuning
And if you have some coffee left over:
● Persistence*
○ MLWriter & MLReader give you the basics
○ You’ll have to do a lot of work yourself :(
● Serving*
*With enough coffee. Not guaranteed.
51. Ok so I put my new fancy thing on GitHub
● Yay thank you!
● Please publish to maven central
● Also consider contributing it to SparklingML
● Also consider listing on spark-packages + user@ list
○ Let me know ( holden@pigscanfly.ca ) :)
● Think of the Python users (and I guess the R users) too?
52. Custom Estimators/Transformers in the Wild
Classification/Regression
xgboost
Deep Learning!
MXNet
DL4J
etc.
Feature Transformation
FeatureHasher
53. More resources:
● High Performance Spark Example Repo has some
sample models
○ Of course buy several copies of the book - it is the gift of the season :p
● The models inside of Spark itself (internal APIs though)
● Sparkling ML - So much fun!
● Nick Pentreath’s FeatureHasher
● O’Reilly radar blog post
https://www.oreilly.com/learning/extend-structured-streami
ng-for-spark-ml
Captain Pancakes
54. Want to do Optional Exercise 1?
Go from the index to something useful
● We could manually look up the labels and then write a
select statement
● Or we could look at the features on the
StringIndexerModel and use IndexToString
● Our pipeline has an array of stages we can use for this
55. Solution:
from pyspark.ml.feature import IndexToString
labels = list(pipeline_model.stages[1].labels())
inverter = IndexToString(inputCol="prediction",
outputCol="prediction-label", labels=labels)
inverter.transform(pipeline_model.transform(df)).select("predict
ion-label", "category").take(20)
# Pre Spark 1.6 use SQL if/else or similar
56. Exercise 2: Add more features to your tree
● Finished quickly? Help others!
● Or tell me if adding these features helped or not…
○ We can download a reserve “test” dataset but how would we know if
we couldn’t do that?
cobra libre
57. Exercise 3: Train a new model type
● Your choice!
● If you want to do regression - change what we are
predicting
58. Exercise 4: Make a pipeline stage!
● Birthday*-saurus** will thank you!
*Birthday in the sense you tell the olive garden it’s your birthday -- not like a “real” birthday
59. Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
60. High Performance Spark!
Available today!
The rest of you can buy it from that scrappy Seattle
bookstore :p
http://bit.ly/hkHighPerfSpark
61. And some upcoming talks:
● Data Day Seattle (SEA, October)
● Spark Summit EU (Dublin, October)
● Big Data Spain (November, Madrid)
● Bee Scala (November, Ljubljana)
● Strata Singapore (Singapore, December)
● ScalaX (London, December)
● Linux Conf AU (Melbourne, January)
● Know of interesting conferences/webinar things that
should be on my radar? Let me know!
62. k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
I have to give a talk on this in ~ 2
weeks (meeps!)
Will tweet results
“eventually” @holdenkarau
Any PySpark Users: Have some
simple UDFs you wish ran faster
you are willing to share?:
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)
63. Cross-validation
because saving a test set is effort
● Automagically* fit your model params
● Because thinking is effort
● org.apache.spark.ml.tuning has the tools
○ (not in Python yet so skipping for now)
Jonathan Kotta
64. Pipeline API has many models:
● org.apache.spark.ml.classification
○ BinaryLogisticRegressionClassification, DecissionTreeClassification,
GBTClassifier, etc.
● org.apache.spark.ml.regression
○ DecissionTreeRegression, GBTRegressor, IsotonicRegression,
LinearRegression, etc.
● org.apache.spark.ml.recommendation
○ ALS
PROcarterse Follow
65. So serving...
● Generally refers to using your model online
○ Generating recommendations...
● In batch mode you can “just” save & use the Spark bits
● Spark’s “native” formats (often parquet w/metadata)
○ Understood by Spark libraries and thats pretty much it
○ If you are serving in JVM can load these but need Spark
dependencies (albeit often not a Spark cluster)
● Some models support PMML export
○ https://github.com/jpmml/openscoring etc.
● We can also write our own export & serving by hand :(
Ambernectar 13
66. So what models are PMML exportable?
● Right now “old” style models
○ KMeans, LinearRegresion, RidgeRegression, Lasso, SVM, and Binary
LogisticRegression
○ However if we look in the code we can sometimes find converters to
the old style models and use this to export our “new” style model
● Waiting on
https://issues.apache.org/jira/browse/SPARK-11171 /
https://github.com/apache/spark/pull/9207 for pipeline
models
● Not getting in for 2.0 :(
67. How to PMML export*
toPMML
● returns a string or
● takes a path to local fs and saves results or
● takes a SparkContext & a distributed path and saves or
● takes a stream and writes result to stream
Oooor just wait for something better
68. Exercise 2: Add more features to your tree
● Finished quickly? Help others!
● Or tell me if adding these features helped or not…
○ We can download a reserve “test” dataset but how would we know if
we couldn’t do that?
cobra libre
69. Exercise 3: Train a new model type
● Your choice!
● If you want to do regression - change what we are
predicting
70. Optional* exercise time
● Take a model you trained and save it to PMML
○ You will have to dig around in the Spark code to be able to do this
● Look at the file
● Load it into a serving system and try some predictions
● Note: PMML export currently only includes the model -
not any transformations beforehand
● Also: you might need to train a new model
● If you don’t get it don’t worry - hints to follow :)