Intro & Extending Spark ML
With your “friend” @holdenkarau
& friend Boo!
Who's presenting:
● My name is Holden Karau
● Prefered pronouns are she/her
● I’m a Principal Software Engineer at IBM’s Spark Technology Center
● Apache Spark committer (as of January!) :)
● previously Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● @holdenkarau
Who is Boo?
● Boo uses she/her pronouns (as I told the Texas house committee)
● Best doge
● Lot’s of experience barking at computers to make them go faster
● Author of “Learning to Bark” & “High Performance Barking”
● On twitter @BooProgrammer
What we are going to explore together!
● Who I think you all are
● Spark’s two different ML APIs
● Running through a simple example with one
● Model save/load
● Discussion of “serving” options
● Extending Spark ML
● Optional take home exercises
The different pieces of Spark
Apache Spark
Python, &
Spark ML
The different pieces of Spark: 2.0+
Apache Spark
Python, &
Who do I think you all are?
● Nice people*
● Some knowledge of Apache Spark core & maybe SQL
● Interested in using Spark for Machine Learning
● Familiar-ish with Scala or Java or Python
If you're planning to following along:
● Spark 2+ (Spark 2.2 would be best!)
○ (built with Hive support if building from source)
● Since this is a regular talk, you won’t have time to the
exercises as we go -- but you can come back and finish
it after :)
Download census data
Getting some data for working with:
● census data:
● Goal: predict income > 50k
● Also included in the github repo
● Download that now if you haven’t already
● We will add a header to the data
So what are the two APIs?
● Traditional and Pipeline
○ Pipeline is the new shiny future which will fix all problems*
● Traditional API works on RDDs
○ Data preparation work is generally done in traditional Spark
● Pipeline API works on DataFrames
○ Often we want to apply some transformations to our data before
feeding to the machine learning algorithm
○ Makes it easy to chain these together
(*until replaced by a newer shinier future)
So what are DataFrames / Datasets?
● Spark SQL’s version of RDDs of the world
○ It’s for more than just SQL
● Restricted data types, schema information, compile time
○ Datasets add the types back
● Slightly restricted operations (more relational style)
○ Still support many of the same functional programming magic
○ map & friends are here to stay, but at a cost
● Allow lots of fun extra optimizations
○ Tungsten, Apache Arrow, etc.
● Not Pandas or R DataFrames
What is DataFrame performance like?
Spark ML pipelines
Tokenizer HashingTF String Indexer Naive Bayes
Tokenizer HashingTF
String Indexer
Naive Bayes
● Sci-Kit Learn Inspired
● Consist of Estimators and Transformers
So what does a pipeline stage look like?
Are either an:
● Estimator - has a method called “fit” which returns an transformer
● Transformer - no need to train can directly transform (e.g. HashingTF) (with
Both must provide:
● transformSchema* (used to validate input schema is reasonable) & copy
Often have:
● Parameters for configuration (think input columns, regularization, etc.)
How are transformers made?
class Estimator extends PipelineStage {
def fit(dataset: Dataset[_]): Transformer = {
// magic happens here
Let’s start with loading some data
● We’ve got some CSV data, we could use textfile and
parse by hand
● spark-packages can save by providing the spark-csv
package by Hossein Falaki
○ If we were building a Java project we can include maven coordinates
○ For the Spark shell when launching add:
--packages com.databricks:spark-csv_2.10:1.3.0
Loading with sparkSQL & spark-csv returns a DataFrameReader
We can specify general properties & data specific options
● option(“key”, “value”)
○ spark-csv ones we will use are header & inferSchema
● format(“formatName”)
○ built in formats include parquet, jdbc, etc. today we will use csv (added
in Spark 2.0, prior to that as com.databricks.csv)
● load(“path”)
Loading with sparkSQL & spark-csv
df =
.option("header", "true")
.option("inferSchema", "true")
Lets explore training a Decision Tree
● Step 1: Data loading (done!)
● Step 2: Data prep (select features, etc.)
● Step 3: Train
● Step 4: Predict
Data prep / cleaning
● We need to predict a double (can be 0.0, 1.0, but type
must be double)
● We need to train with a vector of features
from pyspark.mllib.linalg import Vectors
from import DecisionTreeClassifier
from import Param, Params
from import Bucketizer, VectorAssembler,
from import Pipeline
Data prep / cleaning continued
# Combines a list of double input features into a vector
assembler = VectorAssembler(inputCols=["age", "education-num"],
# String indexer converts a set of strings into doubles
indexer =
# Can be used to combine pipeline components together
pipeline = Pipeline().setStages([assembler, indexer])
So a bit more about that pipeline
● Each of our previous components has “fit” & “transform”
● Constructing the pipeline this way makes it easier to
work with (only need to call one fit & one transform)
● Can re-use the fitted model on future data
prepared = model.transform(df)
What does the pipeline look like so far?
Data Assembler Input Data
+ Vectors
Input Data
+Cat ID
+ Vectors
While not an ML
learning algorithm
this still needs to be
This is a transformer
- no fitting required.
Let's train a model on our prepared data:
# Specify model
dt = DecisionTreeClassifier(labelCol = "category-index",
# Fit it
dt_model =
# Or as part of the pipeline
pipeline_and_model = Pipeline().setStages([assembler, indexer,
pipeline_model =
Yay! You have an ML pipeline!
And predict the results on the same data:
Pipeline API has many models:
○ BinaryLogisticRegressionClassification, DecissionTreeClassification,
GBTClassifier, etc.
○ DecissionTreeRegression, GBTRegressor, IsotonicRegression,
LinearRegression, etc.
● You can also check out spark-packages for some more
● But possible not your special AwesomeFooBazinatorML
& data prep stages...
○ ~30 elements from VectorAssembler to Tokenizer, to PCA, etc.
● Often simpler to understand while getting started with
building our own stages
What is/why Sparkling ML
● A place for useful Spark ML pipeline stages to live
○ Including both feature transformers and estimators
● The why: Spark ML can’t keep up with every new algorithm
● Lots of cool ML on Spark tools exist, but many don’t play nice with Spark ML
or together
So now begins our adventure to add stages
So what does a pipeline stage look like?
Must provide:
● Scala: transformSchema (used to validate input schema is
reasonable) & copy
● Both: Either a “fit” (for estimator) or transform (for
Often have:
● Params for configuration (so we can do meta-algorithms)
Building a simple transformer:
class HardCodedWordCountStage(override val uid: String) extends Transformer {
def this() = this(Identifiable.randomUID("hardcodedwordcount"))
def copy(extra: ParamMap): HardCodedWordCountStage = {
Verify the input schema is reasonable:
override def transformSchema(schema: StructType): StructType = {
// Check that the input type is a string
val idx = schema.fieldIndex("happy_pandas")
val field = schema.fields(idx)
if (field.dataType != StringType) {
throw new Exception(s"Input type ${field.dataType} did not match
input type StringType")
// Add the return field
schema.add(StructField("happy_panda_counts", IntegerType, false))
How is transformSchema used?
● When you call fit on a pipeline it calls transformSchema
on the pipeline stages in order
● This is used to verify that things should work
● Ideally allows pipelines to fail fast when misconfigured,
instead of at the final stage of a 48-hour process
● Doesn’t always work that way :p
● Not supported in Python (I’m sorry!)
Do the “work” (e.g. predict labels or w/e):
def transform(df: Dataset[_]): DataFrame = {
val wordcount = udf { in: String => in.split(" ").size }"*"),
Do the “work” (e.g. call numpy or tensorflow or w/e):
class StrLenPlus3Transformer(Model):
def __init__(self):
super(StrLenPlusKTransformer, self).__init__()
def _transform(self, dataset):
func = lambda x : len(x) + 3
retType = IntegerType()
udf = UserDefinedFunction(func, retType)
return dataset.withColumn(
"magic", udf("input")
What about configuring our stage?
class ConfigurableWordCount(override val uid: String) extends
Transformer {
final val inputCol= new Param[String](this, "inputCol", "The input
final val outputCol = new Param[String](this, "outputCol", "The
output column")
def setInputCol(value: String): this.type = set(inputCol, value)
def setOutputCol(value: String): this.type = set(outputCol, value)
What about configuring our stage?
class StrLenPlusKTransformer(Model, HasInputCol, HasOutputCol):
# We need a parameter to configure k
k = Param(Params._dummy(),
"k", "amount to add to str len",
def __init__(self, k=None, inputCol=None, outputCol=None):
super(StrLenPlusKTransformer, self).__init__()
kwargs = self._input_kwargs
What about configuring our stage?
def setParams(self, k=None, inputCol=None, outputCol=None):
kwargs = self._input_kwargs
return self._set(**kwargs)
def setK(self, value):
return self._set(k=value)
def getK(self):
return self.getOrDefault(self.k)
So why do we configure it that way?
● Allow meta algorithms to work on it
● Scala:
○ If you look inside of spark you’ll see
“sharedParams.scala” for common params (like input
○ We can’t access those unless we pretend to be inside
of org.apache.spark - so we have to make our own
● Python: Just import
So how to make an estimator?
● Very similar, instead of directly providing transform
provide a `fit` which returns a “model” which implements
the estimator interface as shown above
● Also take a look at the algorithms in Spark itself (helpful
traits you can mixin to take care of many common things).
● Let’s look at a simple one now!
A simple string indexer estimator
class SimpleIndexer(override val uid: String) extends
Estimator[SimpleIndexerModel] with SimpleIndexerParams {
override def fit(dataset: Dataset[_]): SimpleIndexerModel = {
import dataset.sparkSession.implicits._
val words =$(inputCol)).as[String]).distinct
new SimpleIndexerModel(uid, words)
Quick aside: What’ts that “$(inputCol)”?
● How you get access to a configuration parameter
● Inside stage only (external use getInputCol just like
Java™ :p)
And our friend the transformer is back:
class SimpleIndexerModel(
override val uid: String, words: Array[String]) extends
Model[SimpleIndexerModel] with SimpleIndexerParams {
private val labelToIndex: Map[String, Double] = words.zipWithIndex.
map{case (x, y) => (x, y.toDouble)}.toMap
override def transform(dataset: Dataset[_]): DataFrame = {
val indexer = udf { label: String => labelToIndex(label) }"*"),
Ok so how do you make the train function?
● Read some papers on the algorithm(s) you care about
● Most likely some iterative approach (pro-tip: RDDs >
Datasets for iterative)
○ Write down your cost function
○ Seth has some interesting work around pluggable
optimizers -- you can play with them or read about
parallel iterative algorithms.
● Closed form solution? Go have a party!
What else can you add to your models?
● Put in an ML pipeline
● Do hyper-parameter tuning
And if you have some coffee left over:
● Persistence*
○ MLWriter & MLReader give you the basics
○ You’ll have to do a lot of work yourself :(
● Serving*
Ok so I put my new fancy thing on GitHub
● Yay thank you!
● Please publish to maven central
● Also consider contributing it to SparklingML
● Also consider listing on spark-packages + user@ list
○ Let me know ( ) :)
● Think of the Python users (and I guess the R users) too?
Custom Estimators/Transformers in the Wild
Deep Learning!
Feature Transformation
More resources:
● High Performance Spark Example Repo has some
sample models
○ Of course buy several copies of the book - it is the gift of the season :p
● The models inside of Spark itself (internal APIs though)
● Sparkling ML - So much fun!
● Nick Pentreath’s FeatureHasher
● O’Reilly radar blog post
Want to do Optional Exercise 1?
Go from the index to something useful
● We could manually look up the labels and then write a
select statement
● Or we could look at the features on the
StringIndexerModel and use IndexToString
● Our pipeline has an array of stages we can use for this
from import IndexToString
labels = list(pipeline_model.stages[1].labels())
inverter = IndexToString(inputCol="prediction",
outputCol="prediction-label", labels=labels)
ion-label", "category").take(20)
# Pre Spark 1.6 use SQL if/else or similar
Exercise 2: Add more features to your tree
● Finished quickly? Help others!
● Or tell me if adding these features helped or not…
○ We can download a reserve “test” dataset but how would we know if
we couldn’t do that?
Exercise 3: Train a new model type
● Your choice!
● If you want to do regression - change what we are
Exercise 4: Make a pipeline stage!
● Birthday*-saurus** will thank you!
Pipeline API has many models:
○ BinaryLogisticRegressionClassification, DecissionTreeClassification,
GBTClassifier, etc.
○ DecissionTreeRegression, GBTRegressor, IsotonicRegression,
LinearRegression, etc.
So serving...
● Generally refers to using your model online
○ Generating recommendations...
● In batch mode you can “just” save & use the Spark bits
● Spark’s “native” formats (often parquet w/metadata)
○ Understood by Spark libraries and thats pretty much it
○ If you are serving in JVM can load these but need Spark
dependencies (albeit often not a Spark cluster)
● Some models support PMML export
○ etc.
● We can also write our own export & serving by hand :(
So what models are PMML exportable?
● Right now “old” style models
○ KMeans, LinearRegresion, RidgeRegression, Lasso, SVM, and Binary
○ However if we look in the code we can sometimes find converters to
the old style models and use this to export our “new” style model
● Waiting on / for pipeline
● Not getting in for 2.0 :(
How to PMML export*
● returns a string or
● takes a path to local fs and saves results or
● takes a SparkContext & a distributed path and saves or
● takes a stream and writes result to stream
● Take a model you trained and save it to PMML
○ You will have to dig around in the Spark code to be able to do this
● Look at the file
● Load it into a serving system and try some predictions
● Note: PMML export currently only includes the model -
not any transformations beforehand
● Also: you might need to train a new model
● If you don’t get it don’t worry - hints to follow :)

Spark ML Intro & Extending

  • 1. Hi LA Apache Spark friends! Thanks for coming early! Enjoy burritos and the WWGuest WiFi network (or you know in person networking works too)
  • 2. Intro & Extending Spark ML With your “friend” @holdenkarau & friend Boo! Hella-Legit
  • 3. Who’s presenting: ● My name is Holden Karau ● Prefered pronouns are she/her ● I’m a Principal Software Engineer at IBM’s Spark Technology Center ● Apache Spark committer (as of January!) :) ● previously Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & High Performance Spark ● @holdenkarau ● Slide share ● Linkedin ● Github ● Spark Videos
  • 4. Who is Boo? ● Boo uses she/her pronouns (as I told the Texas house committee) ● Best doge ● Lot’s of experience barking at computers to make them go faster ● Author of “Learning to Bark” & “High Performance Barking” ● On twitter @BooProgrammer
  • 5.
  • 6. Spark Technology Center 6 IBM Spark Technology Center Founded in 2015. Location: Physical: 505 Howard St., San Francisco CA Web: Twitter: @apachespark_tc Mission: Contribute intellectual and technical capital to the Apache Spark community. Make the core technology enterprise- and cloud-ready. Build data science skills to drive intelligence into business applications — Key statistics: About 50 developers, co-located with 25 IBM designers. Major contributions to Apache Spark Apache SystemML is now an Apache Incubator project. Founding member of UC Berkeley AMPLab and RISE Lab Member of R Consortium and Scala Center Spark Technology Center
  • 7. What we are going to explore together! ● Who I think you all are ● Spark’s two different ML APIs ● Running through a simple example with one ● Model save/load ● Discussion of “serving” options ● Extending Spark ML ● Optional take home exercises ● Finish up with a raffle & cookies (sponsored by Qubole!) ○ Please fill out your ticket if you want to win!
  • 8. The different pieces of Spark Apache Spark SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Grah X MLLib Community Packages
  • 9. The different pieces of Spark: 2.0+ Apache Spark SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Graph X MLLib Community Packages Structured Streaming
  • 10. Who do I think you all are? ● Nice people* ● Some knowledge of Apache Spark core & maybe SQL ● Interested in using Spark for Machine Learning ● Familiar-ish with Scala or Java or Python Amanda
  • 11. If you're planning to following along: ● Spark 2+ (Spark 2.2 would be best!) ○ (built with Hive support if building from source) ● Since this is a regular talk, you won’t have time to the exercises as we go -- but you can come back and finish it after :) Amanda
  • 12. Some resources: OR Download census data Dwight Sipler
  • 13. Getting some data for working with: ● census data: ● Goal: predict income > 50k ● Also included in the github repo ● Download that now if you haven’t already ● We will add a header to the data ○ PROTill Westermayer
  • 14. So what are the two APIs? ● Traditional and Pipeline ○ Pipeline is the new shiny future which will fix all problems* ● Traditional API works on RDDs ○ Data preparation work is generally done in traditional Spark transformations ● Pipeline API works on DataFrames ○ Often we want to apply some transformations to our data before feeding to the machine learning algorithm ○ Makes it easy to chain these together (*until replaced by a newer shinier future) Steve Jurvetson
  • 15. So what are DataFrames / Datasets? ● Spark SQL’s version of RDDs of the world ○ It’s for more than just SQL ● Restricted data types, schema information, compile time untyped* ○ Datasets add the types back ● Slightly restricted operations (more relational style) ○ Still support many of the same functional programming magic ○ map & friends are here to stay, but at a cost ● Allow lots of fun extra optimizations ○ Tungsten, Apache Arrow, etc. ● Not Pandas or R DataFrames
  • 16. What is DataFrame performance like? Andrew Skudder
  • 17. Spark ML pipelines Tokenizer HashingTF String Indexer Naive Bayes Tokenizer HashingTF Streaming String Indexer Streaming Naive Bayes fit(df) Estimator Transformer ● Sci-Kit Learn Inspired ● Consist of Estimators and Transformers
  • 18. So what does a pipeline stage look like? Are either an: ● Estimator - has a method called “fit” which returns an transformer ● Transformer - no need to train can directly transform (e.g. HashingTF) (with transform) Both must provide: ● transformSchema* (used to validate input schema is reasonable) & copy Often have: ● Parameters for configuration (think input columns, regularization, etc.) Wendy Piersall
  • 19. How are transformers made? Estimator data class Estimator extends PipelineStage { def fit(dataset: Dataset[_]): Transformer = { // magic happens here } } Transformer
  • 20. Let’s start with loading some data ● We’ve got some CSV data, we could use textfile and parse by hand ● spark-packages can save by providing the spark-csv package by Hossein Falaki ○ If we were building a Java project we can include maven coordinates ○ For the Spark shell when launching add: --packages com.databricks:spark-csv_2.10:1.3.0 Jess Johnson
  • 21. Loading with sparkSQL & spark-csv returns a DataFrameReader We can specify general properties & data specific options ● option(“key”, “value”) ○ spark-csv ones we will use are header & inferSchema ● format(“formatName”) ○ built in formats include parquet, jdbc, etc. today we will use csv (added in Spark 2.0, prior to that as com.databricks.csv) ● load(“path”) Jess Johnson
  • 22. Loading with sparkSQL & spark-csv df = .format("csv") .option("header", "true") .option("inferSchema", "true") .load("resources/")
  • 23. Lets explore training a Decision Tree ● Step 1: Data loading (done!) ● Step 2: Data prep (select features, etc.) ● Step 3: Train ● Step 4: Predict
  • 24. Data prep / cleaning ● We need to predict a double (can be 0.0, 1.0, but type must be double) ● We need to train with a vector of features Imports: from pyspark.mllib.linalg import Vectors from import DecisionTreeClassifier from import Param, Params from import Bucketizer, VectorAssembler, StringIndexer from import Pipeline Huang Yun Chung
  • 25. Data prep / cleaning continued # Combines a list of double input features into a vector assembler = VectorAssembler(inputCols=["age", "education-num"], outputCol="feautres") # String indexer converts a set of strings into doubles indexer = StringIndexer(inputCol="category") .setOutputCol("category-index") # Can be used to combine pipeline components together pipeline = Pipeline().setStages([assembler, indexer]) Huang Yun Chung
  • 26. So a bit more about that pipeline ● Each of our previous components has “fit” & “transform” stage ● Constructing the pipeline this way makes it easier to work with (only need to call one fit & one transform) ● Can re-use the fitted model on future data prepared = model.transform(df) Andrey
  • 27. What does the pipeline look like so far? Input Data Assembler Input Data + Vectors StringIndexer Input Data +Cat ID + Vectors While not an ML learning algorithm this still needs to be “fit” This is a transformer - no fitting required. Ray Bodden
  • 28. Let's train a model on our prepared data: # Specify model dt = DecisionTreeClassifier(labelCol = "category-index", featuresCol="features") # Fit it dt_model = # Or as part of the pipeline pipeline_and_model = Pipeline().setStages([assembler, indexer, dt]) pipeline_model = Edmund Fitzgerald
  • 29. Yay! You have an ML pipeline! Photo by Jessica Fiess-Hill
  • 30. And predict the results on the same data: pipeline_model.transform(df).select("prediction", "category-index").take(20)
  • 31. Pipeline API has many models: ● ○ BinaryLogisticRegressionClassification, DecissionTreeClassification, GBTClassifier, etc. ● ○ DecissionTreeRegression, GBTRegressor, IsotonicRegression, LinearRegression, etc. ● ○ ALS ● You can also check out spark-packages for some more ● But possible not your special AwesomeFooBazinatorML PROcarterse Follow
  • 32. & data prep stages... ● ○ ~30 elements from VectorAssembler to Tokenizer, to PCA, etc. ● Often simpler to understand while getting started with building our own stages PROcarterse Follow
  • 33. What is/why Sparkling ML ● A place for useful Spark ML pipeline stages to live ○ Including both feature transformers and estimators ● The why: Spark ML can’t keep up with every new algorithm ● Lots of cool ML on Spark tools exist, but many don’t play nice with Spark ML or together ●
  • 34. So now begins our adventure to add stages
  • 35. So what does a pipeline stage look like? Must provide: ● Scala: transformSchema (used to validate input schema is reasonable) & copy ● Both: Either a “fit” (for estimator) or transform (for transformer) Often have: ● Params for configuration (so we can do meta-algorithms) Wendy Piersall
  • 36. Building a simple transformer: class HardCodedWordCountStage(override val uid: String) extends Transformer { def this() = this(Identifiable.randomUID("hardcodedwordcount")) def copy(extra: ParamMap): HardCodedWordCountStage = { defaultCopy(extra) } ... } Not to be confused with the Transformers franchise from Hasbro and Tomy.
  • 37. Verify the input schema is reasonable: override def transformSchema(schema: StructType): StructType = { // Check that the input type is a string val idx = schema.fieldIndex("happy_pandas") val field = schema.fields(idx) if (field.dataType != StringType) { throw new Exception(s"Input type ${field.dataType} did not match input type StringType") } // Add the return field schema.add(StructField("happy_panda_counts", IntegerType, false)) }
  • 38. How is transformSchema used? ● When you call fit on a pipeline it calls transformSchema on the pipeline stages in order ● This is used to verify that things should work ● Ideally allows pipelines to fail fast when misconfigured, instead of at the final stage of a 48-hour process ● Doesn’t always work that way :p ● Not supported in Python (I’m sorry!) Tricia Hall
  • 39. Do the “work” (e.g. predict labels or w/e): def transform(df: Dataset[_]): DataFrame = { val wordcount = udf { in: String => in.split(" ").size }"*"), wordcount(df.col("happy_pandas")).as("happy_panda_counts")) } vic15
  • 40. Do the “work” (e.g. call numpy or tensorflow or w/e): class StrLenPlus3Transformer(Model): @keyword_only def __init__(self): super(StrLenPlusKTransformer, self).__init__() def _transform(self, dataset): func = lambda x : len(x) + 3 retType = IntegerType() udf = UserDefinedFunction(func, retType) return dataset.withColumn( "magic", udf("input") )
  • 41. What about configuring our stage? class ConfigurableWordCount(override val uid: String) extends Transformer { final val inputCol= new Param[String](this, "inputCol", "The input column") final val outputCol = new Param[String](this, "outputCol", "The output column") def setInputCol(value: String): this.type = set(inputCol, value) def setOutputCol(value: String): this.type = set(outputCol, value) Jason Wesley Upton
  • 42. What about configuring our stage? class StrLenPlusKTransformer(Model, HasInputCol, HasOutputCol): # We need a parameter to configure k k = Param(Params._dummy(), "k", "amount to add to str len", typeConverter=TypeConverters.toInt) @keyword_only def __init__(self, k=None, inputCol=None, outputCol=None): super(StrLenPlusKTransformer, self).__init__() kwargs = self._input_kwargs self.setParams(**kwargs) Jason Wesley Upton
  • 43. What about configuring our stage? @keyword_only def setParams(self, k=None, inputCol=None, outputCol=None): kwargs = self._input_kwargs return self._set(**kwargs) def setK(self, value): return self._set(k=value) def getK(self): return self.getOrDefault(self.k) Jason Wesley Upton
  • 44. So why do we configure it that way? ● Allow meta algorithms to work on it ● Scala: ○ If you look inside of spark you’ll see “sharedParams.scala” for common params (like input column) ○ We can’t access those unless we pretend to be inside of org.apache.spark - so we have to make our own ● Python: Just import Tricia Hall
  • 45. So how to make an estimator? ● Very similar, instead of directly providing transform provide a `fit` which returns a “model” which implements the estimator interface as shown above ● Also take a look at the algorithms in Spark itself (helpful traits you can mixin to take care of many common things). ● Let’s look at a simple one now! sneakerdog
  • 46. A simple string indexer estimator class SimpleIndexer(override val uid: String) extends Estimator[SimpleIndexerModel] with SimpleIndexerParams { …. override def fit(dataset: Dataset[_]): SimpleIndexerModel = { import dataset.sparkSession.implicits._ val words =$(inputCol)).as[String]).distinct .collect() new SimpleIndexerModel(uid, words) } }
  • 47. Quick aside: What’ts that “$(inputCol)”? ● How you get access to a configuration parameter ● Inside stage only (external use getInputCol just like Java™ :p)
  • 48. And our friend the transformer is back: class SimpleIndexerModel( override val uid: String, words: Array[String]) extends Model[SimpleIndexerModel] with SimpleIndexerParams { ... private val labelToIndex: Map[String, Double] = words.zipWithIndex. map{case (x, y) => (x, y.toDouble)}.toMap override def transform(dataset: Dataset[_]): DataFrame = { val indexer = udf { label: String => labelToIndex(label) }"*"), indexer(dataset($(inputCol)).cast(StringType)).as($(outputCol))) Still not to be confused with the Transformers franchise from Hasbro and Tomy.
  • 49. Ok so how do you make the train function? ● Read some papers on the algorithm(s) you care about ● Most likely some iterative approach (pro-tip: RDDs > Datasets for iterative) ○ Write down your cost function ○ Seth has some interesting work around pluggable optimizers -- you can play with them or read about parallel iterative algorithms. ● Closed form solution? Go have a party!
  • 50. What else can you add to your models? ● Put in an ML pipeline ● Do hyper-parameter tuning And if you have some coffee left over: ● Persistence* ○ MLWriter & MLReader give you the basics ○ You’ll have to do a lot of work yourself :( ● Serving* *With enough coffee. Not guaranteed.
  • 51. Ok so I put my new fancy thing on GitHub ● Yay thank you! ● Please publish to maven central ● Also consider contributing it to SparklingML ● Also consider listing on spark-packages + user@ list ○ Let me know ( ) :) ● Think of the Python users (and I guess the R users) too?
  • 52. Custom Estimators/Transformers in the Wild Classification/Regression xgboost Deep Learning! MXNet DL4J etc. Feature Transformation FeatureHasher
  • 53. More resources: ● High Performance Spark Example Repo has some sample models ○ Of course buy several copies of the book - it is the gift of the season :p ● The models inside of Spark itself (internal APIs though) ● Sparkling ML - So much fun! ● Nick Pentreath’s FeatureHasher ● O’Reilly radar blog post ng-for-spark-ml Captain Pancakes
  • 54. Want to do Optional Exercise 1? Go from the index to something useful ● We could manually look up the labels and then write a select statement ● Or we could look at the features on the StringIndexerModel and use IndexToString ● Our pipeline has an array of stages we can use for this
  • 55. Solution: from import IndexToString labels = list(pipeline_model.stages[1].labels()) inverter = IndexToString(inputCol="prediction", outputCol="prediction-label", labels=labels) inverter.transform(pipeline_model.transform(df)).select("predict ion-label", "category").take(20) # Pre Spark 1.6 use SQL if/else or similar
  • 56. Exercise 2: Add more features to your tree ● Finished quickly? Help others! ● Or tell me if adding these features helped or not… ○ We can download a reserve “test” dataset but how would we know if we couldn’t do that? cobra libre
  • 57. Exercise 3: Train a new model type ● Your choice! ● If you want to do regression - change what we are predicting
  • 58. Exercise 4: Make a pipeline stage! ● Birthday*-saurus** will thank you! *Birthday in the sense you tell the olive garden it’s your birthday -- not like a “real” birthday
  • 59. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  • 60. High Performance Spark! Available today! The rest of you can buy it from that scrappy Seattle bookstore :p
  • 61. And some upcoming talks: ● Data Day Seattle (SEA, October) ● Spark Summit EU (Dublin, October) ● Big Data Spain (November, Madrid) ● Bee Scala (November, Ljubljana) ● Strata Singapore (Singapore, December) ● ScalaX (London, December) ● Linux Conf AU (Melbourne, January) ● Know of interesting conferences/webinar things that should be on my radar? Let me know!
  • 62. k thnx bye :) If you care about Spark testing and don’t hate surveys: I have to give a talk on this in ~ 2 weeks (meeps!) Will tweet results “eventually” @holdenkarau Any PySpark Users: Have some simple UDFs you wish ran faster you are willing to share?: Pssst: Have feedback on the presentation? Give me a shout ( if you feel comfortable doing so :)
  • 63. Cross-validation because saving a test set is effort ● Automagically* fit your model params ● Because thinking is effort ● has the tools ○ (not in Python yet so skipping for now) Jonathan Kotta
  • 64. Pipeline API has many models: ● ○ BinaryLogisticRegressionClassification, DecissionTreeClassification, GBTClassifier, etc. ● ○ DecissionTreeRegression, GBTRegressor, IsotonicRegression, LinearRegression, etc. ● ○ ALS PROcarterse Follow
  • 65. So serving... ● Generally refers to using your model online ○ Generating recommendations... ● In batch mode you can “just” save & use the Spark bits ● Spark’s “native” formats (often parquet w/metadata) ○ Understood by Spark libraries and thats pretty much it ○ If you are serving in JVM can load these but need Spark dependencies (albeit often not a Spark cluster) ● Some models support PMML export ○ etc. ● We can also write our own export & serving by hand :( Ambernectar 13
  • 66. So what models are PMML exportable? ● Right now “old” style models ○ KMeans, LinearRegresion, RidgeRegression, Lasso, SVM, and Binary LogisticRegression ○ However if we look in the code we can sometimes find converters to the old style models and use this to export our “new” style model ● Waiting on / for pipeline models ● Not getting in for 2.0 :(
  • 67. How to PMML export* toPMML ● returns a string or ● takes a path to local fs and saves results or ● takes a SparkContext & a distributed path and saves or ● takes a stream and writes result to stream Oooor just wait for something better
  • 68. Exercise 2: Add more features to your tree ● Finished quickly? Help others! ● Or tell me if adding these features helped or not… ○ We can download a reserve “test” dataset but how would we know if we couldn’t do that? cobra libre
  • 69. Exercise 3: Train a new model type ● Your choice! ● If you want to do regression - change what we are predicting
  • 70. Optional* exercise time ● Take a model you trained and save it to PMML ○ You will have to dig around in the Spark code to be able to do this ● Look at the file ● Load it into a serving system and try some predictions ● Note: PMML export currently only includes the model - not any transformations beforehand ● Also: you might need to train a new model ● If you don’t get it don’t worry - hints to follow :)