SlideShare a Scribd company logo
1 of 31
Download to read offline
Matthew Tovbin
Principal Engineer, Salesforce Einstein
mtovbin@salesforce.com
@tovbinm
Fantastic ML apps and how to
build them
“This lonely scene – the galaxies like
dust, is what most of space looks like.
This emptiness is normal. The richness
of our own neighborhood is the
exception.”
– Powers of Ten (1977), by Charles and Ray Eames
Powers of Ten (1977)
A travel between a
quark and the
observable universe
[10-17, 1024]
Powers of Ten for Machine Learning
•  Data collection
•  Data preparation
•  Feature engineering
•  Feature selection
•  Sampling
•  Algorithm implementation
•  Hyperparameter tuning
•  Model selection
•  Model serving (scoring)
•  Prediction insights
•  Metrics
a)  Hours
b)  Days
c)  Weeks
d)  Months
e)  More
How long does it take to build a
machine learning application?
How to cope with this complexity?
E = mc2


Free[F[_], A]

M[A]

Functor[F[_]]

Cofree[S[_], A]

Months -> Hours
“The task of the software development
team is to engineer the illusion of
simplicity.”
– Grady Booch
Complexity vs. Abstraction
Appropriate Level of Abstraction
Language
Syntax &
Semantics
Degrees of
Freedom
Lower
Abstraction
Higher
Abstraction
define
•  Less flexible
•  Simpler syntax
•  Reuse
•  Suitable for
complex problems
•  Difficult to use
•  More complex
•  Error prone
???
“FP removes one important dimension
of complexity:
To understand a program part (a
function) you need no longer account
for the possible histories of executions
that can lead to that program part.”
– Martin Odersky
Functional Approach
•  Type-safe
•  No side effects
•  Composability
•  Concise
•  Fine-grained control
// Extracting URL features!
def urlFeatures(s: String): (Text, Text) = { !
val url = Url(s)!
url.protocol -> url.domain!
}!
Seq("http://einstein.com", “”).map(urlFeatures)!
!
> Seq((Text(“http”), Text(“einstein.com”),!
(Text(), Text()))!
Object-oriented Approach
•  Modularity
•  Code reuse
•  Polymorphism
// Extracting text features!
val txt = Seq(!
Url("http://einstein.com"),!
Base64("b25lIHR3byB0aHJlZQ==”),!
Text(”Hello world!”),!
Phone(”650-123-4567”)!
Text.empty !
)!
txt.map(_.tokenize)!
!
Seq(!
TextList(“http”, “einstein.com”),!
TextList(“one”, “two”, “three”),!
TextList(“Hello”, “world”),!
TextList(“+1”, “650”, “1234567”),!
TextList()!
)!
Why Scala?
•  Combines FP & OOP
•  Strongly-typed
•  Expressive
•  Concise
•  Fun (mostly)
•  Default for Spark
Optimus Prime
An AutoML library for building modular,
reusable, strongly typed ML workflows on Spark
•  Declarative & intuitive syntax
•  Proper level of abstraction
•  Aimed for simplicity & reuse
•  >90% accuracy with 100X reduction in time
FeatureType
OPNumeric OPCollection
OPSetOPList
NonNullable
TextEmail
Base64
Phone
ID
URL
ComboBox
PickList
TextArea
OPVector OPMap
BinaryMap
IntegralMap
RealMap
DateList
DateTimeList
Integral
Real
Binary
Percent
Currency
Date
DateTime
MultiPickList TextMap
…
TextList
City
Street
Country
PostalCode
Location
State
Geolocation
StateMap
SingleResponse
RealNN
Categorical
MultiResponse
Legend: bold - abstract type, normal - concrete type, italic - trait, solid line - inheritance, dashed line - trait mixin
Types Hide the Complexity
Type Safety Everywhere
•  Value Operations
•  Feature Operations
•  Transformation Pipelines (aka Workflow)
// Typed value operations!
def tokenize(t: Text): TextList = t.map(_.split(“ “)).toTextList!
!
// Typed feature operations!
val title: Feature[Text] =
FeatureBuilder.Text[Book].extract(_.title).asPredictor!
val tokens: Feature[TextList] = title.map(tokenize)!
!
// Transformation pipelines!
new OpWorkflow().setInput(books).setResultFeatures(tokens.vectorize())!
Book Price Prediction
// Raw feature definitions!
val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor!
val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor!
val descr = FeatureBuilder.Text[Book].extract(_.description).asPredictor!
val price = FeatureBuilder.RealNN[Book].extract(_.price).asResponse!
!
// Feature engineering: tokenize, tfidf etc.!
val tokns = (title + description).tokenize(removePunctuation = true)!
val tfidf = tokns.tf(numTerms = 1024).idf(minFreq = 0)!
val feats = Seq(tfidf, author).vectorize()!
!
// Model training!
implicit val spark = SparkSession.builder.config(new SparkConf).getOrCreate!
val books = spark.read.csv(“books.csv”).as[Book]!
val preds = RegressionModelSelector().setInput(price, feats).getOutput!
new OpWorkflow().setInput(books).setResultFeatures(feats, preds).train()!
Magic Behind “vectorize()”
// Raw feature definitions!
val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor!
val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor!
val descr = FeatureBuilder.Text[Book].extract(_.description).asPredictor!
val price = FeatureBuilder.RealNN[Book].extract(_.price).asResponse!
!
// Feature engineering: tokenize, tfidf etc.!
val tokns = (title + description).tokenize(removePunctuation = true)!
val tfidf = tokns.tf(numTerms = 1024).idf(minFreq = 0)!
val feats = Seq(tfidf, author).vectorize() // <- magic here!
!
// Model training!
implicit val spark = SparkSession.builder.config(new SparkConf).getOrCreate!
val books = spark.read.csv(“books.csv”).as[Book]!
val preds = RegressionModelSelector().setInput(price, feats).getOutput!
new OpWorkflow().setInput(books).setResultFeatures(feats, preds).train()!
Automatic Feature Engineering
ZipcodeSubjectPhoneEmail Age
Age
[0-15]
Age
[15-35]
Age
[>35]
Email Is
Spammy
Top Email
Domains
Country
Code
Phone
Is Valid
Top TF-
IDF
Terms
Average
Income
Vector
Automatic Feature Engineering
Imputation
Track null value
Log transformation
for large range
Scaling - zNormalize
Smart Binning
Numeric Categorical SpatialTemporal
Tokenization
Hash Encoding
TF-IDF
Word2Vec
Sentiment Analysis
Language Detection
Time difference
Time Binning
Time extraction
(day, week, month,
year)
Closeness to major
events
Augment with external
data e.g avg income
Spatial fraudulent
behavior e.g:
impossible travel speed
Geo-encoding
Text
Imputation
Track null value
One Hot Encoding
Dynamic Top K pivot
Smart Binning
LabelCount Encoding
Category Embedding
More…
Automatic Feature Selection
•  Analyze features & calculate statistics
•  Ensure features have acceptable ranges
•  Is this feature a leaker?
•  Does this feature help our model? Is it
predictive?
Automatic Feature Selection
// Sanity check your features against the label!
val checked = price.check(!
featureVector = feats,!
checkSample = 0.3,!
sampleSeed = 1L,!
sampleLimit = 100000L,!
maxCorrelation = 0.95,!
minCorrelation = 0.0,!
correlationType = Pearson,!
minVariance = 0.00001,!
removeBadFeatures = true!
)!
!
new OpWorkflow().setInput(books).setResultFeatures(checked, preds).train()!
Automatic Model Selection
•  Multiple algorithms to pick from
•  Many hyperparameters for each algorithm
•  Automated hyperparameter tuning
–  Faster model creation with improved metrics
–  Search algorithms to find the optimal
hyperparameters. e.g. grid search, random
search, bandit methods
Automatic Model Selection
// Model selection and hyperparameter tuning!
val preds =!
RegressionModelSelector!
.withCrossValidation(!
dataSplitter = DataSplitter(reserveTestFraction = 0.1),!
numFolds = 3,!
validationMetric = Evaluators.Regression.rmse(),!
trainTestEvaluators = Seq.empty,!
seed = 1L)!
.setModelsToTry(LinearRegression, RandomForestRegression)!
.setLinearRegressionElasticNetParam(0, 0.5, 1)!
.setLinearRegressionMaxIter(10, 100)!
.setLinearRegressionSolver(Solver.LBFGS)!
.setRandomForestMaxDepth(2, 10)!
.setRandomForestNumTrees(10)!
.setInput(price, checked).getOutput!
!
new OpWorkflow().setInput(books).setResultFeatures(checked, preds).train()!
Automatic Model Selection
Demo
How well does it work?
•  Most of our models deployed in production
are completely hands free
•  We serve 475,000,000+ predictions per day
Fantastic ML apps HOWTO
•  Define appropriate level of abstraction
•  Use types to express it
•  Automate everything:
–  feature engineering & selection
–  model selection
–  hyperparameter tuning
–  Etc.
Months -> Hours
Further exploration
Talks @ Scale By The Bay 2017:
•  “Real Time ML Pipelines in Multi-Tenant Environments” by
Karl Skucha and Yan Yang
•  “Fireworks - lighting up the sky with millions of Sparks“ by
Thomas Gerber
•  “Functional Linear Algebra in Scala” by Vlad Patryshev
•  “Complex Machine Learning Pipelines Made Easy” by Chris
Rupley and Till Bergmann
•  “Just enough DevOps for data scientists” by Anya Bida
We are hiring!
einstein-recruiting@salesforce.com
Thank You

More Related Content

Similar to Fantastic ML apps and how to build them

Predicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsPredicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data Analytics
Databricks
 
Taxonomy of Scala
Taxonomy of ScalaTaxonomy of Scala
Taxonomy of Scala
shinolajla
 

Similar to Fantastic ML apps and how to build them (20)

The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
Predicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsPredicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data Analytics
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
 
Prepare your data for machine learning
Prepare your data for machine learningPrepare your data for machine learning
Prepare your data for machine learning
 
Taxonomy of Scala
Taxonomy of ScalaTaxonomy of Scala
Taxonomy of Scala
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
 
ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017
 
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
 
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
 
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al MesAyudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert Bifet
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOA
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
DDD, CQRS and testing with ASP.Net MVC
DDD, CQRS and testing with ASP.Net MVCDDD, CQRS and testing with ASP.Net MVC
DDD, CQRS and testing with ASP.Net MVC
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
ML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdfML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdf
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 

Recently uploaded

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 

Fantastic ML apps and how to build them

  • 1. Matthew Tovbin Principal Engineer, Salesforce Einstein mtovbin@salesforce.com @tovbinm Fantastic ML apps and how to build them
  • 2. “This lonely scene – the galaxies like dust, is what most of space looks like. This emptiness is normal. The richness of our own neighborhood is the exception.” – Powers of Ten (1977), by Charles and Ray Eames
  • 3. Powers of Ten (1977) A travel between a quark and the observable universe [10-17, 1024]
  • 4. Powers of Ten for Machine Learning •  Data collection •  Data preparation •  Feature engineering •  Feature selection •  Sampling •  Algorithm implementation •  Hyperparameter tuning •  Model selection •  Model serving (scoring) •  Prediction insights •  Metrics
  • 5. a)  Hours b)  Days c)  Weeks d)  Months e)  More How long does it take to build a machine learning application?
  • 6. How to cope with this complexity? E = mc2 Free[F[_], A] M[A] Functor[F[_]] Cofree[S[_], A] Months -> Hours
  • 7. “The task of the software development team is to engineer the illusion of simplicity.” – Grady Booch
  • 9. Appropriate Level of Abstraction Language Syntax & Semantics Degrees of Freedom Lower Abstraction Higher Abstraction define •  Less flexible •  Simpler syntax •  Reuse •  Suitable for complex problems •  Difficult to use •  More complex •  Error prone ???
  • 10. “FP removes one important dimension of complexity: To understand a program part (a function) you need no longer account for the possible histories of executions that can lead to that program part.” – Martin Odersky
  • 11. Functional Approach •  Type-safe •  No side effects •  Composability •  Concise •  Fine-grained control // Extracting URL features! def urlFeatures(s: String): (Text, Text) = { ! val url = Url(s)! url.protocol -> url.domain! }! Seq("http://einstein.com", “”).map(urlFeatures)! ! > Seq((Text(“http”), Text(“einstein.com”),! (Text(), Text()))!
  • 12. Object-oriented Approach •  Modularity •  Code reuse •  Polymorphism // Extracting text features! val txt = Seq(! Url("http://einstein.com"),! Base64("b25lIHR3byB0aHJlZQ==”),! Text(”Hello world!”),! Phone(”650-123-4567”)! Text.empty ! )! txt.map(_.tokenize)! ! Seq(! TextList(“http”, “einstein.com”),! TextList(“one”, “two”, “three”),! TextList(“Hello”, “world”),! TextList(“+1”, “650”, “1234567”),! TextList()! )!
  • 13. Why Scala? •  Combines FP & OOP •  Strongly-typed •  Expressive •  Concise •  Fun (mostly) •  Default for Spark
  • 14. Optimus Prime An AutoML library for building modular, reusable, strongly typed ML workflows on Spark •  Declarative & intuitive syntax •  Proper level of abstraction •  Aimed for simplicity & reuse •  >90% accuracy with 100X reduction in time
  • 15. FeatureType OPNumeric OPCollection OPSetOPList NonNullable TextEmail Base64 Phone ID URL ComboBox PickList TextArea OPVector OPMap BinaryMap IntegralMap RealMap DateList DateTimeList Integral Real Binary Percent Currency Date DateTime MultiPickList TextMap … TextList City Street Country PostalCode Location State Geolocation StateMap SingleResponse RealNN Categorical MultiResponse Legend: bold - abstract type, normal - concrete type, italic - trait, solid line - inheritance, dashed line - trait mixin Types Hide the Complexity
  • 16. Type Safety Everywhere •  Value Operations •  Feature Operations •  Transformation Pipelines (aka Workflow) // Typed value operations! def tokenize(t: Text): TextList = t.map(_.split(“ “)).toTextList! ! // Typed feature operations! val title: Feature[Text] = FeatureBuilder.Text[Book].extract(_.title).asPredictor! val tokens: Feature[TextList] = title.map(tokenize)! ! // Transformation pipelines! new OpWorkflow().setInput(books).setResultFeatures(tokens.vectorize())!
  • 17. Book Price Prediction // Raw feature definitions! val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor! val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor! val descr = FeatureBuilder.Text[Book].extract(_.description).asPredictor! val price = FeatureBuilder.RealNN[Book].extract(_.price).asResponse! ! // Feature engineering: tokenize, tfidf etc.! val tokns = (title + description).tokenize(removePunctuation = true)! val tfidf = tokns.tf(numTerms = 1024).idf(minFreq = 0)! val feats = Seq(tfidf, author).vectorize()! ! // Model training! implicit val spark = SparkSession.builder.config(new SparkConf).getOrCreate! val books = spark.read.csv(“books.csv”).as[Book]! val preds = RegressionModelSelector().setInput(price, feats).getOutput! new OpWorkflow().setInput(books).setResultFeatures(feats, preds).train()!
  • 18. Magic Behind “vectorize()” // Raw feature definitions! val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor! val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor! val descr = FeatureBuilder.Text[Book].extract(_.description).asPredictor! val price = FeatureBuilder.RealNN[Book].extract(_.price).asResponse! ! // Feature engineering: tokenize, tfidf etc.! val tokns = (title + description).tokenize(removePunctuation = true)! val tfidf = tokns.tf(numTerms = 1024).idf(minFreq = 0)! val feats = Seq(tfidf, author).vectorize() // <- magic here! ! // Model training! implicit val spark = SparkSession.builder.config(new SparkConf).getOrCreate! val books = spark.read.csv(“books.csv”).as[Book]! val preds = RegressionModelSelector().setInput(price, feats).getOutput! new OpWorkflow().setInput(books).setResultFeatures(feats, preds).train()!
  • 19. Automatic Feature Engineering ZipcodeSubjectPhoneEmail Age Age [0-15] Age [15-35] Age [>35] Email Is Spammy Top Email Domains Country Code Phone Is Valid Top TF- IDF Terms Average Income Vector
  • 20. Automatic Feature Engineering Imputation Track null value Log transformation for large range Scaling - zNormalize Smart Binning Numeric Categorical SpatialTemporal Tokenization Hash Encoding TF-IDF Word2Vec Sentiment Analysis Language Detection Time difference Time Binning Time extraction (day, week, month, year) Closeness to major events Augment with external data e.g avg income Spatial fraudulent behavior e.g: impossible travel speed Geo-encoding Text Imputation Track null value One Hot Encoding Dynamic Top K pivot Smart Binning LabelCount Encoding Category Embedding More…
  • 21. Automatic Feature Selection •  Analyze features & calculate statistics •  Ensure features have acceptable ranges •  Is this feature a leaker? •  Does this feature help our model? Is it predictive?
  • 22. Automatic Feature Selection // Sanity check your features against the label! val checked = price.check(! featureVector = feats,! checkSample = 0.3,! sampleSeed = 1L,! sampleLimit = 100000L,! maxCorrelation = 0.95,! minCorrelation = 0.0,! correlationType = Pearson,! minVariance = 0.00001,! removeBadFeatures = true! )! ! new OpWorkflow().setInput(books).setResultFeatures(checked, preds).train()!
  • 23. Automatic Model Selection •  Multiple algorithms to pick from •  Many hyperparameters for each algorithm •  Automated hyperparameter tuning –  Faster model creation with improved metrics –  Search algorithms to find the optimal hyperparameters. e.g. grid search, random search, bandit methods
  • 24. Automatic Model Selection // Model selection and hyperparameter tuning! val preds =! RegressionModelSelector! .withCrossValidation(! dataSplitter = DataSplitter(reserveTestFraction = 0.1),! numFolds = 3,! validationMetric = Evaluators.Regression.rmse(),! trainTestEvaluators = Seq.empty,! seed = 1L)! .setModelsToTry(LinearRegression, RandomForestRegression)! .setLinearRegressionElasticNetParam(0, 0.5, 1)! .setLinearRegressionMaxIter(10, 100)! .setLinearRegressionSolver(Solver.LBFGS)! .setRandomForestMaxDepth(2, 10)! .setRandomForestNumTrees(10)! .setInput(price, checked).getOutput! ! new OpWorkflow().setInput(books).setResultFeatures(checked, preds).train()!
  • 26. Demo
  • 27. How well does it work? •  Most of our models deployed in production are completely hands free •  We serve 475,000,000+ predictions per day
  • 28. Fantastic ML apps HOWTO •  Define appropriate level of abstraction •  Use types to express it •  Automate everything: –  feature engineering & selection –  model selection –  hyperparameter tuning –  Etc. Months -> Hours
  • 29. Further exploration Talks @ Scale By The Bay 2017: •  “Real Time ML Pipelines in Multi-Tenant Environments” by Karl Skucha and Yan Yang •  “Fireworks - lighting up the sky with millions of Sparks“ by Thomas Gerber •  “Functional Linear Algebra in Scala” by Vlad Patryshev •  “Complex Machine Learning Pipelines Made Easy” by Chris Rupley and Till Bergmann •  “Just enough DevOps for data scientists” by Anya Bida