SlideShare una empresa de Scribd logo
1 de 35
Descargar para leer sin conexión
R, Scikit-Learn and Apache Spark ML -
What difference does it make?
Villu Ruusmann
Openscoring OÜ
Overview
● Identifying long-standing, high-value opportunities in the
applied predictive analytics domain
● Thinking about problems in API terms
● Providing solutions in API terms
● Developing and applying custom tools
+ A couple of tips if you're looking to buy or sell a VW Golf
The trade-off
"More data beats better algorithms"
The state of the art
Scaling out horizontally
Elements of reproducibility
Standardized, human- and machine-readable descriptions:
● Dataset
● Data pre- and post-processing steps:
○ From real-life input table (SQL, CSV) to model
○ From model to real-life output table
● Model
● Statistics
Calling R from within Apache Spark
1. Create and initialize R runtime
2. Format and upload input RDD; upload and execute R
model; download output and parse into result RDD
3. Destroy R runtime
Calling Scikit-Learn from within Apache Spark
1. Format input RDD (eg. using Java NIO) as numpy.array
2. Invoke Scikit-Learn via Python/C API
3. Parse output numpy.array into result RDD
API prioritization
Training << Maintenance ~ Deployment
One-time activity << Repeated activities
Short-term << Long-term
JPMML - Java PMML API
● Conversion API
● Maintenance API
● Execution API
○ Interpreted mode
○ Translated + compiled ("Transpiled") mode
● Serving API
○ Integrations with popular Big Data frameworks
○ REST web service
Calling JPMML-Spark from within Apache Spark
org.jpmml.spark.TransformerBuilder pmmlTransformerBuilder = ..;
org.apache.spark.ml.Transformer pmmlTransformer = pmmlTransformerBuilder.build();
org.apache.spark.sql.Dataset<Row> input = ..;
org.apache.spark.sql.DataSet<Row> result = pmmlTransformer.transform(input);
The case study
Predicting the price of VW Golf cars using GBT algorithms:
● 71 columns:
○ A continuous label: log(price)
○ Two string and four numeric categorical features
○ 64 binary-like (0/1) and numeric continuous features
● 270'458 rows:
○ 153'978 complete cases
○ 116'480 incomplete (ie. with missing values) cases
Gradient-Boosted Trees (GBTs)
R training and conversion API
#library("caret")
library("gbm")
library("r2pmml")
cars = read.csv("cars.tsv", sep = "t", na.strings = "N/A")
factor_cols = c("category", "colour", "ac", "fuel_type", "gearbox", "interior_color", "interior_type")
for(factor_col in factor_cols){
cars[, factor_col] = as.factor(cars[, factor_col])
}
# Doesn't work with factors with missing values
#cars.gbm = train(price ~ ., data = cars, method = "gbm", na.action = na.pass, ..)
cars.gbm = gbm(price ~ ., data = cars, n.trees = 100, shrinkage = 0.1, interaction.depth = 6)
r2pmml(cars.gbm, "gbm.pmml")
Scikit-Learn training and conversion API
from sklearn_pandas import DataFrameMapper
from sklearn.model_selection import GridSearchCV
from sklearn2pmml import sklearn2pmml, PMMLPipeline
cars = pandas.read_csv("cars.tsv", sep = "t", na_values = ["N/A", "NA"])
mapper = DataFrameMapper(..)
regressor = ..
tuner = GridSearchCV(regressor, param_grid = .., fit_params = ..)
tuner.fit(mapper.fit_transform(cars), cars["price"])
pipeline = PMMLPipeline([
("mapper", mapper),
("regressor", tuner.best_estimator_)
])
sklearn2pmml(pipeline, "pipeline.pmml", with_repr = True)
Dataset
R LightGBM XGBoost
Scikit-
Learn
Apache
Spark ML
Abstraction data.frame lgb.Dataset xgb.DMatrix numpy.array RDD<Vector>
Memory
layout
Contiguous,
dense
Contiguous,
dense(?)
Contiguous,
dense/sparse
Contiguous,
dense/sparse
Distributed,
dense/sparse
Data type Any double float float or
double
double
Categorical
values
As-is (factor) Encoded Binarized Binarized Binarized
Missing
values
Yes Pseudo (NaN) Pseudo (NaN) No No
LightGBM via Scikit-Learn
from sklearn_pandas import DataFrameMapper
from sklearn2pmml.preprocessing import PMMLLabelEncoder
from lightgbm import LGBMRegressor
mapper = DataFrameMapper(
[(factor_column, PMMLLabelEncoder()) for factor_column in factor_columns] +
[(continuous_columns, None)]
)
transformed_cars = mapper.fit_transform(cars)
regressor = LGBMRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6, num_leaves = 64)
regressor.fit(transformed_cars, cars["price"],
categorical_feature = list(range(0, len(factor_columns))))
XGBoost via Scikit-Learn
from sklearn_pandas import DataFrameMapper
from sklearn2pmml.preprocessing import PMMLLabelBinarizer
from xgboost.sklearn import XGBRegressor
mapper = DataFrameMapper(
[(factor_column, PMMLLabelBinarizer()) for factor_column in factor_columns] +
[(continuous_columns, None)]
)
transformed_cars = mapper.fit_transform(cars)
regressor = XGBRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6)
regressor.fit(transformed_cars, cars["price"])
GBT algorithm (training)
R LightGBM XGBoost
Scikit-
Learn
Apache
Spark ML
Abstraction gbm LGBMRegressor XGBRegressor GradientBoost
ingRegressor
GBTRegressor
Parameterizab
ility
Medium High High Medium Medium
Split type Multi-way Binary Binary Binary Binary
Categorical
values
"set contains" "equals" Pseudo
("equals")
Pseudo
("equals")
"equals"
Missing
values
First-class Pseudo Pseudo No No
gbm-style splits
<Node id="9">
<SimplePredicate field="interior_type" operator="isMissing"/>
<Node id="12" score="3.0702062395803734E-4">
<SimplePredicate field="colour" operator="isMissing"/>
</Node>
<Node id="10" score="-0.018950416258408962">
<SimpleSetPredicate field="colour" booleanOperator="isIn">
<Array type="string">Grün Rot Violett Weiß</Array>
</SimpleSetPredicate>
</Node>
<Node id="11" score="-0.0017446280908351925">
<SimpleSetPredicate field="colour" booleanOperator="isIn">
<Array type="string">Beige Blau Braun Gelb Gold Grau Orange Schwarz Silber</Array>
</SimpleSetPredicate>
</Node>
</Node>
LightGBM- and XGBoost-style splits (1/3)
<Node id="39" defaultChild="76">
<SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/>
<Node id="76" score="0.0030283758">
<SimplePredicate field="colour" operator="notEqual" value="Orange"/>
</Node>
<Node id="77" score="0.02483887">
<SimplePredicate field="colour" operator="equal" value="Orange"/>
</Node>
</Node>
LightGBM- and XGBoost-style splits (2/3)
<Node id="39">
<SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/>
<!-- if(colour == null || !"Orange".equals(colour)) return 0.0030283758 -->
<Node id="76" score="0.0030283758">
<CompoundPredicate booleanOperator="or">
<SimplePredicate field="colour" operator="isMissing"/>
<SimplePredicate field="colour" operator="notEqual" value="Orange"/>
</CompoundPredicate>
</Node>
<!-- else if("Orange".equals(colour)) return 0.02483887 -->
<Node id="77" score="0.02483887">
<SimplePredicate field="colour" operator="equal" value="Orange"/>
</Node>
<!-- else return null -->
</Node>
LightGBM- and XGBoost-style splits (2/3)
<Node id="39">
<SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/>
<!-- if(colour != null && "Orange".equals(colour)) return 0.02483887 -->
<Node id="77" score="0.02483887">
<CompoundPredicate booleanOperator="and">
<SimplePredicate field="colour" operator="isNotMissing"/>
<SimplePredicate field="colour" operator="equal" value="Orange"/>
</CompoundPredicate>
</Node>
<!-- else return 0.0030283758 -->
<Node id="76" score="0.0030283758">
<True/>
</Node>
</Node>
Model measurement using JPMML
org.dmg.pmml.tree.TreeModel treeModel = ..;
treeModel.accept(new org.jpmml.model.visitors.AbstractVisitor(){
private int count = 0; // Number of Node elements
private int maxDepth = 0; // Max "nesting depth" of Node elements
@Override
public VisitorAction visit(org.dmg.pmml.tree.Node node){
this.count++;
int depth = 0;
for(org.dmg.pmml.PMMLObject parent : getParents()){
if(!(parent instanceof org.dmg.pmml.tree.Node)) break;
depth++;
}
this.maxDepth = Math.max(this.maxDepth, depth);
return super.visit(node);
}
});
GBT algorithm (interpretation)
R LightGBM XGBoost
Scikit-
Learn
Apache
Spark ML
Feature
importances
Direct Direct Transformed Transformed Transformed
Decision path No No(?) No(?) Transformed Transformed
Model
persistence
RDS (binary) Proprietary
(text)
Proprietary
(binary, text)
Pickle (binary) SER (binary) or
JSON (text)
Model
reusability
Good Fair(?) Good Fair Fair
Java API No No Pseudo No Yes
LightGBM feature importances
Age 936
Mileage 887
Performance 738
[Category] 205
New? 179
[Type of fuel] 170
[Type of interior] 167
Airbags? 130
[Colour] 129
[Type of gearbox] 105
Model execution using JPMML
org.dmg.pmml.PMML pmml;
try(InputStream is = ..){
pmml = org.jpmml.model.PMMLUtil.unmarshal(is);
}
org.jpmml.evaluator.Evaluator evaluator =
new org.jpmml.evaluator.mining.MiningModelEvaluator(pmml);
org.jpmml.evaluator.InputField inputField = selectField(evaluator.getInputFields(), ..);
org.jpmml.evaluator.TargetField targetField = selectField(evaluator.getTargetFields(), ..);
for(int value = min; value <= max; value += increment){
Map<FieldName, FieldValue> arguments =
Collections.singletonMap(inputField.getName(), inputField.prepare(value));
Map<FieldName, ?> result = evaluator.evaluate(arguments);
System.out.println(result.get(targetField.getName()));
}
Lessons (to be-) learned
● Limits and limitations of individual APIs
● Vertical integration vs. horizontal integration:
○ All capabilities on a single platform
○ Specialized capabilities on specialized platforms
● Ease-of-use and robustness beat raw performance in
most application scenarios
● "Conventions over configuration"
Q&A
villu@openscoring.io
https://github.com/jpmml
https://github.com/openscoring
https://groups.google.com/forum/#!forum/jpmml

Más contenido relacionado

La actualidad más candente

Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesNeo4j
 
An Introduction to Graph Databases
An Introduction to Graph DatabasesAn Introduction to Graph Databases
An Introduction to Graph DatabasesInfiniteGraph
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnDatabricks
 
Applied Data Science Capstone
Applied Data Science CapstoneApplied Data Science Capstone
Applied Data Science CapstoneOmid Karami
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using RUmmiya Mohammedi
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowDatabricks
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
 
Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in PythonMarc Garcia
 
Productionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureProductionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureDatabricks
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowDatabricks
 
Large Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache SparkLarge Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache SparkDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Deploying End-to-End Deep Learning Pipelines with ONNX
Deploying End-to-End Deep Learning Pipelines with ONNXDeploying End-to-End Deep Learning Pipelines with ONNX
Deploying End-to-End Deep Learning Pipelines with ONNXDatabricks
 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins Edureka!
 
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...Edureka!
 

La actualidad más candente (20)

Apache spark
Apache sparkApache spark
Apache spark
 
Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph Databases
 
Python for Data Science
Python for Data SciencePython for Data Science
Python for Data Science
 
An Introduction to Graph Databases
An Introduction to Graph DatabasesAn Introduction to Graph Databases
An Introduction to Graph Databases
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
 
Applied Data Science Capstone
Applied Data Science CapstoneApplied Data Science Capstone
Applied Data Science Capstone
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using R
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in Python
 
Productionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureProductionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices Architecture
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
 
Large Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache SparkLarge Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache Spark
 
Lect12 graph mining
Lect12 graph miningLect12 graph mining
Lect12 graph mining
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Deploying End-to-End Deep Learning Pipelines with ONNX
Deploying End-to-End Deep Learning Pipelines with ONNXDeploying End-to-End Deep Learning Pipelines with ONNX
Deploying End-to-End Deep Learning Pipelines with ONNX
 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins
 
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
 

Destacado

Representing TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMMLRepresenting TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMMLVillu Ruusmann
 
Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms Timothy Spann
 
On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsVillu Ruusmann
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317Nan Zhu
 
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...Instaclustr
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017MLconf
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Chris Fregly
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An OverviewMohit Jain
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkDatabricks
 
Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3Alluxio, Inc.
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks
 
Product Update: EDB Postgres Platform 2017
Product Update: EDB Postgres Platform 2017Product Update: EDB Postgres Platform 2017
Product Update: EDB Postgres Platform 2017EDB
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMeeraj Kunnumpurath
 

Destacado (20)

Representing TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMMLRepresenting TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMML
 
Yace 3.0
Yace 3.0Yace 3.0
Yace 3.0
 
Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms
 
On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) models
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317
 
Weld Strata talk
Weld Strata talkWeld Strata talk
Weld Strata talk
 
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
 
Production Grade Data Science for Hadoop
Production Grade Data Science for HadoopProduction Grade Data Science for Hadoop
Production Grade Data Science for Hadoop
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
Product Update: EDB Postgres Platform 2017
Product Update: EDB Postgres Platform 2017Product Update: EDB Postgres Platform 2017
Product Update: EDB Postgres Platform 2017
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
 

Similar a R, Scikit-Learn and Apache Spark ML - What difference does it make?

TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...Chetan Khatri
 
Semantic search in databases
Semantic search in databasesSemantic search in databases
Semantic search in databasesTomáš Drenčák
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri
 
cbse 12 computer science investigatory project
cbse 12 computer science investigatory project  cbse 12 computer science investigatory project
cbse 12 computer science investigatory project D. j Vicky
 
cbse 12 computer science investigatory project
cbse 12 computer science investigatory project  cbse 12 computer science investigatory project
cbse 12 computer science investigatory project D. j Vicky
 
Use Angular Schematics to Simplify Your Life - Develop Denver 2019
Use Angular Schematics to Simplify Your Life - Develop Denver 2019Use Angular Schematics to Simplify Your Life - Develop Denver 2019
Use Angular Schematics to Simplify Your Life - Develop Denver 2019Matt Raible
 
cbse 12 computer science IP
cbse 12 computer science IPcbse 12 computer science IP
cbse 12 computer science IPD. j Vicky
 
A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019
A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019
A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019Matt Raible
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Karthik Murugesan
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkFaisal Siddiqi
 
Adaptive Query Optimization
Adaptive Query OptimizationAdaptive Query Optimization
Adaptive Query OptimizationAnju Garg
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured StreamingKnoldus Inc.
 
Computer graphics practical(jainam)
Computer graphics practical(jainam)Computer graphics practical(jainam)
Computer graphics practical(jainam)JAINAM KAPADIYA
 
VSSML18. Introduction to WhizzML
VSSML18. Introduction to WhizzMLVSSML18. Introduction to WhizzML
VSSML18. Introduction to WhizzMLBigML, Inc
 
Deep Dive Into Swift
Deep Dive Into SwiftDeep Dive Into Swift
Deep Dive Into SwiftSarath C
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit
 
How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...
How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...
How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...Oleksandr Tarasenko
 

Similar a R, Scikit-Learn and Apache Spark ML - What difference does it make? (20)

TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
Semantic search in databases
Semantic search in databasesSemantic search in databases
Semantic search in databases
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
cbse 12 computer science investigatory project
cbse 12 computer science investigatory project  cbse 12 computer science investigatory project
cbse 12 computer science investigatory project
 
cbse 12 computer science investigatory project
cbse 12 computer science investigatory project  cbse 12 computer science investigatory project
cbse 12 computer science investigatory project
 
Use Angular Schematics to Simplify Your Life - Develop Denver 2019
Use Angular Schematics to Simplify Your Life - Develop Denver 2019Use Angular Schematics to Simplify Your Life - Develop Denver 2019
Use Angular Schematics to Simplify Your Life - Develop Denver 2019
 
cbse 12 computer science IP
cbse 12 computer science IPcbse 12 computer science IP
cbse 12 computer science IP
 
A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019
A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019
A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
Adaptive Query Optimization
Adaptive Query OptimizationAdaptive Query Optimization
Adaptive Query Optimization
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Computer graphics practical(jainam)
Computer graphics practical(jainam)Computer graphics practical(jainam)
Computer graphics practical(jainam)
 
VSSML18. Introduction to WhizzML
VSSML18. Introduction to WhizzMLVSSML18. Introduction to WhizzML
VSSML18. Introduction to WhizzML
 
Deep Dive Into Swift
Deep Dive Into SwiftDeep Dive Into Swift
Deep Dive Into Swift
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
 
R console
R consoleR console
R console
 
How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...
How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...
How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...
 

Último

➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 

Último (20)

➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 

R, Scikit-Learn and Apache Spark ML - What difference does it make?

  • 1. R, Scikit-Learn and Apache Spark ML - What difference does it make? Villu Ruusmann Openscoring OÜ
  • 2. Overview ● Identifying long-standing, high-value opportunities in the applied predictive analytics domain ● Thinking about problems in API terms ● Providing solutions in API terms ● Developing and applying custom tools + A couple of tips if you're looking to buy or sell a VW Golf
  • 4. "More data beats better algorithms"
  • 5. The state of the art
  • 7. Elements of reproducibility Standardized, human- and machine-readable descriptions: ● Dataset ● Data pre- and post-processing steps: ○ From real-life input table (SQL, CSV) to model ○ From model to real-life output table ● Model ● Statistics
  • 8. Calling R from within Apache Spark 1. Create and initialize R runtime 2. Format and upload input RDD; upload and execute R model; download output and parse into result RDD 3. Destroy R runtime
  • 9. Calling Scikit-Learn from within Apache Spark 1. Format input RDD (eg. using Java NIO) as numpy.array 2. Invoke Scikit-Learn via Python/C API 3. Parse output numpy.array into result RDD
  • 10. API prioritization Training << Maintenance ~ Deployment One-time activity << Repeated activities Short-term << Long-term
  • 11. JPMML - Java PMML API ● Conversion API ● Maintenance API ● Execution API ○ Interpreted mode ○ Translated + compiled ("Transpiled") mode ● Serving API ○ Integrations with popular Big Data frameworks ○ REST web service
  • 12. Calling JPMML-Spark from within Apache Spark org.jpmml.spark.TransformerBuilder pmmlTransformerBuilder = ..; org.apache.spark.ml.Transformer pmmlTransformer = pmmlTransformerBuilder.build(); org.apache.spark.sql.Dataset<Row> input = ..; org.apache.spark.sql.DataSet<Row> result = pmmlTransformer.transform(input);
  • 13. The case study Predicting the price of VW Golf cars using GBT algorithms: ● 71 columns: ○ A continuous label: log(price) ○ Two string and four numeric categorical features ○ 64 binary-like (0/1) and numeric continuous features ● 270'458 rows: ○ 153'978 complete cases ○ 116'480 incomplete (ie. with missing values) cases
  • 15. R training and conversion API #library("caret") library("gbm") library("r2pmml") cars = read.csv("cars.tsv", sep = "t", na.strings = "N/A") factor_cols = c("category", "colour", "ac", "fuel_type", "gearbox", "interior_color", "interior_type") for(factor_col in factor_cols){ cars[, factor_col] = as.factor(cars[, factor_col]) } # Doesn't work with factors with missing values #cars.gbm = train(price ~ ., data = cars, method = "gbm", na.action = na.pass, ..) cars.gbm = gbm(price ~ ., data = cars, n.trees = 100, shrinkage = 0.1, interaction.depth = 6) r2pmml(cars.gbm, "gbm.pmml")
  • 16. Scikit-Learn training and conversion API from sklearn_pandas import DataFrameMapper from sklearn.model_selection import GridSearchCV from sklearn2pmml import sklearn2pmml, PMMLPipeline cars = pandas.read_csv("cars.tsv", sep = "t", na_values = ["N/A", "NA"]) mapper = DataFrameMapper(..) regressor = .. tuner = GridSearchCV(regressor, param_grid = .., fit_params = ..) tuner.fit(mapper.fit_transform(cars), cars["price"]) pipeline = PMMLPipeline([ ("mapper", mapper), ("regressor", tuner.best_estimator_) ]) sklearn2pmml(pipeline, "pipeline.pmml", with_repr = True)
  • 17. Dataset R LightGBM XGBoost Scikit- Learn Apache Spark ML Abstraction data.frame lgb.Dataset xgb.DMatrix numpy.array RDD<Vector> Memory layout Contiguous, dense Contiguous, dense(?) Contiguous, dense/sparse Contiguous, dense/sparse Distributed, dense/sparse Data type Any double float float or double double Categorical values As-is (factor) Encoded Binarized Binarized Binarized Missing values Yes Pseudo (NaN) Pseudo (NaN) No No
  • 18. LightGBM via Scikit-Learn from sklearn_pandas import DataFrameMapper from sklearn2pmml.preprocessing import PMMLLabelEncoder from lightgbm import LGBMRegressor mapper = DataFrameMapper( [(factor_column, PMMLLabelEncoder()) for factor_column in factor_columns] + [(continuous_columns, None)] ) transformed_cars = mapper.fit_transform(cars) regressor = LGBMRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6, num_leaves = 64) regressor.fit(transformed_cars, cars["price"], categorical_feature = list(range(0, len(factor_columns))))
  • 19. XGBoost via Scikit-Learn from sklearn_pandas import DataFrameMapper from sklearn2pmml.preprocessing import PMMLLabelBinarizer from xgboost.sklearn import XGBRegressor mapper = DataFrameMapper( [(factor_column, PMMLLabelBinarizer()) for factor_column in factor_columns] + [(continuous_columns, None)] ) transformed_cars = mapper.fit_transform(cars) regressor = XGBRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6) regressor.fit(transformed_cars, cars["price"])
  • 20. GBT algorithm (training) R LightGBM XGBoost Scikit- Learn Apache Spark ML Abstraction gbm LGBMRegressor XGBRegressor GradientBoost ingRegressor GBTRegressor Parameterizab ility Medium High High Medium Medium Split type Multi-way Binary Binary Binary Binary Categorical values "set contains" "equals" Pseudo ("equals") Pseudo ("equals") "equals" Missing values First-class Pseudo Pseudo No No
  • 21. gbm-style splits <Node id="9"> <SimplePredicate field="interior_type" operator="isMissing"/> <Node id="12" score="3.0702062395803734E-4"> <SimplePredicate field="colour" operator="isMissing"/> </Node> <Node id="10" score="-0.018950416258408962"> <SimpleSetPredicate field="colour" booleanOperator="isIn"> <Array type="string">Grün Rot Violett Weiß</Array> </SimpleSetPredicate> </Node> <Node id="11" score="-0.0017446280908351925"> <SimpleSetPredicate field="colour" booleanOperator="isIn"> <Array type="string">Beige Blau Braun Gelb Gold Grau Orange Schwarz Silber</Array> </SimpleSetPredicate> </Node> </Node>
  • 22. LightGBM- and XGBoost-style splits (1/3) <Node id="39" defaultChild="76"> <SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/> <Node id="76" score="0.0030283758"> <SimplePredicate field="colour" operator="notEqual" value="Orange"/> </Node> <Node id="77" score="0.02483887"> <SimplePredicate field="colour" operator="equal" value="Orange"/> </Node> </Node>
  • 23. LightGBM- and XGBoost-style splits (2/3) <Node id="39"> <SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/> <!-- if(colour == null || !"Orange".equals(colour)) return 0.0030283758 --> <Node id="76" score="0.0030283758"> <CompoundPredicate booleanOperator="or"> <SimplePredicate field="colour" operator="isMissing"/> <SimplePredicate field="colour" operator="notEqual" value="Orange"/> </CompoundPredicate> </Node> <!-- else if("Orange".equals(colour)) return 0.02483887 --> <Node id="77" score="0.02483887"> <SimplePredicate field="colour" operator="equal" value="Orange"/> </Node> <!-- else return null --> </Node>
  • 24. LightGBM- and XGBoost-style splits (2/3) <Node id="39"> <SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/> <!-- if(colour != null && "Orange".equals(colour)) return 0.02483887 --> <Node id="77" score="0.02483887"> <CompoundPredicate booleanOperator="and"> <SimplePredicate field="colour" operator="isNotMissing"/> <SimplePredicate field="colour" operator="equal" value="Orange"/> </CompoundPredicate> </Node> <!-- else return 0.0030283758 --> <Node id="76" score="0.0030283758"> <True/> </Node> </Node>
  • 25. Model measurement using JPMML org.dmg.pmml.tree.TreeModel treeModel = ..; treeModel.accept(new org.jpmml.model.visitors.AbstractVisitor(){ private int count = 0; // Number of Node elements private int maxDepth = 0; // Max "nesting depth" of Node elements @Override public VisitorAction visit(org.dmg.pmml.tree.Node node){ this.count++; int depth = 0; for(org.dmg.pmml.PMMLObject parent : getParents()){ if(!(parent instanceof org.dmg.pmml.tree.Node)) break; depth++; } this.maxDepth = Math.max(this.maxDepth, depth); return super.visit(node); } });
  • 26.
  • 27.
  • 28.
  • 29. GBT algorithm (interpretation) R LightGBM XGBoost Scikit- Learn Apache Spark ML Feature importances Direct Direct Transformed Transformed Transformed Decision path No No(?) No(?) Transformed Transformed Model persistence RDS (binary) Proprietary (text) Proprietary (binary, text) Pickle (binary) SER (binary) or JSON (text) Model reusability Good Fair(?) Good Fair Fair Java API No No Pseudo No Yes
  • 30. LightGBM feature importances Age 936 Mileage 887 Performance 738 [Category] 205 New? 179 [Type of fuel] 170 [Type of interior] 167 Airbags? 130 [Colour] 129 [Type of gearbox] 105
  • 31. Model execution using JPMML org.dmg.pmml.PMML pmml; try(InputStream is = ..){ pmml = org.jpmml.model.PMMLUtil.unmarshal(is); } org.jpmml.evaluator.Evaluator evaluator = new org.jpmml.evaluator.mining.MiningModelEvaluator(pmml); org.jpmml.evaluator.InputField inputField = selectField(evaluator.getInputFields(), ..); org.jpmml.evaluator.TargetField targetField = selectField(evaluator.getTargetFields(), ..); for(int value = min; value <= max; value += increment){ Map<FieldName, FieldValue> arguments = Collections.singletonMap(inputField.getName(), inputField.prepare(value)); Map<FieldName, ?> result = evaluator.evaluate(arguments); System.out.println(result.get(targetField.getName())); }
  • 32.
  • 33.
  • 34. Lessons (to be-) learned ● Limits and limitations of individual APIs ● Vertical integration vs. horizontal integration: ○ All capabilities on a single platform ○ Specialized capabilities on specialized platforms ● Ease-of-use and robustness beat raw performance in most application scenarios ● "Conventions over configuration"