SlideShare una empresa de Scribd logo
1 de 19
Descargar para leer sin conexión
State of the (J)PMML art
Villu Ruusmann
Openscoring OÜ
Overview
Data Science paradigms
Structured ("ML") Unstructured ("AI")
Data Scalars Arrays, Sequences
Modeling Statistics/CPU,
Human-driven
Brute force/GPU,
"Deep learning"
Audience Global, from SMEs to
mega-corps
SV tech companies
(proprietary datasets)
Standards PMML ONNX, TensorFlow
Structured information
● Relational databases
○ Schemaful
○ High-level, conceptual features
● Human interpretable algorithms
○ Decision tree
○ Logistic regression
○ Scorecard
○ Business rules
● Machine interpretable algorithms
Training perspective
Structured training workflow
1. Feature engineering
1.1. Transformation
1.2. Filtering
2. Modeling
2.1. Choice of algorithm
2.2. Hyperparameter tuning
3. Decision engineering
3.1. Transformation
3.2. Packaging (following consumer API conventions)
OSS ML frameworks
R Scikit-Learn Apache Spark
Scale Desktop Desktop, Server Server, Cluster
Data model Rich Poor Mixed
Main concept Model formula Pipeline Pipeline
Feature eng. Average Very good Good
Modeling Very good Good Average
Decision eng. N/A N/A Very good
OSS ML algorithm providers
H2O.ai XGBoost LightGBM
Scale Desktop, Server, Cluster
Data model Mixed
Algorithms Decision tree
ensembles,
GLM,
Naive Bayes,
Neural Networks,
Stacking
Decision tree
ensembles
Decision tree
ensembles
Workflow implementation
● R/Scikit-Learn/Apache Spark everything
● R/Scikit-Learn feature eng. + H2O.ai/XGBoost/LightGBM
modeling
● Apache Spark feature eng. + H2O.ai/XGBoost modeling +
Apache Spark decision eng.
Mixed workflows bring better results, but are considearbly
harder to deploy.
Deployment perspective
The challenge
Productionalization happens on Java, C# or SQL, which have
very limited interoperability with R and Python:
● Desktop vs. Server/Cluster
● Language-level threading, memory management issues
● Zero overlap between library sets
ML training and deployment are completely separate
concerns and shouldn't be (force-)blended into one.
Possible solutions
● Translation from R/Python application code to
Java/C#/SQL application code
● Translation from R/Python application code to some
standardized intermediate representation:
○ Higher abstraction level
○ Reorganized, refactored and simplified
○ Stable in time
● Containerization
Deployment options on Java/JVM
● R: N/A
● Scikit-Learn: nok/sklearn-porter
● Apache Spark: Java/Scala APIs, which are tightly coupled
to the Apache Spark runtime
● H2O.ai: POJO and MOJO mechanisms
● XGBoost: JNI wrapper,
komiya-atsushi/xgboost-predictor-java
● LightGBM: JNI wrapper(?)
The JPMML ecosystem
● "The API is the product"
○ Conversion APIs
○ Analysis and interpretation APIs
○ Scoring APIs
● Layered, role-based design
○ Data scientists, business analysts
○ Data engineers
○ Java/Scala developers
● Automated annotation, quality control
Converting R to PMML
library("r2pmml")
audit = read.csv("Audit.csv")
audit$Adjusted = as.factor(audit$Adjusted)
audit.glm = glm(Adjusted ~ . - Age + cut(Age, breaks = c(0, 18, 65, 100))
+ Gender:Education + I(Income / (Hours * 52)), data = audit, family =
"binomial")
audit.glm = r2pmml::verify(audit.glm, audit[sample(nrow(audit), 100), ])
r2pmml::r2pmml(audit.glm, "GLMAudit.pmml")
Converting Scikit-Learn to PMML
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
pipeline = PMMLPipeline([
("mapper", DataFrameMapper([...])),
("classifier", DecisionTreeClassifier())
])
pipeline.fit(audit_X, audit_y)
pipeline.configure(compact = True, flat = True)
pipeline.verify(audit_X.sample(100))
sklearn2pmml(pipeline, "DecisionTreeAudit.pmml")
Scoring PMML
The JPMML-Evaluator®
library:
● Small (<1 MB), nearly self-contained (F)OSS Java library
● Developer API consists a single Java interface and a
couple of utility classes (model loading, optimization)
● Supports all PMML 3.X and 4.X schema versions
● Vendor extensions:
○ Accessing 3rd party Java libraries in PMML data flow
○ MathML prediction reports
Scoring application scenarios
● Synchronous vs. asynchronous
● Low-latency vs. high-latency
● Individual data records vs. batches of data records
● Embedded vs. over-the-network
Q&A
villu@openscoring.io
https://github.com/jpmml
https://github.com/openscoring
https://groups.google.com/forum/#!forum/jpmml

Más contenido relacionado

Similar a State of the (J)PMML art

Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
 
Rust & Apache Arrow @ RMS
Rust & Apache Arrow @ RMSRust & Apache Arrow @ RMS
Rust & Apache Arrow @ RMSAndy Grove
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkDESMOND YUEN
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistAlexey Zinoviev
 
Ai platform at scale
Ai platform at scaleAi platform at scale
Ai platform at scaleHenry Saputra
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?Ivo Andreev
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkAhsan Javed Awan
 
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Spark Summit
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryDeepak Shankar
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkPetr Zapletal
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
ISC Frankfurt 2015: Good, bad and ugly of accelerators and a complementary path
ISC Frankfurt 2015: Good, bad and ugly of accelerators and a complementary pathISC Frankfurt 2015: Good, bad and ugly of accelerators and a complementary path
ISC Frankfurt 2015: Good, bad and ugly of accelerators and a complementary pathJohn Holden
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking VN
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Revolution Analytics
 

Similar a State of the (J)PMML art (20)

Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
Rust & Apache Arrow @ RMS
Rust & Apache Arrow @ RMSRust & Apache Arrow @ RMS
Rust & Apache Arrow @ RMS
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for Spark
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
 
Ai platform at scale
Ai platform at scaleAi platform at scale
Ai platform at scale
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
 
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP Library
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
ISC Frankfurt 2015: Good, bad and ugly of accelerators and a complementary path
ISC Frankfurt 2015: Good, bad and ugly of accelerators and a complementary pathISC Frankfurt 2015: Good, bad and ugly of accelerators and a complementary path
ISC Frankfurt 2015: Good, bad and ugly of accelerators and a complementary path
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 

Último

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 

Último (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

State of the (J)PMML art

  • 1. State of the (J)PMML art Villu Ruusmann Openscoring OÜ
  • 3. Data Science paradigms Structured ("ML") Unstructured ("AI") Data Scalars Arrays, Sequences Modeling Statistics/CPU, Human-driven Brute force/GPU, "Deep learning" Audience Global, from SMEs to mega-corps SV tech companies (proprietary datasets) Standards PMML ONNX, TensorFlow
  • 4. Structured information ● Relational databases ○ Schemaful ○ High-level, conceptual features ● Human interpretable algorithms ○ Decision tree ○ Logistic regression ○ Scorecard ○ Business rules ● Machine interpretable algorithms
  • 6. Structured training workflow 1. Feature engineering 1.1. Transformation 1.2. Filtering 2. Modeling 2.1. Choice of algorithm 2.2. Hyperparameter tuning 3. Decision engineering 3.1. Transformation 3.2. Packaging (following consumer API conventions)
  • 7. OSS ML frameworks R Scikit-Learn Apache Spark Scale Desktop Desktop, Server Server, Cluster Data model Rich Poor Mixed Main concept Model formula Pipeline Pipeline Feature eng. Average Very good Good Modeling Very good Good Average Decision eng. N/A N/A Very good
  • 8. OSS ML algorithm providers H2O.ai XGBoost LightGBM Scale Desktop, Server, Cluster Data model Mixed Algorithms Decision tree ensembles, GLM, Naive Bayes, Neural Networks, Stacking Decision tree ensembles Decision tree ensembles
  • 9. Workflow implementation ● R/Scikit-Learn/Apache Spark everything ● R/Scikit-Learn feature eng. + H2O.ai/XGBoost/LightGBM modeling ● Apache Spark feature eng. + H2O.ai/XGBoost modeling + Apache Spark decision eng. Mixed workflows bring better results, but are considearbly harder to deploy.
  • 11. The challenge Productionalization happens on Java, C# or SQL, which have very limited interoperability with R and Python: ● Desktop vs. Server/Cluster ● Language-level threading, memory management issues ● Zero overlap between library sets ML training and deployment are completely separate concerns and shouldn't be (force-)blended into one.
  • 12. Possible solutions ● Translation from R/Python application code to Java/C#/SQL application code ● Translation from R/Python application code to some standardized intermediate representation: ○ Higher abstraction level ○ Reorganized, refactored and simplified ○ Stable in time ● Containerization
  • 13. Deployment options on Java/JVM ● R: N/A ● Scikit-Learn: nok/sklearn-porter ● Apache Spark: Java/Scala APIs, which are tightly coupled to the Apache Spark runtime ● H2O.ai: POJO and MOJO mechanisms ● XGBoost: JNI wrapper, komiya-atsushi/xgboost-predictor-java ● LightGBM: JNI wrapper(?)
  • 14. The JPMML ecosystem ● "The API is the product" ○ Conversion APIs ○ Analysis and interpretation APIs ○ Scoring APIs ● Layered, role-based design ○ Data scientists, business analysts ○ Data engineers ○ Java/Scala developers ● Automated annotation, quality control
  • 15. Converting R to PMML library("r2pmml") audit = read.csv("Audit.csv") audit$Adjusted = as.factor(audit$Adjusted) audit.glm = glm(Adjusted ~ . - Age + cut(Age, breaks = c(0, 18, 65, 100)) + Gender:Education + I(Income / (Hours * 52)), data = audit, family = "binomial") audit.glm = r2pmml::verify(audit.glm, audit[sample(nrow(audit), 100), ]) r2pmml::r2pmml(audit.glm, "GLMAudit.pmml")
  • 16. Converting Scikit-Learn to PMML from sklearn2pmml import sklearn2pmml from sklearn2pmml.pipeline import PMMLPipeline pipeline = PMMLPipeline([ ("mapper", DataFrameMapper([...])), ("classifier", DecisionTreeClassifier()) ]) pipeline.fit(audit_X, audit_y) pipeline.configure(compact = True, flat = True) pipeline.verify(audit_X.sample(100)) sklearn2pmml(pipeline, "DecisionTreeAudit.pmml")
  • 17. Scoring PMML The JPMML-Evaluator® library: ● Small (<1 MB), nearly self-contained (F)OSS Java library ● Developer API consists a single Java interface and a couple of utility classes (model loading, optimization) ● Supports all PMML 3.X and 4.X schema versions ● Vendor extensions: ○ Accessing 3rd party Java libraries in PMML data flow ○ MathML prediction reports
  • 18. Scoring application scenarios ● Synchronous vs. asynchronous ● Low-latency vs. high-latency ● Individual data records vs. batches of data records ● Embedded vs. over-the-network