SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
Converting R to PMML
Villu Ruusmann
Openscoring OÜ
"Train once, deploy anywhere"
R challenge
"Fat code around thin model objects":
● Many packages to solve the same problem, many ways of
doing and/or representing the same thing
● The unit of reusability is R script
● Model object is tightly coupled to the R script that
produced it. Incomplete/split schemas
● No (widely adopted-) pipeline formalization
(J)PMML solution
Evaluator evaluator = ...;
List<InputField> argumentFields = evaluator.getInputFields();
List<ResultField> resultFields =
Lists.union(evaluator.getTargetFields(), evaluator.getOutputFields());
Map<FieldName, ?> arguments = readRecord(argumentFields);
Map<FieldName, ?> result = evaluator.evaluate(arguments);
writeRecord(result, resultFields);
● The R script (data pre-and post-processing, model) is
represented using standardized PMML data structures
● Well-defined data entry and exit interfaces
Workflow
R challenge
Training:
audit_train = read.csv("Audit.csv")
audit_train$HourlyIncome = (audit_train$Income / (audit_train$Hours * 52))
audit.model = model(Adjusted ~ ., data = audit_train)
saveRDS(audit.model, "Audit.rds")
Deployment:
audit_test = read.csv("Audit.csv")
audit.model = readRDS("Audit.rds")
# "Error in eval(expr, envir, enclos) : object 'HourlyIncome' not found"
predict(audit.model, newdata = audit_test)
R solution
Matrix interface ( ):
label = audit$Adjusted
features = audit[, !(colnames(audit) %in% c("Adjusted"))]
# Returns a `model` object
audit.model = model(x = features, y = label)
Formula interface ( ):
# Returns a `model.formula` object
audit.model = model(Adjusted ~ . + I(Income / (Hours * 52)), data = audit)
From `pmml` to `r2pmml`
library("pmml")
library("r2pmml")
audit = read.csv("Audit.csv")
audit$Adjusted = as.factor(audit$Adjusted)
audit.glm = glm(Adjusted ~ . + I(Income / (Hours * 52)),
data = audit, family = "binomial")
saveXML(pmml(audit.glm), "Audit.pmml")
r2pmml(audit.glm, "Audit.pmml")
Package `pmml` ( )
<PMML>
<DataDictionary>
<DataField name="I(Income/(Hours * 52))" optype="continuous" dataType="double"/>
</DataDictionary>
<GeneralRegressionModel>
<MiningSchema>
<MiningField name="I(Income/(Hours * 52))"/>
</MiningSchema>
</GeneralRegressionModel>
</PMML>
Package `r2pmml` ( )
<PMML>
<DataDictionary>
<DataField name="Income" optype="continuous" dataType="double"/>
<DataField name="Hours" optype="continuous" dataType="double"/>
</DataDictionary>
<TransformationDictionary>
<DerivedField name="Income/(Hours * 52)" optype="continuous" dataType="double">
<Apply function="/"><FieldRef field="Income"/><Apply function="*"><FieldRef
field="Hours"/><Constant>52</Constant></Apply></Apply>
</DerivedField>
</TransformationDictionary>
<GeneralRegressionModel>
<MiningSchema>
<MiningField name="Income"/>
<MiningField name="Hours"/>
</MiningSchema>
</GeneralRegressionModel>
</PMML>
Manual conversion (1/2)
r2pmml::r2pmml
r2pmml::decorate base::saveRDS RDS file JPMML-R library
The R side The Java side
Manual conversion (2/2)
The R side:
audit.model = model(Adjusted ~ ., data = audit)
# Decorate the model object with supporting information
audit.model = r2pmml::decorate(audit.model, dataset = audit)
saveRDS(audit.model, "Audit.rds")
The Java side:
$ git clone https://github.com/jpmml/jpmml-r.git
$ cd jpmml-r; mvn clean package
$ java -jar target/converter-executable-1.2-SNAPSHOT.jar --rds-input
/path/to/Audit.rds --pmml-output /path/to/Audit.pmml
In-formula feature engineering
Sanitization (1/2)
?base::ifelse
# Static replacement value
as.formula(Adjusted ~ ifelse(is.na(Hours), 0, Hours))
# Dynamic replacement value
as.formula(paste("Adjusted ~ ifelse((is.na(Hours) | Hours < 0 | Hours >
168),", mean(df$Hours), ", Hours)", sep = ""))
Sanitization (2/2)
<DerivedField name="ifelse(is.na(Hours))" optype="continuous" dataType="double">
<Extension>ifelse(is.na(Hours), 0, Hours)</Extension>
<Apply function="if">
<Apply function="isMissing">
<FieldRef field="Hours"/>
</Apply>
<Constant>0</Constant>
<FieldRef field="Hours"/>
</Apply>
</DerivedField>
Binarization (1/2)
?base::ifelse
as.formula(Adjusted ~ ifelse(Hours >= 40, "full-time", "part-time"))
Binarization (2/2)
<DerivedField name="ifelse(Hours &gt;= 40)" optype="categorical" dataType="string">
<Extension>ifelse(Hours &gt;= 40, "full-time", "part-time")</Extension>
<Apply function="if">
<Apply function="greaterOrEqual">
<FieldRef field="Hours"/>
<Constant>40</Constant>
</Apply>
<Constant>full-time</Constant>
<Constant>part-time</Constant>
</Apply>
</DerivedField>
Bucketization (1/2)
?base::cut
# Static intervals
as.formula(Adjusted ~ cut(Income, breaks = 3))
# Dynamic intervals
as.formula(Adjusted ~ cut(log10(Income), breaks = quantile(log10(Income)))
Bucketization (2/2)
<DerivedField name="cut(Income)" optype="categorical" dataType="string">
<Extension>cut(Income, breaks = 3)</Extension>
<Discretize field="Income">
<DiscretizeBin binValue="(129,1.61e+05]">
<Interval closure="openClosed" leftMargin="129.0" rightMargin="161000.0"/>
</DiscretizeBin>
<DiscretizeBin binValue="(1.61e+05,3.21e+05]">
<Interval closure="openClosed" leftMargin="161000.0" rightMargin="321000.0"/>
</DiscretizeBin>
<DiscretizeBin binValue="(3.21e+05,4.82e+05]">
<Interval closure="openClosed" leftMargin="321000.0" rightMargin="482000.0"/>
</DiscretizeBin>
</Discretize>
</DerivedField>
Mathematical expressions (1/2)
?base::I
as.formula(Adjusted ~ I(floor(Income / (Hours * 52))))
Supported constructs:
● Arithmetic operators `+`, `-`, `*`, `/`, `^` and `**`
● Relational operators `==`, `!=`, `<`, `<=`, `>=` and `>`
● Logical operators `!`, `&` and `|`
● Functions `abs`, `ceiling`, `exp`, `floor`, `is.na`, `log`, `log10`, `round`
and `sqrt`
Mathematical expressions (2/2)
<DerivedField name="floor(Income/(Hours * 52))" optype="continuous" dataType="double">
<Extension>I(floor(Income/(Hours * 52)))</Extension>
<Apply function="floor">
<Apply function="/">
<FieldRef field="Income"/>
<Apply function="*">
<FieldRef field="Hours"/>
<Constant>52</Constant>
</Apply>
</Apply>
</Apply>
</DerivedField>
Renaming and regrouping categories (1/2)
library("plyr")
?plyr::revalue
?plyr::mapvalues
as.formula(Adjusted ~ plyr::revalue(Employment, c("PSFederal" = "Public",
"PSState" = "Public", "PSLocal" = "Public")))
as.formula(Adjusted ~ plyr::revalue(Education, c("Yr1t4" = "Yr1t6",
"Yr5t6" = "Yr1t6", "Yr7t8" = "Yr7t9", "Yr9" = "Yr7t9", "Yr10" = "Yr10t12",
"Yr11" = "Yr10t12", "Yr12" = "Yr10t12")))
as.formula(Adjusted ~ plyr::mapvalues(Gender, c("Male", "Female"), c(0, 1)))
Renaming and regrouping categories (2/2)
<DerivedField name="revalue(Employment)" optype="categorical" dataType="string">
<Extension>plyr::revalue(Employment, c(PSFederal = "Public", PSState = "Public",
PSLocal = "Public"))</Extension>
<MapValues outputColumn="to">
<FieldColumnPair field="Employment" column="from"/>
<InlineTable>
<row><from>PSFederal</from><to>Public</to></row>
<row><from>PSState</from><to>Public</to></row>
<row><from>PSLocal</from><to>Public</to></row>
<row><from>Consultant</from><to>Consultant</to></row>
<row><from>Private</from><to>Private</to></row>
<row><from>SelfEmp</from><to>SelfEmp</to></row>
<row><from>Volunteer</from><to>Volunteer</to></row>
</InlineTable>
</MapValues>
</DerivedField>
Interactions
as.formula(Adjusted ~
# Continuous vs. continuous
Age:Income +
# Continuous vs. transformed continuous
Age:I(log10(Income)) +
# Continuous vs. categorical
Gender:Income +
# Categorical vs. categorical
(Gender + Education) ^ 2
)
Estimator fitting
Alternative encodings (1/2)
# Continuous label ~ Continuous and categorical features
audit.glm = glm(Income ~ Age + Employment + Education + Marital +
Occupation + Gender + Hours, data = audit, family = "gaussian")
# Encoded as GeneralRegressionModel element
r2pmml(audit.glm, "Audit.pmml")
# Encoded as RegressionModel element
r2pmml(audit.glm, "Audit.pmml", converter = "org.jpmml.rexp.LMConverter")
Alternative encodings (2/2)
# Binary label ~ Categorical features
audit.glm = glm(Adjusted ~ cut(Age, breaks = c(0, 21, 65, 100)) +
Employment + Education + Marital + Occupation + cut(Income, breaks =
quantile(Income)) + Gender + ifelse(Hours >= 40, "full-time",
"part-time"), data = audit, family = "binomial")
audit.scorecard = r2pmml::as.scorecard(audit.glm)
# Encoded as Scorecard element
r2pmml(audit.scorecard, "Audit.pmml")
Q&A
villu@openscoring.io
https://github.com/jpmml
https://github.com/openscoring
https://groups.google.com/forum/#!forum/jpmml
Software (Nov 2017)
● The R side:
○ base 3.3(.1)
○ pmml 1.5(.2)
○ r2pmml 0.15(.0)
○ plyr 1.8(.4)
● The Java side:
○ JPMML-R 1.2(.20)
○ JPMML-Evaluator 1.3(.10)

Más contenido relacionado

La actualidad más candente

Serving earth observation data with GeoServer: addressing real world requirem...
Serving earth observation data with GeoServer: addressing real world requirem...Serving earth observation data with GeoServer: addressing real world requirem...
Serving earth observation data with GeoServer: addressing real world requirem...GeoSolutions
 
softCours design pattern m youssfi partie 9 creation des objets abstract fact...
softCours design pattern m youssfi partie 9 creation des objets abstract fact...softCours design pattern m youssfi partie 9 creation des objets abstract fact...
softCours design pattern m youssfi partie 9 creation des objets abstract fact...ENSET, Université Hassan II Casablanca
 
Sum and Product Types - The Fruit Salad & Fruit Snack Example - From F# to Ha...
Sum and Product Types -The Fruit Salad & Fruit Snack Example - From F# to Ha...Sum and Product Types -The Fruit Salad & Fruit Snack Example - From F# to Ha...
Sum and Product Types - The Fruit Salad & Fruit Snack Example - From F# to Ha...Philip Schwarz
 
Map(), flatmap() and reduce() are your new best friends: simpler collections,...
Map(), flatmap() and reduce() are your new best friends: simpler collections,...Map(), flatmap() and reduce() are your new best friends: simpler collections,...
Map(), flatmap() and reduce() are your new best friends: simpler collections,...Chris Richardson
 
Architecture of Hadoop
Architecture of HadoopArchitecture of Hadoop
Architecture of HadoopKnoldus Inc.
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architectureBishal Khanal
 
Introduction to Sightly and Sling Models
Introduction to Sightly and Sling ModelsIntroduction to Sightly and Sling Models
Introduction to Sightly and Sling ModelsStefano Celentano
 
Event sourcing in the functional world (22 07-2021)
Event sourcing in the functional world (22 07-2021)Event sourcing in the functional world (22 07-2021)
Event sourcing in the functional world (22 07-2021)Vitaly Brusentsev
 
RxJS - The Basics & The Future
RxJS - The Basics & The FutureRxJS - The Basics & The Future
RxJS - The Basics & The FutureTracy Lee
 
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka Edureka!
 
Hibernate
HibernateHibernate
HibernateAjay K
 
Functional Programming with Groovy
Functional Programming with GroovyFunctional Programming with Groovy
Functional Programming with GroovyArturo Herrero
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slidesDat Tran
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...t_ivanov
 

La actualidad más candente (20)

Serving earth observation data with GeoServer: addressing real world requirem...
Serving earth observation data with GeoServer: addressing real world requirem...Serving earth observation data with GeoServer: addressing real world requirem...
Serving earth observation data with GeoServer: addressing real world requirem...
 
softCours design pattern m youssfi partie 9 creation des objets abstract fact...
softCours design pattern m youssfi partie 9 creation des objets abstract fact...softCours design pattern m youssfi partie 9 creation des objets abstract fact...
softCours design pattern m youssfi partie 9 creation des objets abstract fact...
 
Clean coding-practices
Clean coding-practicesClean coding-practices
Clean coding-practices
 
Sum and Product Types - The Fruit Salad & Fruit Snack Example - From F# to Ha...
Sum and Product Types -The Fruit Salad & Fruit Snack Example - From F# to Ha...Sum and Product Types -The Fruit Salad & Fruit Snack Example - From F# to Ha...
Sum and Product Types - The Fruit Salad & Fruit Snack Example - From F# to Ha...
 
Map(), flatmap() and reduce() are your new best friends: simpler collections,...
Map(), flatmap() and reduce() are your new best friends: simpler collections,...Map(), flatmap() and reduce() are your new best friends: simpler collections,...
Map(), flatmap() and reduce() are your new best friends: simpler collections,...
 
Introduction to Rust
Introduction to RustIntroduction to Rust
Introduction to Rust
 
Architecture of Hadoop
Architecture of HadoopArchitecture of Hadoop
Architecture of Hadoop
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
 
Clang tidy
Clang tidyClang tidy
Clang tidy
 
Introduction to Sightly and Sling Models
Introduction to Sightly and Sling ModelsIntroduction to Sightly and Sling Models
Introduction to Sightly and Sling Models
 
SOLID _Principles.pptx
SOLID _Principles.pptxSOLID _Principles.pptx
SOLID _Principles.pptx
 
Event sourcing in the functional world (22 07-2021)
Event sourcing in the functional world (22 07-2021)Event sourcing in the functional world (22 07-2021)
Event sourcing in the functional world (22 07-2021)
 
RxJS - The Basics & The Future
RxJS - The Basics & The FutureRxJS - The Basics & The Future
RxJS - The Basics & The Future
 
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
 
Hibernate
HibernateHibernate
Hibernate
 
React workshop
React workshopReact workshop
React workshop
 
Functional Programming with Groovy
Functional Programming with GroovyFunctional Programming with Groovy
Functional Programming with Groovy
 
Lazy java
Lazy javaLazy java
Lazy java
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
 

Similar a Converting R to PMML

Innovative Specifications for Better Performance Logging and Monitoring
Innovative Specifications for Better Performance Logging and MonitoringInnovative Specifications for Better Performance Logging and Monitoring
Innovative Specifications for Better Performance Logging and MonitoringCary Millsap
 
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud Alithya
 
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopPaco Nathan
 
Angular 2 Architecture (Bucharest 26/10/2016)
Angular 2 Architecture (Bucharest 26/10/2016)Angular 2 Architecture (Bucharest 26/10/2016)
Angular 2 Architecture (Bucharest 26/10/2016)Eyal Vardi
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
 
Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Jonathan Felch
 
Introduction to Zend Framework web services
Introduction to Zend Framework web servicesIntroduction to Zend Framework web services
Introduction to Zend Framework web servicesMichelangelo van Dam
 
Mind Your Business. And Its Logic
Mind Your Business. And Its LogicMind Your Business. And Its Logic
Mind Your Business. And Its LogicVladik Khononov
 
Finagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at PinterestFinagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at PinterestPavan Chitumalla
 
Mapfilterreducepresentation
MapfilterreducepresentationMapfilterreducepresentation
MapfilterreducepresentationManjuKumara GH
 
Functional Reactive Programming with RxJS
Functional Reactive Programming with RxJSFunctional Reactive Programming with RxJS
Functional Reactive Programming with RxJSstefanmayer13
 
An introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduceAn introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduceAnanth PackkilDurai
 
ZFConf 2010: Zend Framework & MVC, Model Implementation (Part 2, Dependency I...
ZFConf 2010: Zend Framework & MVC, Model Implementation (Part 2, Dependency I...ZFConf 2010: Zend Framework & MVC, Model Implementation (Part 2, Dependency I...
ZFConf 2010: Zend Framework & MVC, Model Implementation (Part 2, Dependency I...ZFConf Conference
 
Flink Batch Processing and Iterations
Flink Batch Processing and IterationsFlink Batch Processing and Iterations
Flink Batch Processing and IterationsSameer Wadkar
 
Sparkling Water Meetup
Sparkling Water MeetupSparkling Water Meetup
Sparkling Water MeetupSri Ambati
 

Similar a Converting R to PMML (20)

Innovative Specifications for Better Performance Logging and Monitoring
Innovative Specifications for Better Performance Logging and MonitoringInnovative Specifications for Better Performance Logging and Monitoring
Innovative Specifications for Better Performance Logging and Monitoring
 
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
 
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
 
Angular 2 Architecture (Bucharest 26/10/2016)
Angular 2 Architecture (Bucharest 26/10/2016)Angular 2 Architecture (Bucharest 26/10/2016)
Angular 2 Architecture (Bucharest 26/10/2016)
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)
 
Drupal 8 migrate!
Drupal 8 migrate!Drupal 8 migrate!
Drupal 8 migrate!
 
Zend framework service
Zend framework serviceZend framework service
Zend framework service
 
Zend framework service
Zend framework serviceZend framework service
Zend framework service
 
Introduction to Zend Framework web services
Introduction to Zend Framework web servicesIntroduction to Zend Framework web services
Introduction to Zend Framework web services
 
Mind Your Business. And Its Logic
Mind Your Business. And Its LogicMind Your Business. And Its Logic
Mind Your Business. And Its Logic
 
Finagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at PinterestFinagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at Pinterest
 
Reason and GraphQL
Reason and GraphQLReason and GraphQL
Reason and GraphQL
 
Mapfilterreducepresentation
MapfilterreducepresentationMapfilterreducepresentation
Mapfilterreducepresentation
 
Broadleaf Presents Thymeleaf
Broadleaf Presents ThymeleafBroadleaf Presents Thymeleaf
Broadleaf Presents Thymeleaf
 
Functional Reactive Programming with RxJS
Functional Reactive Programming with RxJSFunctional Reactive Programming with RxJS
Functional Reactive Programming with RxJS
 
An introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduceAn introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduce
 
ZFConf 2010: Zend Framework & MVC, Model Implementation (Part 2, Dependency I...
ZFConf 2010: Zend Framework & MVC, Model Implementation (Part 2, Dependency I...ZFConf 2010: Zend Framework & MVC, Model Implementation (Part 2, Dependency I...
ZFConf 2010: Zend Framework & MVC, Model Implementation (Part 2, Dependency I...
 
Flink Batch Processing and Iterations
Flink Batch Processing and IterationsFlink Batch Processing and Iterations
Flink Batch Processing and Iterations
 
Sparkling Water Meetup
Sparkling Water MeetupSparkling Water Meetup
Sparkling Water Meetup
 

Más de Villu Ruusmann

State of the (J)PMML art
State of the (J)PMML artState of the (J)PMML art
State of the (J)PMML artVillu Ruusmann
 
Converting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLConverting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLVillu Ruusmann
 
R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?Villu Ruusmann
 
Representing TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMMLRepresenting TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMMLVillu Ruusmann
 
On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsVillu Ruusmann
 

Más de Villu Ruusmann (6)

State of the (J)PMML art
State of the (J)PMML artState of the (J)PMML art
State of the (J)PMML art
 
Converting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLConverting Scikit-Learn to PMML
Converting Scikit-Learn to PMML
 
The case for (J)PMML
The case for (J)PMMLThe case for (J)PMML
The case for (J)PMML
 
R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?
 
Representing TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMMLRepresenting TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMML
 
On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) models
 

Último

From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 

Último (20)

From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 

Converting R to PMML

  • 1. Converting R to PMML Villu Ruusmann Openscoring OÜ
  • 3. R challenge "Fat code around thin model objects": ● Many packages to solve the same problem, many ways of doing and/or representing the same thing ● The unit of reusability is R script ● Model object is tightly coupled to the R script that produced it. Incomplete/split schemas ● No (widely adopted-) pipeline formalization
  • 4. (J)PMML solution Evaluator evaluator = ...; List<InputField> argumentFields = evaluator.getInputFields(); List<ResultField> resultFields = Lists.union(evaluator.getTargetFields(), evaluator.getOutputFields()); Map<FieldName, ?> arguments = readRecord(argumentFields); Map<FieldName, ?> result = evaluator.evaluate(arguments); writeRecord(result, resultFields); ● The R script (data pre-and post-processing, model) is represented using standardized PMML data structures ● Well-defined data entry and exit interfaces
  • 6. R challenge Training: audit_train = read.csv("Audit.csv") audit_train$HourlyIncome = (audit_train$Income / (audit_train$Hours * 52)) audit.model = model(Adjusted ~ ., data = audit_train) saveRDS(audit.model, "Audit.rds") Deployment: audit_test = read.csv("Audit.csv") audit.model = readRDS("Audit.rds") # "Error in eval(expr, envir, enclos) : object 'HourlyIncome' not found" predict(audit.model, newdata = audit_test)
  • 7. R solution Matrix interface ( ): label = audit$Adjusted features = audit[, !(colnames(audit) %in% c("Adjusted"))] # Returns a `model` object audit.model = model(x = features, y = label) Formula interface ( ): # Returns a `model.formula` object audit.model = model(Adjusted ~ . + I(Income / (Hours * 52)), data = audit)
  • 8. From `pmml` to `r2pmml` library("pmml") library("r2pmml") audit = read.csv("Audit.csv") audit$Adjusted = as.factor(audit$Adjusted) audit.glm = glm(Adjusted ~ . + I(Income / (Hours * 52)), data = audit, family = "binomial") saveXML(pmml(audit.glm), "Audit.pmml") r2pmml(audit.glm, "Audit.pmml")
  • 9. Package `pmml` ( ) <PMML> <DataDictionary> <DataField name="I(Income/(Hours * 52))" optype="continuous" dataType="double"/> </DataDictionary> <GeneralRegressionModel> <MiningSchema> <MiningField name="I(Income/(Hours * 52))"/> </MiningSchema> </GeneralRegressionModel> </PMML>
  • 10. Package `r2pmml` ( ) <PMML> <DataDictionary> <DataField name="Income" optype="continuous" dataType="double"/> <DataField name="Hours" optype="continuous" dataType="double"/> </DataDictionary> <TransformationDictionary> <DerivedField name="Income/(Hours * 52)" optype="continuous" dataType="double"> <Apply function="/"><FieldRef field="Income"/><Apply function="*"><FieldRef field="Hours"/><Constant>52</Constant></Apply></Apply> </DerivedField> </TransformationDictionary> <GeneralRegressionModel> <MiningSchema> <MiningField name="Income"/> <MiningField name="Hours"/> </MiningSchema> </GeneralRegressionModel> </PMML>
  • 11. Manual conversion (1/2) r2pmml::r2pmml r2pmml::decorate base::saveRDS RDS file JPMML-R library The R side The Java side
  • 12. Manual conversion (2/2) The R side: audit.model = model(Adjusted ~ ., data = audit) # Decorate the model object with supporting information audit.model = r2pmml::decorate(audit.model, dataset = audit) saveRDS(audit.model, "Audit.rds") The Java side: $ git clone https://github.com/jpmml/jpmml-r.git $ cd jpmml-r; mvn clean package $ java -jar target/converter-executable-1.2-SNAPSHOT.jar --rds-input /path/to/Audit.rds --pmml-output /path/to/Audit.pmml
  • 14. Sanitization (1/2) ?base::ifelse # Static replacement value as.formula(Adjusted ~ ifelse(is.na(Hours), 0, Hours)) # Dynamic replacement value as.formula(paste("Adjusted ~ ifelse((is.na(Hours) | Hours < 0 | Hours > 168),", mean(df$Hours), ", Hours)", sep = ""))
  • 15. Sanitization (2/2) <DerivedField name="ifelse(is.na(Hours))" optype="continuous" dataType="double"> <Extension>ifelse(is.na(Hours), 0, Hours)</Extension> <Apply function="if"> <Apply function="isMissing"> <FieldRef field="Hours"/> </Apply> <Constant>0</Constant> <FieldRef field="Hours"/> </Apply> </DerivedField>
  • 16. Binarization (1/2) ?base::ifelse as.formula(Adjusted ~ ifelse(Hours >= 40, "full-time", "part-time"))
  • 17. Binarization (2/2) <DerivedField name="ifelse(Hours &gt;= 40)" optype="categorical" dataType="string"> <Extension>ifelse(Hours &gt;= 40, "full-time", "part-time")</Extension> <Apply function="if"> <Apply function="greaterOrEqual"> <FieldRef field="Hours"/> <Constant>40</Constant> </Apply> <Constant>full-time</Constant> <Constant>part-time</Constant> </Apply> </DerivedField>
  • 18. Bucketization (1/2) ?base::cut # Static intervals as.formula(Adjusted ~ cut(Income, breaks = 3)) # Dynamic intervals as.formula(Adjusted ~ cut(log10(Income), breaks = quantile(log10(Income)))
  • 19. Bucketization (2/2) <DerivedField name="cut(Income)" optype="categorical" dataType="string"> <Extension>cut(Income, breaks = 3)</Extension> <Discretize field="Income"> <DiscretizeBin binValue="(129,1.61e+05]"> <Interval closure="openClosed" leftMargin="129.0" rightMargin="161000.0"/> </DiscretizeBin> <DiscretizeBin binValue="(1.61e+05,3.21e+05]"> <Interval closure="openClosed" leftMargin="161000.0" rightMargin="321000.0"/> </DiscretizeBin> <DiscretizeBin binValue="(3.21e+05,4.82e+05]"> <Interval closure="openClosed" leftMargin="321000.0" rightMargin="482000.0"/> </DiscretizeBin> </Discretize> </DerivedField>
  • 20. Mathematical expressions (1/2) ?base::I as.formula(Adjusted ~ I(floor(Income / (Hours * 52)))) Supported constructs: ● Arithmetic operators `+`, `-`, `*`, `/`, `^` and `**` ● Relational operators `==`, `!=`, `<`, `<=`, `>=` and `>` ● Logical operators `!`, `&` and `|` ● Functions `abs`, `ceiling`, `exp`, `floor`, `is.na`, `log`, `log10`, `round` and `sqrt`
  • 21. Mathematical expressions (2/2) <DerivedField name="floor(Income/(Hours * 52))" optype="continuous" dataType="double"> <Extension>I(floor(Income/(Hours * 52)))</Extension> <Apply function="floor"> <Apply function="/"> <FieldRef field="Income"/> <Apply function="*"> <FieldRef field="Hours"/> <Constant>52</Constant> </Apply> </Apply> </Apply> </DerivedField>
  • 22. Renaming and regrouping categories (1/2) library("plyr") ?plyr::revalue ?plyr::mapvalues as.formula(Adjusted ~ plyr::revalue(Employment, c("PSFederal" = "Public", "PSState" = "Public", "PSLocal" = "Public"))) as.formula(Adjusted ~ plyr::revalue(Education, c("Yr1t4" = "Yr1t6", "Yr5t6" = "Yr1t6", "Yr7t8" = "Yr7t9", "Yr9" = "Yr7t9", "Yr10" = "Yr10t12", "Yr11" = "Yr10t12", "Yr12" = "Yr10t12"))) as.formula(Adjusted ~ plyr::mapvalues(Gender, c("Male", "Female"), c(0, 1)))
  • 23. Renaming and regrouping categories (2/2) <DerivedField name="revalue(Employment)" optype="categorical" dataType="string"> <Extension>plyr::revalue(Employment, c(PSFederal = "Public", PSState = "Public", PSLocal = "Public"))</Extension> <MapValues outputColumn="to"> <FieldColumnPair field="Employment" column="from"/> <InlineTable> <row><from>PSFederal</from><to>Public</to></row> <row><from>PSState</from><to>Public</to></row> <row><from>PSLocal</from><to>Public</to></row> <row><from>Consultant</from><to>Consultant</to></row> <row><from>Private</from><to>Private</to></row> <row><from>SelfEmp</from><to>SelfEmp</to></row> <row><from>Volunteer</from><to>Volunteer</to></row> </InlineTable> </MapValues> </DerivedField>
  • 24. Interactions as.formula(Adjusted ~ # Continuous vs. continuous Age:Income + # Continuous vs. transformed continuous Age:I(log10(Income)) + # Continuous vs. categorical Gender:Income + # Categorical vs. categorical (Gender + Education) ^ 2 )
  • 26. Alternative encodings (1/2) # Continuous label ~ Continuous and categorical features audit.glm = glm(Income ~ Age + Employment + Education + Marital + Occupation + Gender + Hours, data = audit, family = "gaussian") # Encoded as GeneralRegressionModel element r2pmml(audit.glm, "Audit.pmml") # Encoded as RegressionModel element r2pmml(audit.glm, "Audit.pmml", converter = "org.jpmml.rexp.LMConverter")
  • 27. Alternative encodings (2/2) # Binary label ~ Categorical features audit.glm = glm(Adjusted ~ cut(Age, breaks = c(0, 21, 65, 100)) + Employment + Education + Marital + Occupation + cut(Income, breaks = quantile(Income)) + Gender + ifelse(Hours >= 40, "full-time", "part-time"), data = audit, family = "binomial") audit.scorecard = r2pmml::as.scorecard(audit.glm) # Encoded as Scorecard element r2pmml(audit.scorecard, "Audit.pmml")
  • 29. Software (Nov 2017) ● The R side: ○ base 3.3(.1) ○ pmml 1.5(.2) ○ r2pmml 0.15(.0) ○ plyr 1.8(.4) ● The Java side: ○ JPMML-R 1.2(.20) ○ JPMML-Evaluator 1.3(.10)