demonstration of using featuretools package to generate features / aggregates from raw relational data, and using ml flow to track entire model building & hyperparams optimization
databricks ml flow demonstration using automatic features engineering
1.
2. Overview of a typical machine learning model workflow
Fact #1 : Doing machine learning IS complex
3. Fact #2 ! Hardest part of AI actually is not AI code...
4.
5. Machine learning projects main concerns
1- Open source ML ecosystem is crowded : For each phase of ML
process, there is a myriad of tools to choose from ;
2- Tracking : it is difficult to track by hand which parameters, code, and
data went into each experiment to produce a model, especially when
work in teams ;
3- Reproducibility : Without detailed tracking, teams often have
trouble getting the same code to work / achieve same results
5
6. - First release on june 2018
- Latest version v1.5, released 19 Dec 2019
Introducing MLflow
6
7. MLflow address machine learning challenges through its
3 main components
7
“MLflow is an open source platform to manage the ML
lifecycle, including experimentation, reproducibility and
deployment. It currently offers three components:
What is ML flow ?
1 2 3
9. ML tracking API
Single API + UI to track for
each experiment :
▸ Parameters
▸ Metrics
▸ Artefacts (training
datasets, …)
Can be used on standalone
script / from a notebook
9
11. ML projects
- ML projects define a standard
packaging format to manage data
science code.
- It can be a simple directory / git
repo with code to run.
- The running environment
requirements are defined as a
simple YAML file.
11
ML flow projects sample YAML project
12. ML Models
- MLflow Models is a
convention for packaging
machine learning models in
multiple formats called
“flavors”. MLflow offers a
variety of tools to help you
deploy different flavors of
models.
- Each MLflow Model is saved
as a directory containing
arbitrary files and an MLmodel
descriptor file that lists the
flavors it can be used in.
12
Example of scikit-learn model
13. Model serving commands
13
mlflow models serve
Deploys the model as a
local REST API server.
mlflow models build-
docker
packages a REST API
endpoint serving the model
as a docker image.
mlflow models predict
uses the model to generate
a prediction for a local CSV
or JSON file.
15. E-commerce fraud detection
We have some json profiles representing fictional customers from an ecommerce
company.
(cf. courtesy of RAVELIN : https://github.com/unravelin/code-test-data-science)
The profiles contain information about the customers, their orders, their transactions, what
payment methods they used and whether the customer is fraudulent or not.
Our task :
● Transform the json profiles into feature vectors :
a. Automated feature engineering using featuretools package
● Construct a model to predict if a customer is fraudulent based on their profile.
a. modeling phase using python + scikit-learn
b. Track experiments results using databricks ML flow
15
16. 16
Transform input data
1/Transactions / orders
2/Labels : fraudulent (true/false)
Extract features
Count orders
min/max/avg transaction amount...
Store analytical dataset
Parquet file with :
customerID | features X… | label
Python script to decode each user profile json array into relational
pandas dataframes
Data aggregation can be done using SQL / sparkSQL / pandas dataframe
- In our case will use featurestools package to automate this phase
Baseline classifier
Train a random forest model with
default parameters
Tuned classifier
Using gridsearchCV to tune best
parameters based on cross-
validation results
Base
AUC
Optimized
AUC
MLflow tracking goes here
Model Building & tracking process
17. Appendix #1 :
Deep feature synthesis used in Feature tools python package to generate
aggregates / apply transformations on relational data
17featurelabs.com
18. 18
Appendix #2 : How random forests works ?
Main parameters to tune :
- Max_depth : # trees
- Nb_estimators : # of trees
- Min_rows: Specify the minimum
number of observations for a leaf
- Col_sample: column sample / tree
- Sample_rate : default to 0.63333
19. 19
Complete code on github :
https://github.com/mmejdoubi/mlflow_fraud_ecom/blob/mast
er/ravelin_fraud_RF_mlflow_v1.ipynb