This document discusses MLBox, an automated machine learning tool that aims to automate as many steps of the machine learning pipeline as possible with minimal human intervention. It focuses on explaining the automation of four main steps: 1) reading and merging data, 2) preprocessing like cleaning and encoding data, 3) model optimization through techniques like feature selection and hyperparameter tuning, and 4) making predictions on new data. MLBox handles common data types and tasks like classification and regression. The document outlines MLBox's features, compares it to other auto-ML tools, and discusses plans for future improvements.
3. • INTRODUCTION : WHY AUTO-ML IS GAINING MOMENTUM ?
• FOCUS ON THE AUTOMATION PROCESS
• MLBOX COMPARED WITH OTHER AUTOMATED TOOLS
• CONCLUSION
• LIVE DEMO
• Q&A
5. Auto Machine Learning
A fully automated process
Data Computation meansRobot
• Supervised tasks
- classification
- regression
• Structured data
- csv files
- json files
- …
• Unsupervised tasks
- outlier detection
- clustering
- …
• Unstructured data
- images
- texts
- …
6. What is auto-ML ?
We want to automate…
…the maximum number of steps in a ML pipeline…
…with minimum human intervention…
…while conserving a high performance !
7. Data
cleaning
(duplicates, ids,
correlations,
leaks, … )
Data
encoding
(NA, dates, text,
categorical
features, … )
STEP 2 : Preprocessing
STEP 1 : Reading /
merging
STEP 3 : Optimisation
Feature
selection
Feature
engineering
Model
selection
Prediction
Model
interpretation
STEP 4 : Application
Focus on the automation process
Diagram of a standard ML pipeline
8. Step 1: reading and merging
Maybe the hardest step. Not a priority for auto-ML at
the moment.
Some inputs : paths to the data + target name
Difficult to auto-merge different sources
9. Step 2: preprocessing
Not all packages tackle preprocessing
No inputs : heuristics to detect feature types are easy
Naive encoding for most auto-ML packages
10. Step 3: optimization
Top priority for auto-ML community
Some inputs : scoring function + a hyper-parameter space
Computation time can be long !
11. Step 4: application
Prediction is also a priority for auto-ML
No auto-monitoring after model deployment
Fast and accurate step
12. Maintenance
Ease of setup
Automated
steps
- Reading
- Encoding
- Optimisation
- Encoding
- Optimisation
- Reading/merging
- Cleaning
- Encoding
- Optimisation
- Encoding
- Optimisation - Optimisation
Quality of
automation
Open source ?
Overview of various solutions
14. STEP 1 : Reading / merging
From several raw datasets to one structured dataset.
• List of paths to all the datasets
• Target name
• Reading of several files (csv, xls, json and hdf5)
• Auto-merging of all the sources – information crunching
• Task detection (binary/multiclass classification or regression)
• Split between train and test sets
• Basic information display
• Dense structured train and test files (without duplicates)
Features
what is
Automated by MLBox ?
15. • A dataset
• Auto-cleaning/dropping : duplicates, drifts / covariate shifts (*), high correlations, highly sparse features
/ samples, constants, …
• Feature encoding : missing values, lists, dates, categorical features, text, … - SEVERAL STRATEGIES
• A cleaned dataset with numerical features
STEP 2 : Preprocessing
From a dirty dataset to a cleaned numerical one
Features
what is
Automated by MLBox ?
16. STEP 3 : Optimisation
A wide range of models are tested and cross-validated
• A metric (a wide choice, otherwise can be implemented) - OPTIONAL
• A hyper-parameter space – OPTIONAL
• The train set
• Feature engineering : using neural networks (*)
• Feature selection : filter methods, wrapper methods, embedded methods
• Model selection : a wide range of accurate models (LightGBM, XGBoost, Random Forest, Linear, …)
• Hyper-parameter optimisation : TPE (Bayesian optimisation method) – dumping of fitted pipelines
• Ensembling : multi-layer stacking, boosting, bagging, …
• The optimal pipeline configuration
Features
what is
Automated by MLBox ?
17. STEP 4 : Prediction
The best model is fitted and predicts on the test set
• Train and test datasets
• The optimal pipeline configuration - OPTIONAL
• Target prediction : classification and regression - dumping of predictions + optimal fitted pipeline
• Model interpretation : feature importances (saved)
• Leak detection : warning
• The predictions on the test set
Features
what is
Automated by MLBox ?
18. Quality: functional code : tested on Kaggle & codecov
Performance: fully distributed and optimised
AI: dumping and automatic reading of computations
Updates: latest algorithms
MLBox: about the package
Compatibility: Python 3.5-3.7 / Windows, Linux & Mac OS
Quick setup: $ pip install mlbox
User friendly: tutorials, docs, examples…
19. Conclusion: the benefits of auto-ML
Democratizes Machine Learning
Machine Learning for everybody (no coding)
Avoids errors
A robot never makes mistakes…
Increases productivity
Repetitive tasks are automated and accelerated ! A Data
Scientist can focus more on non-traditional issues !
2min
Data preprocessing and model tuning are both repetitive tasks that take a lot of time…
A Data Scientist is expensive !
2min
So why don’t we replace the DS by a robot ??? We would save time and money !
Let’s see what can be automated !
1min
Performance = computation time + accuracy
2min
2012-2017 : Machine Learning (understand the benefit of Machine Learning | identify use cases)
2018- ? : Auto ML (model deployment issue !)
2min
Companies will benefit from Auto-ml to deploy ML models !
Solution : identify supervised tasks that can be automated with a reusable/generic pipeline ! Other tasks are specifics (so difficult to be automated with a single tool/method: DS will focus on these tasks !).
Gap between IT companies that have already automated pipelines and other companies that are late !
2min
90% of machine learning tasks follow this pipeline
1min
Reading and creating the datasets has always been manual ! Here is the key point for automation !!!
1min
Preprocessing issues : drifts (ids), leaks, duplicates ??
Heuristics : str -> categorical features, long str -> text, …
Naive encoding : no NN !
1min
First step automated : no needs to automate more !! (some solution pretend to test « thousands » of models but it is useless !)
Optional inputs
Optimization algo : a lot of ways to tackle the high dimension issue
Meta learning !
1min
NO monitoring !
Feature interpretation : ok
3min
MLBox tackles merging !!
1min
Available on PyPI
Github with tutos, examples
Docs with articles, kaggle kernels, …
Performance : tested on Kaggle !
Features : drifts, embeddings, stacking, leak, feature importances,…
2min
2min
2min
2min
2min
2min
Democratize ML !!
Google CEO Sundar Pichai said that, “We hope AutoML will take an ability that a few Ph.D.s have today and will make it possible in three to five years for hundreds of thousands of developers to design new neural nets for their particular needs.”