SlideShare una empresa de Scribd logo
1 de 22
BY
AXEL DE ROMBLAY & HENRI GÉRARD
HTTPS://GITHUB.COM/AXELDEROMBLAY/MLBOX
• INTRODUCTION : WHY AUTO-ML IS GAINING MOMENTUM ?
• FOCUS ON THE AUTOMATION PROCESS
• MLBOX COMPARED WITH OTHER AUTOMATED TOOLS
• CONCLUSION
• LIVE DEMO
• Q&A
Data ScientistData Computation means
Data pre-processing Model tuning
Machine Learning
Almost an automated process…
Auto Machine Learning
A fully automated process
Data Computation meansRobot
• Supervised tasks
- classification
- regression
• Structured data
- csv files
- json files
- …
• Unsupervised tasks
- outlier detection
- clustering
- …
• Unstructured data
- images
- texts
- …
What is auto-ML ?
We want to automate…
…the maximum number of steps in a ML pipeline…
…with minimum human intervention…
…while conserving a high performance !
Data
cleaning
(duplicates, ids,
correlations,
leaks, … )
Data
encoding
(NA, dates, text,
categorical
features, … )
STEP 2 : Preprocessing
STEP 1 : Reading /
merging
STEP 3 : Optimisation
Feature
selection
Feature
engineering
Model
selection
Prediction
Model
interpretation
STEP 4 : Application
Focus on the automation process
Diagram of a standard ML pipeline
Step 1: reading and merging
Maybe the hardest step. Not a priority for auto-ML at
the moment.
Some inputs : paths to the data + target name
Difficult to auto-merge different sources
Step 2: preprocessing
Not all packages tackle preprocessing
No inputs : heuristics to detect feature types are easy
Naive encoding for most auto-ML packages
Step 3: optimization
Top priority for auto-ML community
Some inputs : scoring function + a hyper-parameter space
Computation time can be long !
Step 4: application
Prediction is also a priority for auto-ML
No auto-monitoring after model deployment
Fast and accurate step
Maintenance
Ease of setup
Automated
steps
- Reading
- Encoding
- Optimisation
- Encoding
- Optimisation
- Reading/merging
- Cleaning
- Encoding
- Optimisation
- Encoding
- Optimisation - Optimisation
Quality of
automation
Open source ?
Overview of various solutions
MLBox: a fully automated package
STEP 1 : Reading / merging
From several raw datasets to one structured dataset.
• List of paths to all the datasets
• Target name
• Reading of several files (csv, xls, json and hdf5)
• Auto-merging of all the sources – information crunching
• Task detection (binary/multiclass classification or regression)
• Split between train and test sets
• Basic information display
• Dense structured train and test files (without duplicates)
Features
what is
Automated by MLBox ?
• A dataset
• Auto-cleaning/dropping : duplicates, drifts / covariate shifts (*), high correlations, highly sparse features
/ samples, constants, …
• Feature encoding : missing values, lists, dates, categorical features, text, … - SEVERAL STRATEGIES
• A cleaned dataset with numerical features
STEP 2 : Preprocessing
From a dirty dataset to a cleaned numerical one
Features
what is
Automated by MLBox ?
STEP 3 : Optimisation
A wide range of models are tested and cross-validated
• A metric (a wide choice, otherwise can be implemented) - OPTIONAL
• A hyper-parameter space – OPTIONAL
• The train set
• Feature engineering : using neural networks (*)
• Feature selection : filter methods, wrapper methods, embedded methods
• Model selection : a wide range of accurate models (LightGBM, XGBoost, Random Forest, Linear, …)
• Hyper-parameter optimisation : TPE (Bayesian optimisation method) – dumping of fitted pipelines
• Ensembling : multi-layer stacking, boosting, bagging, …
• The optimal pipeline configuration
Features
what is
Automated by MLBox ?
STEP 4 : Prediction
The best model is fitted and predicts on the test set
• Train and test datasets
• The optimal pipeline configuration - OPTIONAL
• Target prediction : classification and regression - dumping of predictions + optimal fitted pipeline
• Model interpretation : feature importances (saved)
• Leak detection : warning
• The predictions on the test set
Features
what is
Automated by MLBox ?
 Quality: functional code : tested on Kaggle & codecov
 Performance: fully distributed and optimised
 AI: dumping and automatic reading of computations
 Updates: latest algorithms
MLBox: about the package
 Compatibility: Python 3.5-3.7 / Windows, Linux & Mac OS
 Quick setup: $ pip install mlbox
 User friendly: tutorials, docs, examples…
Conclusion: the benefits of auto-ML
Democratizes Machine Learning
Machine Learning for everybody (no coding)
Avoids errors
A robot never makes mistakes…
Increases productivity
Repetitive tasks are automated and accelerated ! A Data
Scientist can focus more on non-traditional issues !
Live demo
https://colab.research.google.com/drive/1EwYZg_4yayagfQfOrXUUFidUuYbj0fEi
Next releases
• FEATURES
TEXT & TIME SERIES ENCODING
AUTO OPTIMISATION (USING TUNE)
MODEL INTERPRETATION (USING LIME)
• PERFORMANCES
PIPELINE PARALLELISM PER VARIABLE TYPE (USING COLUMNTRANSFORMER IN SKLEARN)
PIPELINE MEMORY IMPROVEMENTS
• ERGONOMY
FIT AND PREDICT TASKS SPLIT
VERBOSITY IMPROVEMENTS (USING MLFLOW…)
ERGONOMY IMPROVEMENTS (PARAMETERS NAMES, …)
Thank you !
Questions ?
Please star or share the repo !

Más contenido relacionado

La actualidad más candente

Jake Mannix, MLconf 2013
Jake Mannix, MLconf 2013Jake Mannix, MLconf 2013
Jake Mannix, MLconf 2013MLconf
 
Machine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMachine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMateusz Dymczyk
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine LearningYuriy Guts
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableJustin Basilico
 
AutoML - The Future of AI
AutoML - The Future of AIAutoML - The Future of AI
AutoML - The Future of AINing Jiang
 
Microsoft Introduction to Automated Machine Learning
Microsoft Introduction to Automated Machine LearningMicrosoft Introduction to Automated Machine Learning
Microsoft Introduction to Automated Machine LearningSetu Chokshi
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLHimadri Mishra
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learningsafa cimenli
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkPetr Zapletal
 
Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101QuantUniversity
 
Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...Gianmario Spacagna
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...Spark Summit
 
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...Lola Burgueño
 
Open and Automated Machine Learning
Open and Automated Machine LearningOpen and Automated Machine Learning
Open and Automated Machine LearningJoaquin Vanschoren
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Hayim Makabee
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21Gülden Bilgütay
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whyKorea Sdec
 

La actualidad más candente (20)

Jake Mannix, MLconf 2013
Jake Mannix, MLconf 2013Jake Mannix, MLconf 2013
Jake Mannix, MLconf 2013
 
Machine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMachine Learning for (JVM) Developers
Machine Learning for (JVM) Developers
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
 
AutoML - The Future of AI
AutoML - The Future of AIAutoML - The Future of AI
AutoML - The Future of AI
 
Microsoft Introduction to Automated Machine Learning
Microsoft Introduction to Automated Machine LearningMicrosoft Introduction to Automated Machine Learning
Microsoft Introduction to Automated Machine Learning
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
 
Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101
 
Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
 
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
 
Open and Automated Machine Learning
Open and Automated Machine LearningOpen and Automated Machine Learning
Open and Automated Machine Learning
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
OpenML 2019
OpenML 2019OpenML 2019
OpenML 2019
 
Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
 

Similar a Auto-ML Tools Compared: MLBox Automates Machine Learning

Cutting Edge Computer Vision for Everyone
Cutting Edge Computer Vision for EveryoneCutting Edge Computer Vision for Everyone
Cutting Edge Computer Vision for EveryoneIvo Andreev
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorizationAndreas Loupasakis
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization Warply
 
Scaling up Machine Learning Development
Scaling up Machine Learning DevelopmentScaling up Machine Learning Development
Scaling up Machine Learning DevelopmentMatei Zaharia
 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsCarole Goble
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real ExperienceIhor Bobak
 
END-TO-END MACHINE LEARNING STACK
END-TO-END MACHINE LEARNING STACKEND-TO-END MACHINE LEARNING STACK
END-TO-END MACHINE LEARNING STACKJan Wiegelmann
 
Data Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningData Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningSergey Karayev
 
Reproducibility and automation of machine learning process
Reproducibility and automation of machine learning processReproducibility and automation of machine learning process
Reproducibility and automation of machine learning processDenis Dus
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?Ivo Andreev
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringTao Xie
 
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)Senturus
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 
Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewLucidworks
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Welcome Webinar Slides
Welcome Webinar SlidesWelcome Webinar Slides
Welcome Webinar SlidesSumo Logic
 

Similar a Auto-ML Tools Compared: MLBox Automates Machine Learning (20)

Cutting Edge Computer Vision for Everyone
Cutting Edge Computer Vision for EveryoneCutting Edge Computer Vision for Everyone
Cutting Edge Computer Vision for Everyone
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Ember
EmberEmber
Ember
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorization
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization
 
Scaling up Machine Learning Development
Scaling up Machine Learning DevelopmentScaling up Machine Learning Development
Scaling up Machine Learning Development
 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow Environments
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real Experience
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 
END-TO-END MACHINE LEARNING STACK
END-TO-END MACHINE LEARNING STACKEND-TO-END MACHINE LEARNING STACK
END-TO-END MACHINE LEARNING STACK
 
Data Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningData Management - Full Stack Deep Learning
Data Management - Full Stack Deep Learning
 
Reproducibility and automation of machine learning process
Reproducibility and automation of machine learning processReproducibility and automation of machine learning process
Reproducibility and automation of machine learning process
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's New
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Welcome Webinar Slides
Welcome Webinar SlidesWelcome Webinar Slides
Welcome Webinar Slides
 

Último

Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 

Último (20)

Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 

Auto-ML Tools Compared: MLBox Automates Machine Learning

  • 1. BY AXEL DE ROMBLAY & HENRI GÉRARD
  • 3. • INTRODUCTION : WHY AUTO-ML IS GAINING MOMENTUM ? • FOCUS ON THE AUTOMATION PROCESS • MLBOX COMPARED WITH OTHER AUTOMATED TOOLS • CONCLUSION • LIVE DEMO • Q&A
  • 4. Data ScientistData Computation means Data pre-processing Model tuning Machine Learning Almost an automated process…
  • 5. Auto Machine Learning A fully automated process Data Computation meansRobot • Supervised tasks - classification - regression • Structured data - csv files - json files - … • Unsupervised tasks - outlier detection - clustering - … • Unstructured data - images - texts - …
  • 6. What is auto-ML ? We want to automate… …the maximum number of steps in a ML pipeline… …with minimum human intervention… …while conserving a high performance !
  • 7. Data cleaning (duplicates, ids, correlations, leaks, … ) Data encoding (NA, dates, text, categorical features, … ) STEP 2 : Preprocessing STEP 1 : Reading / merging STEP 3 : Optimisation Feature selection Feature engineering Model selection Prediction Model interpretation STEP 4 : Application Focus on the automation process Diagram of a standard ML pipeline
  • 8. Step 1: reading and merging Maybe the hardest step. Not a priority for auto-ML at the moment. Some inputs : paths to the data + target name Difficult to auto-merge different sources
  • 9. Step 2: preprocessing Not all packages tackle preprocessing No inputs : heuristics to detect feature types are easy Naive encoding for most auto-ML packages
  • 10. Step 3: optimization Top priority for auto-ML community Some inputs : scoring function + a hyper-parameter space Computation time can be long !
  • 11. Step 4: application Prediction is also a priority for auto-ML No auto-monitoring after model deployment Fast and accurate step
  • 12. Maintenance Ease of setup Automated steps - Reading - Encoding - Optimisation - Encoding - Optimisation - Reading/merging - Cleaning - Encoding - Optimisation - Encoding - Optimisation - Optimisation Quality of automation Open source ? Overview of various solutions
  • 13. MLBox: a fully automated package
  • 14. STEP 1 : Reading / merging From several raw datasets to one structured dataset. • List of paths to all the datasets • Target name • Reading of several files (csv, xls, json and hdf5) • Auto-merging of all the sources – information crunching • Task detection (binary/multiclass classification or regression) • Split between train and test sets • Basic information display • Dense structured train and test files (without duplicates) Features what is Automated by MLBox ?
  • 15. • A dataset • Auto-cleaning/dropping : duplicates, drifts / covariate shifts (*), high correlations, highly sparse features / samples, constants, … • Feature encoding : missing values, lists, dates, categorical features, text, … - SEVERAL STRATEGIES • A cleaned dataset with numerical features STEP 2 : Preprocessing From a dirty dataset to a cleaned numerical one Features what is Automated by MLBox ?
  • 16. STEP 3 : Optimisation A wide range of models are tested and cross-validated • A metric (a wide choice, otherwise can be implemented) - OPTIONAL • A hyper-parameter space – OPTIONAL • The train set • Feature engineering : using neural networks (*) • Feature selection : filter methods, wrapper methods, embedded methods • Model selection : a wide range of accurate models (LightGBM, XGBoost, Random Forest, Linear, …) • Hyper-parameter optimisation : TPE (Bayesian optimisation method) – dumping of fitted pipelines • Ensembling : multi-layer stacking, boosting, bagging, … • The optimal pipeline configuration Features what is Automated by MLBox ?
  • 17. STEP 4 : Prediction The best model is fitted and predicts on the test set • Train and test datasets • The optimal pipeline configuration - OPTIONAL • Target prediction : classification and regression - dumping of predictions + optimal fitted pipeline • Model interpretation : feature importances (saved) • Leak detection : warning • The predictions on the test set Features what is Automated by MLBox ?
  • 18.  Quality: functional code : tested on Kaggle & codecov  Performance: fully distributed and optimised  AI: dumping and automatic reading of computations  Updates: latest algorithms MLBox: about the package  Compatibility: Python 3.5-3.7 / Windows, Linux & Mac OS  Quick setup: $ pip install mlbox  User friendly: tutorials, docs, examples…
  • 19. Conclusion: the benefits of auto-ML Democratizes Machine Learning Machine Learning for everybody (no coding) Avoids errors A robot never makes mistakes… Increases productivity Repetitive tasks are automated and accelerated ! A Data Scientist can focus more on non-traditional issues !
  • 21. Next releases • FEATURES TEXT & TIME SERIES ENCODING AUTO OPTIMISATION (USING TUNE) MODEL INTERPRETATION (USING LIME) • PERFORMANCES PIPELINE PARALLELISM PER VARIABLE TYPE (USING COLUMNTRANSFORMER IN SKLEARN) PIPELINE MEMORY IMPROVEMENTS • ERGONOMY FIT AND PREDICT TASKS SPLIT VERBOSITY IMPROVEMENTS (USING MLFLOW…) ERGONOMY IMPROVEMENTS (PARAMETERS NAMES, …)
  • 22. Thank you ! Questions ? Please star or share the repo !

Notas del editor

  1. 1min
  2. 1min
  3. 1min
  4. 2min Data preprocessing and model tuning are both repetitive tasks that take a lot of time… A Data Scientist is expensive !
  5. 2min So why don’t we replace the DS by a robot ??? We would save time and money ! Let’s see what can be automated !
  6. 1min Performance = computation time + accuracy
  7. 2min 2012-2017 : Machine Learning (understand the benefit of Machine Learning | identify use cases) 2018- ? : Auto ML (model deployment issue !)
  8. 2min Companies will benefit from Auto-ml to deploy ML models ! Solution : identify supervised tasks that can be automated with a reusable/generic pipeline ! Other tasks are specifics (so difficult to be automated with a single tool/method: DS will focus on these tasks !). Gap between IT companies that have already automated pipelines and other companies that are late !
  9. 2min 90% of machine learning tasks follow this pipeline
  10. 1min Reading and creating the datasets has always been manual ! Here is the key point for automation !!!
  11. 1min Preprocessing issues : drifts (ids), leaks, duplicates ?? Heuristics : str -> categorical features, long str -> text, … Naive encoding : no NN !
  12. 1min First step automated : no needs to automate more !! (some solution pretend to test « thousands » of models but it is useless !) Optional inputs Optimization algo : a lot of ways to tackle the high dimension issue Meta learning !
  13. 1min NO monitoring ! Feature interpretation : ok
  14. 3min MLBox tackles merging !!
  15. 1min Available on PyPI Github with tutos, examples Docs with articles, kaggle kernels, … Performance : tested on Kaggle ! Features : drifts, embeddings, stacking, leak, feature importances,…
  16. 2min
  17. 2min
  18. 2min
  19. 2min
  20. 2min
  21. 2min Democratize ML !! Google CEO Sundar Pichai said that, “We hope AutoML will take an ability that a few Ph.D.s have today and will make it possible in three to five years for hundreds of thousands of developers to design new neural nets for their particular needs.”
  22. 10min THANKS ! Q&A ?
  23. 10min THANKS ! Q&A ?
  24. 10min THANKS ! Q&A ?