Building ML models is a time consuming endeavor that requires a thorough understanding of feature engineering, selecting useful features, choosing an appropriate algorithm, and performing hyper-parameter tuning. Extensive experimentation is required to arrive at a robust and performant model. Additionally, keeping track of the models that have been developed and deployed may be complex. Solving these challenges is key for successfully implementing end-to-end ML pipelines at scale.
In this talk, we will present a seamless integration of automated machine learning within a Databricks notebook, thus providing a truly unified analytics lifecycle for data scientists and business users with improved speed and efficiency. Specifically, we will show an app that generates and executes a Databricks notebook to train an ML model with H2O’s Driverless AI automatically. The resulting model will be automatically tracked and managed with MLflow. Furthermore, we will show several deployment options to score new data on a Databricks cluster or with an external REST server, all within the app.
Aspirational Block Program Block Syaldey District - Almora
Accelerate Your ML Pipeline with AutoML and MLflow
1. Accelerate Your Machine
Learning Pipeline with
AutoML and MLflow
Elena Boiarskaia
Senior Data Scientist at H2O.ai
Eric Gudgion
Senior Principal Solutions Architect at H2O.ai
2. Agenda
§ Challenges with ML
§ Integration solution
§ Data Scientist workflow
§ Live Demo
§ Dev Ops workflow
3. Challenges with Machine Learning
▪ Feature selection
▪ Feature
transformation
▪ Select which
features
transformers work
▪ Reuse engineered
features
▪ Select scoring
metric
▪ Algorithm selection
▪ Hyperparameter
tuning
▪ Ensemble methods
• Model Training
• Feature Engineering
▪ Track experiments
▪ Select which model
to deploy
▪ Deploy models in
variety of
environments
• Model Deployment
▪ User friendly
interface
▪ Build visualizations
and dashboards
▪ Update visuals with
model predictions
• Presenting Results
4. Solution via Integration
Driverless AI:
▪ Automated feature
selection
▪ Automated feature
engineering
▪ Custom feature
transformers,
models and scoring
metrics
Driverless AI:
▪ Genetic algorithm
▪ Algorithm selection
▪ Hyperparameter
tuning
▪ Explainability
▪ Stand alone model
object (MOJO)
• Model Training
• Feature Engineering
Mlflow:
▪ Track experiments
▪ Reproducible
projects
▪ Model deployment
▪ Model registry
• Model Deployment
Wave:
▪ Rapid app
development
▪ Python based
▪ Create dashboards
and visualizations
▪ Realtime apps
connected to
models and data
sources
• Presenting Results
6. Example Notebook
Workflow
Update Update table with Databricks Delta
Score Score new data with Driverless AI model on Databricks
Log Log Driverless AI model in MLFlow
Train Send data to Driverless AI and train model
Prepare Prepare data with Spark on Databricks
Manage Store and manage data with Databricks Delta
8. Example Wave App Workflow
Import Data
Set up Driverless AI experiment
Automatically generate Databricks notebook to run and log experiment
Send notebook to a Databricks cluster to
run
Trains Driverless AI
model
Logs model in MLFlow
20. External Rest API Scorer
• Invoke call from Databricks
Worker nodes
• Notebook friendly API
• Create DataFrame
• Specify model to score
• Call Endpoint
• Call multiple models in one
call
• Returns DataFrame per model
• Handy to test new vs. old model
Predictions
(MOJO)
Data
Warehouse
Other Model
Users
Rest Endpoint
22. External Batch Scorer
• Use Databricks as a JDBC
data source
• Supports Delta Tables
and any Tables
accessible with SQL
• Connects via JDBC
connection
Predictions
(MOJO)
Data
Warehouse
Batch Job
CPU CPU CPU
23.
24. Saving Results
§ Update existing table or
Insert into a new table
§ Update method
▪ Single row
▪ Bulk Upload
§ Catching runtime data
changes
25. Conclusion
Databricks and H2O.ai integration offers:
• End to end pipeline from data management to model
deployment
• Highly scalable model training and scoring
• Leverage advanced automated ML with Driverless AI
• Advanced feature engineering and feature selection
• Highly accurate and explainable model