This document discusses custom machine learning recipes in H2O Driverless AI. It introduces bring your own recipes (BYOR), which allows users to write their own Python code snippets called recipes to customize Driverless AI's machine learning workflow. The document provides an overview of writing recipes, including how to build transformers, scorers, and models. It also covers testing and debugging recipes and more advanced options like importing packages and setting recipe parameters. The goal of BYOR is to give users flexibility and control over Driverless AI's automation to best solve their specific machine learning problems.
2. Confidential2 Confidential2
• aquarium.h2o.ai
• H2O.ai’s software-as-a-service platform for training and initial
exploration
• Recommended for use as a training, workshops and tutorials
• Driverless AI Test Drive
• https://github.com/h2oai/tutorials/blob/master/DriverlessAI/Test-
Drive/test-drive.md
• Your data will disappear after the time period
• Run as many times as needed
H2O Aquarium 1
2
3
3. Confidential3
Make Your Own AI: Agenda
• Where does BYOR fit into Driverless AI?
• What are custom recipes?
• Tutorial: Using custom recipes
• What does it take to write a recipe?
• Example deep dive with the experts
4. Confidential4
Key Capabilities of H2O Driverless AI
• Automatic Feature Engineering
• Automatic Visualization
• Machine Learning Interpretability (MLI)
• Automatic Scoring Pipelines
• Natural Language Processing
• Time Series Forecasting
• Flexibility of Data & Deployment
• NVIDIA GPU Acceleration
• Bring-Your-Own Recipes
6. Confidential6
The Workflow of Driverless AI
SQL
HDFS
X Y
Automatic Model Optimization
Automatic
Scoring Pipeline
Deploy
Low-latency
Scoring to
Production
Modelling
Dataset
Model Recipes
• i.i.d. data
• Time-series
• More on the way
Advanced
Feature
Engineering
Algorithm
Model
Tuning+ +
Survival of the Fittest
1 Drag and Drop Data
2 Automatic Visualization
4 Automatic Model Optimization
5 Automatic Scoring Pipelines
Snowflake
Model
Documentation
Upload your own recipe(s)
Transformations Algorithms Scorers
3 Bring Your Own Recipes
Driverless AI executes automation on your recipes
Feature engineering, model selection, hyper-parameter tuning,
overfitting protection
Driverless AI automates
model scoring and
deployment using your
recipes
Amazon S3
Google BigQuery
Azure Blog Storage
7. Confidential7
What is a Recipe…
• Machine Learning Pipelines’ model prepped data to solve a business question
• Transformations are done on the original data to ensure it’s clean and most predictive
• Additional datasets may be brought in to add insights
• The data is modeled using an algorithm to find the optimal rules to solve the problem
• We determine the best model by using a specific metric, or scorer
• BYOR stands for Bring Your Own Recipe and it allows domain scientists to solve their
problems faster and with more precision by adding their expertise in the form of Python
code snippets
• By providing your own custom recipes, you can gain control over the optimization choices
that Driverless AI makes to best solve your machine learning problems
8. Confidential8
• Flexibility, extensibility and customizations built into the Driverless AI
platform
• New open source recipes built by the data science community, curated by
Kaggle Grand Masters @ H2O.ai
• Data scientists can focus on domain-specific functions to build
customizations
• 1-click upload of your recipes – models, scorers and transformations
• Driverless AI treats custom recipes as first-class citizens in the automatic
machine learning workflow
• Every business can have a recipe cookbook for collaborative data
science within their organization
…and Why Do You Care?
10. Confidential10 Confidential10
• aquarium.h2o.ai
• H2O.ai’s software-as-a-service platform for training and initial
exploration
• Recommended for use as a training, workshops and tutorials
• Driverless AI Test Drive
• https://github.com/h2oai/tutorials/blob/master/DriverlessAI/Test-
Drive/test-drive.md
• Your data will disappear after the time period
• Run as many times as needed
H2O Aquarium 1
2
3
11. Confidential11
The Writing Recipes Process
• First write and test idea on
sample data before wrapping as
a recipe
• Download the Driverless AI
Recipes Repository for easy
access to examples
• Use the Recipe Templates to
ensure you have all required
components
https://github.com/h2oai/driverlessai-recipes
12. Confidential12
What does it take to write a custom recipe?
• Somewhere to write .py files
• To use or test your recipe you need Driverless AI 1.7.0 or later
• BYOR is not available in the current LTS release series (1.6.X)
• To test your code locally you need
• Python 3.6, numpy, datatable, & the Driverless AI python client
• Python development environment such as PyCharm or Spyder
• To write recipes you need
• The ability to write python code
13. Confidential13
The Testing Recipes Process
• Upload to Driverless AI to
automatically test on sample data
or
• Use the DAI Python or R client to
automate this process
or
• Test locally using a dummy
version of the RecipeTransformer
class we will be extending
14. Confidential14
What if I get stuck writing a custom recipe?
• Use error messages and stack traces from Driverless AI & your python development
environment to try to pinpoint what is causing the problem
• Write to the Driverless AI Experiment Logs (Example in Advanced Options below)
• Read the FAQ & look the templates: https://github.com/h2oai/driverlessai-recipes
• Follow along with the tutorial (Coming Soon): https://h2oai.github.io/tutorials/
• Ask on the community channel: https://www.h2o.ai/community/
15. Confidential15
Build Your Own Recipe
Full customization of the entire ML Pipeline through scikit-learn Python API
Custom Feature Engineering – fit_transform & transform
• Custom statistical transformations and embeddings for numbers, categories,
text, date/time, time-series, image, audio, zip, lat/long, ICD, ...
Custom Optimization Functions – f(id, actual, predicted, weight)
• Ranking, Pricing, Yield Scoring, Cost/Reward, any Business Metrics
Custom ML Algorithms – fit & predict
• Access to ML ecosystem: H2O-3, sklearn, Keras, PyTorch, CatBoost, etc.
19. Confidential19
Bring Your Own Recipes
• What is BYOR?
• Building a Transformer
• Building a Scorer
• Building a Model Algorithm
• Advanced Options
• Writing Recipes Help
20. Confidential20
Advanced Options: Importing Packages
• Install and use the exact version of the
exact package you need for your recipe
• _global_modules_needed_by_name
• Use before class definition for when there are
multiple recipes in one file that need the
package
• _modules_needed_by_name
• Use in the class definition
"""Row-by-row similarity between two text columns based
on FuzzyWuzzy"""
# https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-
string-matching-in-python/
# https://github.com/seatgeek/fuzzywuzzy
from h2oaicore.transformer_utils import
CustomTransformer
import datatable as dt
import numpy as np
_global_modules_needed_by_name = ['nltk==3.4.3']
import nltk
21. Confidential21
Advanced Options: Similar Recipes
• Extend your custom recipes when there
are multiple options or similar methods
and you want all of them to be tested
class FuzzyQRatioTransformer(FuzzyBaseTransformer,
CustomTransformer):
_method = "QRatio"
class FuzzyWRatioTransformer(FuzzyBaseTransformer,
CustomTransformer):
_method = "WRatio"
class
ZipcodeTypeTransformer(ZipcodeLightBaseTransformer,
CustomTransformer):
def get_property_name(self, value):
return 'zip_code_type'
class
ZipcodeCityTransformer(ZipcodeLightBaseTransformer,
CustomTransformer):
def get_property_name(self, value):
return 'city'
22. Confidential22
Advanced Options: Recipe Parameters
• set_default_params
• Parameters of models or transformers
• Access in functions with self.params
from h2oaicore.systemutils import physical_cores_count
class ExtraTreesModel(CustomModel):
_display_name = "ExtraTrees"
_description = "Extra Trees Model based on sklearn"
def set_default_params(self, accuracy=None,
time_tolerance=None, interpretability=None, **kwargs):
self.params = dict(
random_state=kwargs.get("random_state", 1234)
, n_estimators=min(kwargs.get("n_estimators", 100), 1000)
, criterion="gini" if self.num_classes >= 2 else "mse"
, n_jobs=self.params_base.get('n_jobs', max(1,
physical_cores_count)))
23. Confidential23
Advanced Options: Recipe Parameters
• mutate_params
• Random permutations of parameter
options for transformers and models
• Can get the options chosen in final
model from Auto Doc
class ExtraTreesModel(CustomModel):
_display_name = "ExtraTrees"
_description = "Extra Trees Model based on sklearn"
def mutate_params(self, accuracy=10, **kwargs):
if accuracy > 8:
estimators_list = [100, 200, 300, 500, 1000, 2000]
elif accuracy >= 5:
estimators_list = [50, 100, 200, 300, 400, 500]
else:
estimators_list = [10, 50, 100, 150, 200, 250, 300]
# Modify certain parameters for tuning
self.params["n_estimators"] =
int(np.random.choice(estimators_list))
self.params["criterion"] = np.random.choice(["gini", "entropy"]) if
self.num_classes >= 2
else np.random.choice(["mse", "mae"])
24. Confidential24
Advanced Options: Writing to Logs
• Leave notes in the
experiment logs
from h2oaicore.systemutils import make_experiment_logger,
loggerinfo, loggerwarning
...
if self.context and self.context.experiment_id:
logger = make_experiment_logger(
experiment_id=self.context.experiment_id,
tmp_dir=self.context.tmp_dir,
experiment_tmp_dir=self.context.experiment_tmp_dir
)
...
loggerinfo(logger, "Prophet will use {} workers for
fitting".format(n_jobs))
Notas del editor
Took ~ 6 minutes w/o pre-warming
/data/Smalldata/gbm_test/titanic.csv
/data/Kaggle/. CreditCard/CreditCard-train.csv
This is not uptodate: https://github.com/h2oai/tutorials/blob/master/DriverlessAI/aquarium/aquarium.md
DAI quick overview
Types of problems we can handle: TS, NLP, bi, multi, regress
New engineered features to get new value out of your data
Not a black box!!! MLI & Autodoc
Production ready code (including all data transformations)
Recipes to augment this process with your business knowledge
Driverless AI is platform that is applicable across industries
General purpose
It’s not build for a single vertical or use case, but can be used for a wide range of basically all supervised problems
Name a few use cases & industries
Domain Scientists and SMEs are king when it comes to knowing their data and how to use it
Combine together – turbo charge time to solution
This is where recipes come in, we allow this expert knowledge to be added in to Driverless AI which refines the process for an individual use case
Horizontal not vertical, core capabilities are agnostics, specific datasets use cases can be refined by domain expertise
Meant to save time, not replace people but augment them to make them more efficient and provide guidance on how to operate on the data
At a very high level, here’s how Driverless AI works:
Ingest data from any data source: Hadoop, Snowflake, S3 object storage, Google BigQuery – Driverless AI is agnostic about the data source.
Use Automatic Visualization and its various plots, graphics and charts to look at the data, and understand the data shape, outliers, missing values and so on. This is where a data scientist can quickly spot things such as bias in the data.
Based on the problem type, Driverless AI will use recipes to do advanced feature engineering (automatically), while the model continues to iterate across thousands of choices, does parameter tuning, and looks for the best fit of the model.
Finally, another amazing feature of Driverless AI is that it can build an automatic scoring pipeline, which means it can generate Python and Java code to deploy low latency scoring of that model into production. Imagine taking that scored model and propagating it across every edge device – on smart phones, or in cars, to continuously generate value.
Through this process, Machine Learning Interpretability gives the data scientist the reason codes and insight into what model was generated and which features were used to build the model. Automatic documentation gives one an in-depth explanation of the entire feature engineering process. This satisfies that desire to have trust in AI with explainability.
This entire process is done through a graphical user interface, making it easy for even a novice data scientist to be productive immediately.
Of course, acceleration to achieve faster time to insight is important, and an IBM Power System server with GPUs (such as the AC922) will give the highest level of acceleration to gain results and insights faster.
The slide makes reference to IID (Independent and Identically Distributed) data. This refers to data where the individual rows (or observations) are essentially independent of each other, unlike time series data where there is a relationship between rows in the dataset as they are collected over time. For example, whether a customer applying for a new mortgage is likely to default on that mortgage at some point in the future would use IID data such as age, income, current net worth and so on to make the prediction. For time series data, think of a utilities company; for example, a utility company tracks resource utilization over the course of a day, weeks and years where trends of usage over time periods are a factor in the prediction process.
Recipes: bring in your own domain knowledge
Bring in existing IP - reuse existing IP on top of the engine
Newer data scientists can use their senior’s IP ->
Took ~ 6 minutes w/o pre-warming
/data/Smalldata/gbm_test/titanic.csv
/data/Kaggle/. CreditCard/CreditCard-train.csv
This is not uptodate: https://github.com/h2oai/tutorials/blob/master/DriverlessAI/aquarium/aquarium.md