Automatic and Interpretable Machine Learning in R with H2O and LIME (Milan Edition)

Automatic and Interpretable Machine
Learning in R with H2O and LIME
Jo-fai (Joe) Chow
Data Science Evangelist /
Community Manager
joe@h2o.ai
@matlabulous
Download → https://bit.ly/
joe_eRum_2018

Agenda
2
Time Topics / Tasks
7:00 – 7:30 pm Welcome + Data Hack Italia
7:30 – 7:45 pm
Install h2o, lime, mlbench from CRAN
slides/code: bit.ly/joe_eRum_2018
7:45 – 8:00 pm Introduction (H2O, AutoML, LIME)
8:00 – 8:30 pm Regression Example
8:30 – 9:00 pm 🍕 🍕 🍕 🍕 🍕
9:00 – 9:20 pm Classification Example
9:20 – 9:30 pm Quick Recap
9:30 – 9:45 pm Real Use-Case: Moneyball
9:45 – 10:00 pm Other H2O News + Q & A

3
Time Topics / Tasks
7:30 – 7:45 pm
mlbench for datasets
install.packages(‘mlbench’)

4
Install H2O using tar.gz (if needed)
Go to https://www.h2o.ai/download/
Click
Click
Unzip
Install

5
Time Topics / Tasks
7:30 – 7:45 pm
8:30 – 9:00 pm

Why?
• Most users/organizations can
benefit from automatic machine
learning pipelines.
• Eliminate time wasted on human
errors, debugging etc.
• Model interpretations is crucial for
those who must explain their
models to regulators or customers.
You will learn …
• How to build high quality H2O
models (almost) automatically.
• How to explain predictions from
complex H2O models with LIME.
• Bonus: A real use-case that led to
multimillion-dollar baseball
decisions earlier this year.
6

+ +
$20Mtrade 2 weeks prior to the
season beginning
AutoML + LIME + Shiny = Real Moneyball

About Me
8
• Before H2O
• Water Engineer / EngD Researcher /
Matlab Fan Boy
(wonder why @matlabulous?)
• Discovered R, Python, H2O …
never look back again
• Data Scientist at Virgin Media (UK),
Domino Data Lab (US)
• At H2O …
• Data Scientist / Evangelist /
• Sales Engineer / Solution Architect /
• Community Manager
… The harsh reality of startup life …
• H2O SWAG Photographer
#AroundTheWorldWithH2Oai
Love H2O? Get some stickers!
Barcelona
Berlin
Brussels Paris
Milan (2016)

9
Reminder: #360Selfie
Budapest
Cologne
ToulouseAmsterdam
BerlinLondon

About H2O.ai …
Have you seen Avengers: Infinity War?
Do you know all the characters in the movie? (No spoilers - I promise)
10

11
We develop
machine learning platforms

CONFIDENTIAL
Gartner names H2O as Leader with the most completeness of vision
• H2O.ai recognized as a technology
leader with most completeness of
vision
• H2O.ai was recognized for the
mindshare, partner network and
status as a quasi-industry standard
for machine learning and AI.
• H2O customers gave the highest
overall score among all the vendors
for sales relationship and account
management, customer support
(onboarding, troubleshooting, etc.)
and overall service and support.

CONFIDENTIAL
Platforms with H2O integration
H2O + KNIME Talk
at KNIME Summit
Mar 2017

Founded 2012, Series C in Nov, 2017
Products • Driverless AI – Automated Machine Learning
• H2O Open Source Machine Learning
• Sparkling Water
Mission Democratize AI. Do Good
Team ~100 employees
• Distributed Systems Engineers doing Machine Learning
• World-class visualization designers
Offices Mountain View, London, Prague
14
Company Overview

H2O Products
In-Memory, Distributed
Machine Learning Algorithms
with H2O Flow GUI
H2O AI Open Source Engine
Integration with Spark
Lightning Fast machine
learning on GPUs
Automatic feature
engineering, machine
learning and interpretability
Secure multi-tenant H2O clusters

H2O Products
with H2O Flow GUI
learning on GPUs
Automatic feature
with H2O Flow GUI
This Workshop

CONFIDENTIAL
CONFIDENTIAL
Worldwide Recognition in the H2O.ai Community
Financial InsuranceMarketingHW
Vendors
Retail Advisory &
Accounting
Healthcare
“H2O.ai's reference customers gave it the highest overall score for sales
relationship and overall service and support” - Gartner MQ 2018
Open Source
Community Paying Customers

19
Our Mission:
Make Machine Learning Accessible to Everyone

Scientific Advisory Council
20

22H2O Team
Origin of R Package `ggplot2`

24H2O Team
Erin LeDell, Chief ML Scientist
Women in ML/DS & R-Ladies Global

25H2O Team
Erin LeDell Navdeep Gill
H2O AutoML

26H2O Team
1st
4th
25th
48th
33rd
Kaggle Grand Masters
(and their Highest Rank)
13th
About 80,000 Kagglers

27H2O Team
1st
4th
25th
48th
33rd
13th
181st
Hoping to get closer to them at some point …

28
https://www.meetup.com/pro/h2oai/

29
Encourage your friends/
colleagues to give a talk.
joe@h2o.ai
We sponsor
meetups
We encourage
diversity
Source: https://www.maddycoupons.in/blog/yummy-yummy-pizza/
… and more
Joe’s
Meetups
Female
Speaker
Female Speaker
Ratio
London
Dec 2017
Kasia Kulma 1/3
Amsterdam
Feb 2018
Andreea
Bejinaru
1/2
London
Mar 2018
Cheuk Ting Ho 1/3
London
Jun 2018
Torgyn
Shaikhina
1/4
Overall (so far) = 4/12 = 33.3%

About H2O AutoML
30
Automatic Machine Learning with H2O
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html

HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Production Scoring Environment
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Plain Old Java Object
Your
Imagination
Data Prep Export:
Local
SQL
High Level Architecture
31

HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Your
Imagination
Data Prep Export:
Local
SQL
32
Import Data from
Multiple Sources

HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Your
Imagination
Data Prep Export:
Local
SQL
33
Fast, Scalable & Distributed
Compute Engine Written in
Java

Supervised Learning
• Generalized Linear Models: Binomial,
Gaussian, Gamma, Poisson and Tweedie
• Naïve Bayes
Statistical
Analysis
Ensembles
• Distributed Random Forest: Classification
or regression models
• Gradient Boosting Machine: Produces an
ensemble of decision trees with increasing
refined approximations
Deep Neural
Networks
• Deep learning: Create multi-layer feed
forward neural networks starting with an
input layer followed by multiple layers of
nonlinear transformations
H2O-3 Algorithms Overview
Unsupervised Learning
• K-means: Partitions observations into k
clusters/groups of the same spatial size.
Automatically detect optimal k
Clustering
Dimensionality
Reduction
• Principal Component Analysis: Linearly transforms
correlated variables to independent components
• Generalized Low Rank Models: extend the idea of
PCA to handle arbitrary data consisting of numerical,
Boolean, categorical, and missing data
Anomaly
Detection
• Autoencoders: Find outliers using a
nonlinear dimensionality reduction using
deep learning
34

H2O Core
CPU
Model Building
H2O

H2O Core
CPU CPU CPU
Model Building
H2O Distributed In-Memory

H2O Core
YARN
CPU CPU CPU
Model Building
H2O Distributed In-Memory
SQL NFS
S3
Firewall or Cloud

HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Your
Imagination
Data Prep Export:
Local
SQL
39
Multiple Interfaces

Using H2O with R and Python
41

42H2O Team
Erin LeDell Navdeep Gill
H2O AutoML

Supervised Learning
Gaussian, Gamma, Poisson and Tweedie
• Naïve Bayes
Statistical
Analysis
Ensembles
• Distributed Random Forest: Classification
or regression models
• Gradient Boosting Machine: Produces an
ensemble of decision trees with increasing
refined approximations
Deep Neural
Networks
forward neural networks starting with an
input layer followed by multiple layers of
nonlinear transformations
H2O-3 Supervised Algorithms in AutoML
• K-means: Partitions observations into k
clusters/groups of the same spatial size.
Automatically detect optimal k
Clustering
Dimensionality
Reduction
• Principal Component Analysis: Linearly transforms
correlated variables to independent components
PCA to handle arbitrary data consisting of numerical,
Boolean, categorical, and missing data
Anomaly
Detection
nonlinear dimensionality reduction using
deep learning
45

HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Your
Imagination
Data Prep Export:
Local
SQL
48
Export Standalone Models
for Production

About Machine Learning
Interpretability
50
LIME (Local Interpretable Model-Agnostic Explanations)
… and more

Why Should I Trust Your Model?
52

Age
Age
Propensity to PurchasePropensity to Purchase
Linear Models
Machine Learning
Exact explanations for
approximate models.
Lost profits.
Wasted marketing.
“For a one unit increase in age, propensity to
purchase increases by 3% on average.”
Approximate explanations
for exact models.
Slope begins to decrease
here. Act to optimize
savings.
Slope begins to increase
here sharply. Act to
optimize profits.

Theory
• LIME approximates model locally
as logistic or linear model
• Repeats process many times
• Outputs features that are most
important to local models
Outcome
• Approximate reasoning
• Complex models can be
interpreted
• Neural nets, Random Forest,
Ensembles etc.
55
Local Interpretable Model-Agnostic Explanations
LIME - How does it work?

56
https://github.com/jphall663/awesome-machine-learning-interpretability

57
Time Topics / Tasks
7:30 – 7:45 pm
8:00 – 8:30 pm
Regression Example
examplesregression_...Rmd
8:30 – 9:00 pm

Learning from Boston Housing Data
Features:
No. of rooms, Crime Rate,
Age …
Median
House
Value
(medv)
Predictions
H2O Machine Learning
Learn the Pattern
58

🍕🍕🍕@h2oai @matlabulous
#AutoML #LIME
Pizzas & Drinks 8:30 – 9:00 pm
Sponsored by H2O.ai

60
Time Topics / Tasks
7:30 – 7:45 pm
8:30 – 9:00 pm
9:00 – 9:20 pm
Classification Example
examplesclassification_...Rmd

Learning from Diabetes Data
Features:
Pregnant, Age, Glucose …
Positive
/
Negative
Predictions
H2O Machine Learning
Learn the Pattern
61

62
Time Topics / Tasks
7:30 – 7:45 pm
8:30 – 9:00 pm 🍕 🍕 🍕 🍕 🍕

Why?
• Most users/organizations can
benefit from automatic machine
learning pipelines.
• Eliminate time wasted on human
errors, debugging etc.
• Model interpretations is crucial for
those who must explain their
models to regulators or customers.
You will learn …
• How to build high quality H2O
models (almost) automatically.
• How to explain predictions from
complex H2O models with LIME.
• Bonus: A real use-case that led to
multimillion-dollar baseball
decisions earlier this year.
63

64
Time Topics / Tasks
7:30 – 7:45 pm
8:30 – 9:00 pm 🍕 🍕 🍕 🍕 🍕

66
Time Topics / Tasks
7:30 – 7:45 pm
8:30 – 9:00 pm 🍕 🍕 🍕 🍕 🍕

H2O Products
with H2O Flow GUI
learning on GPUs
Automatic feature
learning on GPUs

“Confidential and property of H2O.ai. All rights reserved”
Supervised Learning
Gaussian, Gamma, Poisson and
Tweedie
• Naïve Bayes
Statistical
Analysis
Ensembles
• Distributed Random Forest:
Classification or regression models
• Gradient Boosting Machine: Produces
an ensemble of decision trees with
increasing refined approximations
Deep Neural
Networks
forward neural networks starting with
an input layer followed by multiple
layers of nonlinear transformations
Algorithms on H2O-3 (CPU)
• K-means: Partitions observations into
k clusters/groups of the same spatial
size. Automatically detect optimal k
Clustering
Dimensionality
Reduction
• Principal Component Analysis: Linearly
transforms correlated variables to independent
components
PCA to handle arbitrary data consisting of
numerical, Boolean, categorical, and missing
data
Anomaly
Detection
nonlinear dimensionality reduction
using deep learning

“Confidential and property of H2O.ai. All rights reserved”
Supervised Learning
Gaussian, Gamma, Poisson and
Tweedie
• Naïve Bayes
Statistical
Analysis
Ensembles
• Distributed Random Forest:
Classification or regression models
• Gradient Boosting Machine: Produces
an ensemble of decision trees with
increasing refined approximations
Deep Neural
Networks
forward neural networks starting with
an input layer followed by multiple
layers of nonlinear transformations
Algorithms on H2O4GPU (more to come)
• K-means: Partitions observations into
k clusters/groups of the same spatial
size. Automatically detect optimal k
Clustering
Dimensionality
Reduction
• Principal Component Analysis: Linearly
transforms correlated variables to independent
components
PCA to handle arbitrary data consisting of
numerical, Boolean, categorical, and missing
data
Anomaly
Detection
nonlinear dimensionality reduction
using deep learning

70
https://github.com/h2oai/h2o4gpu

Coming Up
72
Date City Talk / Workshop
25 June Milan H2O + LIME Workshop
28 June Rome H2O Intro + Use Cases
Mid-July London Next London Meetup (T.B.C.)
Mid-Oct London H2O World London
13-14 Nov London
Big Data LDN
Keynote by Sri (CEO) + Meetup

• Organizers & Sponsors
73
Thanks!
• Code, Slides & Documents
• bit.ly/h2o_meetups
• bit.ly/joe_eRum_2018
• docs.h2o.ai
• Contact
• joe@h2o.ai
• @matlabulous
• github.com/woobe
• Please search/ask questions on
Stack Overflow
• Use the tag `h2o` (not h2 zero)

Supported Formats & Data Sources
CSV
XLS
XLSX
ORC*
Hive*
SVMLight
ARFF
Parquet
Avro 1.8.0*
HDFS
S3
NFS
LOCAL
SQL
9Formats 5Sources
File type or Folder of Files
* 1. only if H2O is running as a Hadoop job
* 2. Hive files that are saved in ORC format
* 3. without multi-file parsing or column type modification

Distributed Algorithms
• Foundation for In-Memory Distributed Algorithm
Calculation - Distributed Data Frames and columnar
compression
• All algorithms are distributed in H2O: GBM, GLM, DRF, Deep
Learning and more. Fine-grained map-reduce iterations.
• Only enterprise-grade, open-source distributed algorithms
in the market
User Benefits
Advantageous Foundation
• “Out-of-box” functionalities for all algorithms (NO MORE
SCRIPTING) and uniform interface across all languages: R,
Python, Java
• Designed for all sizes of data sets, especially large data
• Highly optimized Java code for model exports
• In-house expertise for all algorithms
Parallel Parse into Distributed Rows
Fine Grain Map Reduce Illustration: Scalable
Distributed Histogram Calculation for GBM
Foundation for Distributed Algorithms
76

Demo:
H2O on a 320-Core Hadoop
Cluster
(Web Interface)
77

78
https://www.kaggle.com/c/higgs-boson

Learning from Higgs Boson Machine Data
79
Sensors
(Detector)
Data
Historical Outcome
Is it a Higgs Particle
(Yes/No)
Predicted Outcome
Learn the Pattern
11M
Rows
28 Features
Raw Data Size: 7.48 GB

80
11M Rows Size (Raw): 7.48 GB
Compressed: 2.00 GB (≈ 27% of Raw)

81
10 nodes
10 x 32 =
320 Cores
10 x 29.6 = 296
GB Memory

H2O Water Meter
(CPU Monitor)
82
10 x 32 = 320 Cores

Automatic and Interpretable Machine Learning in R with H2O and LIME (Milan Edition)

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Automatic and Interpretable Machine Learning in R with H2O and LIME (Milan Edition)

Similar a Automatic and Interpretable Machine Learning in R with H2O and LIME (Milan Edition) (20)

Más de Sri Ambati

Más de Sri Ambati (20)

Último

Último (20)

Automatic and Interpretable Machine Learning in R with H2O and LIME (Milan Edition)