H2O World 2015
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
5. Accuracy
with
Speed
and
Scale
HDFS%
S3%
SQL%%
NoSQL%
CLASSIFICATION%
REGRESSION%
FEATURE%
ENGINEERING%
IN4MEMORY%
MAP%REDUCE/FORK%JOIN%
COLUMNAR%COMPRESSION%
DEEP%LEARNING%
PCA,%GLM,%COX%
RANDOM%FOREST%/%GBM%
ENSEMBLES%
FAST % M O D E L I NG % E NG I NE %
Streaming%
NANO % FAST % JAVA% S CO R I NG% E NGI NES %
MATRIX%
FACTORIZATION% CLUSTERING%
MUNGING%
6. What’s
New
in
H2O-‐3
H2O-‐3
vs
H2O-‐2:
• Total
rewrite
of
the
core
in
Java:
built
for
data
scientists
AND
developers!
• Unique
Flow
GUI
(Notebook
and
more)
• REST
Schemas
for
self-‐describing
API
for
all
methods/algos
• New
R
client:
cleaner,
faster
• Sparkling
Water:
H2O
is
the
Killer
App
on
Spark
• Fully
featured
Python
client
(incl.
Pipelines,
scikit-‐learn
look&feel)
• New
expression
parser
&
backend
execution
engine
for
R,
Py,
Flow
• New
Algo:
GLRM
-‐
Generalized
Low
Rank
Modeling
(unifies
PCA,
K-‐Means,
Matrix
Factorization,
Imputation,
etc.)
• New
Solvers
for
GLM:
Coordinate
Descent
and
L-‐BFGS
continued…
7. What’s
New
in
H2O-‐3
Additional
New
Features:
• Grid
Search
for
all
Algorithms
(R/Py/Flow)
• N-‐fold
Cross-‐Validation
for
all
Algorithms
• Early
Stopping
(check
for
convergence)
for
GBM/DRF/DL
• Stochastic
GBM
(row/col
sampling)
• Distributions
(Gaussian,
Laplace,
Poisson,
Gamma,
Tweedie)
for
GBM/DL
• Improved
sparse
data
handling
for
DL
• Multi-‐node
auto-‐tuning
for
DL
• Multinomial
GLM
• Scalable
Scatter
Plots
for
numeric
and
categorical
data
• Big-‐Big
Joins
(“distributed
data.table”)
-‐
in
QA
…and
many
more!
8. Convergence-‐Based
Early
Stopping
in
H2O
Before:
trains
too
long,
but
at
least
overwrite_with_best_model=true
prevents
overfitting
(returns
the
model
with
lowest
validation
error)
Now:
specify
additional
convergence
criterion:
E.g.
stopping_rounds=5,
stopping_metric=“MSE”,
stopping_tolerance=1e-‐3,
to
stop
as
soon
as
the
moving
average
(length
5)
of
the
validation
MSE
does
not
improve
by
at
least
0.1%
for
5
consecutive
scoring
events
validation
error
training
error
overwrite_with_best_model=true
training
time
/
epochs
training
time
/
epochsUse
Flow
to
inspect
the
model
Early
stopping
saves
tons
of
time
Best
Model
Deep
Learning
with
Higgs
data
9. What
do
these
stickers
mean?
I have H2O
Installed
I have Python
installed
I have R
installed
I have the H2O
World data
sets
Pick
up
stickers
or
get
install
help
at
the
information
booth