SlideShare una empresa de Scribd logo
1 de 35
Descargar para leer sin conexión
Gradient Boosting Machine:
Distributed Regression Trees
on H2O
Cliff Click, CTO 0xdata
cliffc@0xdata.com
http://0xdata.com
http://cliffc.org/blog
H2O is...
●

Pure Java, Open Source: 0xdata.com
●

●

https://github.com/0xdata/h2o/

A Platform for doing Math
●

Parallel Distributed Math

●

In-memory analytics: GLM, GBM, RF, Logistic Reg

●

Accessible via REST & JSON

●

A K/V Store: ~150ns per get or put

●

Distributed Fork/Join + Map/Reduce + K/V
0xdata.com

2
Agenda
●

Building Blocks For Big Data:
●

●

Vecs & Frames & Chunks

Distributed Tree Algorithms
●

Access Patterns & Execution

●

GBM on H2O

●

Performance

0xdata.com

3
A Collection of Distributed Vectors
// A Distributed Vector
//
much more than 2billion elements
class Vec {
long length(); // more than an int's worth
// fast random access
double at(long idx); // Get the idx'th elem
boolean isNA(long idx);
void set(long idx, double d); // writable
void append(double d); // variable sized
}
0xdata.com

4
Frames
A Frame: Vec[]
age

sex

zip

ID

car

JVM 1
Heap
JVM 2
Heap
JVM 3
Heap

Vecs aligned
in heaps
●Optimized for
concurrent access
●Random access
any row, any JVM
●

But faster if local...
more on that later

●

JVM 4
Heap

0xdata.com

5
Distributed Data Taxonomy
A Chunk, Unit of Parallel Access
Vec

Vec

Vec

Vec

Vec

JVM 1
Heap
JVM 2
Heap
JVM 3
Heap

Typically 1e3 to
1e6 elements
●Stored compressed
●In byte arrays
●Get/put is a few
clock cycles
including
compression
●

JVM 4
Heap

0xdata.com

6
Distributed Parallel Execution
Vec

Vec

Vec

Vec

Vec

JVM 1
Heap

●

JVM 2
Heap

●

JVM 3
Heap

All CPUs grab
Chunks in parallel
●F/J load balances
Code moves to Data
●Map/Reduce & F/J
handles all sync
●H2O handles all
comm, data manage

JVM 4
Heap

0xdata.com

7
Distributed Data Taxonomy

Frame – a collection of Vecs
Vec – a collection of Chunks
Chunk – a collection of 1e3 to 1e6 elems
elem – a java double
Row i – i'th elements of all the Vecs in a Frame

0xdata.com

8
Distributed Coding Taxonomy
●

No Distribution Coding:
●
●

●

Whole Algorithms, Whole Vector-Math
REST + JSON: e.g. load data, GLM, get results

Simple Data-Parallel Coding:
●
●

●

Per-Row (or neighbor row) Math
Map/Reduce-style: e.g. Any dense linear algebra

Complex Data-Parallel Coding
●

K/V Store, Graph Algo's, e.g. PageRank
0xdata.com

9
Distributed Coding Taxonomy
●

No Distribution Coding:

Read the docs!

●
●

●

Whole Algorithms, Whole Vector-Math
REST + JSON: e.g. load data, GLM, get results

Simple Data-Parallel Coding:

This talk!

●
●

●

Per-Row (or neighbor row) Math

Map/Reduce-style: e.g. Any dense linear algebra

Complex Data-Parallel Coding
●

Join our GIT!

K/V Store, Graph Algo's, e.g. PageRank
0xdata.com

10
Simple Data-Parallel Coding
●

Map/Reduce Per-Row: Stateless
●

Example from Linear Regression, Σ y2

double sumY2 = new MRTask() {
double map( double d ) { return d*d; }
double reduce( double d1, double d2 ) {
return d1+d2;
}
}.doAll( vecY );
●

Auto-parallel, auto-distributed

●

Near Fortran speed, Java Ease

0xdata.com

11
Simple Data-Parallel Coding
●

Map/Reduce Per-Row: State-full
●

Linear Regression Pass1: Σ x, Σ y, Σ y2

class LRPass1 extends MRTask {
double sumX, sumY, sumY2; // I Can Haz State?
void map( double X, double Y ) {
sumX += X; sumY += Y; sumY2 += Y*Y;
}
void reduce( LRPass1 that ) {
sumX += that.sumX ;
sumY += that.sumY ;
sumY2 += that.sumY2;
}
0xdata.com
}

12
Simple Data-Parallel Coding
●

Map/Reduce Per-Row: Batch State-full
class LRPass1 extends MRTask {
double sumX, sumY, sumY2;
void map( Chunk CX, Chunk CY ) {// Whole Chunks
for( int i=0; i<CX.len; i++ ){// Batch!
double X = CX.at(i), Y = CY.at(i);
sumX += X; sumY += Y; sumY2 += Y*Y;
}
}
void reduce( LRPass1 that ) {
sumX += that.sumX ;
sumY += that.sumY ;
sumY2 += that.sumY2;
}
0xdata.com
13
}
GBM (for K-classifier)
Elements of Statistical
Learning, 2nd Ed, 2009
Pg 387
Trevor Hastie,
Robert Tibshirani
Jerome Friedman

0xdata.com

14
Distributed Trees
●

Overlay a Tree over the data
●

Really: Assign a Tree Node to each Row
Vec nids = v.makeZero();
… nids.set(row,nid)...

●
●

●

Number the Nodes

Store "Node_ID" per row in a temp Vec

Make a pass over all Rows
●
●

●

Nodes not visited in order...
but all rows, all Nodes efficiently visited

Do work (e.g. histogram) per Row/Node
0xdata.com

15
Distributed Trees
●

An initial Tree
●
●

All rows start on n0
MRTask: compute stats

X
0
1
2
3

Tree
n0

Y nids
1.3 0
1.1 0
3.1 0
-2.1 0

MRTask.avg=0.85
MRTask.var =3.5075
●

Use the stats to make a decision...
●

(varies by algorithm)!

●

(e.g. lowest MSE, best col, best split)
0xdata.com

16
Distributed Trees
●

Next layer in the Tree (and MRTask across rows)
●

Each row: decide!
–

●

Tree

If "X<1.5" go right else left

Compute stats per new leaf

n0

X>=1.5

●

Each pass across all
rows builds entire layer

X<1.5

n1

n2

avg=0.5
var=6.76

avg=1.2
var=0.01

X
0
1
2
3

Y nids
1.3 2
1.1 2
3.1 1
-2.1 1

0xdata.com

17
Distributed Trees
●
●

●

Another MRTask, another layer...
i.e., a 5-deep tree
takes 5 passes

Tree
n0

Fully data-parallel
for each tree level

X<1.5

X>=1.5

n1
X
0
1
2
3

Y nids
1.3 2
1.1 2
3.1 4
-2.1 3

X>=2.5

n3
avg= -2.1

1.2
X<2.5

n4
avg=3.1

0xdata.com

18
Distributed Trees
●

Each pass is over one layer in the tree

●

Builds per-node histogram in map+reduce calls
class Pass extends MRTask2<Pass> {
void map( Chunk chks[] ) {
Chunk nids = chks[...];
// Node-IDs per row
for( int r=0; r<nids.len; r++ ){// All rows
int nid = nids.at80(i); // Node-ID THIS row
// Lazy: not all Chunks see all Nodes
if( dHisto[nid]==null ) dHisto[nid]=...
// Accumulate histogram stats per node
dHisto[nid].accum(chks,r);
}
}
0xdata.com
19
}.doAll(myDataFrame,nids);
Distributed Trees
●

Each pass analyzes one Tree level
●

Then decide how to build next level

●

Reassign Rows to new levels in another pass
–

●

Builds a Histogram-per-Node
●

●

(actually merge the two passes)

Which requires a reduce() call to roll up

All Histograms for one level done in parallel

0xdata.com

20
Distributed Trees: utilities
●

“score+build” in one pass:
●

Test each row against decision from prior pass

●

Assign to a new leaf

●

Build histogram on that leaf

●

“score”: just walk the tree, and get results

●

“compress”: Tree from POJO to byte[]
●

●

Easily 10x smaller, can still walk, score, print

Plus utilities to walk, print, display
0xdata.com

21
GBM on Distributed Trees
●

GBM builds 1 Tree, 1 level at a time, but...

●

We run the entire level in parallel & distributed
●
●

●

Built breadth-first because it's "free"
More data offset by more CPUs

Classic GBM otherwise
●
●

Build residuals tree-by-tree

●

●

ESL2, page 387
Tuning knobs: trees, depth, shrinkage, min_rows

Pure Java

0xdata.com

22
GBM on Distributed Trees
●

Limiting factor: latency in turning over a level
●

About 4x faster than R single-node on covtype

●

Does the per-level compute in parallel

●

Requires sending histograms over network
–

●

Can get big for very deep tree

Adding more data offset by adding more Nodes

0xdata.com

23
Summary: Write (parallel) Java
●

Most simple Java “just works”

●

Fast: parallel distributed reads, writes, appends
●

Reads same speed as plain Java array loads

●

Writes, appends: slightly slower (compression)

●

Typically memory bandwidth limited
–

●

(may be CPU limited in a few cases)

Slower: conflicting writes (but follows strict JMM)
●

Also supports transactional updates
0xdata.com

24
Summary: Writing Analytics
●

We're writing Big Data Analytics
●

Generalized Linear Modeling (ADMM, GLMNET)
–

●

Logistic Regression, Poisson, Gamma

Random Forest, GBM, KMeans++, KNN

●

State-of-the-art Algorithms, running Distributed

●

Solidly working on 100G datasets
●

Heading for Tera Scale

●

Paying customers (in production!)

●

Come write your own (distributed) algorithm!!!
0xdata.com

25
Cool Systems Stuff...
●

… that I ran out of space for

●

Reliable UDP, integrated w/RPC

●

TCP is reliably UNReliable
●

●

Already have a reliable UDP framework, so no prob

Fork/Join Goodies:
●
●

Distributed F/J

●

●

Priority Queues
Surviving fork bombs & lost threads

K/V does JMM via hardware-like MESI protocol
0xdata.com

26
H2O is...
●

Pure Java, Open Source: 0xdata.com
●

●

https://github.com/0xdata/h2o/

A Platform for doing Math
●

Parallel Distributed Math

●

In-memory analytics: GLM, GBM, RF, Logistic Reg

●

Accessible via REST & JSON

●

A K/V Store: ~150ns per get or put

●

Distributed Fork/Join + Map/Reduce + K/V
0xdata.com

27
The Platform
JVM 1
extends MRTask

User code?

extends DRemoteTask
extends DTask

extends Iced
byte[]
NFS
HDFS

JVM 2
extends MRTask

D/F/J

User code?

extends DRemoteTask

RPC
K/V get/put

AutoBuffer

UDP / TCP

extends DTask

extends Iced

D/F/J
RPC

AutoBuffer

byte[]
NFS
HDFS

0xdata.com

28
Other Simple Examples
●

Filter & Count (underage males):
●

(can pass in any number of Vecs or a Frame)

long sumY2 = new MRTask() {
long map( long age, long sex ) {
return (age<=17 && sex==MALE) ? 1 : 0;
}
long reduce( long d1, long d2 ) {
return d1+d2;
}
}.doAll( vecAge, vecSex );

0xdata.com

29
Other Simple Examples
●

Filter into new set (underage males):
●

Can write or append subset of rows
–

(append order is preserved)

class Filter extends MRTask {
void map(Chunk CRisk, Chunk CAge, Chunk CSex){
for( int i=0; i<CAge.len; i++ )
if( CAge.at(i)<=17 && CSex.at(i)==MALE )
CRisk.append(CAge.at(i)); // build a set
}
};
Vec risk = new AppendableVec();
new Filter().doAll( risk, vecAge, vecSex );
...risk... // all the underage males
0xdata.com
30
Other Simple Examples
●

Filter into new set (underage males):
●

Can write or append subset of rows
–

(append order is preserved)

class Filter extends MRTask {
void map(Chunk CRisk, Chunk CAge, Chunk CSex){
for( int i=0; i<CAge.len; i++ )
if( CAge.at(i)<=17 && CSex.at(i)==MALE )
CRisk.append(CAge.at(i)); // build a set
}
};
Vec risk = new AppendableVec();
new Filter().doAll( risk, vecAge, vecSex );
...risk... // all the underage males
0xdata.com
31
Other Simple Examples
●

Group-by: count of car-types by age
class AgeHisto extends MRTask {
long carAges[][]; // count of cars by age
void map( Chunk CAge, Chunk CCar ) {
carAges = new long[numAges][numCars];
for( int i=0; i<CAge.len; i++ )
carAges[CAge.at(i)][CCar.at(i)]++;
}
void reduce( AgeHisto that ) {
for( int i=0; i<carAges.length; i++ )
for( int j=0; i<carAges[j].length; j++ )
carAges[i][j] += that.carAges[i][j];
}
}
0xdata.com

32
Other Simple Examples
●

Group-by: count of car-types by age

Setting carAges in map makes it an output field.
Setting carAges in map()makes it an output field.
class AgeHisto extendsper-map call, single-threaded write access.
Private MRTask {
Must be rolled-up cars by age
Must be rolled-up in the reduce call.
long carAges[][]; // count of in the reduce call.

void map( Chunk CAge, Chunk CCar ) {
carAges = new long[numAges][numCars];
for( int i=0; i<CAge.len; i++ )
carAges[CAge.at(i)][CCar.at(i)]++;
}
void reduce( AgeHisto that ) {
for( int i=0; i<carAges.length; i++ )
for( int j=0; i<carAges[j].length; j++ )
carAges[i][j] += that.carAges[i][j];
}
}

0xdata.com

33
Other Simple Examples
●

Uniques
●

Uses distributed hash set

class Uniques extends MRTask {
DNonBlockingHashSet<Long> dnbhs = new ...;
void map( long id ) { dnbhs.add(id); }
void reduce( Uniques that ) {
dnbhs.putAll(that.dnbhs);
}
};
long uniques = new Uniques().
doAll( vecVistors ).dnbhs.size();
0xdata.com

34
Other Simple Examples
●

Uniques
●

Uses distributed

Setting dnbhs in <init> makes it an input field.
Shared across all maps(). Often read-only.
hash set
This one is written, so needs a reduce.

class Uniques extends MRTask {
DNonBlockingHashSet<Long> dnbhs = new ...;
void map( long id ) { dnbhs.add(id); }
void reduce( Uniques that ) {
dnbhs.putAll(that.dnbhs);
}
};
long uniques = new Uniques().
doAll( vecVistors ).dnbhs.size();
0xdata.com

35

Más contenido relacionado

La actualidad más candente

OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0HBaseCon
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyGuillaume Lefranc
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoNathaniel Braun
 
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase HBaseCon
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
 
MapDB - taking Java collections to the next level
MapDB - taking Java collections to the next levelMapDB - taking Java collections to the next level
MapDB - taking Java collections to the next levelJavaDayUA
 
Predicate-Preserving Collision-Resistant Hashing
Predicate-Preserving  Collision-Resistant HashingPredicate-Preserving  Collision-Resistant Hashing
Predicate-Preserving Collision-Resistant HashingPhilippe Camacho, Ph.D.
 
Using timed-release cryptography to mitigate the preservation risk of embargo...
Using timed-release cryptography to mitigate the preservation risk of embargo...Using timed-release cryptography to mitigate the preservation risk of embargo...
Using timed-release cryptography to mitigate the preservation risk of embargo...Michael Nelson
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...Rob Skillington
 
Pitfalls of object_oriented_programming_gcap_09
Pitfalls of object_oriented_programming_gcap_09Pitfalls of object_oriented_programming_gcap_09
Pitfalls of object_oriented_programming_gcap_09Royce Lu
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O Sri Ambati
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3Rob Skillington
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandrashimi_k
 
Jvm profiling under the hood
Jvm profiling under the hoodJvm profiling under the hood
Jvm profiling under the hoodRichardWarburton
 

La actualidad más candente (20)

OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
RealmDB for Android
RealmDB for AndroidRealmDB for Android
RealmDB for Android
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ Criteo
 
Nicety of Java 8 Multithreading
Nicety of Java 8 MultithreadingNicety of Java 8 Multithreading
Nicety of Java 8 Multithreading
 
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
MapDB - taking Java collections to the next level
MapDB - taking Java collections to the next levelMapDB - taking Java collections to the next level
MapDB - taking Java collections to the next level
 
Sha3
Sha3Sha3
Sha3
 
Vaex pygrunn
Vaex pygrunnVaex pygrunn
Vaex pygrunn
 
Predicate-Preserving Collision-Resistant Hashing
Predicate-Preserving  Collision-Resistant HashingPredicate-Preserving  Collision-Resistant Hashing
Predicate-Preserving Collision-Resistant Hashing
 
Using timed-release cryptography to mitigate the preservation risk of embargo...
Using timed-release cryptography to mitigate the preservation risk of embargo...Using timed-release cryptography to mitigate the preservation risk of embargo...
Using timed-release cryptography to mitigate the preservation risk of embargo...
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
 
Pitfalls of object_oriented_programming_gcap_09
Pitfalls of object_oriented_programming_gcap_09Pitfalls of object_oriented_programming_gcap_09
Pitfalls of object_oriented_programming_gcap_09
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O
 
Caching in
Caching inCaching in
Caching in
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Jvm profiling under the hood
Jvm profiling under the hoodJvm profiling under the hood
Jvm profiling under the hood
 

Similar a Cliff Click Explains GBM at Netflix October 10 2013

Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013Sri Ambati
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleSri Ambati
 
Shared Database Concurrency
Shared Database ConcurrencyShared Database Concurrency
Shared Database ConcurrencyAivars Kalvans
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Martin Zapletal
 
Introduction to Postrges-XC
Introduction to Postrges-XCIntroduction to Postrges-XC
Introduction to Postrges-XCAshutosh Bapat
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
Computer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming IComputer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming I💻 Anton Gerdelan
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learningAmer Ather
 
Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUGCassandra Talk: Austin JUG
Cassandra Talk: Austin JUGStu Hood
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAlluxio, Inc.
 
GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11smashflt
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodbPGConf APAC
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive Omid Vahdaty
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
HBaseCon 2015: OpenTSDB and AsyncHBase Update
HBaseCon 2015: OpenTSDB and AsyncHBase UpdateHBaseCon 2015: OpenTSDB and AsyncHBase Update
HBaseCon 2015: OpenTSDB and AsyncHBase UpdateHBaseCon
 
mloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentmloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentDavid Galeano
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
Etl confessions pg conf us 2017
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017Corey Huinker
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchEdward Capriolo
 

Similar a Cliff Click Explains GBM at Netflix October 10 2013 (20)

Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt Dowle
 
Shared Database Concurrency
Shared Database ConcurrencyShared Database Concurrency
Shared Database Concurrency
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
Introduction to Postrges-XC
Introduction to Postrges-XCIntroduction to Postrges-XC
Introduction to Postrges-XC
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Computer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming IComputer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming I
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUGCassandra Talk: Austin JUG
Cassandra Talk: Austin JUG
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
HPC Essentials 0
HPC Essentials 0HPC Essentials 0
HPC Essentials 0
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
HBaseCon 2015: OpenTSDB and AsyncHBase Update
HBaseCon 2015: OpenTSDB and AsyncHBase UpdateHBaseCon 2015: OpenTSDB and AsyncHBase Update
HBaseCon 2015: OpenTSDB and AsyncHBase Update
 
mloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentmloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game development
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Etl confessions pg conf us 2017
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batch
 

Más de Sri Ambati

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxSri Ambati
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek Sri Ambati
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thSri Ambati
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionSri Ambati
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Sri Ambati
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMsSri Ambati
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the WaySri Ambati
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OSri Ambati
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Sri Ambati
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersSri Ambati
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Sri Ambati
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Sri Ambati
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...Sri Ambati
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability Sri Ambati
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email AgainSri Ambati
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Sri Ambati
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...Sri Ambati
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...Sri Ambati
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneySri Ambati
 

Más de Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 

Último

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Último (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Cliff Click Explains GBM at Netflix October 10 2013

  • 1. Gradient Boosting Machine: Distributed Regression Trees on H2O Cliff Click, CTO 0xdata cliffc@0xdata.com http://0xdata.com http://cliffc.org/blog
  • 2. H2O is... ● Pure Java, Open Source: 0xdata.com ● ● https://github.com/0xdata/h2o/ A Platform for doing Math ● Parallel Distributed Math ● In-memory analytics: GLM, GBM, RF, Logistic Reg ● Accessible via REST & JSON ● A K/V Store: ~150ns per get or put ● Distributed Fork/Join + Map/Reduce + K/V 0xdata.com 2
  • 3. Agenda ● Building Blocks For Big Data: ● ● Vecs & Frames & Chunks Distributed Tree Algorithms ● Access Patterns & Execution ● GBM on H2O ● Performance 0xdata.com 3
  • 4. A Collection of Distributed Vectors // A Distributed Vector // much more than 2billion elements class Vec { long length(); // more than an int's worth // fast random access double at(long idx); // Get the idx'th elem boolean isNA(long idx); void set(long idx, double d); // writable void append(double d); // variable sized } 0xdata.com 4
  • 5. Frames A Frame: Vec[] age sex zip ID car JVM 1 Heap JVM 2 Heap JVM 3 Heap Vecs aligned in heaps ●Optimized for concurrent access ●Random access any row, any JVM ● But faster if local... more on that later ● JVM 4 Heap 0xdata.com 5
  • 6. Distributed Data Taxonomy A Chunk, Unit of Parallel Access Vec Vec Vec Vec Vec JVM 1 Heap JVM 2 Heap JVM 3 Heap Typically 1e3 to 1e6 elements ●Stored compressed ●In byte arrays ●Get/put is a few clock cycles including compression ● JVM 4 Heap 0xdata.com 6
  • 7. Distributed Parallel Execution Vec Vec Vec Vec Vec JVM 1 Heap ● JVM 2 Heap ● JVM 3 Heap All CPUs grab Chunks in parallel ●F/J load balances Code moves to Data ●Map/Reduce & F/J handles all sync ●H2O handles all comm, data manage JVM 4 Heap 0xdata.com 7
  • 8. Distributed Data Taxonomy Frame – a collection of Vecs Vec – a collection of Chunks Chunk – a collection of 1e3 to 1e6 elems elem – a java double Row i – i'th elements of all the Vecs in a Frame 0xdata.com 8
  • 9. Distributed Coding Taxonomy ● No Distribution Coding: ● ● ● Whole Algorithms, Whole Vector-Math REST + JSON: e.g. load data, GLM, get results Simple Data-Parallel Coding: ● ● ● Per-Row (or neighbor row) Math Map/Reduce-style: e.g. Any dense linear algebra Complex Data-Parallel Coding ● K/V Store, Graph Algo's, e.g. PageRank 0xdata.com 9
  • 10. Distributed Coding Taxonomy ● No Distribution Coding: Read the docs! ● ● ● Whole Algorithms, Whole Vector-Math REST + JSON: e.g. load data, GLM, get results Simple Data-Parallel Coding: This talk! ● ● ● Per-Row (or neighbor row) Math Map/Reduce-style: e.g. Any dense linear algebra Complex Data-Parallel Coding ● Join our GIT! K/V Store, Graph Algo's, e.g. PageRank 0xdata.com 10
  • 11. Simple Data-Parallel Coding ● Map/Reduce Per-Row: Stateless ● Example from Linear Regression, Σ y2 double sumY2 = new MRTask() { double map( double d ) { return d*d; } double reduce( double d1, double d2 ) { return d1+d2; } }.doAll( vecY ); ● Auto-parallel, auto-distributed ● Near Fortran speed, Java Ease 0xdata.com 11
  • 12. Simple Data-Parallel Coding ● Map/Reduce Per-Row: State-full ● Linear Regression Pass1: Σ x, Σ y, Σ y2 class LRPass1 extends MRTask { double sumX, sumY, sumY2; // I Can Haz State? void map( double X, double Y ) { sumX += X; sumY += Y; sumY2 += Y*Y; } void reduce( LRPass1 that ) { sumX += that.sumX ; sumY += that.sumY ; sumY2 += that.sumY2; } 0xdata.com } 12
  • 13. Simple Data-Parallel Coding ● Map/Reduce Per-Row: Batch State-full class LRPass1 extends MRTask { double sumX, sumY, sumY2; void map( Chunk CX, Chunk CY ) {// Whole Chunks for( int i=0; i<CX.len; i++ ){// Batch! double X = CX.at(i), Y = CY.at(i); sumX += X; sumY += Y; sumY2 += Y*Y; } } void reduce( LRPass1 that ) { sumX += that.sumX ; sumY += that.sumY ; sumY2 += that.sumY2; } 0xdata.com 13 }
  • 14. GBM (for K-classifier) Elements of Statistical Learning, 2nd Ed, 2009 Pg 387 Trevor Hastie, Robert Tibshirani Jerome Friedman 0xdata.com 14
  • 15. Distributed Trees ● Overlay a Tree over the data ● Really: Assign a Tree Node to each Row Vec nids = v.makeZero(); … nids.set(row,nid)... ● ● ● Number the Nodes Store "Node_ID" per row in a temp Vec Make a pass over all Rows ● ● ● Nodes not visited in order... but all rows, all Nodes efficiently visited Do work (e.g. histogram) per Row/Node 0xdata.com 15
  • 16. Distributed Trees ● An initial Tree ● ● All rows start on n0 MRTask: compute stats X 0 1 2 3 Tree n0 Y nids 1.3 0 1.1 0 3.1 0 -2.1 0 MRTask.avg=0.85 MRTask.var =3.5075 ● Use the stats to make a decision... ● (varies by algorithm)! ● (e.g. lowest MSE, best col, best split) 0xdata.com 16
  • 17. Distributed Trees ● Next layer in the Tree (and MRTask across rows) ● Each row: decide! – ● Tree If "X<1.5" go right else left Compute stats per new leaf n0 X>=1.5 ● Each pass across all rows builds entire layer X<1.5 n1 n2 avg=0.5 var=6.76 avg=1.2 var=0.01 X 0 1 2 3 Y nids 1.3 2 1.1 2 3.1 1 -2.1 1 0xdata.com 17
  • 18. Distributed Trees ● ● ● Another MRTask, another layer... i.e., a 5-deep tree takes 5 passes Tree n0 Fully data-parallel for each tree level X<1.5 X>=1.5 n1 X 0 1 2 3 Y nids 1.3 2 1.1 2 3.1 4 -2.1 3 X>=2.5 n3 avg= -2.1 1.2 X<2.5 n4 avg=3.1 0xdata.com 18
  • 19. Distributed Trees ● Each pass is over one layer in the tree ● Builds per-node histogram in map+reduce calls class Pass extends MRTask2<Pass> { void map( Chunk chks[] ) { Chunk nids = chks[...]; // Node-IDs per row for( int r=0; r<nids.len; r++ ){// All rows int nid = nids.at80(i); // Node-ID THIS row // Lazy: not all Chunks see all Nodes if( dHisto[nid]==null ) dHisto[nid]=... // Accumulate histogram stats per node dHisto[nid].accum(chks,r); } } 0xdata.com 19 }.doAll(myDataFrame,nids);
  • 20. Distributed Trees ● Each pass analyzes one Tree level ● Then decide how to build next level ● Reassign Rows to new levels in another pass – ● Builds a Histogram-per-Node ● ● (actually merge the two passes) Which requires a reduce() call to roll up All Histograms for one level done in parallel 0xdata.com 20
  • 21. Distributed Trees: utilities ● “score+build” in one pass: ● Test each row against decision from prior pass ● Assign to a new leaf ● Build histogram on that leaf ● “score”: just walk the tree, and get results ● “compress”: Tree from POJO to byte[] ● ● Easily 10x smaller, can still walk, score, print Plus utilities to walk, print, display 0xdata.com 21
  • 22. GBM on Distributed Trees ● GBM builds 1 Tree, 1 level at a time, but... ● We run the entire level in parallel & distributed ● ● ● Built breadth-first because it's "free" More data offset by more CPUs Classic GBM otherwise ● ● Build residuals tree-by-tree ● ● ESL2, page 387 Tuning knobs: trees, depth, shrinkage, min_rows Pure Java 0xdata.com 22
  • 23. GBM on Distributed Trees ● Limiting factor: latency in turning over a level ● About 4x faster than R single-node on covtype ● Does the per-level compute in parallel ● Requires sending histograms over network – ● Can get big for very deep tree Adding more data offset by adding more Nodes 0xdata.com 23
  • 24. Summary: Write (parallel) Java ● Most simple Java “just works” ● Fast: parallel distributed reads, writes, appends ● Reads same speed as plain Java array loads ● Writes, appends: slightly slower (compression) ● Typically memory bandwidth limited – ● (may be CPU limited in a few cases) Slower: conflicting writes (but follows strict JMM) ● Also supports transactional updates 0xdata.com 24
  • 25. Summary: Writing Analytics ● We're writing Big Data Analytics ● Generalized Linear Modeling (ADMM, GLMNET) – ● Logistic Regression, Poisson, Gamma Random Forest, GBM, KMeans++, KNN ● State-of-the-art Algorithms, running Distributed ● Solidly working on 100G datasets ● Heading for Tera Scale ● Paying customers (in production!) ● Come write your own (distributed) algorithm!!! 0xdata.com 25
  • 26. Cool Systems Stuff... ● … that I ran out of space for ● Reliable UDP, integrated w/RPC ● TCP is reliably UNReliable ● ● Already have a reliable UDP framework, so no prob Fork/Join Goodies: ● ● Distributed F/J ● ● Priority Queues Surviving fork bombs & lost threads K/V does JMM via hardware-like MESI protocol 0xdata.com 26
  • 27. H2O is... ● Pure Java, Open Source: 0xdata.com ● ● https://github.com/0xdata/h2o/ A Platform for doing Math ● Parallel Distributed Math ● In-memory analytics: GLM, GBM, RF, Logistic Reg ● Accessible via REST & JSON ● A K/V Store: ~150ns per get or put ● Distributed Fork/Join + Map/Reduce + K/V 0xdata.com 27
  • 28. The Platform JVM 1 extends MRTask User code? extends DRemoteTask extends DTask extends Iced byte[] NFS HDFS JVM 2 extends MRTask D/F/J User code? extends DRemoteTask RPC K/V get/put AutoBuffer UDP / TCP extends DTask extends Iced D/F/J RPC AutoBuffer byte[] NFS HDFS 0xdata.com 28
  • 29. Other Simple Examples ● Filter & Count (underage males): ● (can pass in any number of Vecs or a Frame) long sumY2 = new MRTask() { long map( long age, long sex ) { return (age<=17 && sex==MALE) ? 1 : 0; } long reduce( long d1, long d2 ) { return d1+d2; } }.doAll( vecAge, vecSex ); 0xdata.com 29
  • 30. Other Simple Examples ● Filter into new set (underage males): ● Can write or append subset of rows – (append order is preserved) class Filter extends MRTask { void map(Chunk CRisk, Chunk CAge, Chunk CSex){ for( int i=0; i<CAge.len; i++ ) if( CAge.at(i)<=17 && CSex.at(i)==MALE ) CRisk.append(CAge.at(i)); // build a set } }; Vec risk = new AppendableVec(); new Filter().doAll( risk, vecAge, vecSex ); ...risk... // all the underage males 0xdata.com 30
  • 31. Other Simple Examples ● Filter into new set (underage males): ● Can write or append subset of rows – (append order is preserved) class Filter extends MRTask { void map(Chunk CRisk, Chunk CAge, Chunk CSex){ for( int i=0; i<CAge.len; i++ ) if( CAge.at(i)<=17 && CSex.at(i)==MALE ) CRisk.append(CAge.at(i)); // build a set } }; Vec risk = new AppendableVec(); new Filter().doAll( risk, vecAge, vecSex ); ...risk... // all the underage males 0xdata.com 31
  • 32. Other Simple Examples ● Group-by: count of car-types by age class AgeHisto extends MRTask { long carAges[][]; // count of cars by age void map( Chunk CAge, Chunk CCar ) { carAges = new long[numAges][numCars]; for( int i=0; i<CAge.len; i++ ) carAges[CAge.at(i)][CCar.at(i)]++; } void reduce( AgeHisto that ) { for( int i=0; i<carAges.length; i++ ) for( int j=0; i<carAges[j].length; j++ ) carAges[i][j] += that.carAges[i][j]; } } 0xdata.com 32
  • 33. Other Simple Examples ● Group-by: count of car-types by age Setting carAges in map makes it an output field. Setting carAges in map()makes it an output field. class AgeHisto extendsper-map call, single-threaded write access. Private MRTask { Must be rolled-up cars by age Must be rolled-up in the reduce call. long carAges[][]; // count of in the reduce call. void map( Chunk CAge, Chunk CCar ) { carAges = new long[numAges][numCars]; for( int i=0; i<CAge.len; i++ ) carAges[CAge.at(i)][CCar.at(i)]++; } void reduce( AgeHisto that ) { for( int i=0; i<carAges.length; i++ ) for( int j=0; i<carAges[j].length; j++ ) carAges[i][j] += that.carAges[i][j]; } } 0xdata.com 33
  • 34. Other Simple Examples ● Uniques ● Uses distributed hash set class Uniques extends MRTask { DNonBlockingHashSet<Long> dnbhs = new ...; void map( long id ) { dnbhs.add(id); } void reduce( Uniques that ) { dnbhs.putAll(that.dnbhs); } }; long uniques = new Uniques(). doAll( vecVistors ).dnbhs.size(); 0xdata.com 34
  • 35. Other Simple Examples ● Uniques ● Uses distributed Setting dnbhs in <init> makes it an input field. Shared across all maps(). Often read-only. hash set This one is written, so needs a reduce. class Uniques extends MRTask { DNonBlockingHashSet<Long> dnbhs = new ...; void map( long id ) { dnbhs.add(id); } void reduce( Uniques that ) { dnbhs.putAll(that.dnbhs); } }; long uniques = new Uniques(). doAll( vecVistors ).dnbhs.size(); 0xdata.com 35