Cliff Click Explains GBM at Netflix October 10 2013

Gradient Boosting Machine:
Distributed Regression Trees
on H2O
Cliff Click, CTO 0xdata
cliffc@0xdata.com
http://0xdata.com
http://cliffc.org/blog

H2O is...
●

Pure Java, Open Source: 0xdata.com
●

●

https://github.com/0xdata/h2o/

A Platform for doing Math
●

Parallel Distributed Math

●

In-memory analytics: GLM, GBM, RF, Logistic Reg

●

Accessible via REST & JSON

●

A K/V Store: ~150ns per get or put

●

Distributed Fork/Join + Map/Reduce + K/V
0xdata.com

2

Agenda
●

Building Blocks For Big Data:
●

●

Vecs & Frames & Chunks

Distributed Tree Algorithms
●

Access Patterns & Execution

●

GBM on H2O

●

Performance

0xdata.com

3

A Collection of Distributed Vectors
// A Distributed Vector
//
much more than 2billion elements
class Vec {
long length(); // more than an int's worth
// fast random access
double at(long idx); // Get the idx'th elem
boolean isNA(long idx);
void set(long idx, double d); // writable
void append(double d); // variable sized
}
0xdata.com

4

Frames
A Frame: Vec[]
age

sex

zip

ID

car

JVM 1
Heap
JVM 2
Heap
JVM 3
Heap

Vecs aligned
in heaps
●Optimized for
concurrent access
●Random access
any row, any JVM
●

But faster if local...
more on that later

●

JVM 4
Heap

0xdata.com

5

Distributed Data Taxonomy
A Chunk, Unit of Parallel Access
Vec

Vec

Vec

Vec

Vec

JVM 1
Heap
JVM 2
Heap
JVM 3
Heap

Typically 1e3 to
1e6 elements
●Stored compressed
●In byte arrays
●Get/put is a few
clock cycles
including
compression
●

JVM 4
Heap

0xdata.com

6

Distributed Parallel Execution
Vec

Vec

Vec

Vec

Vec

JVM 1
Heap

●

JVM 2
Heap

●

JVM 3
Heap

All CPUs grab
Chunks in parallel
●F/J load balances
Code moves to Data
●Map/Reduce & F/J
handles all sync
●H2O handles all
comm, data manage

JVM 4
Heap

0xdata.com

7

Distributed Data Taxonomy

Frame – a collection of Vecs
Vec – a collection of Chunks
Chunk – a collection of 1e3 to 1e6 elems
elem – a java double
Row i – i'th elements of all the Vecs in a Frame

0xdata.com

8

Distributed Coding Taxonomy
●

No Distribution Coding:
●
●

●

Whole Algorithms, Whole Vector-Math
REST + JSON: e.g. load data, GLM, get results

Simple Data-Parallel Coding:
●
●

●

Per-Row (or neighbor row) Math
Map/Reduce-style: e.g. Any dense linear algebra

Complex Data-Parallel Coding
●

K/V Store, Graph Algo's, e.g. PageRank
0xdata.com

9

Distributed Coding Taxonomy
●

No Distribution Coding:

Read the docs!

●
●

●

Whole Algorithms, Whole Vector-Math
REST + JSON: e.g. load data, GLM, get results

Simple Data-Parallel Coding:

This talk!

●
●

●

Per-Row (or neighbor row) Math

Map/Reduce-style: e.g. Any dense linear algebra

Complex Data-Parallel Coding
●

Join our GIT!

K/V Store, Graph Algo's, e.g. PageRank
0xdata.com

10

Simple Data-Parallel Coding
●

Map/Reduce Per-Row: Stateless
●

Example from Linear Regression, Σ y2

double sumY2 = new MRTask() {
double map( double d ) { return d*d; }
double reduce( double d1, double d2 ) {
return d1+d2;
}
}.doAll( vecY );
●

Auto-parallel, auto-distributed

●

Near Fortran speed, Java Ease

0xdata.com

11

●

Map/Reduce Per-Row: State-full
●

Linear Regression Pass1: Σ x, Σ y, Σ y2

class LRPass1 extends MRTask {
double sumX, sumY, sumY2; // I Can Haz State?
void map( double X, double Y ) {
sumX += X; sumY += Y; sumY2 += Y*Y;
}
void reduce( LRPass1 that ) {
sumX += that.sumX ;
sumY += that.sumY ;
sumY2 += that.sumY2;
}
0xdata.com
}

12

●

Map/Reduce Per-Row: Batch State-full
class LRPass1 extends MRTask {
double sumX, sumY, sumY2;
void map( Chunk CX, Chunk CY ) {// Whole Chunks
for( int i=0; i<CX.len; i++ ){// Batch!
double X = CX.at(i), Y = CY.at(i);
sumX += X; sumY += Y; sumY2 += Y*Y;
}
}
void reduce( LRPass1 that ) {
sumX += that.sumX ;
sumY += that.sumY ;
sumY2 += that.sumY2;
}
0xdata.com
13
}

GBM (for K-classifier)
Elements of Statistical
Learning, 2nd Ed, 2009
Pg 387
Trevor Hastie,
Robert Tibshirani
Jerome Friedman

0xdata.com

14

Distributed Trees
●

Overlay a Tree over the data
●

Really: Assign a Tree Node to each Row
Vec nids = v.makeZero();
… nids.set(row,nid)...

●
●

●

Number the Nodes

Store "Node_ID" per row in a temp Vec

Make a pass over all Rows
●
●

●

Nodes not visited in order...
but all rows, all Nodes efficiently visited

Do work (e.g. histogram) per Row/Node
0xdata.com

15

Distributed Trees
●

An initial Tree
●
●

All rows start on n0
MRTask: compute stats

X
0
1
2
3

Tree
n0

Y nids
1.3 0
1.1 0
3.1 0
-2.1 0

MRTask.avg=0.85
MRTask.var =3.5075
●

Use the stats to make a decision...
●

(varies by algorithm)!

●

(e.g. lowest MSE, best col, best split)
0xdata.com

16

Distributed Trees
●

Next layer in the Tree (and MRTask across rows)
●

Each row: decide!
–

●

Tree

If "X<1.5" go right else left

Compute stats per new leaf

n0

X>=1.5

●

Each pass across all
rows builds entire layer

X<1.5

n1

n2

avg=0.5
var=6.76

avg=1.2
var=0.01

X
0
1
2
3

Y nids
1.3 2
1.1 2
3.1 1
-2.1 1

0xdata.com

17

Distributed Trees
●
●

●

Another MRTask, another layer...
i.e., a 5-deep tree
takes 5 passes

Tree
n0

Fully data-parallel
for each tree level

X<1.5

X>=1.5

n1
X
0
1
2
3

Y nids
1.3 2
1.1 2
3.1 4
-2.1 3

X>=2.5

n3
avg= -2.1

1.2
X<2.5

n4
avg=3.1

0xdata.com

18

Distributed Trees
●

Each pass is over one layer in the tree

●

Builds per-node histogram in map+reduce calls
class Pass extends MRTask2<Pass> {
void map( Chunk chks[] ) {
Chunk nids = chks[...];
// Node-IDs per row
for( int r=0; r<nids.len; r++ ){// All rows
int nid = nids.at80(i); // Node-ID THIS row
// Lazy: not all Chunks see all Nodes
if( dHisto[nid]==null ) dHisto[nid]=...
// Accumulate histogram stats per node
dHisto[nid].accum(chks,r);
}
}
0xdata.com
19
}.doAll(myDataFrame,nids);

Distributed Trees
●

Each pass analyzes one Tree level
●

Then decide how to build next level

●

Reassign Rows to new levels in another pass
–

●

Builds a Histogram-per-Node
●

●

(actually merge the two passes)

Which requires a reduce() call to roll up

All Histograms for one level done in parallel

0xdata.com

20

Distributed Trees: utilities
●

“score+build” in one pass:
●

Test each row against decision from prior pass

●

Assign to a new leaf

●

Build histogram on that leaf

●

“score”: just walk the tree, and get results

●

“compress”: Tree from POJO to byte[]
●

●

Easily 10x smaller, can still walk, score, print

Plus utilities to walk, print, display
0xdata.com

21

GBM on Distributed Trees
●

GBM builds 1 Tree, 1 level at a time, but...

●

We run the entire level in parallel & distributed
●
●

●

Built breadth-first because it's "free"
More data offset by more CPUs

Classic GBM otherwise
●
●

Build residuals tree-by-tree

●

●

ESL2, page 387
Tuning knobs: trees, depth, shrinkage, min_rows

Pure Java

0xdata.com

22

GBM on Distributed Trees
●

Limiting factor: latency in turning over a level
●

About 4x faster than R single-node on covtype

●

Does the per-level compute in parallel

●

Requires sending histograms over network
–

●

Can get big for very deep tree

Adding more data offset by adding more Nodes

0xdata.com

23

Summary: Write (parallel) Java
●

Most simple Java “just works”

●

Fast: parallel distributed reads, writes, appends
●

Reads same speed as plain Java array loads

●

Writes, appends: slightly slower (compression)

●

Typically memory bandwidth limited
–

●

(may be CPU limited in a few cases)

Slower: conflicting writes (but follows strict JMM)
●

Also supports transactional updates
0xdata.com

24

Summary: Writing Analytics
●

We're writing Big Data Analytics
●

Generalized Linear Modeling (ADMM, GLMNET)
–

●

Logistic Regression, Poisson, Gamma

Random Forest, GBM, KMeans++, KNN

●

State-of-the-art Algorithms, running Distributed

●

Solidly working on 100G datasets
●

Heading for Tera Scale

●

Paying customers (in production!)

●

Come write your own (distributed) algorithm!!!
0xdata.com

25

Cool Systems Stuff...
●

… that I ran out of space for

●

Reliable UDP, integrated w/RPC

●

TCP is reliably UNReliable
●

●

Already have a reliable UDP framework, so no prob

Fork/Join Goodies:
●
●

Distributed F/J

●

●

Priority Queues
Surviving fork bombs & lost threads

K/V does JMM via hardware-like MESI protocol
0xdata.com

26

H2O is...
●

Pure Java, Open Source: 0xdata.com
●

●

https://github.com/0xdata/h2o/

A Platform for doing Math
●

Parallel Distributed Math

●

In-memory analytics: GLM, GBM, RF, Logistic Reg

●

Accessible via REST & JSON

●

A K/V Store: ~150ns per get or put

●

Distributed Fork/Join + Map/Reduce + K/V
0xdata.com

27

The Platform
JVM 1
extends MRTask

User code?

extends DRemoteTask
extends DTask

extends Iced
byte[]
NFS
HDFS

JVM 2
extends MRTask

D/F/J

User code?

extends DRemoteTask

RPC
K/V get/put

AutoBuffer

UDP / TCP

extends DTask

extends Iced

D/F/J
RPC

AutoBuffer

byte[]
NFS
HDFS

0xdata.com

28

Other Simple Examples
●

Filter & Count (underage males):
●

(can pass in any number of Vecs or a Frame)

long sumY2 = new MRTask() {
long map( long age, long sex ) {
return (age<=17 && sex==MALE) ? 1 : 0;
}
long reduce( long d1, long d2 ) {
return d1+d2;
}
}.doAll( vecAge, vecSex );

0xdata.com

29

●

Filter into new set (underage males):
●

Can write or append subset of rows
–

(append order is preserved)

class Filter extends MRTask {
void map(Chunk CRisk, Chunk CAge, Chunk CSex){
for( int i=0; i<CAge.len; i++ )
if( CAge.at(i)<=17 && CSex.at(i)==MALE )
CRisk.append(CAge.at(i)); // build a set
}
};
Vec risk = new AppendableVec();
new Filter().doAll( risk, vecAge, vecSex );
...risk... // all the underage males
0xdata.com
30

●

Filter into new set (underage males):
●

Can write or append subset of rows
–

(append order is preserved)

class Filter extends MRTask {
void map(Chunk CRisk, Chunk CAge, Chunk CSex){
if( CAge.at(i)<=17 && CSex.at(i)==MALE )
CRisk.append(CAge.at(i)); // build a set
}
};
Vec risk = new AppendableVec();
new Filter().doAll( risk, vecAge, vecSex );
...risk... // all the underage males
0xdata.com
31

●

Group-by: count of car-types by age
class AgeHisto extends MRTask {
long carAges[][]; // count of cars by age
void map( Chunk CAge, Chunk CCar ) {
carAges = new long[numAges][numCars];
carAges[CAge.at(i)][CCar.at(i)]++;
}
void reduce( AgeHisto that ) {
for( int i=0; i<carAges.length; i++ )
for( int j=0; i<carAges[j].length; j++ )
carAges[i][j] += that.carAges[i][j];
}
}
0xdata.com

32

●

Group-by: count of car-types by age

Setting carAges in map makes it an output field.
Setting carAges in map()makes it an output field.
class AgeHisto extendsper-map call, single-threaded write access.
Private MRTask {
Must be rolled-up cars by age
Must be rolled-up in the reduce call.
long carAges[][]; // count of in the reduce call.

void map( Chunk CAge, Chunk CCar ) {
carAges = new long[numAges][numCars];
carAges[CAge.at(i)][CCar.at(i)]++;
}
void reduce( AgeHisto that ) {
for( int i=0; i<carAges.length; i++ )
for( int j=0; i<carAges[j].length; j++ )
carAges[i][j] += that.carAges[i][j];
}
}

0xdata.com

33

●

Uniques
●

Uses distributed hash set

class Uniques extends MRTask {
DNonBlockingHashSet<Long> dnbhs = new ...;
void map( long id ) { dnbhs.add(id); }
void reduce( Uniques that ) {
dnbhs.putAll(that.dnbhs);
}
};
long uniques = new Uniques().
doAll( vecVistors ).dnbhs.size();
0xdata.com

34

●

Uniques
●

Uses distributed

Setting dnbhs in <init> makes it an input field.
Shared across all maps(). Often read-only.
hash set
This one is written, so needs a reduce.

class Uniques extends MRTask {
DNonBlockingHashSet<Long> dnbhs = new ...;
void map( long id ) { dnbhs.add(id); }
void reduce( Uniques that ) {
dnbhs.putAll(that.dnbhs);
}
};
long uniques = new Uniques().
doAll( vecVistors ).dnbhs.size();
0xdata.com

35

Cliff Click Explains GBM at Netflix October 10 2013

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Cliff Click Explains GBM at Netflix October 10 2013

Similar a Cliff Click Explains GBM at Netflix October 10 2013 (20)

Más de Sri Ambati

Más de Sri Ambati (20)

Último

Último (20)

Cliff Click Explains GBM at Netflix October 10 2013