2. H2O is...
●
Pure Java, Open Source: 0xdata.com
●
●
https://github.com/0xdata/h2o/
A Platform for doing Math
●
Parallel Distributed Math
●
In-memory analytics: GLM, GBM, RF, Logistic Reg
●
Accessible via REST & JSON
●
A K/V Store: ~150ns per get or put
●
Distributed Fork/Join + Map/Reduce + K/V
0xdata.com
2
3. Agenda
●
Building Blocks For Big Data:
●
●
Vecs & Frames & Chunks
Distributed Tree Algorithms
●
Access Patterns & Execution
●
GBM on H2O
●
Performance
0xdata.com
3
4. A Collection of Distributed Vectors
// A Distributed Vector
//
much more than 2billion elements
class Vec {
long length(); // more than an int's worth
// fast random access
double at(long idx); // Get the idx'th elem
boolean isNA(long idx);
void set(long idx, double d); // writable
void append(double d); // variable sized
}
0xdata.com
4
5. Frames
A Frame: Vec[]
age
sex
zip
ID
car
JVM 1
Heap
JVM 2
Heap
JVM 3
Heap
Vecs aligned
in heaps
●Optimized for
concurrent access
●Random access
any row, any JVM
●
But faster if local...
more on that later
●
JVM 4
Heap
0xdata.com
5
6. Distributed Data Taxonomy
A Chunk, Unit of Parallel Access
Vec
Vec
Vec
Vec
Vec
JVM 1
Heap
JVM 2
Heap
JVM 3
Heap
Typically 1e3 to
1e6 elements
●Stored compressed
●In byte arrays
●Get/put is a few
clock cycles
including
compression
●
JVM 4
Heap
0xdata.com
6
7. Distributed Parallel Execution
Vec
Vec
Vec
Vec
Vec
JVM 1
Heap
●
JVM 2
Heap
●
JVM 3
Heap
All CPUs grab
Chunks in parallel
●F/J load balances
Code moves to Data
●Map/Reduce & F/J
handles all sync
●H2O handles all
comm, data manage
JVM 4
Heap
0xdata.com
7
8. Distributed Data Taxonomy
Frame – a collection of Vecs
Vec – a collection of Chunks
Chunk – a collection of 1e3 to 1e6 elems
elem – a java double
Row i – i'th elements of all the Vecs in a Frame
0xdata.com
8
9. Distributed Coding Taxonomy
●
No Distribution Coding:
●
●
●
Whole Algorithms, Whole Vector-Math
REST + JSON: e.g. load data, GLM, get results
Simple Data-Parallel Coding:
●
●
●
Per-Row (or neighbor row) Math
Map/Reduce-style: e.g. Any dense linear algebra
Complex Data-Parallel Coding
●
K/V Store, Graph Algo's, e.g. PageRank
0xdata.com
9
10. Distributed Coding Taxonomy
●
No Distribution Coding:
Read the docs!
●
●
●
Whole Algorithms, Whole Vector-Math
REST + JSON: e.g. load data, GLM, get results
Simple Data-Parallel Coding:
This talk!
●
●
●
Per-Row (or neighbor row) Math
Map/Reduce-style: e.g. Any dense linear algebra
Complex Data-Parallel Coding
●
Join our GIT!
K/V Store, Graph Algo's, e.g. PageRank
0xdata.com
10
11. Simple Data-Parallel Coding
●
Map/Reduce Per-Row: Stateless
●
Example from Linear Regression, Σ y2
double sumY2 = new MRTask() {
double map( double d ) { return d*d; }
double reduce( double d1, double d2 ) {
return d1+d2;
}
}.doAll( vecY );
●
Auto-parallel, auto-distributed
●
Near Fortran speed, Java Ease
0xdata.com
11
14. GBM (for K-classifier)
Elements of Statistical
Learning, 2nd Ed, 2009
Pg 387
Trevor Hastie,
Robert Tibshirani
Jerome Friedman
0xdata.com
14
15. Distributed Trees
●
Overlay a Tree over the data
●
Really: Assign a Tree Node to each Row
Vec nids = v.makeZero();
… nids.set(row,nid)...
●
●
●
Number the Nodes
Store "Node_ID" per row in a temp Vec
Make a pass over all Rows
●
●
●
Nodes not visited in order...
but all rows, all Nodes efficiently visited
Do work (e.g. histogram) per Row/Node
0xdata.com
15
16. Distributed Trees
●
An initial Tree
●
●
All rows start on n0
MRTask: compute stats
X
0
1
2
3
Tree
n0
Y nids
1.3 0
1.1 0
3.1 0
-2.1 0
MRTask.avg=0.85
MRTask.var =3.5075
●
Use the stats to make a decision...
●
(varies by algorithm)!
●
(e.g. lowest MSE, best col, best split)
0xdata.com
16
17. Distributed Trees
●
Next layer in the Tree (and MRTask across rows)
●
Each row: decide!
–
●
Tree
If "X<1.5" go right else left
Compute stats per new leaf
n0
X>=1.5
●
Each pass across all
rows builds entire layer
X<1.5
n1
n2
avg=0.5
var=6.76
avg=1.2
var=0.01
X
0
1
2
3
Y nids
1.3 2
1.1 2
3.1 1
-2.1 1
0xdata.com
17
18. Distributed Trees
●
●
●
Another MRTask, another layer...
i.e., a 5-deep tree
takes 5 passes
Tree
n0
Fully data-parallel
for each tree level
X<1.5
X>=1.5
n1
X
0
1
2
3
Y nids
1.3 2
1.1 2
3.1 4
-2.1 3
X>=2.5
n3
avg= -2.1
1.2
X<2.5
n4
avg=3.1
0xdata.com
18
19. Distributed Trees
●
Each pass is over one layer in the tree
●
Builds per-node histogram in map+reduce calls
class Pass extends MRTask2<Pass> {
void map( Chunk chks[] ) {
Chunk nids = chks[...];
// Node-IDs per row
for( int r=0; r<nids.len; r++ ){// All rows
int nid = nids.at80(i); // Node-ID THIS row
// Lazy: not all Chunks see all Nodes
if( dHisto[nid]==null ) dHisto[nid]=...
// Accumulate histogram stats per node
dHisto[nid].accum(chks,r);
}
}
0xdata.com
19
}.doAll(myDataFrame,nids);
20. Distributed Trees
●
Each pass analyzes one Tree level
●
Then decide how to build next level
●
Reassign Rows to new levels in another pass
–
●
Builds a Histogram-per-Node
●
●
(actually merge the two passes)
Which requires a reduce() call to roll up
All Histograms for one level done in parallel
0xdata.com
20
21. Distributed Trees: utilities
●
“score+build” in one pass:
●
Test each row against decision from prior pass
●
Assign to a new leaf
●
Build histogram on that leaf
●
“score”: just walk the tree, and get results
●
“compress”: Tree from POJO to byte[]
●
●
Easily 10x smaller, can still walk, score, print
Plus utilities to walk, print, display
0xdata.com
21
22. GBM on Distributed Trees
●
GBM builds 1 Tree, 1 level at a time, but...
●
We run the entire level in parallel & distributed
●
●
●
Built breadth-first because it's "free"
More data offset by more CPUs
Classic GBM otherwise
●
●
Build residuals tree-by-tree
●
●
ESL2, page 387
Tuning knobs: trees, depth, shrinkage, min_rows
Pure Java
0xdata.com
22
23. GBM on Distributed Trees
●
Limiting factor: latency in turning over a level
●
About 4x faster than R single-node on covtype
●
Does the per-level compute in parallel
●
Requires sending histograms over network
–
●
Can get big for very deep tree
Adding more data offset by adding more Nodes
0xdata.com
23
24. Summary: Write (parallel) Java
●
Most simple Java “just works”
●
Fast: parallel distributed reads, writes, appends
●
Reads same speed as plain Java array loads
●
Writes, appends: slightly slower (compression)
●
Typically memory bandwidth limited
–
●
(may be CPU limited in a few cases)
Slower: conflicting writes (but follows strict JMM)
●
Also supports transactional updates
0xdata.com
24
25. Summary: Writing Analytics
●
We're writing Big Data Analytics
●
Generalized Linear Modeling (ADMM, GLMNET)
–
●
Logistic Regression, Poisson, Gamma
Random Forest, GBM, KMeans++, KNN
●
State-of-the-art Algorithms, running Distributed
●
Solidly working on 100G datasets
●
Heading for Tera Scale
●
Paying customers (in production!)
●
Come write your own (distributed) algorithm!!!
0xdata.com
25
26. Cool Systems Stuff...
●
… that I ran out of space for
●
Reliable UDP, integrated w/RPC
●
TCP is reliably UNReliable
●
●
Already have a reliable UDP framework, so no prob
Fork/Join Goodies:
●
●
Distributed F/J
●
●
Priority Queues
Surviving fork bombs & lost threads
K/V does JMM via hardware-like MESI protocol
0xdata.com
26
27. H2O is...
●
Pure Java, Open Source: 0xdata.com
●
●
https://github.com/0xdata/h2o/
A Platform for doing Math
●
Parallel Distributed Math
●
In-memory analytics: GLM, GBM, RF, Logistic Reg
●
Accessible via REST & JSON
●
A K/V Store: ~150ns per get or put
●
Distributed Fork/Join + Map/Reduce + K/V
0xdata.com
27
29. Other Simple Examples
●
Filter & Count (underage males):
●
(can pass in any number of Vecs or a Frame)
long sumY2 = new MRTask() {
long map( long age, long sex ) {
return (age<=17 && sex==MALE) ? 1 : 0;
}
long reduce( long d1, long d2 ) {
return d1+d2;
}
}.doAll( vecAge, vecSex );
0xdata.com
29
30. Other Simple Examples
●
Filter into new set (underage males):
●
Can write or append subset of rows
–
(append order is preserved)
class Filter extends MRTask {
void map(Chunk CRisk, Chunk CAge, Chunk CSex){
for( int i=0; i<CAge.len; i++ )
if( CAge.at(i)<=17 && CSex.at(i)==MALE )
CRisk.append(CAge.at(i)); // build a set
}
};
Vec risk = new AppendableVec();
new Filter().doAll( risk, vecAge, vecSex );
...risk... // all the underage males
0xdata.com
30
31. Other Simple Examples
●
Filter into new set (underage males):
●
Can write or append subset of rows
–
(append order is preserved)
class Filter extends MRTask {
void map(Chunk CRisk, Chunk CAge, Chunk CSex){
for( int i=0; i<CAge.len; i++ )
if( CAge.at(i)<=17 && CSex.at(i)==MALE )
CRisk.append(CAge.at(i)); // build a set
}
};
Vec risk = new AppendableVec();
new Filter().doAll( risk, vecAge, vecSex );
...risk... // all the underage males
0xdata.com
31
32. Other Simple Examples
●
Group-by: count of car-types by age
class AgeHisto extends MRTask {
long carAges[][]; // count of cars by age
void map( Chunk CAge, Chunk CCar ) {
carAges = new long[numAges][numCars];
for( int i=0; i<CAge.len; i++ )
carAges[CAge.at(i)][CCar.at(i)]++;
}
void reduce( AgeHisto that ) {
for( int i=0; i<carAges.length; i++ )
for( int j=0; i<carAges[j].length; j++ )
carAges[i][j] += that.carAges[i][j];
}
}
0xdata.com
32
33. Other Simple Examples
●
Group-by: count of car-types by age
Setting carAges in map makes it an output field.
Setting carAges in map()makes it an output field.
class AgeHisto extendsper-map call, single-threaded write access.
Private MRTask {
Must be rolled-up cars by age
Must be rolled-up in the reduce call.
long carAges[][]; // count of in the reduce call.
void map( Chunk CAge, Chunk CCar ) {
carAges = new long[numAges][numCars];
for( int i=0; i<CAge.len; i++ )
carAges[CAge.at(i)][CCar.at(i)]++;
}
void reduce( AgeHisto that ) {
for( int i=0; i<carAges.length; i++ )
for( int j=0; i<carAges[j].length; j++ )
carAges[i][j] += that.carAges[i][j];
}
}
0xdata.com
33
34. Other Simple Examples
●
Uniques
●
Uses distributed hash set
class Uniques extends MRTask {
DNonBlockingHashSet<Long> dnbhs = new ...;
void map( long id ) { dnbhs.add(id); }
void reduce( Uniques that ) {
dnbhs.putAll(that.dnbhs);
}
};
long uniques = new Uniques().
doAll( vecVistors ).dnbhs.size();
0xdata.com
34
35. Other Simple Examples
●
Uniques
●
Uses distributed
Setting dnbhs in <init> makes it an input field.
Shared across all maps(). Often read-only.
hash set
This one is written, so needs a reduce.
class Uniques extends MRTask {
DNonBlockingHashSet<Long> dnbhs = new ...;
void map( long id ) { dnbhs.add(id); }
void reduce( Uniques that ) {
dnbhs.putAll(that.dnbhs);
}
};
long uniques = new Uniques().
doAll( vecVistors ).dnbhs.size();
0xdata.com
35