Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure

© 2015 IBM Corporation
Declarative Machine Learning: Bring your Own
Algorithm, Data, Syntax and Infrastructure
Shivakumar Vaithyanathan
IBM Fellow
Watson & IBM Research

IBM Research
Credit Risk Scoring Application at a Large Financial Institution
 To execute on one machine (with a hypothetical statistical package/engine)
 3.6 TB of RAM required (underestimate). Reduced Set: 1.2 TB of RAM (underestimate)
 In practice more RAM is required
– Outputs and intermediates also need to be stored along with the input
2
Prototypical of problems in other industries ranging from automotive to
insurance to transportation
Credit Risk Scoring
Payment History
Amount Owed
Length of Credit History
New Credit
Types of Credit Used
 Problem size
 300 million rows, 1500 features
 Reduced set: 500 features
 Data size on disk
 3.6 TB (uncompressed)
 Even for reduced set: 1.2 TB
 Algorithm of interest
 Regression
…

IBM Research
Insurance
Big Data Analytics Usecases
 Problem Description
– Consumer risk modeling
– Consumer data with
~300 M rows and ~500
attributes
3
– Predict customer monetary
loss
– Multi-million observations, 95
features, evaluate several
hundred models for optimal
subset of features
– Customer Satisfaction
– Multi-million cars with few
reacquired cars
– Feature expansion from ~250
to ~21,800
Automotive
DaaS (Retail
Finance)
RISK

IBM Research
A Day in the life of a Data Scientist ….
4
data sample
data characteristics
Develop new
algorithm or modify
existing algorithm
original data
Data
scientist
 Bayesian networks
 Neural networks
 Random forests
 Support vector machines
 …
algorithms
Custom
syntax

IBM Research
Bottleneck: Moving the algorithm onto Big Data Infrastructure
5
Data
scientist
Hadoop
Programmer
Spark
Programmer
MPI
Programmer

IBM Research
What If .….
6
Data
scientist
Hadoop
Programmer
Spark
Programmer
MPI
Programmer
compiler optimizer

IBM Research
Simplified view of what we want to build …
7
The What The How
language
tooling compiler optimizer
High-level
language
 Write any
algorithm
Adapt to different data and
program characteristics
Support different backend
architectures and configurations

IBM Research
SystemML: IBM Research Project will soon be in Open Source
8
• IBM Research Project started 6 years ago
• More than 10 papers in major conferences
• In Beta for more than a year and used in multiple
applications
What
• R- like, Python-like syntax, …..
• Rich set of statistical functions
• User-defined & external function
How
• Single-node, embeddable and Hadoop & Spark
• Dense / sparse matrix representation
• Library of more than 15 algorithms
In-Memory
Single Node
Hadoop / Spark
Lower Ops (LOP)
Higher Ops (HOP)
R-parser
Python-
parser
Writing a Python-syntax
parser took less than 2
man-months

IBM Research
How should the “What” work ?
9
package gnmf;
import java.io.IOException;
import java.net.URISyntaxException;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.JobConf;
public class MatrixGNMF
{
public static void main(String[] args) throws IOException, URISyntaxException
{
if(args.length < 10)
{
System.out.println("missing parameters");
System.out.println("expected parameters: [directory of v] [directory of w] [directory of h] " +
"[k] [num mappers] [num reducers] [replication] [working directory] " +
"[final directory of w] [final directory of h]");
System.exit(1);
}
String vDir = args[0];
String wDir = args[1];
String hDir = args[2];
int k = Integer.parseInt(args[3]);
int numMappers = Integer.parseInt(args[4]);
int numReducers = Integer.parseInt(args[5]);
int replication = Integer.parseInt(args[6]);
String outputDir = args[7];
String wFinalDir = args[8];
String hFinalDir = args[9];
JobConf mainJob = new JobConf(MatrixGNMF.class);
String vDirectory;
String wDirectory;
String hDirectory;
FileSystem.get(mainJob).delete(new Path(outputDir));
vDirectory = vDir;
hDirectory = hDir;
wDirectory = wDir;
String workingDirectory;
String resultDirectoryX;
String resultDirectoryY;
long start = System.currentTimeMillis();
System.gc();
System.out.println("starting calculation");
System.out.print("calculating X = WT * V... ");
workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication,
UpdateWHStep1.UPDATE_TYPE_H, vDirectory, wDirectory, outputDir, k);
resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication,
workingDirectory, outputDir);
FileSystem.get(mainJob).delete(new Path(workingDirectory));
System.out.println("done");
System.out.print("calculating Y = WT * W * H... ");
wDirectory, outputDir);
resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory,
UpdateWHStep4.UPDATE_TYPE_H, hDirectory, outputDir);
System.out.print("calculating H = H .* X ./ Y... ");
hDirectory, resultDirectoryX, resultDirectoryY, hFinalDir, k);
FileSystem.get(mainJob).delete(new Path(resultDirectoryX));
FileSystem.get(mainJob).delete(new Path(resultDirectoryY));
System.out.print("storing back H... ");
FileSystem.get(mainJob).delete(new Path(hDirectory));
hDirectory = workingDirectory;
System.out.print("calculating X = V * HT... ");
UpdateWHStep1.UPDATE_TYPE_W, vDirectory, hDirectory, outputDir, k);
resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication,
workingDirectory, outputDir);
System.out.print("calculating Y = W * H * HT... ");
hDirectory, outputDir);
resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory,
UpdateWHStep4.UPDATE_TYPE_W, wDirectory, outputDir);
System.out.print("calculating W = W .* X ./ Y... ");
wDirectory, resultDirectoryX, resultDirectoryY, wFinalDir, k);
FileSystem.get(mainJob).delete(new Path(resultDirectoryX));
FileSystem.get(mainJob).delete(new Path(resultDirectoryY));
System.out.print("storing back W... ");
FileSystem.get(mainJob).delete(new Path(wDirectory));
wDirectory = workingDirectory;
long requiredTime = System.currentTimeMillis() - start;
long requiredTimeMilliseconds = requiredTime % 1000;
requiredTime -= requiredTimeMilliseconds;
requiredTime /= 1000;
long requiredTimeSeconds = requiredTime % 60;
requiredTime -= requiredTimeSeconds;
requiredTime /= 60;
long requiredTimeMinutes = requiredTime % 60;
requiredTime -= requiredTimeMinutes;
requiredTime /= 60;
long requiredTimeHours = requiredTime;
}
}
package gnmf;
import gnmf.io.MatrixObject;
import gnmf.io.MatrixVector;
import gnmf.io.TaggedIndex;
import java.util.Iterator;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
public class UpdateWHStep2
{
static class UpdateWHStep2Mapper extends MapReduceBase
implements Mapper<TaggedIndex, MatrixVector, TaggedIndex, MatrixVector>
{
@Override
public void map(TaggedIndex key, MatrixVector value,
OutputCollector<TaggedIndex, MatrixVector> out,
Reporter reporter) throws IOException
{
out.collect(key, value);
}
}
static class UpdateWHStep2Reducer extends MapReduceBase
implements Reducer<TaggedIndex, MatrixVector, TaggedIndex, MatrixObject>
{
@Override
public void reduce(TaggedIndex key, Iterator<MatrixVector> values,
OutputCollector<TaggedIndex, MatrixObject> out, Reporter reporter)
throws IOException
{
MatrixVector result = null;
while(values.hasNext())
{
MatrixVector current = values.next();
if(result == null)
{
result = current.getCopy();
} else
{
result.addVector(current);
}
}
if(result != null)
{
out.collect(new TaggedIndex(key.getIndex(), TaggedIndex.TYPE_VECTOR_X),
new MatrixObject(result));
}
}
}
public static String runJob(int numMappers, int numReducers, int replication,
String inputDir, String outputDir) throws IOException
{
String workingDirectory = outputDir + System.currentTimeMillis() + "-
UpdateWHStep2/";
JobConf job = new JobConf(UpdateWHStep2.class);
job.setJobName("MatrixGNMFUpdateWHStep2");
job.setInputFormat(SequenceFileInputFormat.class);
FileInputFormat.setInputPaths(job, new Path(inputDir));
job.setOutputFormat(SequenceFileOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(workingDirectory));
job.setNumMapTasks(numMappers);
job.setMapperClass(UpdateWHStep2Mapper.class);
job.setMapOutputKeyClass(TaggedIndex.class);
job.setMapOutputValueClass(MatrixVector.class);
job.setNumReduceTasks(numReducers);
job.setReducerClass(UpdateWHStep2Reducer.class);
job.setOutputKeyClass(TaggedIndex.class);
job.setOutputValueClass(MatrixObject.class);
JobClient.runJob(job);
return workingDirectory;
}
}
package gnmf;
import gnmf.io.MatrixCell;
import gnmf.io.MatrixFormats;
import gnmf.io.MatrixObject;
import gnmf.io.MatrixVector;
import gnmf.io.TaggedIndex;
import java.util.Iterator;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
public class UpdateWHStep1
{
public static final int UPDATE_TYPE_H = 0;
public static final int UPDATE_TYPE_W = 1;
static class UpdateWHStep1Mapper extends MapReduceBase
implements Mapper<TaggedIndex, MatrixObject, TaggedIndex, MatrixObject>
{
private int updateType;
@Override
public void map(TaggedIndex key, MatrixObject value,
OutputCollector<TaggedIndex, MatrixObject> out,
Reporter reporter) throws IOException
{
if(updateType == UPDATE_TYPE_W && key.getType() == TaggedIndex.TYPE_CELL)
{
MatrixCell current = (MatrixCell) value.getObject();
out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_CELL),
new MatrixObject(new MatrixCell(key.getIndex(), current.getValue())));
} else
{
out.collect(key, value);
}
}
@Override
public void configure(JobConf job)
{
updateType = job.getInt("gnmf.updateType", 0);
}
}
static class UpdateWHStep1Reducer extends MapReduceBase
implements Reducer<TaggedIndex, MatrixObject, TaggedIndex, MatrixVector>
{
private double[] baseVector = null;
private int vectorSizeK;
@Override
public void reduce(TaggedIndex key, Iterator<MatrixObject> values,
OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter)
throws IOException
{
if(key.getType() == TaggedIndex.TYPE_VECTOR)
{
if(!values.hasNext())
throw new RuntimeException("expected vector");
MatrixFormats current = values.next().getObject();
if(!(current instanceof MatrixVector))
throw new RuntimeException("expected vector");
baseVector = ((MatrixVector) current).getValues();
} else
{
while(values.hasNext())
{
MatrixCell current = (MatrixCell) values.next().getObject();
if(baseVector == null)
{
out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR),
new MatrixVector(vectorSizeK));
} else
{
if(baseVector.length == 0)
throw new RuntimeException("base vector is corrupted");
MatrixVector resultingVector = new MatrixVector(baseVector);
resultingVector.multiplyWithScalar(current.getValue());
if(resultingVector.getValues().length == 0)
throw new RuntimeException("multiplying with scalar failed");
out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR),
resultingVector);
}
}
baseVector = null;
}
}
@Override
public void configure(JobConf job)
{
vectorSizeK = job.getInt("dml.matrix.gnmf.k", 0);
if(vectorSizeK == 0)
throw new RuntimeException("invalid k specified");
}
}
public static String runJob(int numMappers, int numReducers, int replication,
int updateType, String matrixInputDir, String whInputDir, String outputDir,
int k) throws IOException
{
R syntax
(10 lines of code)
Python syntax
(10 lines of code)
A factor of 7 – 10
advantage in man-
months over multiple
algorithms

IBM Research
Scalability and Performance – GNMF Example
10
All operations
execute on
Single machine
0 MR Jobs
Hybrid Execution
(majority of operations
execute on single machine)
4 MR Jobs
Hybrid Execution
(majority of operations
execute in map-reduce)
6 MR Jobs

IBM Research
What does the “How” do ?
1111

IBM Research
What does the “How” do ?
12
X has 3 times more columns
300M
500
X
300M
1
y From 2.5 to GB Map Task JVM
7 GB In-Mem Master JVM
Change in Cluster configuration
600M
500
X
600M
1
y
X has 2 times more rows
300M
1500
X
300M
1
y
X’y job1
X’y job2
X’X job
solve
X’y job1
X’y job2
X’X job
solve
300M
500
X
300M
1
y
Original data X’X and
X’y job
solve
Execution plan
Change in data characteristics
X’X and
X’y job
solve
X’X job1
X’X job2
X’y job
solve
3X faster!

IBM Research
Compilation Chain Overview with Example
13
+
%*%
*
b sb
X
y
Q
bsb
Xbsb
yXbsb
Parse Tree
If dimensions are unknown at
compile time, validate will pass
through and additional checks will be
made at run time
Runtime Instructions:
CP: b+sb  _mvar1
MR-Job: [map=X%*%_mvar1  _mvar2]
CP: y*_mvar2  _mvar3
HOPs DAG:
LOPs DAG:

IBM Research
 Data fits in aggregated memory: SystemML optimizations give ~10X over Hadoop
In-Memory Data Set (160GB)
Some Performance Numbers for Spark / Hadoop
 Data larger than aggregated memory: SystemML optimizations give ~ 2X
ML Program MR Backend
(All ML optims)
Spark Backend
(All ML optims)
Spark Backend
(Limited ML optims)
LinregDS 479s 342s 456s
LinregCG 954s 188s 243s
L2SVM 1,517s 237s 531s
GLM 1,989s 205s 318s
ML Program MR Backend (All ML
optims)
Spark Backend
(All ML optims)
LinregDS 5,429s 6,779s
LinregCG 12,469s 10,014s
L2SVM 24,360s 12,795s
GLM 32,521s 17,301s
Large-Scale Data Set (1.6TB)

IBM Research
Thank You

Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (12)

Similar a Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure

Similar a Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure (20)

Más de Turi, Inc.

Más de Turi, Inc. (20)

Último

Último (20)

Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure

Notas del editor