This document summarizes a presentation on Inferno, a system for scalable deep learning on Apache Spark. Inferno allows deep learning models built with Blaze, La Trobe University's deep learning system, to be trained faster using a Spark cluster. It coordinates distributed training of Blaze models across worker nodes, with optimized communication of weights and hyperparameters. Evaluation shows Inferno can train ResNet models on ImageNet up to 4-5 times faster than a single GPU. The presentation provides an overview of deep learning and Spark, demonstrates how Blaze allows easy model building, and explains Inferno's architecture for distributed deep learning training on Spark.
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Inferno Scalable Deep Learning on Spark
1. Inferno
Scalable Deep Learning on Spark
Matthias Langer
m.langer@latrobe.edu.au
Dr. Zhen He
z.he@latrobe.edu.au
Prof. Wenny Rahayu
w.rahayu@latrobe.edu.au
Department of Computer Science &
Computer Engineering
2. Topics
• Deep Learning – Introduction
• Spark & Deep Learning
• Our solution:
La Trobe University’s Deep Learning System
• Conclusion, Timeline, Q&A
5. Object/Action Recognition
• Automatic Captioning
• Navigating Artificial Agents
• Deep Learning performs
100% better than the best
non-deep learning algorithms
in many Computer Vision
tasks!
Source: Research @ Facebook (left), google.com/selfdrivingcar (right)
6. Voice Recognition
• Deep Learning performs 30%
better than the best non-deep
learning algorithms!
7. Natural Language Processing
• Translation
• Thought Vector Q&A
• …
• Deep Learning tends to perform
“better” than traditional machine
learning algorithms!
Source: Google Inc. / Google Translate
9. Spark & DL
How they could be an ideal tandem, but there
are challenges…
10. Why do you want to use a cluster to
train Deep Neural Networks?
Deep Learning is SLOW
11. • Highly scalable
• No relevant hardware limits
• Extensible
Two approaches to speed up DL
Scaling Up Scaling Out
• Superior scaling until fundamental
limits of the hardware are reached
Max. the number of PCIe lanes
Max. read speed of HDD
Costs scale up non-linear
(DGX-1 = $129,000)
Source: https://developer.nvidia.com/devbox
12. You already have all your valuable data in Spark/Hadoop
DL (often) requires a lot of data to train
Need a lot of memory
Pre-processing has massive of I/O requirements
(disk & network)
More reasons why you would want to use
Hadoop/Spark for DL?
&
13. How could you implement
DL on Spark?
Worker 1 Worker 2 Worker 3
𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯ 𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯ 𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯
Master
𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯
= mini-batch of data
Draw mini-batch
Map:
Compute updated model in
each worker
Reduce:
Assemble into “better” model
via Master node
Broadcast “better” model
and repeat
Spark RDD
𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯
14. Comp
ute
5%
Comm
unicat
ion
95%
Problem 1:
Big Parameters = High shuffle cost!
Worker 1 Worker 2 Worker 3
𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯ 𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯ 𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯
Master
𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯
Reduce models
(at best 5 s over 1 GbE)
Broadcast combined model
(at best 5 s over 1 GbE)
500 MB 500 MB 500 MB
500 MB
Compute updated models
(typically 50 – 500 ms)
16. Blaze
La Trobe University DL-System
Cluster Single Machine
Blaze
Scala based standalone
deep learning system
CUBlaze
CUBlaze
GPU acceleration for Blaze
Inferno
Inferno
Coordinates distributed
computation of Blaze
models in synchronous
Spark environment
17. A (probably biased) comparison
Inferno SparkNet (Caffe) CaffeOnSpark deeplearning4j H2O
ConvNets, AutoEncoders, etc. planned
Communication protocol during
training
Spark MR Spark MR MPI/RDMA
Spark MR among
others
Grpc/MPI/RDMA
Build Complex models (e.g. ResNet) some
Dynamic branching support
(Path altering / dropping)
Pluggable preprocessing Pipeline partial
Pluggable update policies for
hyper parameters
Pluggable & visualizable
online cross validation
Entire execution path determined
in single runtime environment
Model description language JVM Code Config File Config File JVM Code multiple
GPU acceleration
22. Cached Sample
…
Cached Sample
Cached Sample
How Blaze works
(example)
PrefetcherModel
(fprop only)
Augmenter
Weights
(fixed)
Sample
Merger
Data Source
(HDD, SparkRDD, HDFS)
Optimizer
Model
Weights
(tunable)
Hyper
Param.
Hyper
Param.
Objectives
Hyper
Param.
Scope
Delimiter
Terminal,
File,
Showoff,
etc.
23. Easy Setup: Model
• Blaze automatically infers most layer parameters based on the actual input
• Usually no need to specify input and output dimensions or whether to use CPU or GPU
val noClasses = 100
// Kernels
val kernelConv1 = Kernel2D(dims = (11, 11), stride = (4, 4), padding = (2, 2))
val kernelConv2 = Kernel2D.centered((3, 3))
val kernelPool = Kernel2D((3, 3), (2, 2))
// Layers
val bias = AddBiasBuilder()
val relu = ReLUBuilder()
val lrn = LateralResponseNormalizationBuilder(n = 5, k = 2, alpha = 1e-4f, beta = 0.75f)
val pool = MaxPoolingBuilder(kernelPool)
// Lego!
val mb = SequenceBuilder(
ConvolutionFilterBuilder(kernelConv1, 48), bias, relu, pool, lrn,
ConvolutionFilterBuilder(kernelConv2, 192), bias, relu,
ConvolutionFilterBuilder(kernelConv2, 128), bias, relu, pool,
ReshapeBuilder.collapseDimensions(),
LinearBuilder(noClasses), bias,
SoftmaxBuilder(), ClassLLConstraintBuilder()
)
24. Easy Setup: CPU and GPU
• Blaze maintains a variant table for each module type.
• When you “build” an instance of a module, all variants are scored and the
“best” variant for the current situation is selected automatically.
You can configure what “best” means.
// Input data
val data = Array[Batch](...)
// Inspect batches
val hints = BuildHints.derive(data)
// Build compatible model
val m = mb.build(hints)
19:25:20 INFO Scoring ConvolutionFilter[Kernel2[(3, 3), (1, 1)] x 2, 0/1 = filter]:
19:25:20 DEBUG 0000800a => CUDA_CUDNN, preferred, input type matches
19:25:20 DEBUG 0000400a => JVM_BLAS_IMPLICITMM, preferred
19:25:20 DEBUG 00000004 => JVM_BLAS_MM
19:25:20 DEBUG 0000000a => JVM_BREEZE_MM, preferred
19:25:20 DEBUG 00000002 => JVM_BREEZE_SPARSEMM
19:25:20 INFO CUDA_CUDNN selected!
25. Working with large models!
val mb = SequenceBuilder(...)
val hints = ...
val g = mb.toGraph(hints)
SvgRenderer.render(g)
28. Other Features
• Tensor Memory Management
Automatically monitors the dependencies between all tensors
Reallocates space occupied by unneeded tensors on the fly
Will automatically toggle “inPlace” processing when it is safe
• Intermediate results are stored separate from model
Forward passes yield backpropagation contexts that can be consumed or discarded
at any time.
Very interesting property for:
Live Query/Training
Fancy Optimizers
Hyper Parameter Search
Saves up to
40%
GPU memory
during training!
32. 57 minutes
2 hours, 42 minutes
Performance
ResNet 34 on ImageNet
Blaze
2 x 8 core Xeon CPU + 1 x NVIDIA TitanX
Inferno (over 1 GbE)
8 x 8 core Xeon CPU + 4 x NIVIDA TitanX
Reached 20% Top1 accuracy 2.84 times faster!
33. Performance
PreAct ResNet 152 on ImageNet
0%
10%
20%
30%
40%
50%
60%
70%
80%
0 h 10 h 20 h 30 h 40 h 50 h
1x TitanX - Top 1 Accuracy
1x TitanX - Top 5 Accuracy
Inferno Cluster (5x TitanX, 1 GbE) - Top 1 Accuracy
Inferno Cluster (5x TitanX, 1 GbE) - Top 5 Accuracy
Reached 30% Top1 accuracy 4.81 times faster using 5 GPUs!*
* 6.8 ℎ vs. 32.7 ℎ
34. Conclusion
• Blaze & CUBlaze
Fast
Huge extensible module library
Easy to use
• Inferno
Allows you to accelerate Blaze DL tasks on Spark
Uses Spark MR methods for all data transmissions:
Can run rather nicely along with other Spark jobs.
Can be used without high-speed / low latency equipment
(usually required to make RDMA solutions perform well)
Plain old (and even slow) Ethernet is enough!
* Note that using “Showoff” to visualize progress may open separate HTTP connections to the Showoff-Server.
35. Where can I get it?
• Blaze & CUBlaze & Example Code
Stable, we train models using it for months already. A snapshot of the current stable release
is available at:
https://github.com/bashimao/ltudl (Apache License 2.0)
• Showoff
Multi-purpose live visualization system developed by Aiden Nibali (La Trobe University):
https://github.com/anibali/showoff
• Inferno
I am writing a paper about Inferno’s optimization system right now.
Once it has been accepted for publication, we will release the full source code on GitHub.
The best way to prepare for Inferno, is to download Blaze now and to get familiar with it.
36. Questions?
Matthias Langer, PhD cand.
m.langer@latrobe.edu.au
Supervisors:
Dr. Zhen He
z.he@latrobe.edu.au
Prof. Wenny Rahayu
w.rahayu@latrobe.edu.au
37. Deep Learning & Spark @ LaTrobe
Students
• Master of Data Science degree
http://tinyurl.com/hf4wmn2
Advanced data science lab established in 2016 with newest hardware.
CSE5BDC
Big Data Management on the Cloud (I tutor this!)
CSE5DEV
Data Exploration and Visualization
(~50% lectures on deep learning)
CSE5WDC
Web Development on the Cloud
• Research
GPU research cluster capable of running distributed deep learning
tasks.
In-house development of a distributed deep learning system.
Dedicated research group working with various Deep Learning systems.
CSE4DLJ
Weekly Deep Learning Journal Club
Industry
• If you have a data analytics problem:
… we have a dedicated deep learning research team!
… and probably also a deep learning solution for it!
• Spark & Deep Learning workshops for Torch
available on demand.
• Past & current machine learning research
collaborations
Alfred Hospital
ZenDesk
AIS (Australian Institute for Sports)
• Contact: z.he@latobe.edu.au
Notas del editor
Time Budget: 30 seconds
Hi, my name is Matthias Langer. I am currently a PhD student at La Trobe University.
Today I would like to present to you Inferno, which is a deep learning system that we develop here in Melbourne and can run on top of Spark.
Time Budget: 30 seconds
My talk will be structured as follows:
I will talk with you a little bit about DL.
… then about DL and Spark…
… our own DL system ….
… and then we will conclude, and I will also tell you where you can download our stuff.
Time Budget: 30 seconds
Talking Points:
So without further ado, let’s start…
Time Budget: 1 minute
So, what is deep learning?
Deep learning is machine learning algorithm that tries to extract hierarchical features from input data.
In itself that is kind of similar to how the brain does it in this slide.
So how does that work:
Let’s say a stimulus (or input) comes from the eye and eventually ends up in region V1.
There primitive features like edges are extracted.
Then in V2 these features are combined into more complex features.
This is done many times to grasp very complex features.
Time Budget: 30 seconds
Talking Points:
Now, where can DL be used?
For example, for in computer vision.
In this area, DL has completely reshaped the landscape.
Time Budget: 30 seconds
Talking Points:
But also in voice recognition DL is now used a lot!
Time Budget: 30 seconds
The same goes for natural language processing.
I could now go on with examples, but… (next slide)
Time Budget: 30 seconds
… I think this slide from GoogleBrain sums it up pretty well.
This is the amount of projects at Google that take advantage of DL to achieve their functionality.
You can draw your own conclusions. But.. Well.. I would say this is an exponential development.
Time Budget: 30 seconds
So the first question that arises is probably... (next slide)
Time Budget: 1 minute
“Why do you want to use cluster resources to train DNNs?”
When you dive into the literature available about DL, you will often see comments like this:
(click) “This model took about 22 days to train.”
(wait 5) Or another frequent comments could be: (click) “I trained 50x from scratch…”
(wait 5)
So, let me sum this up in on short sentence (click!)
DEEP LEARNING IS SLOW!
Time Budget: 1.5 minutes
Scaling Up
Scaling up works super-well until a certain point.
And then it becomes either fundamentally hardware limited and/or expensive!
Also consider that you have a box that can do ML very well but might be not good host for your data.
Scaling Out (click)
On the other hand we have the scaling out approach by using a cluster of computers and clever software like Hadoop & Spark.
Here you have no hardware limits.
And even better, it is extensible: So, you can gradually buy more resources for DL as you run more DL jobs.
Time Budget: 1 minute
Here are a few more reasons why you might want to try running DL on Spark:
If you are here at this conference today, chances are that you already have all your valuable data in Hadoop and use Spark to process them.
DL requires a lot of data. In your HDFS is a lot of da.
DL needs a lot of memory, your Spark cluster probably has a lot of memory.
Require lots of memory and IO for preprocessing data. Spark and Hadoop are masters at doing this.
Time Budget: 1.5 minutes
OK, Done deal! Let’s implement DL on Spark.
As always, we first put all our data into an SparkRDD.
(click) Now start a bunch of workers and give them our model.
(click) Each worker then pulls one batch from the RDD and updates the model. This is a map-job in Spark.
(click) Then we combine the changes from all workers into a joint model. This would be a reduce-job in Spark.
(click) And finally, we take this model pass it back to the workers for the next optimization round. You could do this with a broadcast-job in Spark.
Time Budget: 1.5 minutes
The before-mentioned approach looks theoretically sound.
But let’s take a closer look.
(click) Typical DL models need 50-500 ms to compute on a modern GPU.
(click) But presuming the model is large (e.g. 500 MB)
Then the reduction will at least take 5 seconds, because that is minimum flight-time a single instance of such a model 1 GbE.
(click) And then we also need at least another 5 seconds for rebroadcasting the model.
In this scenario we spend about 95% of the time at communication.
Now you could, say: But I have 10 GbE. 10 GbE is of course faster. But at best you still spend at least 66% of the time budget on communication.
Time Budget: 1 minute
Another thing to consider in Spark is map/reduce is synchronous.
Only after the slowest worker has responded to the master it will be able to finish the reduction process.
The master itself and its network connections can quickly become the bottleneck that slows down the entire system.
So synchronous is kind of problematic.
Time Budget: 1.5 minutes
So let’s talk about what we have to offer.
The LTU DL system consists of 3 major components.
Blaze
Is a standalone deep learning system that can train DL models on a single node…
Now you might want to ask, why did you have to create new DL system. Blaze was designed from the ground up for use in a distributed MapReduce environment. So it is highly portable and scalable.
CUBlaze
A plugin for Blaze that adds support for NVIDIA GPUs.
Inferno
Is a coordinator service and a set of advanced optimizers for Blaze that leverage cluster resources to accelerate training of DNNs.
Time Budget: 1.5 minutes
There are already solutions for DL on Spark.
Now why Inferno?
If you type DL + Spark into Google you end up with a couple of systems. And they are all very different. So I will just pick a few things here
This presentation will be available later for downloading. So you can compare more thoroughly.
(click) Our system is not only a deep learning system but covers the entire pipeline. Including preprocessing. So it one can do all solution.
(click) We also have pluggable online cross validation support. So you can see live how well your model generalizes right now.
(click) Last but not least, this is the primary communication protocol used. As you can see, while some systems say they are Apache Spark based, they do not use Spark for communication. Actually, some of them just kick off the learning task using Spark and then open other communication channels. Hence, they are actually not really Spark DL systems. This is quite important. Because if you do something like the Spark resource management is completely thrown out of the window when you do that.
Time Budget: 30 seconds
So, let’s dig into our DL system…
And start with Blaze.
Time Budget: 1 minute
Blaze is not only a Deep Learning Engine.
It also comes with built-in support for a vast array of DL modules and optimizers.
This is an incomplete list, but note that you see Convolution only once in this list and not things like Spatial, Volumetric, etc. Keep that in mind it will come back in a minute.
Time Budget: 45 seconds
Going distributed is useless if your base performance is horrible.
Here is a benchmark that that pits CUBLaze against other famous DL engines on AlexNet.
As you can see, our single GPU performance is comparable with TensorFlow. (lower is better)
Time Budget: 30 seconds
But not only for AlexNet.
We scores similarly well for other network architectures.
Time Budget: 2.5 min
Talking Points:
Next I want to show you how Blaze fundamentally works.
As for all data science tasks, everything starts with the data itself.
(click) Blaze gives you two options, lazily cached and uncached data loading. In this case we went for cached data loading. This is only interesting if you have very slow network connections and or use a regular access pattern.
(click) Anyway, data is pulled from the data source from the first preprocessing pipeline. In this presentation they are always depicted as yellow hexagons.
In this case it is a merger that merges multiple samples together to form a mini-batch.
(click) It then hands it over to the next processing stage.
In this example it is an augmenter. Augmenter allow you add a wide array of modules (including entire NNs), to mangle the data in order to make it consumable for the model under test.
(click) So the augmenter hands the data over to the underlying model.
The model then consumes the batch and produces a new batch.
(click) Which it returns to the augmenter, which
(click) in turn hands it over to the next processing stage.
Here it is a prefetcher. Prefetchers mitigate performance drops through I/O bottlenecks, by pulling in batches ahead of time.
(click) However, regardless what the last preprocessing stage is, now the batch in its current form is handed over to the optimizer.
(click) The optimizer will consult the scope delimiter to decide to what degree the model should be modified next. This is a pretty unique property of Blaze and gives us very interesting properties for special purpose networks, or if you want to use different optimization strategies for different parts of the model and actually the distributed optimization.
(click) Then it reads the current hyper parameters and
(click) begins running the batch through the model.
(click) No surprises here. The model uses its current weights and hyper parameters to compute a cost and returns it to the optimizer.
(click) The optimizer will then process its current objectives and take action (depends what the objective is about).
(click) Objectives can for example result in an output to a file or a Showoff server. It could also result in yield signal to the optimizer. In that case the optimization would be finished.
(click) If it is not finished. Blaze will now use the gradients returned by the model to improve the current weights. I will also trigger update procedures in all hyper parameters.
As you can see there are a few technicalities. But no fancy surprises or magic here.
Arguments:
Remember that caching is not useful if you can afford a prefetcher.
Time Budget: 1.5 min
REMEMBER: Mentioning the ConvLayer again when mentioned in previous slides.
So, what does working with Blaze actually look like. Here we go…
(click) For defining a convolution NN, we best start off with defining kernels (Kernels represent the size of the feature maps)
We could do that later, but it is cleaner like this.
These are many ways to initialize a 2D kernel.
(click) In most NNs there are layers that we frequently use. So let’s just define those upfront.
And now it is Lego time. Here we define a sequence
And add convolution layers that use the previously defined kernels. The first one will create 48 feature maps of kernelConv1 we defined above.
As you can simply mix defining individual layers on the fly and using the layers that we just defined above.
Note that we are creating here a network that is 17 modules high. And it is still pretty readable.
And the reason why that is still quite readable is that every piece of information here has to do with what we want to do. Not how.
(click) As you can see we do not define the actual input and output dimensions of the layers. This is inferred automatically.
Time Budget: 1 minute
Here is why we do not have to specify CPU or GPU.
Blaze will automatically pick the most best available implementation depending on many factors. Especially the runtime type of the tensor of the previous module.
However, you have the option to set preferences to override our built in mechanics if you want.
Blaze supplies fallback implementations for everything.
If something is not supported in the desired implementation. Blaze will temporarily switch to a fallback solution.
So if you give a model to a friend, he will always be able to compute it.
TODO: Give examples for how it can be configured!
Time Budget: 1 minute
With many things being done automatically, you sometimes want to know how Blaze will actually process the data.
To do this, you can transform any NN into a graph.
Just call the toGraph method.
You can also render it for on screen display.
Then Blaze will show you what will happen.
… there are two branches combing from above.
… after the branches join a table is being formed containing those two tensors
… this table is then collapsed down into a single CUDA tensor by a Merge operation that adds the tenors on top of each other.
Time Budget: 30 seconds
Of course, visualization is not limited to the model.
You can also visualize other things as well.
Here is a preprocessing pipeline for ImageNet. (wait 10 seconds)
Time Budget: 2 minutes
So last but not least, and probably the an example how you setup an optimization job in Blaze.
First you create an optimizer builder.
(click) Than you could set hyper parameters.
In this case we set up a learningRate schedule with discrete steps.
You can extend the functionality of optimizers with so-called objectives.
(click) Objectives include stop conditions like this one where we simply say that we want to stop after 1000 iterations.
But you can also execute complex functions
(click) Let’s add a “Online Cross Validation” module.
(click) Now let’s print the status again. That would now print the cost and other figures regarding the learning to the command line.
(click) Let’s say we do not want this information to end up on the command line but in a file. For example in a Hadoop file. Then we would just add two arrows and let them point to a sink.
(click) That was nice. But how about more advanced visualization. You can build them yourself or use presets that we frequently use.
(click) Well.. Where the visualize. We could write an image file. Or we could also send it to our showoff visualization system. Like this.
(click) This will automatically render the image on the showoff server in frame titled “Cross Validation Performance”
(click) And produce in the showoff server a graphic like this.
(click) You can also use logical operations to combine objectives. Here is periodic trigger that we set to 3600. That means this objective evaluates true once every hour.
(click) We combine this using an &&-operator with a dump command. Now very hour the weights of the model will be dumped to the stdout.
(click) But that is not very useful. So, let’s add a directory sink. This will redirect the output of dump to files in the directory “/tmp”.
(click) There are lots of other things you can do.
But eventually you want t build the optimizer by providing a model and a datasource.
And then “run()” it.
Time Budget: 1 minute
Talking Points:
Blaze has many other features.
This here is merely a selection.
Blaze has an automatic Tensor memory management.
It will automatically monitor the relationships between tensors in your network to utilize the available memory as efficient as possible.
The tensor management system will also automatically toggle inPlace processing if it is deemed safe.
(click) This can save up to 40% of GPU memory during training.
Also note that in Blaze, intermediate results are always stored separate from the model.
So you can forward propagate multiple times without loosing the ability to backprop separately for the previous mini-batch.
This is a nice property to have if you are optimizing hyper parameters
.. Or ..
You want to write fancy optimizers that explore the hyperplane of the cost function.
As you have already seen in the previous slides. We have the ability to visualize lots of things. Right now we support only our own visualization system showoff. But the system is extensible.
Time Budget: 30 seconds
Talking Points:
Now, for the last part of our deep learning system.
Inferno itself.
(click)
Time Budget: 1.5 minutes
Talking Points:
To be able to utilize a cluster for training your modules. You have to use Inferno.
In Inferno everything always starts with the Cluster coordinator.
First, you will have to provide a SparkConf so that we know what Spark master you want to connect to.
(click) The Cluster Coordinator creates and takes over control over the Spark Context.
Now the automatic initialization procedure starts.
(click) First The coordinator will briefly claim all cluster resources.
(click) It then probes the each executor and checks for specific settings in their local configuration.
(click) Then it frees all cluster resources that cannot be used for one or another reason to make them available for the Spark Scheduler again.
(click) Now special plugins like for example CUBlaze can be loaded.
Now the system is initialized.
(click) Typically you would now somehow load your dataset.
(click) Here we used the Inferno-FileRDD which is a special RDD that can handle huge amounts of files much faster than the built-in Spark RDDs.
This way we can for instance just ditch the entire imagenet dataset into our HDFS filesystem and have it accessible from the entire cluster.
(click) Anyway. Sooner or later you want to create samples that you can use for learning. Notice that sample creation is lazy. So once we have the meta-data for the HDFS files in the FileRDD we do not need to access that file anymore until we really need it for learning.i
Time Budget: 1.5 minutes
Talking Points:
Presuming that you have already your Blaze optimizer. The Inferno optimizer is easy to use.
As always, everything starts with the
(click) Just provide the Blaze model you want to tune and cache it.
(click) Provide the description of Blaze optimizer and cache it.
(click) Provide the description of the preprocessing pipeline and cache it.
(click) Now come the inferno-optimizers’ objectives hyper parameters and scope delimiter.
(click) Now I know that looks similar. But in fact bother sets of parameters are different.
The optimizer will be distributed to the workers and so are its objectives. And they are only evaluated there.
The inferno parameters are evaluated in the master.
(click) Anyway, finally you will have to provide your sample data and call the build() function.
Now you have your Inferno optimizer. The only thing left is to call the “run()” function.
Time Budget: 1 minute
Now for the performance.
Here we have trained a ResNet 34 on a single GPU.
I deliberately took this picture. This is how Blaze would visualize its progress to you.
With 1 GPU, we reached 20% top1 Accuracy after 2hrs, 42 minutes.
Now the same network training on 4 machines with the same specs on Inferno.
We reached the same result in 57 minutes.
So 4x the hardware 2.8 times speed improvement….
Time Budget: 1 minute
Well nice, but not impressive.
But ResNet 34 is very small and still doable on a single GPU.
Let’s take on something larger. ResNet 152, with pre-activation units.
As you can see from the horizontal axis, it takes incredibly long to train this on a single GPU (blue line). I basically gave up after about 44 hours.
The distributed version (green). In this case 5 TitanX cards in an Inferno cluster with poor 1 GbE link speed.
We can reach a top1 accuracy with similar hyper parameters in less than 7 hours.
That is about 4.8 times faster than 33 hours using a single GPU.
Time Budget: 1.5 minute
So, to sum up
Blaze & CUBlaze & Inferno
Are fast, have a huge module library and still extensible. And quite easy to use.
Inferno
Allows you to accelerate DL tasks.
And is completely Spark MR.
So no shady network connections that punch holes into your security.
And of course that we can also achieve decent results using cheap Ethernet hardware where other’s can’t.
Time Budget: 1.5 minute
So now the big question that remains is: How can you obtain this software to start playing around.
For Blaze and CUBlaze, I have published snapshots of the current stable release on GitHub.
There is example code. So just grab them and follow the instructions.
Our visualization system Showoff, can be found at Aiden Nibali’s GitHub repo as a docker-image.
For Inferno things are more complicated.
I am in fact writing a paper about our optimizer right now.
Unfortunately, I have tow wait until that paper has been accepted before I can release the code.
However, as soon as that happens, you will find it next to Blaze & CUBlaze in the above mentioned repository.
The best way to prepare for Inferno is to get familiar with Blaze now.
Time Budget: 5 minutes
Talking points:
So, I don’t have many slides left. Any questions?
(if people stand up, switch to the next slide.)
Time Budget: -
At LaTrobe we do quite a lot with deep learning.
If you are interested, regardless whether you are a student or industry representative, you can contact us here.