Inferno Scalable Deep Learning on Spark

Inferno
Scalable Deep Learning on Spark
Matthias Langer
m.langer@latrobe.edu.au
Dr. Zhen He
z.he@latrobe.edu.au
Prof. Wenny Rahayu
w.rahayu@latrobe.edu.au
Department of Computer Science &
Computer Engineering

Topics
• Deep Learning – Introduction
• Spark & Deep Learning
• Our solution:
La Trobe University’s Deep Learning System
• Conclusion, Timeline, Q&A

Source: CerCo (Brain and Cognition Research Centre), Toulouse

Object/Action Recognition
• Automatic Captioning
• Navigating Artificial Agents
• Deep Learning performs
100% better than the best
non-deep learning algorithms
in many Computer Vision
tasks!
Source: Research @ Facebook (left), google.com/selfdrivingcar (right)

Voice Recognition
• Deep Learning performs 30%
better than the best non-deep
learning algorithms!

Natural Language Processing
• Translation
• Thought Vector Q&A
• …
• Deep Learning tends to perform
“better” than traditional machine
learning algorithms!
Source: Google Inc. / Google Translate

Source: GoogleBrain; Google, Inc.

Spark & DL
How they could be an ideal tandem, but there
are challenges…

Why do you want to use a cluster to
train Deep Neural Networks?
Deep Learning is SLOW

• Highly scalable
• No relevant hardware limits
• Extensible
Two approaches to speed up DL
Scaling Up Scaling Out
• Superior scaling until fundamental
limits of the hardware are reached
 Max. the number of PCIe lanes
 Max. read speed of HDD
 Costs scale up non-linear
(DGX-1 = $129,000)
Source: https://developer.nvidia.com/devbox

You already have all your valuable data in Spark/Hadoop
DL (often) requires a lot of data to train
Need a lot of memory
Pre-processing has massive of I/O requirements
(disk & network)
More reasons why you would want to use
Hadoop/Spark for DL?
&

How could you implement
DL on Spark?
Worker 1 Worker 2 Worker 3
𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯ 𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯ 𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯
Master
𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯
= mini-batch of data
Draw mini-batch
Map:
Compute updated model in
each worker
Reduce:
Assemble into “better” model
via Master node
Broadcast “better” model
and repeat
Spark RDD
𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯

Comp
ute
5%
Comm
unicat
ion
95%
Problem 1:
Big Parameters = High shuffle cost!
𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯ 𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯ 𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯
Master
𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯
Reduce models
(at best 5 s over 1 GbE)
Broadcast combined model
(at best 5 s over 1 GbE)
500 MB 500 MB 500 MB
500 MB
Compute updated models
(typically 50 – 500 ms)

Problem 2:
Node communication is synchronous
𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯ 𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯ 𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯
Master
𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯
Bottleneck!

Blaze
La Trobe University DL-System
Cluster Single Machine
Blaze
Scala based standalone
deep learning system
CUBlaze
CUBlaze
GPU acceleration for Blaze
Inferno
Inferno
Coordinates distributed
computation of Blaze
models in synchronous
Spark environment

A (probably biased) comparison
Inferno SparkNet (Caffe) CaffeOnSpark deeplearning4j H2O
ConvNets, AutoEncoders, etc. planned
Communication protocol during
training
Spark MR Spark MR MPI/RDMA
Spark MR among
others
Grpc/MPI/RDMA
Build Complex models (e.g. ResNet) some
Dynamic branching support
(Path altering / dropping)
Pluggable preprocessing Pipeline partial
Pluggable update policies for
hyper parameters
Pluggable & visualizable
online cross validation
Entire execution path determined
in single runtime environment
Model description language JVM Code Config File Config File JVM Code multiple
GPU acceleration

Blaze
CUBlaze
Inferno
Blaze
High-Performance Deep Learning Engine

Module Library
• Standard Modules
 Add-Bias (C/U/S/B), Immediate-Filter (C/U/S/B)
 Convolution, Convolution-Decoder, Linear, Linear-
Decoder, Locally-Connected, Locally-Connected-Decoder
 L2-Pooling, Max-Pooling, Mean-Pooling,
 Batch-Normalization , Dropout, LCN, LRN, Normalization
(C/U/S/B), Reshape, Weight-Decay (L1/L2)
• Nonlinearities
Abs, Add-Noise, ELU, Exp, Hard-Tanh, LeakyReLU,
Ln, Pow, PReLU, ReLU, ReQU, (Log-)Sigmoid,
SmoothAbs, (Log-)Softmax, SoftPlus, Sq, Sqrt,
SReLU, Tanh
• Optimizers
AdaDelta, AdaGrad, Adam,
ConjugateGradientDescent, Rprop, RMSProp,
SGD (traditional, local learning rates, momentum)
• Constraints (can inject everywhere!)
BCE, ClassLL, ClassNLL, KLDivergence, MSE
• Containers
Sequence, Auto-Encoder, Branch (Parallel)
• Branching
Alternate-Path, Drop-Path, Random-Path
• Tensor Tables Operations
Select, Concatenate (C/U/S/B), Merge
(add/mean/lerp)
• Visualization & Benchmarking
Benchmark-Wrapper, Visualize-Histogram
Visualize-MeanAndStdDev (C/U/S/B)
C/U/S/B = These operations can be applied either on [C]hannel, [U]nit, [S]ample or [B]atch level.

Performance – AlexNet OWT
All benchmarks done using NVIDIA TitanX GPUs on comparable setups; Source: https://github.com/soumith/convnet-benchmarks
27 26
37
31
42
121
132
53 55 56
72
135
203
210
TORCH
(CUDNN)
TENSORFLOW CUBLAZE
(1 GB WS LIMIT)
TORCH
(FBFFT)
CUDACONVNET2 CAFFE
(NATIVE)
TORCH-7
(NATIVE)
forward (ms) backward (ms)

Performance – VGG A
162 158 167
355
408
323
350331
382 378
737
821
745 755
TORCH
(CUDNN)
TENSORFLOW CUBLAZE
(1 GB WS LIMIT)
TORCH
(FBFFT)
CUDACONVNET2 CAFFE
(NATIVE)
TORCH-7
(NATIVE)
forward (ms) backward (ms)
All benchmarks done using NVIDIA TitanX GPUs on comparable setups; Source: https://github.com/soumith/convnet-benchmarks

Cached Sample
…
Cached Sample
Cached Sample
How Blaze works
(example)
PrefetcherModel
(fprop only)
Augmenter
Weights
(fixed)
Sample
Merger
Data Source
(HDD, SparkRDD, HDFS)
Optimizer
Model
Weights
(tunable)
Hyper
Param.
Hyper
Param.
Objectives
Hyper
Param.
Scope
Delimiter
Terminal,
File,
Showoff,
etc.

Easy Setup: Model
• Blaze automatically infers most layer parameters based on the actual input
• Usually no need to specify input and output dimensions or whether to use CPU or GPU
val noClasses = 100
// Kernels
val kernelConv1 = Kernel2D(dims = (11, 11), stride = (4, 4), padding = (2, 2))
val kernelConv2 = Kernel2D.centered((3, 3))
val kernelPool = Kernel2D((3, 3), (2, 2))
// Layers
val bias = AddBiasBuilder()
val relu = ReLUBuilder()
val lrn = LateralResponseNormalizationBuilder(n = 5, k = 2, alpha = 1e-4f, beta = 0.75f)
val pool = MaxPoolingBuilder(kernelPool)
// Lego!
val mb = SequenceBuilder(
ConvolutionFilterBuilder(kernelConv1, 48), bias, relu, pool, lrn,
ConvolutionFilterBuilder(kernelConv2, 192), bias, relu,
ConvolutionFilterBuilder(kernelConv2, 128), bias, relu, pool,
ReshapeBuilder.collapseDimensions(),
LinearBuilder(noClasses), bias,
SoftmaxBuilder(), ClassLLConstraintBuilder()
)

Easy Setup: CPU and GPU
• Blaze maintains a variant table for each module type.
• When you “build” an instance of a module, all variants are scored and the
“best” variant for the current situation is selected automatically.
 You can configure what “best” means.
// Input data
val data = Array[Batch](...)
// Inspect batches
val hints = BuildHints.derive(data)
// Build compatible model
val m = mb.build(hints)
19:25:20 INFO Scoring ConvolutionFilter[Kernel2[(3, 3), (1, 1)] x 2, 0/1 = filter]:
19:25:20 DEBUG 0000800a => CUDA_CUDNN, preferred, input type matches
19:25:20 DEBUG 0000400a => JVM_BLAS_IMPLICITMM, preferred
19:25:20 DEBUG 00000004 => JVM_BLAS_MM
19:25:20 DEBUG 0000000a => JVM_BREEZE_MM, preferred
19:25:20 DEBUG 00000002 => JVM_BREEZE_SPARSEMM
19:25:20 INFO CUDA_CUDNN selected!

Working with large models!
val mb = SequenceBuilder(...)
val hints = ...
val g = mb.toGraph(hints)
SvgRenderer.render(g)

Visualizing
pre-processing
pipelines
val apb = AsynchronousPrefetcherBuilder(...)
val g = apb.toGraph()
SvgRenderer.render(g)

Easy Setup: Optimizer
val ob = MomentumBuilder()
// Configure Hyper-Parameters
ob.learningRate = DiscreteStepsBuilder(
0 -> 1e-2f,
40000 -> 1e-3f,
80000 -> 1e-4f
)
// Setup Objectives
ob.objectives += IterationCountLimitBuilder(1000)
+= CrossValidationBuilder(dataSource, ... preprocessing pipeline ...)
+= PrintStatusBuilder()
>> FileSinkBuilder(HadoopFileHandle.userHome ++ "results/optimization.log")
+= objectives.Presets.visualizePerformance()
>> ShowoffSinkBuilder("Cross Validation Performance")
// Add more advanced stuff like Regularizers...
// Go!
val o = ob.build(m, dataSource)
o.run()

Other Features
• Tensor Memory Management
 Automatically monitors the dependencies between all tensors
 Reallocates space occupied by unneeded tensors on the fly
 Will automatically toggle “inPlace” processing when it is safe
• Intermediate results are stored separate from model
 Forward passes yield backpropagation contexts that can be consumed or discarded
at any time.
 Very interesting property for:
 Live Query/Training
 Fancy Optimizers
 Hyper Parameter Search
Saves up to
40%
GPU memory
during training!

Blaze
CUBlaze
Inferno
Inferno
Training Deep Learning Models faster
with Apache Spark

Starting an Inferno cluster
Spark
Conf
Cluster
Coordinator
Cluster
FileRDD
Spark BinaryRDD Inferno FileRDD
50,000 files / 50 dirs 689 s 6 s
1,300,000 files / 1000 dirs > 9999 s (gave up) 35 s
689 s
6 s 35 s
Loading meta-data of HDFS files
Claim
Assess
Tailor
Spark
Context
Sample
Data
RDD
Load hdfs://…
Create Samples
Load Plugins
(e.g. CUBlaze)

run()build()
cache()
cache()
cache()
Distributed Optimizer
Blaze Model
Blaze
Optimizer
Pre-
processing
Pipeline
Inferno
Optimizer
Sample
Data
RDD
Cluster
Coordinator
Weights
Hyper
Param.
Objectives
Hyper
Param.
Scope
Delimiter
Hyper
Param.
Objectives
Scope
Delimiter
Cluster
Applied with
cluster wide
focus.
Applied independently
in each worker.

57 minutes
2 hours, 42 minutes
Performance
ResNet 34 on ImageNet
Blaze
2 x 8 core Xeon CPU + 1 x NVIDIA TitanX
Inferno (over 1 GbE)
8 x 8 core Xeon CPU + 4 x NIVIDA TitanX
Reached 20% Top1 accuracy 2.84 times faster!

Performance
PreAct ResNet 152 on ImageNet
0%
10%
20%
30%
40%
50%
60%
70%
80%
0 h 10 h 20 h 30 h 40 h 50 h
1x TitanX - Top 1 Accuracy
1x TitanX - Top 5 Accuracy
Inferno Cluster (5x TitanX, 1 GbE) - Top 1 Accuracy
Inferno Cluster (5x TitanX, 1 GbE) - Top 5 Accuracy
Reached 30% Top1 accuracy 4.81 times faster using 5 GPUs!*
* 6.8 ℎ vs. 32.7 ℎ

Conclusion
• Blaze & CUBlaze
 Fast
 Huge extensible module library
 Easy to use
• Inferno
 Allows you to accelerate Blaze DL tasks on Spark
 Uses Spark MR methods for all data transmissions:
 Can run rather nicely along with other Spark jobs.
 Can be used without high-speed / low latency equipment
(usually required to make RDMA solutions perform well)
 Plain old (and even slow) Ethernet is enough!
* Note that using “Showoff” to visualize progress may open separate HTTP connections to the Showoff-Server.

Where can I get it?
• Blaze & CUBlaze & Example Code
Stable, we train models using it for months already. A snapshot of the current stable release
is available at:
https://github.com/bashimao/ltudl (Apache License 2.0)
• Showoff
Multi-purpose live visualization system developed by Aiden Nibali (La Trobe University):
https://github.com/anibali/showoff
• Inferno
 I am writing a paper about Inferno’s optimization system right now.
 Once it has been accepted for publication, we will release the full source code on GitHub.
 The best way to prepare for Inferno, is to download Blaze now and to get familiar with it.

Questions?
Matthias Langer, PhD cand.
m.langer@latrobe.edu.au
Supervisors:
Dr. Zhen He
z.he@latrobe.edu.au
Prof. Wenny Rahayu
w.rahayu@latrobe.edu.au

Deep Learning & Spark @ LaTrobe
Students
• Master of Data Science degree
 http://tinyurl.com/hf4wmn2
 Advanced data science lab established in 2016 with newest hardware.
 CSE5BDC
Big Data Management on the Cloud (I tutor this!)
 CSE5DEV
Data Exploration and Visualization
(~50% lectures on deep learning)
 CSE5WDC
Web Development on the Cloud
• Research
 GPU research cluster capable of running distributed deep learning
tasks.
 In-house development of a distributed deep learning system.
 Dedicated research group working with various Deep Learning systems.
 CSE4DLJ
Weekly Deep Learning Journal Club
Industry
• If you have a data analytics problem:
 … we have a dedicated deep learning research team!
 … and probably also a deep learning solution for it!
• Spark & Deep Learning workshops for Torch
available on demand.
• Past & current machine learning research
collaborations
 Alfred Hospital
 ZenDesk
 AIS (Australian Institute for Sports)
• Contact: z.he@latobe.edu.au

Inferno Scalable Deep Learning on Spark

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Inferno Scalable Deep Learning on Spark

Similar a Inferno Scalable Deep Learning on Spark (20)

Más de DataWorks Summit/Hadoop Summit

Más de DataWorks Summit/Hadoop Summit (20)

Último

Último (20)

Inferno Scalable Deep Learning on Spark

Notas del editor