Deep Learning Frameworks Using Spark on YARN by Vartika Singh

1
Deep Learning Frameworks Using Spark on YARN
Vartika Singh
Field Data Science Architect

©2014 Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights reserved.
2
“Would you tell me, please, which road do I take?"
"That depends a good deal on where you want to get to."
"I don't much care where –"
"Then it doesn't matter which way you go.”

© Cloudera, Inc. All rights reserved. 4
A overview of ML pipeline
Raw Data
- many
sources
- many
formats
- varying
validity
Validated ML
Models End User
Data
Engineering
Data Science
Well-formatted
data
Training, validation,
and test data
cleaning
merging
filtering
model building
model training
hyper-param
tuning
pipeline
execution
production
operation
Data
Engineering
Consump-
tion for
analysis
ongoing data
ingestion

5
Deep Learning in Big Data
• A major source of difficulty in many real-world
artificial intelligence applications is that many of the
factors of variation influence every single piece of
data we can observe.
• Deep learning solves this central problem via
representation learning by introducing representations
that are expressed in terms of other, simpler
representations.

6
Deep
Learning
in Hadoop
• http://blog.cloudera.com/blog/2017/04/deep-learning-frameworks-on-cdh-and-cloudera-data-science-workbench/
• http://blog.cloudera.com/blog/2017/04/bigdl-on-cdh-and-cloudera-data-science-workbench/

Analysis Pipeline
metadata, feature extraction,filter
Data Engineering
RawBinaryDatainS3
Processeddata
inS3
●training
●validation
●test
model, train, tune
Data Science and
Exploration
Search and SQL
Data Engineering
Validated model
● model
● parameters
Live data ingest
End results UI
● insights
● predictions
● results
Data Lake Cluster
S3/HDFS/Kudu
Parquet
Parquet
PMML
processing, execution
Data Engineering
Need
archived
Data
Hyperparameters/Code
CDSW

9
Deep Learning at scale
• A significant amount of effort has been put into developing deep learning systems that can scale to very large
models and large training sets.
• Large models in the literature are now top performers in supervised visual recognition tasks
• Can even learn to detect objects when trained from unlabeled images alone
• The very largest of these systems has been constructed, which is able to train neural networks with over 1 billion
trainable parameters.
• While such extremely large networks are potentially valuable objects of AI research, the expense to train them is
overwhelming: the distributed computing infrastructure (known as “DistBelief”) manages to train a neural network

10
When to do it?
Distributed training isn’t free.
Setup time.
Continue to train your networks on a single machine, until the training time becomes
prohibitive.

Scaling Out and Up
• Using multiple machines in a large cluster
• Leveraging graphics processing units (GPUs).
11

GPUs
• The use of GPUs is a significant advance in recent years that
makes the training of modestly sized deep networks practical.
• A known limitation of the GPU approach is that the training
speed-up is small when the model does not fit in GPU memory
(typically less than 6 gigabytes).
• To use a GPU effectively, researchers often reduce the size of
the data or parameters so that CPU-to-GPU transfers are not a
significant bottleneck.
12

Parallelism
• Within machines
• Multithreading
• Across machines
• Message passing
13

14
Model parallelism
In model parallelism, different machines in the distributed system are responsible for the computat

15
Data parallelism
In data parallelism, different machines have a complete copy of the model; each machine simply ge

16
Typical considerations in Cloud
What we see out there

17
Driver Libraries
• cuDNN and Intel’s MKL
• One of the primary goals of driver libraries is to enable the community of neural
network frameworks to benefit equally from its APIs.
• The library exposes a host-callable C language API, but requires that input and
output data be resident on the GPU
• The library is thread-safe and its routines can be called from different host
threads.
• The convolution routines in cuDNN provide competitive performance with zero
auxiliary memory required.

18
CPUs and GPUs
• The the most important feature for deep learning performance is memory bandwidth.
• GPUs are optimized for memory bandwidth while sacrificing for memory access time
(latency).
• Batch methods, such as Limited memory BFGS (L- BFGS) or Conjugate Gradient (CG), with
the presence of a line search procedure, are usually much more stable to train and easier to
check for convergence.
• These methods, conventionally considered to be slow, can be fast thanks to the availability
of large amounts of RAMs, multicore CPUs, GPUs and computer clusters with fast network
hardware.
• Balance the number of CPUs and GPUs

19
Communication costs

20
AWS EBS
• Use volumes that are attached to an EBS-optimized instance
or an instance with 10 Gigabit network connectivity.
• EC2 instances that do not meet this criteria offer no
guarantee of network resources.
• You can use all of network bandwidth for traffic to Amazon
EBS if your application isn’t pushing other network traffic that
contends with Amazon EBS. (If not EBS optimization)

21
Optimized EBS
• EBS-optimized connections are full-duplex, and can drive
more throughput and IOPS in a 50/50 read/write workload
where both communication lanes are used.
• In some cases, network, file system, and Amazon EBS
encryption overhead can reduce the maximum throughput
and IOPS available.

22
AWS EC2
• Physical proximity of EC2 instances
• EC2 instance maximum transmission unit (MTU)
• The size of your EC2 instance.
• EC2 enhanced networking support for Linux
• Placement groups

23
Security
The security mechanism in cloud technology is generally weak. Hence tampering of data at the
public cloud is inevitable and it is a big concern. Finding a robust security mechanism for the
purpose of using the public cloud. Usually, in addition to firewalls, VPNS and encryption
provided by cloud service providers, CDH provides:
Authentication
Authorization
Encryption

24
Elasticity
Managed Service for elastic data pipelines
No data silos
Backward compatibility and platform portability
Built in workload management
Data Governance

25
Exploration and development
Fast and interactive data analysis
Isolated filesystem
Custom environment

26

27

28
Deep Learning Frameworks on Hadoop
What we see out there

AWS
Impala
Search
Spark
Manage
upgrades - on
user
Debugging
tricky
All independent
Easy snapshot
Configurable
Scalable
ML/DL

AWS and CDH
CDH
Scale with AWS
Manage with
CDH
Configurable
EasySnapshot
Managefrom
CDH
Upgradefrom
CDH
Debugusing
CDH
Scalable

Caffe2 - Synch SGD
• Data parallel
• Using 8 GPUS to run a batch of 32 each is equivalent to
one GPU running a mini-batch of 256.
31

Tensorflow - Synch and Asynch SGD
• Data parallel
• Synchronous SGD
• Asynchronous SGD
• Model parallel
• Concurrent Steps for Model computation in a pipeline
32

CaffeOnSpark
• Caffe is a Deep Learning Framework from Berkley Vision Lab implemented in C++
where models and optimizations are defined as plaintext schemas instead of code. It
has a command line as well as a Python interface and has been widely adopted
especially for vision related tasks.
• Yahoo released a Spark interface for Caffe which gives you the ability to run the DNN
model within the same cluster where your ingested data and other analytical
frameworks reside, conforming to the company wide security and governance
policies.

TensorflowOnSpark
• In sequence, Google releases Tensorflow, enhanced distributed deep learning
capabilities in Tensorflow, and then support for HDFS Support
• Supports direct Tensor communication between processes.
• Scales easily by adding more machines
• Tensorflow ingests data using QueueRunners or feed_dict. Does not leverage Spark
for data ingestion.

DL4J
• Support Apache Spark (1 and 2) for distributed training on a cluster.
• Supports data parallel synchronous parameter averaging
• Recently added support for asynchronous gradient descent.
• Use Aeron for message passing

36
Data parallelism in Spark
Data parallel approaches to distributed training keep a copy of the entire model on each
worker machine, processing different subsets of the training data set on each. Data
parallel training approaches all require some method of combining results and
synchronizing the model parameters between each worker. A number of different
approaches have been discussed in the literature, and the primary differences between
approaches are
• Parameter averaging vs. update (gradient)-based approaches
• Synchronous vs. asynchronous methods
• Centralized vs. distributed synchronization
Deeplearning4j’s current Spark implementation is a synchronous parameter averaging

WHAT IS BIGDL ?
Github: github.com/intel-analytics/BigDL
http://software.intel.com/ai
• Open Source Deep Learning
framework for Apache Spark*
• High Performance & Efficient
Scale out leveraging Spark
architecture
• Feature Parity with Caffe, Torch
etc.
• Efficient implementations of
synchronous stochastic gradient
descent (SGD) and all-reduce
communications in Spark.

38
Thank you

Deep Learning Frameworks Using Spark on YARN by Vartika Singh

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Deep Learning Frameworks Using Spark on YARN by Vartika Singh

Similar a Deep Learning Frameworks Using Spark on YARN by Vartika Singh (20)

Más de Data Con LA

Más de Data Con LA (20)

Último

Último (20)

Deep Learning Frameworks Using Spark on YARN by Vartika Singh

Notas del editor