Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Parallel/Distributed Deep Learning and CDSW

173 visualizaciones

Publicado el

Deep learning (DL) is still one of the fastest developing areas in machine learning. As models increase their complexity and data sets grow in size, your model training can last hours or even days. In this session we will explore some of the trends in Deep Neural Networks to accelerate training using parallelize/distribute deep learning.

We will also present how to apply some of these strategies using Cloudera Data Science Workbenck and some popular (DL) open source frameworks like Uber Horovod, Tensorflow and Keras.

Speakers
Rafael Arana, Senior Solutions Architect
Cloudera
Zuling Kang, Senior Solutions Architect
Cloudera Inc.

Publicado en: Tecnología
  • Sé el primero en comentar

Parallel/Distributed Deep Learning and CDSW

  1. 1. © Cloudera, Inc. All rights reserved. Parallel/Distributed Deep Learning and CDSW Rafael Arana - Senior Solutions Architect Zuling Kang - Senior Solutions Architect
  2. 2. © Cloudera, Inc. All rights reserved. 2 TABLE OF CONTENTS ● Initiative of distributed deep learning and distributed model training ● Distributing the model training processes ● Integrating the distributed model training into CDSW ● Discussions and future
  3. 3. © Cloudera, Inc. All rights reserved. 3 BACKGROUND CONNECT products & services (IoT) PROTECTDRIVE customer insights
  4. 4. © Cloudera, Inc. All rights reserved. 4 Are we there yet? QUAID Where am I? JOHNNY (cheerful) You're in a JohnnyCab! QUAID I mean...what am I doing here? JOHNNY I'm sorry. Would you please rephrase the question. QUAID (impatient, enunciates) How did I get in this taxi?! JOHNNY The door opened. You got it.
  5. 5. © Cloudera, Inc. All rights reserved. 5 Increase in compute Source: https://blog.openai.com/ai-and-compute/
  6. 6. © Cloudera, Inc. All rights reserved. 6 Model lifecycle
  7. 7. © Cloudera, Inc. All rights reserved. 7 The Power-law Region More compute + more training data -> Better Accuracy Reference: https://arxiv.org/abs/1712.00409
  8. 8. © Cloudera, Inc. All rights reserved. 8 The Power-law Region More compute + more training data set = Better Accuracy Reference: https://arxiv.org/abs/1712.00409
  9. 9. © Cloudera, Inc. All rights reserved. 9 PROBLEM: LABELED TRAINING DATA • Supervised learning • Reuse public data sets • Data Augmentation • Enterprise Data and data privacy regulations
  10. 10. © Cloudera, Inc. All rights reserved. 10 TRANSFER LEARNING • Low budget ( computation , data set labelling,…) • Use transfer learning to sort of transfer knowledge from large public data sets to your own problem. • Small data: Replace soft Layer • Medium Data set: Replace last layers • Large Dataset. Just for initialization • Sample image detection based on retinanet using Keras: • Person, car, … • But,…what is that prediction on Ringo’s left leg?
  11. 11. © Cloudera, Inc. All rights reserved. 11 Neural Networks Architectures Training Data Set
  12. 12. © Cloudera, Inc. All rights reserved. 12 Neural Network Architecture and Accuracy DNN models with more parameters would produce higher classification accuracy? • Example: Computer Vision Popular DNN Convnets • VGG and AlexNet each have more than 150MB of fully-connected layer parameters, GoogLeNet has smaller fully-connected layers, and NiN does not have fully-connected • GoogLeNet and NiN have a resolution of 1x1 instead of 3x3 or larger • Models with fewer parameters are more amenable to scalability, while still delivering high accuracy. Reference: https://arxiv.org/pdf/1511.00175
  13. 13. © Cloudera, Inc. All rights reserved. 13 Let’s put our model in production!!!! Photos by Unsplash
  14. 14. © Cloudera, Inc. All rights reserved. 14 Industrialization of ML – Efficient training Photos by Unsplash
  15. 15. © Cloudera, Inc. All rights reserved. 15 Machine Learning Development Life Cycle
  16. 16. © Cloudera, Inc. All rights reserved. 16 Let’s scale
  17. 17. © Cloudera, Inc. All rights reserved. 17 Cloudera Data Science Workbench Architecture CDH CDH Cloudera Manager Gateway node(s) CDH nodes Hive, HDFS, ... CDSW CDSW ... Master ... Engine EngineEngine EngineEngine Container Registry Git Repo
  18. 18. © Cloudera, Inc. All rights reserved. 18 Cloudera Data Science Workbench Architecture HDP HDP Ambari Gateway node(s) HDP nodes Hive, HDFS, ... CDSW CDSW ... Master ... Engine EngineEngine EngineEngine Container Registry Git Repo
  19. 19. © Cloudera, Inc. All rights reserved. 19 Adding GPUs Step 1. Admin > Engines > Engine Images
  20. 20. © Cloudera, Inc. All rights reserved. 20 Adding GPUs Step 2. Project > Settings > Engine
  21. 21. © Cloudera, Inc. All rights reserved. 21 Adding GPUs GPU Support CDSW CPU CDH/HDP CPU CDH/HDP CPU single-node training distributed training, scoring On CDH coming in C6 GPU
  22. 22. © Cloudera, Inc. All rights reserved. 22 Distributed Tensorflow Package • Main concepts • Workers • Parameter Servers • tf.Server(), • tf.ClusterSpec(), tf.train.SyncReplicasOptimizer() tf.train.replicas_device_setter()
  23. 23. © Cloudera, Inc. All rights reserved. 23 Local Multi-GPU Training - TF Distribution Strategy Keras API distribution = tf.contrib.distribute.MirroredStrategy() with distribution.scope(): inputs = tf.keras.layers.Input(shape=(1,)) predictions = tf.keras.layers.Dense(1)(inputs) model = tf.keras.models.Model(inputs=inputs, outputs=predictions) model.compile(loss='mean_squared_error', optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.2)) model.fit(train_dataset, epochs=5, steps_per_epoch=10) CDSW CPU GPUGPU GPUGPUCPU
  24. 24. © Cloudera, Inc. All rights reserved. 24 Local Multi-GPU Training - TF Distribution Strategy Estimator API def model_fn(features, labels, mode): layer = tf.layers.Dense(1) logits = layer(features) def input_fn(): features = tf.data.Dataset.from_tensors([[1.]]).repeat(100) labels = tf.data.Dataset.from_tensors(1.).repeat(100) return tf.data.Dataset.zip((features, labels)) distribution = tf.contrib.distribute.MirroredStrategy() config = tf.estimator.RunConfig(train_distribute=distribution) classifier = tf.estimator.Estimator(model_fn=model_fn, config=config) classifier.train(input_fn=input_fn) classifier.evaluate(input_fn=input_fn) CDSW CPU GPUGPU GPUGPUCPU TF Estimator
  25. 25. © Cloudera, Inc. All rights reserved. Distributing the Model Training Processes
  26. 26. © Cloudera, Inc. All rights reserved. 26 PROCEDURES OF TRAINING A DEEP LEARNING MODEL Repeat the following code for num_epoch times For each mini_batch(x, y) in dataset Set pred_tensor = model(x) //feeding forward Set diff_tensor = L2_loss(y, pred_tensor) // OR Set diff_tensor = cross_entropy_loss(y, pred_tensor) Set grad = gradient of diff_tensor Update the model using grad
  27. 27. © Cloudera, Inc. All rights reserved. 27 FOUR MAJOR ISSUES IN DISTRIBUTED MODEL TRAINING • Shall we use data parallelism or model parallelism? • How to efficiently distribute model parameters, which is normal huge in amount? • How to aggregate model parameters in different training nodes into a global one? • Model updating algorithms • How to efficiently scale the training load and make it efficient access the huge amount of training data? • The first 3 issues are covered in this section, while the 4th one will be addressed in the next section.
  28. 28. © Cloudera, Inc. All rights reserved. 28 TENSORFLOW AND MODEL PARALLELISM ● The initial idea of DistBelief is proposed by Google ● First idea is published in its research paper in 2012. ● Used as the built-in distributed implementation for Tensorflow ● Parameter server (PS) ● A centralized server for sharing neural network parameters ● Model parallelism: ● A method to distributed the training parameters across worker nodes ● Model updating algorithm ● Downpour SGD Jeffrey Dean, et al. “Large scale distributed deep networks”, advances in neural information processing systems (NIPS), 2012.
  29. 29. © Cloudera, Inc. All rights reserved. 29 FROM MODEL TO DATA PARALLELISM • Strength of model parallelism • Applicable for models with size greater than memory or GPU capacities within ONE worker node • Weakness • Unable to take full advantage of our hardware resources • For models whose parameters can be hold in GPUs within ONE worker node • Data parallelism
  30. 30. © Cloudera, Inc. All rights reserved. 30 HARDWARE USE RATE AS TRAINING NODE INCREASES https://eng.uber.com/horovod/
  31. 31. © Cloudera, Inc. All rights reserved. 31 WHOLE PICTURE OF DATA PARALLELISM https://eng.uber.com/horovod/
  32. 32. © Cloudera, Inc. All rights reserved. 32 FROM PS TO MPI ALLREDUCE ● Based on Baidu ring-allreduce algorithm (see http://andrew.gibiansky.com/ for detail) ● Using HPC/MPI framework, which originally written in C while currently encapsulated with Python ● Implemented in Uber Horovod, Baidu, PyTorch, MXNet, etc. ● Found to be faster in small-scale number of nodes (8-64) ○ https://cwiki.apache.org/confluence/display/M XNET/Extend+MXNet+Distributed+Training+by +MPI+AllReduce
  33. 33. © Cloudera, Inc. All rights reserved. 33 PERFORMANCE GAINS OF INFINIBAND/RDMA https://eng.uber.com/horovod/
  34. 34. © Cloudera, Inc. All rights reserved. 34 MODEL UPDATING ALGORITHMS Synchronized Asynchronous From: Strategies and Principles of Distributed Machine Learning on Big Data, https://doi.org/10.1016/J.ENG.2016.02.008
  35. 35. © Cloudera, Inc. All rights reserved. 35 UPDATING ALGORITHM: SYNCHRONIZED VS. ASYNCHRONOUS • Synchronized algorithms will lead to a more precise and consistent model, however, some workers will sometimes have to wait for a long time during the synchronization barrier, which leads to longer training time. • When the minibatch is large, the low efficiency issue can be largely reduced. • From: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, https://arxiv.org/abs/1706.02677 • Asynchronous algorithms is said to be stochastic in descent directions which will make the model imprecise. • However there find in practice some momentum that leads the model convergent to a very close place of its synchronized counterpart.
  36. 36. © Cloudera, Inc. All rights reserved. 36 MODEL ERROR VS. BATCH SIZE Priya Goyal, et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. https://arxiv.org/abs/1706.02677
  37. 37. © Cloudera, Inc. All rights reserved. 37 FAMOUS SYNCHRONIZED AND ASYNCHRONOUS EXAMPLES • Synchronized updating algorithms • Microsoft CNTK: Model average after certain iterations. • Uber Horovod: Using large minibatches. • Asynchronous updating algorithms • Google Tensorflow: Downpour SGD.
  38. 38. © Cloudera, Inc. All rights reserved. 38 ALGORITHM FRAMEWORK FOR SYNCHRONIZED SGD •
  39. 39. © Cloudera, Inc. All rights reserved. Integrating the Distributed Model Training into CDSW
  40. 40. © Cloudera, Inc. All rights reserved. 40 OVERVIEW OF THE ARCHITECTURE
  41. 41. © Cloudera, Inc. All rights reserved. 41 USING CDSW API TO SPAWN TRAIN-WORKERS Using cdsw.launch_workers() to generate sub-containers, then connect back using the master’s IP address obtained from the CDSW_MASTER_IP environment variable. After that, trainer-master is able to distribute the IP addresses, and enable all the workers to create mutual communication. import cdsw, socket import threading import time workers = cdsw.launch_workers(n=2, cpu=0.2, memory=0.5, script="worker.py") # Attempt to get workers’ IP addresses by accepting connections s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.bind(("0.0.0.0", 6000)) s.listen(1) conns=dict() for i in range(2): conn, addr = s.accept() print("IP address of %d: %s"%(i,addr[0])) conns[i]=(conn,addr[0]) import os, time, socket s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect((os.environ["CDSW_MASTER_IP"], 6000)) data = s.recv(1024).decode() print("Response from the master:", data) s.close() master.py worker.py
  42. 42. © Cloudera, Inc. All rights reserved. 42 CREATING CDSW DOCKER IMAGES • CDSW Docker images for distributed model training can be created by extending the following base images. • docker.repository.cloudera.com/cdsw/engine:7 • Running the base image, and installing OpenMPI 4.0.0 from source code in the docker instance. • Not to install the OS provided OpenMPI package, as its version is below Horovod’s requirement. • Installing the core packages. • pip install petastorm tensorflow pytorch horovd • If you wish to use GPU in model training, make sure to the NVidia driver and use the GPU version of Tensorflow and/or PyTorch. • See: https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_gpu.html
  43. 43. © Cloudera, Inc. All rights reserved. 43 USING PRE-BUILT IMAGES • You can also use our pre-built Docker image from our public Docker repo. • docker pull rarana73/cdsw-7- horovod-gpu:1 • Content: • CDSW Base Image v7 - • CUDA_VERSION 9.0.176 • NCCL_VERSION 2.4.2 • CUDNN_VERSION 7.4.2.24 • Tensorflow 1.12.0 • Open MPI 4.0.0
  44. 44. © Cloudera, Inc. All rights reserved. 44 INITIALIZING OPEN-MPI PEERS • Normally, OpenMPI peers are initialized by directly spawning Python/OpenMPI processes via the mpirun command. • Similarly, in CDSW-Horovod processes, it can also be done by invoke the mpirun command via Python packages. • However, make sure when doing so, the train-worker containers are still there.
  45. 45. © Cloudera, Inc. All rights reserved. 45 Horovod in Action • Applying Horovod to a WideResNet model, trained on the Fashion MNIST dataset • 2 GPUS NVIDA Quadro P600 • CUDA Cores: 384 / 2 GB GDDR5 horovodrun -np 1 python fashion_mnist/fashion_mnist_solution.py --log-dir log/np-1 horovodrun -np 2 python fashion_mnist/fashion_mnist_solution.py --log-dir log/np-2
  46. 46. © Cloudera, Inc. All rights reserved. Discussions and Future
  47. 47. © Cloudera, Inc. All rights reserved. 47 Around the corner • SPARK On K8S & GPU support • Horovod in Spark • TensorFlow 2.0 & Distribution Strategy • Apache Submarine - https://hadoop.apache.org/submarine/ • …
  48. 48. © Cloudera, Inc. All rights reserved. THANK YOU

×