Más contenido relacionado
La actualidad más candente (20)
Similar a [AWS Tech Talk] Using containers for deep learning workflows (20)
[AWS Tech Talk] Using containers for deep learning workflows
- 1. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Shashank Prasanna,
Sr. Technical Evangelist, AI/ML
30th September 2019
Using Containers for Deep
Learning Workflows
- 2. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
Common deep learning setups and challenges
Using containers for deep learning workflows
• Demo 1: Containers for deep learning training workflows
Scaling deep learning training
• Demo 2: Submitting training jobs using containers to Amazon Elastic
Kubernetes Services (Amazon EKS)
• Demo 3: Running large-scale experiments using containers on
Amazon SageMaker
Summary and Q&A
- 3. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Common machine learning setups
1. Code & frameworks
2. Compute
(CPUs, GPUs)
3. Storage CLI
EC2 instance
DL AMI Amazon S3
CLI
On-premises
- 4. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Deep learning workflow
Data acquisition
curation and labeling
Data preparation for
training
Large-scale
experimentation
Distributed
training
Model optimization
and validation
Deployment
Need for scale
- 5. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Deep learning is computationally expensive,
but can be scaled-out
CLI
EC2 instance
this…
CLI
Cluster
…to this
- 6. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Scaling-out deep learning training
Parallel experiments Distributed training
Distributing training
of a single model to
train faster
Different models
running parallel to
find the best model
- 7. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
…but there are challenges to scaling
CLI
Cluster
Code and
dependencies
Infrastructure
management
Cluster
management
1
2
3
- 8. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine learning stack is complex
• “My code requires building several dependencies from source”
• “My code isn’t taking advantage the GPU/GPUs”
• “is cudnn, nccl installed, is it the right version?”
• “My code is running slow on CPUs”
• “oh wait, is it taking advantage of AVX instruction set ?!?”
• “I updated my drivers and training is now slower/errors out”
• “My cluster runs a different version of framework/linux distro”
Makes portability, collaboration, scaling training
really really hard!
Code and
dependencies
1
- 9. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
NVIDIA drivers 436.15
Ubuntu 16.04
TensorFlow 1.13
Keras
horovod
numpy
scipy
others…
Mkl 2019 v3CPU:
cudnn 7.1
cublas 10
nccl 2
CUDA toolkit 10
GPU:
scikit-learn
pandas
openmpi
Python
My code
Development
system
NVIDIA drivers 410.68
Centos 7
Training
cluster
TensorFlow 1.14
Keras
horovod
numpy
scipy
others…
Mkl 2019 v2CPU:
cudnn 7.5
cublas 10
nccl 2.4
CUDA toolkit 10
GPU:
scikit-learn
pandas
openmpi
Python
My code
Multiple
points
of failureDevelopment
system
Training
cluster
Code and
dependencies
1
- 10. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Containers
for
Machine
Learning
Container runtime
Infrastructure
NVIDIA drivers
Host OS
Packages:TensorFlow
mkl
cudnn
cublas
Nccl
CUDA toolkit
CPU:
GPU:
TensorFlow
Container
Image
Keras
horovod
numpy
scipy
others…
scikit-
learn
pandas
openmpi
Python
+
Your
training
scripts
ML environments that
are:
Code and
dependencies
1
- 11. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
TensorFlow
mkl
cudnn
cublas
Nccl
CUDA toolkit
NVIDIA drivers
Host OS
CPU:
GPU:
Container runtime
TensorFlow
Container
Image
Keras
horovod
numpy
scipy
others…
scikit-
learn
pandas
openmpi
Python
Development system
NVIDIA drivers
Host OS
Container runtime
Training cluster
Container
registry
push
TensorFlow
mkl
cudnn
cublas
Nccl
CUDA toolkit
CPU:
GPU:
TensorFlow
Container
Image
Keras
horovod
numpy
scipy
others…
scikit-
learn
pandas
openmpi
Pythonpull
+
Your
training
scripts
+
Your
training
scripts
- 12. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Deep Learning Containers
https://docs.aws.amazon.com/dlami/latest/devguide/deep-learning-containers-images.html
Code and
dependencies
1
- 13. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DEMO 1: Containers for deep learning workflows
AWS Cloud
Amazon ECR
Deep learning
container images
AWS DL
containers
EC2 instance
GPUs
CLI
Amazon EBS
Datasets and
checkpoints
- 14. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Challenges with scaling deep learning
CLI
Cluster Code and
dependencies
Infrastructure
management
Cluster
management
- 15. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ML infrastructure and cluster management
Image registry
Container image repository
Amazon Elastic
Container Registry
(Amazon ECR)
Compute
Where the containers run
Amazon EC2
Jupyter notebook
instances
high performance
algorithms
Large-scale
training
Optimization One-click
deployment
Fully managed with
auto-scaling
ML services
Fully-managed service that
covers the entire machine
learning workflow
Amazon SageMaker
Management
Deployment, scheduling,
scaling, and management of
containerized applications
Amazon Elastic
Kubernetes Service
(Amazon EKS)
Amazon Elastic
Container Service
(Amazon ECS)
- 16. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DEMO 2: Submitting training jobs to Amazon
Elastic Kubernetes Services (Amazon EKS)
Approach:
1. Provision a Kubernetes cluster
Custom container
Code files Container
registry
Amazon EKS cluster
- 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Create a Kubernetes cluster
Create cluster Submit a training jobs
CLI
eksctl create cluster
--name eks-gpu
--version 1.13
--region us-west-2
--nodegroup-name gpu-nodes
--node-type p3.8xlarge
--nodes 4
--timeout=40m
--ssh-access
--ssh-public-key=<public-key>
--auto-kubeconfig
- 18. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Learn more: Amazon EKS, Kubeflow and Katib
Amazon Elastic Kubernetes
Service (Amazon EKS)
Machine learning workflows
on Kubernetes
Hyperparameter Tuning and
Neural Architecture Search
kubeflow.org/docs/aws/
aws.amazon.com/eks/
- 19. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DEMO 3: Hyperparameter search experiment using
Amazon SageMaker
SageMaker SDK
Fully-managed
SageMaker cluster
Amazon S3
Container
registry
Custom container
Code files
Docker build
Approach:
Webinar: Machine Learning with
Containers and Amazon SageMaker
- 20. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Takeaways
• Containers let you build l
• Leverage services such as Amazon
SageMaker and Kubernetes + Kubeflow
to manage large-scale ML workloads.
• Choose fully-managed or self-managed
based on needs
Code and
dependencies
Infrastructure
management
Cluster
management
- 21. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Resources
docs.aws.amazon.com/sagemaker/
latest/dg/whatis.html
Documentation
github.com/awslabs/
amazon-sagemaker-examples
Examples on GitHub
aws.amazon.com/blogs/machine-
learning/category/artificial-intelligence/
AWS ML Blog
docs.aws.amazon.com/dlami/latest/devgui
de/deep-learning-containers-images.html Webinar: Machine Learning with Containers and Amazon SageMaker
- 22. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you!
Shashank Prasanna,
Sr. Technical Evangelist, AI/ML
Questions? Happy to help:
Twitter: @shshnkp
LinkedIn: linkedin.com/in/shashankprasanna
Demo code and configuration scripts:
https://github.com/shashankprasanna/using
-containers-for-dl