SlideShare una empresa de Scribd logo
1 de 22
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Shashank Prasanna,
Sr. Technical Evangelist, AI/ML
30th September 2019
Using Containers for Deep
Learning Workflows
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
Common deep learning setups and challenges
Using containers for deep learning workflows
• Demo 1: Containers for deep learning training workflows
Scaling deep learning training
• Demo 2: Submitting training jobs using containers to Amazon Elastic
Kubernetes Services (Amazon EKS)
• Demo 3: Running large-scale experiments using containers on
Amazon SageMaker
Summary and Q&A
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Common machine learning setups
1. Code & frameworks
2. Compute
(CPUs, GPUs)
3. Storage CLI
EC2 instance
DL AMI Amazon S3
CLI
On-premises
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Deep learning workflow
Data acquisition
curation and labeling
Data preparation for
training
Large-scale
experimentation
Distributed
training
Model optimization
and validation
Deployment
Need for scale
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Deep learning is computationally expensive,
but can be scaled-out
CLI
EC2 instance
this…
CLI
Cluster
…to this
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Scaling-out deep learning training
Parallel experiments Distributed training
Distributing training
of a single model to
train faster
Different models
running parallel to
find the best model
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
…but there are challenges to scaling
CLI
Cluster
Code and
dependencies
Infrastructure
management
Cluster
management
1
2
3
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine learning stack is complex
• “My code requires building several dependencies from source”
• “My code isn’t taking advantage the GPU/GPUs”
• “is cudnn, nccl installed, is it the right version?”
• “My code is running slow on CPUs”
• “oh wait, is it taking advantage of AVX instruction set ?!?”
• “I updated my drivers and training is now slower/errors out”
• “My cluster runs a different version of framework/linux distro”
Makes portability, collaboration, scaling training
really really hard!
Code and
dependencies
1
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
NVIDIA drivers 436.15
Ubuntu 16.04
TensorFlow 1.13
Keras
horovod
numpy
scipy
others…
Mkl 2019 v3CPU:
cudnn 7.1
cublas 10
nccl 2
CUDA toolkit 10
GPU:
scikit-learn
pandas
openmpi
Python
My code
Development
system
NVIDIA drivers 410.68
Centos 7
Training
cluster
TensorFlow 1.14
Keras
horovod
numpy
scipy
others…
Mkl 2019 v2CPU:
cudnn 7.5
cublas 10
nccl 2.4
CUDA toolkit 10
GPU:
scikit-learn
pandas
openmpi
Python
My code
Multiple
points
of failureDevelopment
system
Training
cluster
Code and
dependencies
1
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Containers
for
Machine
Learning
Container runtime
Infrastructure
NVIDIA drivers
Host OS
Packages:TensorFlow
mkl
cudnn
cublas
Nccl
CUDA toolkit
CPU:
GPU:
TensorFlow
Container
Image
Keras
horovod
numpy
scipy
others…
scikit-
learn
pandas
openmpi
Python
+
Your
training
scripts
ML environments that
are:
Code and
dependencies
1
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
TensorFlow
mkl
cudnn
cublas
Nccl
CUDA toolkit
NVIDIA drivers
Host OS
CPU:
GPU:
Container runtime
TensorFlow
Container
Image
Keras
horovod
numpy
scipy
others…
scikit-
learn
pandas
openmpi
Python
Development system
NVIDIA drivers
Host OS
Container runtime
Training cluster
Container
registry
push
TensorFlow
mkl
cudnn
cublas
Nccl
CUDA toolkit
CPU:
GPU:
TensorFlow
Container
Image
Keras
horovod
numpy
scipy
others…
scikit-
learn
pandas
openmpi
Pythonpull
+
Your
training
scripts
+
Your
training
scripts
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Deep Learning Containers
https://docs.aws.amazon.com/dlami/latest/devguide/deep-learning-containers-images.html
Code and
dependencies
1
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DEMO 1: Containers for deep learning workflows
AWS Cloud
Amazon ECR
Deep learning
container images
AWS DL
containers
EC2 instance
GPUs
CLI
Amazon EBS
Datasets and
checkpoints
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Challenges with scaling deep learning
CLI
Cluster Code and
dependencies
Infrastructure
management
Cluster
management
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ML infrastructure and cluster management
Image registry
Container image repository
Amazon Elastic
Container Registry
(Amazon ECR)
Compute
Where the containers run
Amazon EC2
Jupyter notebook
instances
high performance
algorithms
Large-scale
training
Optimization One-click
deployment
Fully managed with
auto-scaling
ML services
Fully-managed service that
covers the entire machine
learning workflow
Amazon SageMaker
Management
Deployment, scheduling,
scaling, and management of
containerized applications
Amazon Elastic
Kubernetes Service
(Amazon EKS)
Amazon Elastic
Container Service
(Amazon ECS)
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DEMO 2: Submitting training jobs to Amazon
Elastic Kubernetes Services (Amazon EKS)
Approach:
1. Provision a Kubernetes cluster
Custom container
Code files Container
registry
Amazon EKS cluster
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Create a Kubernetes cluster
Create cluster Submit a training jobs
CLI
eksctl create cluster 
--name eks-gpu 
--version 1.13 
--region us-west-2 
--nodegroup-name gpu-nodes 
--node-type p3.8xlarge 
--nodes 4 
--timeout=40m 
--ssh-access 
--ssh-public-key=<public-key> 
--auto-kubeconfig
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Learn more: Amazon EKS, Kubeflow and Katib
Amazon Elastic Kubernetes
Service (Amazon EKS)
Machine learning workflows
on Kubernetes
Hyperparameter Tuning and
Neural Architecture Search
kubeflow.org/docs/aws/
aws.amazon.com/eks/
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DEMO 3: Hyperparameter search experiment using
Amazon SageMaker
SageMaker SDK
Fully-managed
SageMaker cluster
Amazon S3
Container
registry
Custom container
Code files
Docker build
Approach:
Webinar: Machine Learning with
Containers and Amazon SageMaker
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Takeaways
• Containers let you build l
• Leverage services such as Amazon
SageMaker and Kubernetes + Kubeflow
to manage large-scale ML workloads.
• Choose fully-managed or self-managed
based on needs
Code and
dependencies
Infrastructure
management
Cluster
management
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Resources
docs.aws.amazon.com/sagemaker/
latest/dg/whatis.html
Documentation
github.com/awslabs/
amazon-sagemaker-examples
Examples on GitHub
aws.amazon.com/blogs/machine-
learning/category/artificial-intelligence/
AWS ML Blog
docs.aws.amazon.com/dlami/latest/devgui
de/deep-learning-containers-images.html Webinar: Machine Learning with Containers and Amazon SageMaker
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you!
Shashank Prasanna,
Sr. Technical Evangelist, AI/ML
Questions? Happy to help:
Twitter: @shshnkp
LinkedIn: linkedin.com/in/shashankprasanna
Demo code and configuration scripts:
https://github.com/shashankprasanna/using
-containers-for-dl

Más contenido relacionado

La actualidad más candente

Docker Based Hadoop Provisioning
Docker Based Hadoop ProvisioningDocker Based Hadoop Provisioning
Docker Based Hadoop Provisioning
DataWorks Summit
 

La actualidad más candente (20)

20200219 AWS Black Belt Online Seminar オンプレミスとAWS間の冗長化接続
20200219 AWS Black Belt Online Seminar オンプレミスとAWS間の冗長化接続20200219 AWS Black Belt Online Seminar オンプレミスとAWS間の冗長化接続
20200219 AWS Black Belt Online Seminar オンプレミスとAWS間の冗長化接続
 
AWS FIS の実験テンプレートを書いてみよう!!
AWS FIS の実験テンプレートを書いてみよう!!AWS FIS の実験テンプレートを書いてみよう!!
AWS FIS の実験テンプレートを書いてみよう!!
 
Moving to Containers: Building with Docker and Amazon ECS - CON310 - re:Inven...
Moving to Containers: Building with Docker and Amazon ECS - CON310 - re:Inven...Moving to Containers: Building with Docker and Amazon ECS - CON310 - re:Inven...
Moving to Containers: Building with Docker and Amazon ECS - CON310 - re:Inven...
 
Build, Train & Deploy Machine Learning Models at Scale
Build, Train & Deploy Machine Learning Models at ScaleBuild, Train & Deploy Machine Learning Models at Scale
Build, Train & Deploy Machine Learning Models at Scale
 
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
 
Cloud-Native Application and Kubernetes
Cloud-Native Application and KubernetesCloud-Native Application and Kubernetes
Cloud-Native Application and Kubernetes
 
[AWS Dev Day] 앱 현대화 | AWS Fargate를 사용한 서버리스 컨테이너 활용 하기 - 삼성전자 개발자 포털 사례 - 정영준...
[AWS Dev Day] 앱 현대화 | AWS Fargate를 사용한 서버리스 컨테이너 활용 하기 - 삼성전자 개발자 포털 사례 - 정영준...[AWS Dev Day] 앱 현대화 | AWS Fargate를 사용한 서버리스 컨테이너 활용 하기 - 삼성전자 개발자 포털 사례 - 정영준...
[AWS Dev Day] 앱 현대화 | AWS Fargate를 사용한 서버리스 컨테이너 활용 하기 - 삼성전자 개발자 포털 사례 - 정영준...
 
Linux Administration Training | Linux Administration Will Never Go Out Of Fas...
Linux Administration Training | Linux Administration Will Never Go Out Of Fas...Linux Administration Training | Linux Administration Will Never Go Out Of Fas...
Linux Administration Training | Linux Administration Will Never Go Out Of Fas...
 
Aws container webinar day 1
Aws container webinar day 1Aws container webinar day 1
Aws container webinar day 1
 
Netflix in the Cloud
Netflix in the CloudNetflix in the Cloud
Netflix in the Cloud
 
Azure from scratch part 4
Azure from scratch part 4Azure from scratch part 4
Azure from scratch part 4
 
Docker Based Hadoop Provisioning
Docker Based Hadoop ProvisioningDocker Based Hadoop Provisioning
Docker Based Hadoop Provisioning
 
20191201 kubernetes managed weblogic revival - part 2
20191201 kubernetes managed weblogic revival - part 220191201 kubernetes managed weblogic revival - part 2
20191201 kubernetes managed weblogic revival - part 2
 
Amazon Web Services EC2 Basics
Amazon Web Services EC2 BasicsAmazon Web Services EC2 Basics
Amazon Web Services EC2 Basics
 
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
 
Pitt Immersion Day Module 2 - ec2 overview
Pitt Immersion Day Module 2 - ec2 overviewPitt Immersion Day Module 2 - ec2 overview
Pitt Immersion Day Module 2 - ec2 overview
 
AWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at Netflix
 
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
 
Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)
 
Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search Roadshow
 

Similar a [AWS Tech Talk] Using containers for deep learning workflows

Similar a [AWS Tech Talk] Using containers for deep learning workflows (20)

Setting up custom machine learning environments on AWS - AIM309 - New York AW...
Setting up custom machine learning environments on AWS - AIM309 - New York AW...Setting up custom machine learning environments on AWS - AIM309 - New York AW...
Setting up custom machine learning environments on AWS - AIM309 - New York AW...
 
Machine Learning using Kubernetes - AI Conclave 2019
Machine Learning using Kubernetes - AI Conclave 2019Machine Learning using Kubernetes - AI Conclave 2019
Machine Learning using Kubernetes - AI Conclave 2019
 
Deep Dive on Amazon Elastic Container Service (ECS) | AWS Summit Tel Aviv 2019
Deep Dive on Amazon Elastic Container Service (ECS)  | AWS Summit Tel Aviv 2019Deep Dive on Amazon Elastic Container Service (ECS)  | AWS Summit Tel Aviv 2019
Deep Dive on Amazon Elastic Container Service (ECS) | AWS Summit Tel Aviv 2019
 
Deep Dive on Amazon Elastic Container Service (ECS) | AWS Summit Tel Aviv 2019
Deep Dive on Amazon Elastic Container Service (ECS)  | AWS Summit Tel Aviv 2019Deep Dive on Amazon Elastic Container Service (ECS)  | AWS Summit Tel Aviv 2019
Deep Dive on Amazon Elastic Container Service (ECS) | AWS Summit Tel Aviv 2019
 
Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...
Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...
Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...
 
MXNet Paris Workshop - Intro To MXNet
MXNet Paris Workshop - Intro To MXNetMXNet Paris Workshop - Intro To MXNet
MXNet Paris Workshop - Intro To MXNet
 
Optimize your Machine Learning workloads | AWS Summit Tel Aviv 2019
Optimize your Machine Learning workloads  | AWS Summit Tel Aviv 2019Optimize your Machine Learning workloads  | AWS Summit Tel Aviv 2019
Optimize your Machine Learning workloads | AWS Summit Tel Aviv 2019
 
Optimize your Machine Learning workloads | AWS Summit Tel Aviv 2019
Optimize your Machine Learning workloads  | AWS Summit Tel Aviv 2019Optimize your Machine Learning workloads  | AWS Summit Tel Aviv 2019
Optimize your Machine Learning workloads | AWS Summit Tel Aviv 2019
 
Amazon EKS - Elastic Container Service for Kubernetes
Amazon EKS - Elastic Container Service for KubernetesAmazon EKS - Elastic Container Service for Kubernetes
Amazon EKS - Elastic Container Service for Kubernetes
 
Optimize your machine learning workloads on AWS (March 2019)
Optimize your machine learning workloads on AWS (March 2019)Optimize your machine learning workloads on AWS (March 2019)
Optimize your machine learning workloads on AWS (March 2019)
 
Breaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdfBreaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdf
 
Architecting security and governance through policy guardrails in Amazon EKS ...
Architecting security and governance through policy guardrails in Amazon EKS ...Architecting security and governance through policy guardrails in Amazon EKS ...
Architecting security and governance through policy guardrails in Amazon EKS ...
 
Build, train and deploy ML models with SageMaker (October 2019)
Build, train and deploy ML models with SageMaker (October 2019)Build, train and deploy ML models with SageMaker (October 2019)
Build, train and deploy ML models with SageMaker (October 2019)
 
Breaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdfBreaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdf
 
Expert Tips for Successful Kubernetes Deployment on AWS
Expert Tips for Successful Kubernetes Deployment on AWSExpert Tips for Successful Kubernetes Deployment on AWS
Expert Tips for Successful Kubernetes Deployment on AWS
 
Cloud-Native Operations with Kubernetes and CI/CD
Cloud-Native Operations with Kubernetes and CI/CDCloud-Native Operations with Kubernetes and CI/CD
Cloud-Native Operations with Kubernetes and CI/CD
 
Run Your CI/CD and Test Workloads for 90% Less with Amazon EC2 Spot Instances...
Run Your CI/CD and Test Workloads for 90% Less with Amazon EC2 Spot Instances...Run Your CI/CD and Test Workloads for 90% Less with Amazon EC2 Spot Instances...
Run Your CI/CD and Test Workloads for 90% Less with Amazon EC2 Spot Instances...
 
AWS Container Services – 유재석 (AWS 솔루션즈 아키텍트)
AWS Container Services – 유재석 (AWS 솔루션즈 아키텍트)AWS Container Services – 유재석 (AWS 솔루션즈 아키텍트)
AWS Container Services – 유재석 (AWS 솔루션즈 아키텍트)
 
Amazon Container Services – 유재석 (AWS 솔루션즈 아키텍트)
 Amazon Container Services – 유재석 (AWS 솔루션즈 아키텍트) Amazon Container Services – 유재석 (AWS 솔루션즈 아키텍트)
Amazon Container Services – 유재석 (AWS 솔루션즈 아키텍트)
 
Building a Recommender System Using Amazon SageMaker's Factorization Machine ...
Building a Recommender System Using Amazon SageMaker's Factorization Machine ...Building a Recommender System Using Amazon SageMaker's Factorization Machine ...
Building a Recommender System Using Amazon SageMaker's Factorization Machine ...
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

[AWS Tech Talk] Using containers for deep learning workflows

  • 1. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Shashank Prasanna, Sr. Technical Evangelist, AI/ML 30th September 2019 Using Containers for Deep Learning Workflows
  • 2. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Common deep learning setups and challenges Using containers for deep learning workflows • Demo 1: Containers for deep learning training workflows Scaling deep learning training • Demo 2: Submitting training jobs using containers to Amazon Elastic Kubernetes Services (Amazon EKS) • Demo 3: Running large-scale experiments using containers on Amazon SageMaker Summary and Q&A
  • 3. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Common machine learning setups 1. Code & frameworks 2. Compute (CPUs, GPUs) 3. Storage CLI EC2 instance DL AMI Amazon S3 CLI On-premises
  • 4. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Deep learning workflow Data acquisition curation and labeling Data preparation for training Large-scale experimentation Distributed training Model optimization and validation Deployment Need for scale
  • 5. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Deep learning is computationally expensive, but can be scaled-out CLI EC2 instance this… CLI Cluster …to this
  • 6. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Scaling-out deep learning training Parallel experiments Distributed training Distributing training of a single model to train faster Different models running parallel to find the best model
  • 7. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. …but there are challenges to scaling CLI Cluster Code and dependencies Infrastructure management Cluster management 1 2 3
  • 8. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine learning stack is complex • “My code requires building several dependencies from source” • “My code isn’t taking advantage the GPU/GPUs” • “is cudnn, nccl installed, is it the right version?” • “My code is running slow on CPUs” • “oh wait, is it taking advantage of AVX instruction set ?!?” • “I updated my drivers and training is now slower/errors out” • “My cluster runs a different version of framework/linux distro” Makes portability, collaboration, scaling training really really hard! Code and dependencies 1
  • 9. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. NVIDIA drivers 436.15 Ubuntu 16.04 TensorFlow 1.13 Keras horovod numpy scipy others… Mkl 2019 v3CPU: cudnn 7.1 cublas 10 nccl 2 CUDA toolkit 10 GPU: scikit-learn pandas openmpi Python My code Development system NVIDIA drivers 410.68 Centos 7 Training cluster TensorFlow 1.14 Keras horovod numpy scipy others… Mkl 2019 v2CPU: cudnn 7.5 cublas 10 nccl 2.4 CUDA toolkit 10 GPU: scikit-learn pandas openmpi Python My code Multiple points of failureDevelopment system Training cluster Code and dependencies 1
  • 10. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Containers for Machine Learning Container runtime Infrastructure NVIDIA drivers Host OS Packages:TensorFlow mkl cudnn cublas Nccl CUDA toolkit CPU: GPU: TensorFlow Container Image Keras horovod numpy scipy others… scikit- learn pandas openmpi Python + Your training scripts ML environments that are: Code and dependencies 1
  • 11. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. TensorFlow mkl cudnn cublas Nccl CUDA toolkit NVIDIA drivers Host OS CPU: GPU: Container runtime TensorFlow Container Image Keras horovod numpy scipy others… scikit- learn pandas openmpi Python Development system NVIDIA drivers Host OS Container runtime Training cluster Container registry push TensorFlow mkl cudnn cublas Nccl CUDA toolkit CPU: GPU: TensorFlow Container Image Keras horovod numpy scipy others… scikit- learn pandas openmpi Pythonpull + Your training scripts + Your training scripts
  • 12. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Deep Learning Containers https://docs.aws.amazon.com/dlami/latest/devguide/deep-learning-containers-images.html Code and dependencies 1
  • 13. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DEMO 1: Containers for deep learning workflows AWS Cloud Amazon ECR Deep learning container images AWS DL containers EC2 instance GPUs CLI Amazon EBS Datasets and checkpoints
  • 14. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Challenges with scaling deep learning CLI Cluster Code and dependencies Infrastructure management Cluster management
  • 15. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ML infrastructure and cluster management Image registry Container image repository Amazon Elastic Container Registry (Amazon ECR) Compute Where the containers run Amazon EC2 Jupyter notebook instances high performance algorithms Large-scale training Optimization One-click deployment Fully managed with auto-scaling ML services Fully-managed service that covers the entire machine learning workflow Amazon SageMaker Management Deployment, scheduling, scaling, and management of containerized applications Amazon Elastic Kubernetes Service (Amazon EKS) Amazon Elastic Container Service (Amazon ECS)
  • 16. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DEMO 2: Submitting training jobs to Amazon Elastic Kubernetes Services (Amazon EKS) Approach: 1. Provision a Kubernetes cluster Custom container Code files Container registry Amazon EKS cluster
  • 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Create a Kubernetes cluster Create cluster Submit a training jobs CLI eksctl create cluster --name eks-gpu --version 1.13 --region us-west-2 --nodegroup-name gpu-nodes --node-type p3.8xlarge --nodes 4 --timeout=40m --ssh-access --ssh-public-key=<public-key> --auto-kubeconfig
  • 18. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Learn more: Amazon EKS, Kubeflow and Katib Amazon Elastic Kubernetes Service (Amazon EKS) Machine learning workflows on Kubernetes Hyperparameter Tuning and Neural Architecture Search kubeflow.org/docs/aws/ aws.amazon.com/eks/
  • 19. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DEMO 3: Hyperparameter search experiment using Amazon SageMaker SageMaker SDK Fully-managed SageMaker cluster Amazon S3 Container registry Custom container Code files Docker build Approach: Webinar: Machine Learning with Containers and Amazon SageMaker
  • 20. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Takeaways • Containers let you build l • Leverage services such as Amazon SageMaker and Kubernetes + Kubeflow to manage large-scale ML workloads. • Choose fully-managed or self-managed based on needs Code and dependencies Infrastructure management Cluster management
  • 21. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Resources docs.aws.amazon.com/sagemaker/ latest/dg/whatis.html Documentation github.com/awslabs/ amazon-sagemaker-examples Examples on GitHub aws.amazon.com/blogs/machine- learning/category/artificial-intelligence/ AWS ML Blog docs.aws.amazon.com/dlami/latest/devgui de/deep-learning-containers-images.html Webinar: Machine Learning with Containers and Amazon SageMaker
  • 22. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you! Shashank Prasanna, Sr. Technical Evangelist, AI/ML Questions? Happy to help: Twitter: @shshnkp LinkedIn: linkedin.com/in/shashankprasanna Demo code and configuration scripts: https://github.com/shashankprasanna/using -containers-for-dl