SlideShare una empresa de Scribd logo
1 de 26
When HPC Meet ML/DL
manage HPC Data Center
with Kubernetes
Yong Feng (yongfeng@ca.ibm.com)
IBM Systems
Please Note:
• IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice
and at IBM’s sole discretion.
• Information regarding potential future products is intended to outline our general product direction and it
should not be relied on in making a purchasing decision.
• The information mentioned regarding potential future products is not a commitment, promise, or legal
obligation to deliver any material, code or functionality. Information about potential future products may not be
incorporated into any contract.
• The development, release, and timing of any future features or functionality described for our products
remains at our sole discretion.
• Performance is based on measurements and projections using standard IBM benchmarks in a controlled
environment. The actual throughput or performance that any user will experience will vary depending upon
many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the
I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be
given that an individual user will achieve results similar to those stated here.
| 2
3IBM Systems
Senior Architect of IBM Spectrum (former Platform Computing)
• Work on resource manager and workload scheduler for 12+ years after Ph.D
• Lead team on Open Source development from OpenStack, Yarn, Mesos, Kubernetes to
Spark etc.
• Lead team on core platform development of IBM Cloud Private
Who am I?
IBM Systems
Agenda
• What does ML/DL mean for HPC?
• What does Container/Docker mean for HPC?
• Kubernetes Basic
• Run MPI job on Kubernetes
• Run ML/DL Pipeline on Kubernetes
• Gaps of Kubernetes for HPC DataCenter
• What about Now?
| 4
ML/DL means for HPC
6IBM Systems
• New business challenges, especially Big Data, bring new topics,
HPDA, AI and IoT.
• Algorithm scientists have to keep optimizing their codes by new
technology
• ML/DL solves business problem across many domains
• New hardware technology makes ML/DL possible.
ML/DL is HPC’s 1st Consumer Killer App?
IBM Systems
Compute Resources & Network
Simulation
Visualization
Analytics Machine
Learning
Remote
UsersRemote
Users
Remote Users
• Scheduler controls job start and
placement
• Applications exchange data as
needed
• Producers
• Consumers
• Both
• Remote users receive/provide
feedback
Scheduler
data exchange
data exchange
HPC Solution Workflow
8IBM Systems
• HPC common requirements
• Hardware: high IOPS Storage, low-latency networks,
powerful CPU, large Memory, etc.
• Software: parallel accelerators, job scheduler
• GPU becomes critical
• Various framework, more than just job, such as, in-memory
databases, long running services, etc.
• MPI is still important
• Development pipeline
• Container does matter
Infrastructure and Software Challenge
Container/Docker means for
HPC
10IBM Systems
• Portability to resolve the complexity
• Scalability to fit the nature of distribute/parallel computing
• Developer friendly with pipeline of develop, build, distribute and
deploy
• Improve resource utilization
• Less overhead
• Network and resource isolation
• Supported by existing HPC job scheduler
Values
11IBM Systems
• Old Linux kernel
• Support infrastructure device/software, IB, parallel FS, GPU,
FPGA, etc.
• Security
• Limit HPC specific optimization
• Image control
• Trouble-shooting
Challenge
From: https://www.hpcwire.com/2017/05/04/singularity-hpc-container-technology-moves-lab/
From: http://www.hpctoday.com/viewpoints/containers-meet-hpc/
Kubernetes
13IBM Systems
Kubernetes Features
Intelligent Scheduling Self-healing Horizontal scaling
Service discovery
& load balancing
Automated rollouts
& rollbacks
Management of secret
& configuration
Storage orchestration
Batch Execution
IBM Systems
Kubernetes Concepts
A group of co-located containers
| 14
A service defines a set of pods and
a means by which to access them,
such as single stable IP address and
corresponding DNS name.
A volume is a directory, possibly
with some data in it, which is
accessible to a Container as part of
its filesystem.
A label is a key/value pair that is
attached to a resource, such as a
pod, to convey a user-defined
identifying attribute.
A replicateset ensures that
a specified number of pod replicas
are running at any one time.
A statefulset is a Controller that provides
a unique identity to its Pods. It provides
guarantees about the ordering of
deployment and scaling.
ReplicateSet StatefulSet
A job creates one or more pods and
ensures that a specified number of
them successfully terminate.
A Secret is an object that contains a
small amount of sensitive data. Such
information might be put in a Pod
specification or in an image
Batchjob
Secret
IBM Systems
Kubernetes Architecture
Getting Started
17IBM Systems
• Auto-discovery GPU resources
• GPU scheduling
• Monitor GPU resource utilization
• GPU driver injection
Manage GPU Resources
18IBM Systems
• Docker image of MPI running environment
• Kubernetes BatchJob to manage MPI job lifecycle
• Kubernetes Secret for password-less ssh access among workers
• Bootstrap to integrate with MPI Process Lifecycle Management
(PLM)
• Kubernetes platform to work with other services and resources
• Kubernetes platform for general data center platform
Run MPI in Kubernetes
(bootstrap)
mpirun
Job pod
(bootstrap)
sshd
(bootstrap)
sshd
kube-api
Job pod Job pod
19IBM Systems
• Docker image of Tensorflow running environment
• Kubernetes BatchJob to manage Tensorflow training job lifecycle
• Kubernetes Volume to share the data
• Kuberentes Deployment/Service to provide Tensorflow serving
service
• Kubernetes platform to work with other services and resources
• Kubernetes platform for general data center platform
Run Tensorflow Pipeline In Kubernetes
ps task
ps task
worker task
worker task
worker task
input
log
mode
l
JobVolume
dashboard
Deployment/ServiceVolume
serving
serving
Deployment/Service
test
Job
20IBM Systems
• Kubernetes Deployment/Service for rolling upgrade
• Integrate with CI/CD utilities
Extend the Pipeline to Iterative Development
ps task
ps task
worker task
worker task
worker task
input
log
mode
l
JobVolume
dashboard
Deployment/ServiceVolume
serving
serving
Deployment/Service
test
Job
new
algorithm
new image
Gaps
22IBM Systems
• Lack of feature on job scheduling
• Job group: ps task and worker task
• Job queue: priority, fare-sharing, pre-emption, etc.
• MPI: gang-scheduling, PLM integration, placement policy
• Advance reservation
• Lack of feature on container support
• MPI optimization: optimization based on placement topology,
share IPC, NUMA/CPU binding, job recovery
• Lack of feature on security
• Image control
Gaps of Kubernetes for HPC
23IBM Systems
• Job queue: (#36716)
• Introduce job queue concept and related resource sharing
policy
Planned Project in Community
What about Now?
25IBM Systems
• Run HPC Job Scheduler as workload manager on Kubernetes
• IBM Spectrum LSF
• Univa
Kubernetes + HPC Job Scheduler
IBM Systems
Q&A

Más contenido relacionado

La actualidad más candente

Introduction to ibm cloud paks concept license and minimum config public
Introduction to ibm cloud paks concept license and minimum config publicIntroduction to ibm cloud paks concept license and minimum config public
Introduction to ibm cloud paks concept license and minimum config publicPetchpaitoon Krungwong
 
Introduction to KubeDirector - SF Kubernetes Meetup
Introduction to KubeDirector - SF Kubernetes MeetupIntroduction to KubeDirector - SF Kubernetes Meetup
Introduction to KubeDirector - SF Kubernetes MeetupBlueData, Inc.
 
Kubernetes Basics - ICP Workshop Batch II
Kubernetes Basics - ICP Workshop Batch IIKubernetes Basics - ICP Workshop Batch II
Kubernetes Basics - ICP Workshop Batch IIPT Datacomm Diangraha
 
Pivotal Greenplum in Action on AWS, Azure, and GCP - Greenplum Summit 2018
Pivotal Greenplum in Action on AWS, Azure, and GCP - Greenplum Summit 2018Pivotal Greenplum in Action on AWS, Azure, and GCP - Greenplum Summit 2018
Pivotal Greenplum in Action on AWS, Azure, and GCP - Greenplum Summit 2018VMware Tanzu
 
Unclouding Container Challenges
 Unclouding  Container Challenges Unclouding  Container Challenges
Unclouding Container ChallengesRakuten Group, Inc.
 
Ai pipelines powered by jupyter notebooks
Ai pipelines powered by jupyter notebooksAi pipelines powered by jupyter notebooks
Ai pipelines powered by jupyter notebooksLuciano Resende
 
Regarding Clouds, Mainframes, and Desktops … and Linux
Regarding Clouds, Mainframes, and Desktops … and LinuxRegarding Clouds, Mainframes, and Desktops … and Linux
Regarding Clouds, Mainframes, and Desktops … and LinuxRobert Sutor
 
DCEU 18: Edge Computing with Docker Enterprise
DCEU 18: Edge Computing with Docker EnterpriseDCEU 18: Edge Computing with Docker Enterprise
DCEU 18: Edge Computing with Docker EnterpriseDocker, Inc.
 
Kubernetes and Cloud Native Update Q4 2018
Kubernetes and Cloud Native Update Q4 2018Kubernetes and Cloud Native Update Q4 2018
Kubernetes and Cloud Native Update Q4 2018CloudOps2005
 
Cloud Native PostgreSQL
Cloud Native PostgreSQLCloud Native PostgreSQL
Cloud Native PostgreSQLEDB
 
Delivering Cloud Native Batch Solutions - Dodd Pfeffer
Delivering Cloud Native Batch Solutions - Dodd PfefferDelivering Cloud Native Batch Solutions - Dodd Pfeffer
Delivering Cloud Native Batch Solutions - Dodd PfefferVMware Tanzu
 
Jupyter Enterprise Gateway Overview
Jupyter Enterprise Gateway OverviewJupyter Enterprise Gateway Overview
Jupyter Enterprise Gateway OverviewLuciano Resende
 
Strata - Scaling Jupyter with Jupyter Enterprise Gateway
Strata - Scaling Jupyter with Jupyter Enterprise GatewayStrata - Scaling Jupyter with Jupyter Enterprise Gateway
Strata - Scaling Jupyter with Jupyter Enterprise GatewayLuciano Resende
 
MongoDB, Cloudformation and Chef
MongoDB, Cloudformation and ChefMongoDB, Cloudformation and Chef
MongoDB, Cloudformation and ChefMongoDB
 
DevNexus 2015: Kubernetes & Container Engine
DevNexus 2015: Kubernetes & Container EngineDevNexus 2015: Kubernetes & Container Engine
DevNexus 2015: Kubernetes & Container EngineKit Merker
 
Micro services vs hadoop
Micro services vs hadoopMicro services vs hadoop
Micro services vs hadoopGergely Devenyi
 
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARNHadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARNDataWorks Summit
 

La actualidad más candente (20)

Introduction to ibm cloud paks concept license and minimum config public
Introduction to ibm cloud paks concept license and minimum config publicIntroduction to ibm cloud paks concept license and minimum config public
Introduction to ibm cloud paks concept license and minimum config public
 
Introduction to KubeDirector - SF Kubernetes Meetup
Introduction to KubeDirector - SF Kubernetes MeetupIntroduction to KubeDirector - SF Kubernetes Meetup
Introduction to KubeDirector - SF Kubernetes Meetup
 
Kubernetes Basics - ICP Workshop Batch II
Kubernetes Basics - ICP Workshop Batch IIKubernetes Basics - ICP Workshop Batch II
Kubernetes Basics - ICP Workshop Batch II
 
Pivotal Greenplum in Action on AWS, Azure, and GCP - Greenplum Summit 2018
Pivotal Greenplum in Action on AWS, Azure, and GCP - Greenplum Summit 2018Pivotal Greenplum in Action on AWS, Azure, and GCP - Greenplum Summit 2018
Pivotal Greenplum in Action on AWS, Azure, and GCP - Greenplum Summit 2018
 
Unclouding Container Challenges
 Unclouding  Container Challenges Unclouding  Container Challenges
Unclouding Container Challenges
 
Ai pipelines powered by jupyter notebooks
Ai pipelines powered by jupyter notebooksAi pipelines powered by jupyter notebooks
Ai pipelines powered by jupyter notebooks
 
Regarding Clouds, Mainframes, and Desktops … and Linux
Regarding Clouds, Mainframes, and Desktops … and LinuxRegarding Clouds, Mainframes, and Desktops … and Linux
Regarding Clouds, Mainframes, and Desktops … and Linux
 
Big data and Kubernetes
Big data and KubernetesBig data and Kubernetes
Big data and Kubernetes
 
DCEU 18: Edge Computing with Docker Enterprise
DCEU 18: Edge Computing with Docker EnterpriseDCEU 18: Edge Computing with Docker Enterprise
DCEU 18: Edge Computing with Docker Enterprise
 
Kubernetes and Cloud Native Update Q4 2018
Kubernetes and Cloud Native Update Q4 2018Kubernetes and Cloud Native Update Q4 2018
Kubernetes and Cloud Native Update Q4 2018
 
Cloud Native PostgreSQL
Cloud Native PostgreSQLCloud Native PostgreSQL
Cloud Native PostgreSQL
 
Watson on bluemix
Watson on bluemixWatson on bluemix
Watson on bluemix
 
Delivering Cloud Native Batch Solutions - Dodd Pfeffer
Delivering Cloud Native Batch Solutions - Dodd PfefferDelivering Cloud Native Batch Solutions - Dodd Pfeffer
Delivering Cloud Native Batch Solutions - Dodd Pfeffer
 
Jupyter Enterprise Gateway Overview
Jupyter Enterprise Gateway OverviewJupyter Enterprise Gateway Overview
Jupyter Enterprise Gateway Overview
 
Strata - Scaling Jupyter with Jupyter Enterprise Gateway
Strata - Scaling Jupyter with Jupyter Enterprise GatewayStrata - Scaling Jupyter with Jupyter Enterprise Gateway
Strata - Scaling Jupyter with Jupyter Enterprise Gateway
 
MongoDB, Cloudformation and Chef
MongoDB, Cloudformation and ChefMongoDB, Cloudformation and Chef
MongoDB, Cloudformation and Chef
 
DevNexus 2015: Kubernetes & Container Engine
DevNexus 2015: Kubernetes & Container EngineDevNexus 2015: Kubernetes & Container Engine
DevNexus 2015: Kubernetes & Container Engine
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
Micro services vs hadoop
Micro services vs hadoopMicro services vs hadoop
Micro services vs hadoop
 
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARNHadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
 

Similar a When HPC meet ML/DL: Manage HPC Data Center with Kubernetes

Building a PaaS Platform like Bluemix on OpenStack
Building a PaaS Platform like Bluemix on OpenStackBuilding a PaaS Platform like Bluemix on OpenStack
Building a PaaS Platform like Bluemix on OpenStackAnimesh Singh
 
Containers as Infrastructure for New Gen Apps
Containers as Infrastructure for New Gen AppsContainers as Infrastructure for New Gen Apps
Containers as Infrastructure for New Gen AppsKhalid Ahmed
 
Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...
Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...
Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...Intel IT Center
 
Build your own private Cloud environment
Build your own private Cloud environmentBuild your own private Cloud environment
Build your own private Cloud environmentNico Meisenzahl
 
DNUG46 - Build your own private Cloud environment
DNUG46 - Build your own private Cloud environmentDNUG46 - Build your own private Cloud environment
DNUG46 - Build your own private Cloud environmentpanagenda
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learnJohn D Almon
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Indrajit Poddar
 
OpenPOWER Roadmap Toward CORAL
OpenPOWER Roadmap Toward CORALOpenPOWER Roadmap Toward CORAL
OpenPOWER Roadmap Toward CORALinside-BigData.com
 
Migrating Enterprise Microservices From Cloud Foundry to Kubernetes
Migrating Enterprise Microservices From Cloud Foundry to KubernetesMigrating Enterprise Microservices From Cloud Foundry to Kubernetes
Migrating Enterprise Microservices From Cloud Foundry to KubernetesTony Erwin
 
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...Indrajit Poddar
 
OIT552 Cloud Computing Material
OIT552 Cloud Computing MaterialOIT552 Cloud Computing Material
OIT552 Cloud Computing Materialpkaviya
 
Cloud computing-2 (1)
Cloud computing-2 (1)Cloud computing-2 (1)
Cloud computing-2 (1)JUDYFLAVIAB
 
The Download: Tech Talks by the HPCC Systems Community, Episode 11
The Download: Tech Talks by the HPCC Systems Community, Episode 11The Download: Tech Talks by the HPCC Systems Community, Episode 11
The Download: Tech Talks by the HPCC Systems Community, Episode 11HPCC Systems
 
Mainframe Application Testing both With and Without Live Data
Mainframe Application Testing both With and Without Live DataMainframe Application Testing both With and Without Live Data
Mainframe Application Testing both With and Without Live DataDevOps for Enterprise Systems
 
A Complete Guide Cloud Computing
A Complete Guide Cloud ComputingA Complete Guide Cloud Computing
A Complete Guide Cloud ComputingSripati Mahapatra
 
Software Defined Infrastructure
Software Defined InfrastructureSoftware Defined Infrastructure
Software Defined Infrastructureinside-BigData.com
 
Why kubernetes matters
Why kubernetes mattersWhy kubernetes matters
Why kubernetes mattersPlatform9
 
Introduction to Cloud Computing
Introduction to Cloud ComputingIntroduction to Cloud Computing
Introduction to Cloud ComputingBharat Kalia
 
Service-Level Objective for Serverless Applications
Service-Level Objective for Serverless ApplicationsService-Level Objective for Serverless Applications
Service-Level Objective for Serverless Applicationsalekn
 

Similar a When HPC meet ML/DL: Manage HPC Data Center with Kubernetes (20)

Building a PaaS Platform like Bluemix on OpenStack
Building a PaaS Platform like Bluemix on OpenStackBuilding a PaaS Platform like Bluemix on OpenStack
Building a PaaS Platform like Bluemix on OpenStack
 
Containers as Infrastructure for New Gen Apps
Containers as Infrastructure for New Gen AppsContainers as Infrastructure for New Gen Apps
Containers as Infrastructure for New Gen Apps
 
Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...
Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...
Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...
 
Build your own private Cloud environment
Build your own private Cloud environmentBuild your own private Cloud environment
Build your own private Cloud environment
 
DNUG46 - Build your own private Cloud environment
DNUG46 - Build your own private Cloud environmentDNUG46 - Build your own private Cloud environment
DNUG46 - Build your own private Cloud environment
 
Migrating from ibm to hpe
Migrating from ibm to hpeMigrating from ibm to hpe
Migrating from ibm to hpe
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...
 
OpenPOWER Roadmap Toward CORAL
OpenPOWER Roadmap Toward CORALOpenPOWER Roadmap Toward CORAL
OpenPOWER Roadmap Toward CORAL
 
Migrating Enterprise Microservices From Cloud Foundry to Kubernetes
Migrating Enterprise Microservices From Cloud Foundry to KubernetesMigrating Enterprise Microservices From Cloud Foundry to Kubernetes
Migrating Enterprise Microservices From Cloud Foundry to Kubernetes
 
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
 
OIT552 Cloud Computing Material
OIT552 Cloud Computing MaterialOIT552 Cloud Computing Material
OIT552 Cloud Computing Material
 
Cloud computing-2 (1)
Cloud computing-2 (1)Cloud computing-2 (1)
Cloud computing-2 (1)
 
The Download: Tech Talks by the HPCC Systems Community, Episode 11
The Download: Tech Talks by the HPCC Systems Community, Episode 11The Download: Tech Talks by the HPCC Systems Community, Episode 11
The Download: Tech Talks by the HPCC Systems Community, Episode 11
 
Mainframe Application Testing both With and Without Live Data
Mainframe Application Testing both With and Without Live DataMainframe Application Testing both With and Without Live Data
Mainframe Application Testing both With and Without Live Data
 
A Complete Guide Cloud Computing
A Complete Guide Cloud ComputingA Complete Guide Cloud Computing
A Complete Guide Cloud Computing
 
Software Defined Infrastructure
Software Defined InfrastructureSoftware Defined Infrastructure
Software Defined Infrastructure
 
Why kubernetes matters
Why kubernetes mattersWhy kubernetes matters
Why kubernetes matters
 
Introduction to Cloud Computing
Introduction to Cloud ComputingIntroduction to Cloud Computing
Introduction to Cloud Computing
 
Service-Level Objective for Serverless Applications
Service-Level Objective for Serverless ApplicationsService-Level Objective for Serverless Applications
Service-Level Objective for Serverless Applications
 

Último

2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge GraphsEleniIlkou
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirtrahman018755
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查ydyuyu
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdfMatthew Sinclair
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"growthgrids
 
PowerDirector Explination Process...pptx
PowerDirector Explination Process...pptxPowerDirector Explination Process...pptx
PowerDirector Explination Process...pptxgalaxypingy
 
75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptxAsmae Rabhi
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查ydyuyu
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrHenryBriggs2
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查ydyuyu
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsMonica Sydney
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasDigicorns Technologies
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Roommeghakumariji156
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdfMatthew Sinclair
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtrahman018755
 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftAanSulistiyo
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsMonica Sydney
 
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime NagercoilNagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoilmeghakumariji156
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.krishnachandrapal52
 

Último (20)

2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
 
PowerDirector Explination Process...pptx
PowerDirector Explination Process...pptxPowerDirector Explination Process...pptx
PowerDirector Explination Process...pptx
 
75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency Dallas
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck Microsoft
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
 
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime NagercoilNagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.
 

When HPC meet ML/DL: Manage HPC Data Center with Kubernetes

  • 1. When HPC Meet ML/DL manage HPC Data Center with Kubernetes Yong Feng (yongfeng@ca.ibm.com)
  • 2. IBM Systems Please Note: • IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice and at IBM’s sole discretion. • Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. • The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. • The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. • Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here. | 2
  • 3. 3IBM Systems Senior Architect of IBM Spectrum (former Platform Computing) • Work on resource manager and workload scheduler for 12+ years after Ph.D • Lead team on Open Source development from OpenStack, Yarn, Mesos, Kubernetes to Spark etc. • Lead team on core platform development of IBM Cloud Private Who am I?
  • 4. IBM Systems Agenda • What does ML/DL mean for HPC? • What does Container/Docker mean for HPC? • Kubernetes Basic • Run MPI job on Kubernetes • Run ML/DL Pipeline on Kubernetes • Gaps of Kubernetes for HPC DataCenter • What about Now? | 4
  • 6. 6IBM Systems • New business challenges, especially Big Data, bring new topics, HPDA, AI and IoT. • Algorithm scientists have to keep optimizing their codes by new technology • ML/DL solves business problem across many domains • New hardware technology makes ML/DL possible. ML/DL is HPC’s 1st Consumer Killer App?
  • 7. IBM Systems Compute Resources & Network Simulation Visualization Analytics Machine Learning Remote UsersRemote Users Remote Users • Scheduler controls job start and placement • Applications exchange data as needed • Producers • Consumers • Both • Remote users receive/provide feedback Scheduler data exchange data exchange HPC Solution Workflow
  • 8. 8IBM Systems • HPC common requirements • Hardware: high IOPS Storage, low-latency networks, powerful CPU, large Memory, etc. • Software: parallel accelerators, job scheduler • GPU becomes critical • Various framework, more than just job, such as, in-memory databases, long running services, etc. • MPI is still important • Development pipeline • Container does matter Infrastructure and Software Challenge
  • 10. 10IBM Systems • Portability to resolve the complexity • Scalability to fit the nature of distribute/parallel computing • Developer friendly with pipeline of develop, build, distribute and deploy • Improve resource utilization • Less overhead • Network and resource isolation • Supported by existing HPC job scheduler Values
  • 11. 11IBM Systems • Old Linux kernel • Support infrastructure device/software, IB, parallel FS, GPU, FPGA, etc. • Security • Limit HPC specific optimization • Image control • Trouble-shooting Challenge From: https://www.hpcwire.com/2017/05/04/singularity-hpc-container-technology-moves-lab/ From: http://www.hpctoday.com/viewpoints/containers-meet-hpc/
  • 13. 13IBM Systems Kubernetes Features Intelligent Scheduling Self-healing Horizontal scaling Service discovery & load balancing Automated rollouts & rollbacks Management of secret & configuration Storage orchestration Batch Execution
  • 14. IBM Systems Kubernetes Concepts A group of co-located containers | 14 A service defines a set of pods and a means by which to access them, such as single stable IP address and corresponding DNS name. A volume is a directory, possibly with some data in it, which is accessible to a Container as part of its filesystem. A label is a key/value pair that is attached to a resource, such as a pod, to convey a user-defined identifying attribute. A replicateset ensures that a specified number of pod replicas are running at any one time. A statefulset is a Controller that provides a unique identity to its Pods. It provides guarantees about the ordering of deployment and scaling. ReplicateSet StatefulSet A job creates one or more pods and ensures that a specified number of them successfully terminate. A Secret is an object that contains a small amount of sensitive data. Such information might be put in a Pod specification or in an image Batchjob Secret
  • 17. 17IBM Systems • Auto-discovery GPU resources • GPU scheduling • Monitor GPU resource utilization • GPU driver injection Manage GPU Resources
  • 18. 18IBM Systems • Docker image of MPI running environment • Kubernetes BatchJob to manage MPI job lifecycle • Kubernetes Secret for password-less ssh access among workers • Bootstrap to integrate with MPI Process Lifecycle Management (PLM) • Kubernetes platform to work with other services and resources • Kubernetes platform for general data center platform Run MPI in Kubernetes (bootstrap) mpirun Job pod (bootstrap) sshd (bootstrap) sshd kube-api Job pod Job pod
  • 19. 19IBM Systems • Docker image of Tensorflow running environment • Kubernetes BatchJob to manage Tensorflow training job lifecycle • Kubernetes Volume to share the data • Kuberentes Deployment/Service to provide Tensorflow serving service • Kubernetes platform to work with other services and resources • Kubernetes platform for general data center platform Run Tensorflow Pipeline In Kubernetes ps task ps task worker task worker task worker task input log mode l JobVolume dashboard Deployment/ServiceVolume serving serving Deployment/Service test Job
  • 20. 20IBM Systems • Kubernetes Deployment/Service for rolling upgrade • Integrate with CI/CD utilities Extend the Pipeline to Iterative Development ps task ps task worker task worker task worker task input log mode l JobVolume dashboard Deployment/ServiceVolume serving serving Deployment/Service test Job new algorithm new image
  • 21. Gaps
  • 22. 22IBM Systems • Lack of feature on job scheduling • Job group: ps task and worker task • Job queue: priority, fare-sharing, pre-emption, etc. • MPI: gang-scheduling, PLM integration, placement policy • Advance reservation • Lack of feature on container support • MPI optimization: optimization based on placement topology, share IPC, NUMA/CPU binding, job recovery • Lack of feature on security • Image control Gaps of Kubernetes for HPC
  • 23. 23IBM Systems • Job queue: (#36716) • Introduce job queue concept and related resource sharing policy Planned Project in Community
  • 25. 25IBM Systems • Run HPC Job Scheduler as workload manager on Kubernetes • IBM Spectrum LSF • Univa Kubernetes + HPC Job Scheduler

Notas del editor

  1. HPDA = Data-Intensive Computing Using HPC Domains Manufactory: Retail Life science Travel Finance Energy&Utility
  2. HPDA = Data-Intensive Computing Using HPC Domains Manufactory: Retail Life science Travel Finance Energy&Utility
  3. Applications are different and each serves a purpose in computing an overall actionable solution to a problem Not all applications need the same data or any data at all hence each application is classified as a data producer, consumer, or both Remote user can be located on Intranet or Internet A lot of point to point transfer data transactions – every application needs to know who it needs to send data to and every application needs to know who it should receive data fromvery cumbersome and potentially complicated if an application should fail or a new application starts
  4. GPU: https://www.google.ca/url?sa=t&rct=j&q=&esrc=s&source=web&cd=19&ved=0ahUKEwjI-Na2wbzUAhVo7oMKHfdIDCY4ChAWCGEwCA&url=http%3A%2F%2Fwww.intersect360.com%2Findustry%2Fdownloadsummary.php%3Fid%3D131&usg=AFQjCNHjQK9EKHqC7KyeqcEe0ecgLKnkpw&cad=rja
  5. Complexity: Dependencies: tools, compilers, libraries, etc Software stack: academic sw is difficult to install, configure and deploy Heterogeneous platform/architecture: laptop->supercomputer, x86-power http://www.hpctoday.com/viewpoints/containers-meet-hpc/ https://www.nextplatform.com/2016/09/13/will-containers-total-package-hpc/
  6. Security: Containers launched as root Access to bare metal, filesystems& device drivers Infrastructure device: incompatibility of low level kernel Image control: vulnerabilities Limit HPC specific optimization: MPI local memory sharing, HDFS/GPFS data locality
  7. https://kubernetes.io/ https://github.com/kubernetes/features/
  8. Complexity: Dependencies: tools, compilers, libraries, etc Software stack: academic sw is difficult to install, configure and deploy Heterogeneous platform/architecture: laptop->supercomputer, x86-power http://www.hpctoday.com/viewpoints/containers-meet-hpc/ https://www.nextplatform.com/2016/09/13/will-containers-total-package-hpc/
  9. Complexity: Dependencies: tools, compilers, libraries, etc Software stack: academic sw is difficult to install, configure and deploy Heterogeneous platform/architecture: laptop->supercomputer, x86-power http://www.hpctoday.com/viewpoints/containers-meet-hpc/ https://www.nextplatform.com/2016/09/13/will-containers-total-package-hpc/
  10. Complexity: Dependencies: tools, compilers, libraries, etc Software stack: academic sw is difficult to install, configure and deploy Heterogeneous platform/architecture: laptop->supercomputer, x86-power http://www.hpctoday.com/viewpoints/containers-meet-hpc/ https://www.nextplatform.com/2016/09/13/will-containers-total-package-hpc/
  11. Complexity: Dependencies: tools, compilers, libraries, etc Software stack: academic sw is difficult to install, configure and deploy Heterogeneous platform/architecture: laptop->supercomputer, x86-power http://www.hpctoday.com/viewpoints/containers-meet-hpc/ https://www.nextplatform.com/2016/09/13/will-containers-total-package-hpc/
  12. Complexity: Dependencies: tools, compilers, libraries, etc Software stack: academic sw is difficult to install, configure and deploy Heterogeneous platform/architecture: laptop->supercomputer, x86-power http://www.hpctoday.com/viewpoints/containers-meet-hpc/ https://www.nextplatform.com/2016/09/13/will-containers-total-package-hpc/
  13. Complexity: Dependencies: tools, compilers, libraries, etc Software stack: academic sw is difficult to install, configure and deploy Heterogeneous platform/architecture: laptop->supercomputer, x86-power http://www.hpctoday.com/viewpoints/containers-meet-hpc/ https://www.nextplatform.com/2016/09/13/will-containers-total-package-hpc/
  14. Complexity: Dependencies: tools, compilers, libraries, etc Software stack: academic sw is difficult to install, configure and deploy Heterogeneous platform/architecture: laptop->supercomputer, x86-power http://www.hpctoday.com/viewpoints/containers-meet-hpc/ https://www.nextplatform.com/2016/09/13/will-containers-total-package-hpc/