SlideShare a Scribd company logo
1 of 8
www.univa.com
www.univa.com
Ian Lumb
Solutions Architect
SUSE, Booth #1681
SC17, Denver, CO
Managing Containerized
HPC and AI Workloads on
TSUBAME3.0
www.univa.com
2
www.univa.com
3
www.univa.com
Copyright © Univa Corporation, 2017. All Rights Reserved. Internal Use Only. 4
TSUBAME 3.0 - Compute Node Overview
A compute-node:
■ 256 GB DDR4 RAM
■ 2 TB SSDs
■ 2x 14 cores
■ 4x GPUs
■ 4x HFI (1000 Gbps)
⇒ This is what they call
a “fat compute node”
www.univa.com
Copyright © Univa Corporation, 2017. All Rights Reserved. Internal Use Only. 5
TSUBAME 3.0 - The Challenges
12.2 PetaFLOPS within only 20 racks or 540 compute
nodes
➢ It is the smallest >10 PFLOPS machine in the world
➢ Wasted/unreachable resources (parts of a node) have a
much bigger impact on such a “small” cluster
➢ Performance is also highly dependent on the job-
placement due to additional resources, such as GPUs
and HFI-devices (the closer, the better)
➢ It needs smart and flexible partitioning to ensure a
high utilization
www.univa.com
6
TSUBAME 3.0 - UGE Enhancements
▪ Core Bindings
▪ Enhanced PE support and strategies
▪ RSMAPS
▪ Enhanced PE support and chaining
▪ Docker
▪ Define unique but known container hostnames
▪ Configure Infiniband device in the container
▪ Map all job users into the container
▪ Provide execution host and Docker container hostnames to the job
www.univa.com
Copyright © Univa Corporation, 2017. All Rights Reserved. Internal Use Only. 7
Putting it all together …
qsub -l docker,docker_images="*ubuntu:14.04*"
-l gpu=1,hfi=1,hosts=1
-xd ‘--device=/dev/gpu${gpu(0)}:/dev/gpu,
--device=/dev/hfi${hfi(0)}:/dev/hfi’
-xd ‘--hostname ${hosts(0)}’
-binding one_socket_balanced:4
-pe rr 4 jobscript.sh
No matter the host-OS, the application
gets whatever OS it needs (if they run
their own docker-repo, the image can
even be prepared however they need it)
Each PE-task will get 1
GPU and 1 HFI device
(both with the same ID,
i.e. in the same “location”)
and a unique hostname
No matter which devices
are granted, the
application only sees
/dev/gpu and /dev/hfi
inside the container and
can use them directly
without any performance
penalty!
Even if the RSMAP would occupy 7 cores
per GPU, we only want 4 per PE-task.
Thus leaving room for other jobs, which do
not need a GPU or HFI. Also, we only go
on one socket per host.
Container gets a unique, known (!) hostname
www.univa.com
8

More Related Content

What's hot

In-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several timesIn-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several times
Aleksander Alekseev
 

What's hot (19)

Deccan RubyConf 2016 - Lighning Talk - SpiceRub
Deccan RubyConf 2016 - Lighning Talk - SpiceRubDeccan RubyConf 2016 - Lighning Talk - SpiceRub
Deccan RubyConf 2016 - Lighning Talk - SpiceRub
 
Glusterfs session #2 1 layer above disk filesystems
Glusterfs session #2   1 layer above disk filesystemsGlusterfs session #2   1 layer above disk filesystems
Glusterfs session #2 1 layer above disk filesystems
 
Integrating GlusterFS with iSCSI Target
Integrating GlusterFS with iSCSI TargetIntegrating GlusterFS with iSCSI Target
Integrating GlusterFS with iSCSI Target
 
Hydra
HydraHydra
Hydra
 
Resource Management with Systemd and cgroups
Resource Management with Systemd and cgroupsResource Management with Systemd and cgroups
Resource Management with Systemd and cgroups
 
High Performance OSM Data Manipulation With Osmium - State of the Map 2013
High Performance OSM Data Manipulation With Osmium - State of the Map 2013High Performance OSM Data Manipulation With Osmium - State of the Map 2013
High Performance OSM Data Manipulation With Osmium - State of the Map 2013
 
Sharding: Past, Present and Future with Krutika Dhananjay
Sharding: Past, Present and Future with Krutika DhananjaySharding: Past, Present and Future with Krutika Dhananjay
Sharding: Past, Present and Future with Krutika Dhananjay
 
Distributed Data Processing Workshop - SBU
Distributed Data Processing Workshop - SBUDistributed Data Processing Workshop - SBU
Distributed Data Processing Workshop - SBU
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
 
UBD (LaserVault Universal Backup Device )
UBD (LaserVault Universal Backup Device )UBD (LaserVault Universal Backup Device )
UBD (LaserVault Universal Backup Device )
 
GPU Performance Prediction Using High-level Application Models
GPU Performance Prediction Using High-level Application ModelsGPU Performance Prediction Using High-level Application Models
GPU Performance Prediction Using High-level Application Models
 
Cassandra4hadoop
Cassandra4hadoopCassandra4hadoop
Cassandra4hadoop
 
In-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several timesIn-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several times
 
Caffe + H2O - By Cyprien noel
Caffe + H2O - By Cyprien noelCaffe + H2O - By Cyprien noel
Caffe + H2O - By Cyprien noel
 
Integrating openSUSE Ceph Block Device & OpenStack
Integrating openSUSE Ceph Block Device & OpenStack Integrating openSUSE Ceph Block Device & OpenStack
Integrating openSUSE Ceph Block Device & OpenStack
 
Life as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan RossiLife as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan Rossi
 
STOR2RRD presentation from Common CZ/SK 2015
STOR2RRD presentation from Common CZ/SK 2015STOR2RRD presentation from Common CZ/SK 2015
STOR2RRD presentation from Common CZ/SK 2015
 
Rear
RearRear
Rear
 
Alexander Ignatyev "MapReduce infrastructure"
Alexander Ignatyev "MapReduce infrastructure"Alexander Ignatyev "MapReduce infrastructure"
Alexander Ignatyev "MapReduce infrastructure"
 

Similar to Managing Containerized HPC and AI Workloads on TSUBAME3.0

Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
Amrut Patil
 
Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9 Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9
Jérôme Petazzoni
 

Similar to Managing Containerized HPC and AI Workloads on TSUBAME3.0 (20)

Hadoop installation
Hadoop installationHadoop installation
Hadoop installation
 
State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigData
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase client
 
Introduction to docker
Introduction to dockerIntroduction to docker
Introduction to docker
 
Docker 0.11 at MaxCDN meetup in Los Angeles
Docker 0.11 at MaxCDN meetup in Los AngelesDocker 0.11 at MaxCDN meetup in Los Angeles
Docker 0.11 at MaxCDN meetup in Los Angeles
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Run wordcount job (hadoop)
Run wordcount job (hadoop)Run wordcount job (hadoop)
Run wordcount job (hadoop)
 
Docker Intro at the Google Developer Group and Google Cloud Platform Meet Up
Docker Intro at the Google Developer Group and Google Cloud Platform Meet UpDocker Intro at the Google Developer Group and Google Cloud Platform Meet Up
Docker Intro at the Google Developer Group and Google Cloud Platform Meet Up
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
 
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Single node hadoop cluster installation
Single node hadoop cluster installation Single node hadoop cluster installation
Single node hadoop cluster installation
 
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQDocker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
 
Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9 Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9
 
NFD9 - Matt Peterson, Data Center Operations
NFD9 - Matt Peterson, Data Center OperationsNFD9 - Matt Peterson, Data Center Operations
NFD9 - Matt Peterson, Data Center Operations
 
[DockerCon 2019] Hardening Docker daemon with Rootless mode
[DockerCon 2019] Hardening Docker daemon with Rootless mode[DockerCon 2019] Hardening Docker daemon with Rootless mode
[DockerCon 2019] Hardening Docker daemon with Rootless mode
 
DCSF19 Hardening Docker daemon with Rootless mode
DCSF19 Hardening Docker daemon with Rootless modeDCSF19 Hardening Docker daemon with Rootless mode
DCSF19 Hardening Docker daemon with Rootless mode
 
Exp-3.pptx
Exp-3.pptxExp-3.pptx
Exp-3.pptx
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Introduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup SunnyvaleIntroduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup Sunnyvale
 

More from Ian Lumb

Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...
Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...
Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...
Ian Lumb
 
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...
Ian Lumb
 
Docker 101 - all about Docker containers
Docker 101 - all about Docker containers Docker 101 - all about Docker containers
Docker 101 - all about Docker containers
Ian Lumb
 
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
Ian Lumb
 

More from Ian Lumb (12)

Towards Deep Learning from Twitter for Improved Tsunami Alerts and Advisories
Towards Deep Learning from Twitter for Improved Tsunami Alerts and AdvisoriesTowards Deep Learning from Twitter for Improved Tsunami Alerts and Advisories
Towards Deep Learning from Twitter for Improved Tsunami Alerts and Advisories
 
Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...
Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...
Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...
 
Dev / Test / Ops – Gain More Horsepower and Reduce Costs by Sharing Kubernete...
Dev / Test / Ops – Gain More Horsepower and Reduce Costs by Sharing Kubernete...Dev / Test / Ops – Gain More Horsepower and Reduce Costs by Sharing Kubernete...
Dev / Test / Ops – Gain More Horsepower and Reduce Costs by Sharing Kubernete...
 
Drilling Deep with Machine Learning as an Enterprise Enabled Micro Service
Drilling Deep with Machine Learning as an Enterprise Enabled Micro ServiceDrilling Deep with Machine Learning as an Enterprise Enabled Micro Service
Drilling Deep with Machine Learning as an Enterprise Enabled Micro Service
 
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...
 
Docker 101 - all about Docker containers
Docker 101 - all about Docker containers Docker 101 - all about Docker containers
Docker 101 - all about Docker containers
 
High Performance Computing in the Cloud?
High Performance Computing in the Cloud?High Performance Computing in the Cloud?
High Performance Computing in the Cloud?
 
VoDcast Slides: The Rise in Popularity of Apache Spark
VoDcast Slides: The Rise in Popularity of Apache SparkVoDcast Slides: The Rise in Popularity of Apache Spark
VoDcast Slides: The Rise in Popularity of Apache Spark
 
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...
 
Utilizing Public AND Private Clouds with Bright Cluster Manager
Utilizing Public AND Private Clouds with Bright Cluster ManagerUtilizing Public AND Private Clouds with Bright Cluster Manager
Utilizing Public AND Private Clouds with Bright Cluster Manager
 
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero DowntimeHow to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
 
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
 

Recently uploaded

Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
chiefasafspells
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 

Recently uploaded (20)

8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 

Managing Containerized HPC and AI Workloads on TSUBAME3.0

  • 1. www.univa.com www.univa.com Ian Lumb Solutions Architect SUSE, Booth #1681 SC17, Denver, CO Managing Containerized HPC and AI Workloads on TSUBAME3.0
  • 4. www.univa.com Copyright © Univa Corporation, 2017. All Rights Reserved. Internal Use Only. 4 TSUBAME 3.0 - Compute Node Overview A compute-node: ■ 256 GB DDR4 RAM ■ 2 TB SSDs ■ 2x 14 cores ■ 4x GPUs ■ 4x HFI (1000 Gbps) ⇒ This is what they call a “fat compute node”
  • 5. www.univa.com Copyright © Univa Corporation, 2017. All Rights Reserved. Internal Use Only. 5 TSUBAME 3.0 - The Challenges 12.2 PetaFLOPS within only 20 racks or 540 compute nodes ➢ It is the smallest >10 PFLOPS machine in the world ➢ Wasted/unreachable resources (parts of a node) have a much bigger impact on such a “small” cluster ➢ Performance is also highly dependent on the job- placement due to additional resources, such as GPUs and HFI-devices (the closer, the better) ➢ It needs smart and flexible partitioning to ensure a high utilization
  • 6. www.univa.com 6 TSUBAME 3.0 - UGE Enhancements ▪ Core Bindings ▪ Enhanced PE support and strategies ▪ RSMAPS ▪ Enhanced PE support and chaining ▪ Docker ▪ Define unique but known container hostnames ▪ Configure Infiniband device in the container ▪ Map all job users into the container ▪ Provide execution host and Docker container hostnames to the job
  • 7. www.univa.com Copyright © Univa Corporation, 2017. All Rights Reserved. Internal Use Only. 7 Putting it all together … qsub -l docker,docker_images="*ubuntu:14.04*" -l gpu=1,hfi=1,hosts=1 -xd ‘--device=/dev/gpu${gpu(0)}:/dev/gpu, --device=/dev/hfi${hfi(0)}:/dev/hfi’ -xd ‘--hostname ${hosts(0)}’ -binding one_socket_balanced:4 -pe rr 4 jobscript.sh No matter the host-OS, the application gets whatever OS it needs (if they run their own docker-repo, the image can even be prepared however they need it) Each PE-task will get 1 GPU and 1 HFI device (both with the same ID, i.e. in the same “location”) and a unique hostname No matter which devices are granted, the application only sees /dev/gpu and /dev/hfi inside the container and can use them directly without any performance penalty! Even if the RSMAP would occupy 7 cores per GPU, we only want 4 per PE-task. Thus leaving room for other jobs, which do not need a GPU or HFI. Also, we only go on one socket per host. Container gets a unique, known (!) hostname