SlideShare una empresa de Scribd logo
1 de 19
Descargar para leer sin conexión
National Aeronautics and Space Administration
www.nasa.gov
Suitability of Commercial Clouds for
NASA's HPC Applications
Robert Hood, InuTeq
NASA Advanced Supercomputing Division
NASA Ames Research Center
HPC User Forum
The HECC Project at NASA Ames
High End Computing Capability
•  Facilitating Science/Engineering at NASA
•  4 compute systems, including:
•  Pleiades: #24 in Top500; #14 in HPCG
•  Electra: #43 & #37 (both should go up)
•  1300+ users from across Agency Directorates
•  Aeronautics, Human Exploration, Science
•  ~900 MPI applications (+ non-MPI codes)
Goal: Maximize resource delivery to users
•  Strive for >80% utilization of resources
•  Embrace new facility technologies
•  Continually evaluate procurement options
NASA HQ: Is HECC doing the best it can?
Marketing Reps say
USE
Answering the Question: A Trade Study
Key Questions:
•  Would clouds be a cost-effective replacement for on-premises resources?
•  Under what conditions could cloud usage be cost effective?
Approach: Compare costs of running NASA workload on premises & in cloud
•  Evaluation Systems
•  Existing resources on premises at NASA InfiniBand interconnect
•  Amazon Web Services (AWS) Ethernet interconnect
•  Penguin-On-Demand (POD) InfiniBand, OmniPath interconnects
•  Benchmarks
•  NAS Parallel Benchmarks (NPBs): Classes C and D
•  6 full-sized applications:
•  Weather/Climate: FVCore, ECCO (really MITgcm), WRF/nuWRF
•  Science: Athena++, Enzo
•  CFD: OpenFoam representing NASA applications that are export controlled
and couldn’t be used for security reasons
Trade Study Approach, continued
Cost Basis:
•  Use worst-case cost estimate for on premises: the full cost
•  Compute cost of delivering a System Billing Unit (SBU) of work:
CostPerSBU = (Total HECC Budget) ➗ (number of SBUs delivered to Agency)
•  Cost of run is: (number of SBUs used in run) ✖ CostPerSBU
•  Note: includes all hardware, software, power, maintenance, staff, and facility costs
•  Use best-case cost estimate for cloud: the compute-only cost
•  AWS: use estimate of best spot price (30% of on-demand price)
•  POD: use published rates (discounts likely available, esp. for volume)
•  Note: does not include costs for storage, software licenses, bandwidth, HECC staff, etc.
Results and Analysis
•  Do a performance and cost analysis for jobs running on similar processor types
•  Produce findings and identify actions for HECC
Comparison of Compute Resources Used
Finding: The per-hour full cost of HECC resources is cheaper than the (compute-only) spot
price of similar resources at AWS and significantly cheaper than the (compute-only)
price of similar resources at POD.
NAS Parallel Benchmark Results
Performed runs from to 16–1024
MPI ranks to evaluate scaling
•  POD performance was similar to
HECC
•  AWS performance was worse
than HECC on communication-
intensive benchmarks
Cost to run the full set of the NPBs
•  AWS was 5.8–12 times more
expensive than HECC,
depending on the processor type
used
•  POD was 4.7 times more
expensive than equivalent runs at
HECC
Full-Sized Application Results
•  Large computations took substantially more time on AWS than on HECC
•  When cores/node are equal (BRO), POD performance is close to HECC’s, but still lags
•  AWS was 1.9x more expensive than HECC for the application workload, even with spot pricing
•  POD was 5.3x more expensive than HECC for the application workload
Finding: Tightly-coupled, multi-node applications from the NASA workload take somewhat more time
when run on cloud-based nodes connected with HPC-level interconnects; they take significantly
more time when run on cloud-based nodes that use conventional, Ethernet-based interconnects.
Main Finding
Finding: Commercial clouds do not offer a viable, cost-effective approach for
replacing in-house HPC resources at NASA
Why might others get different results?
•  Our applications have different computation & communication patterns from theirs?
•  Low HECC Costs
•  Equipment, power, cooling
•  Staff: size of system means lower staff costs per SBU than smaller centers
•  High HECC Utilization
•  Evolution of system:
•  Effectively same system for 10 years
•  User familiarity reduces ramp up costs to reach high utilization of new resources
•  Minimize interruptions for maintenance
•  Rolling upgrades, suspend/resume, …
So, are there cases in HECC where clouds might be cost effective?
Cost-Effective Uses of Cloud for NASA
When utilization would be low
•  New resource types, e.g. GPU-accelerated nodes
•  Use cloud-based resources until demand rises to the point where it is more cost
effective to acquire resources and move work on premises
When there are other costs to consider
•  Opportunity costs associated with high utilization, e.g. longer waits in the queue
•  What if we move 1-node jobs to the cloud? Would that speed up scheduling of
remaining jobs?
Real-time requirements
•  e.g web services
Finding: Commercial clouds provide a variety of resources not available at HECC. Certain use
cases may be cost-effective to run there.
Follow-up Work Identified
Get a better understanding of potential benefits and costs of having a portion of the
HECC workload in the cloud
•  Impact of 1-node jobs on scheduling of rest of workload
•  Understand performance characteristics of jobs that might run there
Define a comprehensive model that allows accurate comparisons of cost of running
jobs depending on resources used
Prepare for a broadening of services provided by HECC to include a portion of its
workload running on commercial cloud resources
•  For HECC users
•  For non-HECC users
Running 1-Node Jobs in the Cloud
Motivation
•  One-node jobs not as likely to see performance impact due to commodity interconnect
•  Does moving them to the cloud speed up scheduling of wider jobs?
Experiment
•  Simulate scheduling of actual job mix with 1-node jobs removed
•  Determine impact to starting time of remaining jobs
Results
•  While some jobs started much sooner, they

tended to be using relatively few nodes
•  Weighted average of improvement was ~2 hours
•  Not clear that benefit would be worth added

cost of using cloud resources
Integrating Cloud Usage into HECC
Pilot project to move jobs from HECC resources to AWS
•  HECC user logs into cloud front end to build executable
•  User annotates batch scripts, indicating files that need to be staged to/from cloud
•  User submits batch jobs to “cloud” queue
•  PBS server moves jobs to server running at AWS; stages input files
•  Server in cloud allocates resources, runs job, stages output files back
•  Accounting is done manually; HECC pays
•  Limited to non-export controlled codes and data (i.e. “low” security plan)
User-defined software stacks
•  Security concerns with Singularity (runs as root)
•  Charliecloud-based containers are unprivileged
•  Deployed Charliecloud on HECC; tested on AWS as well
•  Don’t yet have a solution for MPI programs running on multiple nodes
Future Work
Extend Pilot Project to Include:
•  Full accounting, with account limits and automated tracking of consumption
•  Moderate security plan
Extend Services to Include non-HECC Users
•  They bring their own funding
•  Provide user interface for defining and running jobs, moving data, etc.
•  HECC would add cost-recovery fee to make this self sustaining
Cost Methodology
•  Need to be able to do meaningful cost comparisons between on-premises and in-cloud in
order to determine which would be most cost effective
•  Include opportunity costs, too
•  Benchmark suite
•  Adjust processes for acquisition and phase out of resources to include commercial cloud
Acknowledgements
The Team:
S. Chang, R. Hood, H. Jin, S. Heistand, J. Chang, S. Cheung, J. Djomehri, G. Jost, D. Kokron
Helpful Advice from:
R. Biswas, B. Ciotti, L. Hogle, T. Lee, P. Mehrotra, and W. Thigpen
Resources:
•  The NASA High-End Computing Capability (HECC)
•  Penguin Computing made their cloud-based resources available to our evaluation at no cost
and provided early access to their Skylake-based nodes
The Report: NAS Technical Report NAS-2018-001 https://www.nas.nasa.gov/assets/pdf/
papers/NAS_Technical_Report_NAS-2018-01.pdf
?
Backup
Full-Sized Application Results: 

AWS vs HECC
•  Large computations took substantially more time on AWS than on HECC
•  AWS was 1.9x more expensive than HECC for the application workload
•  Even with optimistic assumptions about getting spot pricing
•  When cores/node are equal, POD performance is just a little bit worse than HECC’s
•  POD was 5.3 times more expensive than HECC for the application workload
Finding: Tightly-coupled, multi-node applications from the NASA workload take somewhat
more time when run on cloud-based nodes connected with HPC-level
interconnects; they take significantly more time when run on cloud-based nodes
that use conventional, Ethernet-based interconnects.
APP Case NCPUS
# of HECC
nodes
# of POD
nodes
HECC time
(sec)
Total HECC
full cost
POD time
(sec)
Total POD
compute
cost
ENZO (HAS) SBU2 196 9 10 1827 $2.44 2355 $11.78
ENZO (BRO) SBU2 196 7 7 1625 $2.04 1870 $10.18
ENZO (SKY) SBU2 196 5 5 1519 $2.15 1751 $10.70
WRF (HAS) SBU1 384 16 20 1243 $2.95 1802 $18.02
WRF (BRO) SBU1 384 14 14 1225 $3.08 1436 $15.64
WRF (SKY) SBU1 384 10 10 1069 $3.02 1352 $16.52
Total Cost $15.68 $82.84
Full-Sized Application Results: 

POD vs HECC

Más contenido relacionado

La actualidad más candente

Big Data and High Performance Computing Solutions in the AWS Cloud
Big Data and High Performance Computing Solutions in the AWS CloudBig Data and High Performance Computing Solutions in the AWS Cloud
Big Data and High Performance Computing Solutions in the AWS CloudAmazon Web Services
 
Optimization of Resource Provisioning Cost in Cloud Computing
Optimization of Resource Provisioning Cost in Cloud ComputingOptimization of Resource Provisioning Cost in Cloud Computing
Optimization of Resource Provisioning Cost in Cloud ComputingAswin Kalarickal
 
(CMP202) Engineering Simulation and Analysis in the Cloud
(CMP202) Engineering Simulation and Analysis in the Cloud(CMP202) Engineering Simulation and Analysis in the Cloud
(CMP202) Engineering Simulation and Analysis in the CloudAmazon Web Services
 
High Performance Computing (HPC) in cloud
High Performance Computing (HPC) in cloudHigh Performance Computing (HPC) in cloud
High Performance Computing (HPC) in cloudAccubits Technologies
 
2016 Utah Cloud Summit: TCO & Cost Optimization
2016 Utah Cloud Summit: TCO & Cost Optimization2016 Utah Cloud Summit: TCO & Cost Optimization
2016 Utah Cloud Summit: TCO & Cost Optimization1Strategy
 
NAVGEM on the Cloud: Computational Evaluation of Cloud HPC with a Global Atmo...
NAVGEM on the Cloud: Computational Evaluation of Cloud HPC with a Global Atmo...NAVGEM on the Cloud: Computational Evaluation of Cloud HPC with a Global Atmo...
NAVGEM on the Cloud: Computational Evaluation of Cloud HPC with a Global Atmo...inside-BigData.com
 
Cost Optimization as Major Architectural Consideration for Cloud Application
Cost Optimization as Major Architectural Consideration for Cloud ApplicationCost Optimization as Major Architectural Consideration for Cloud Application
Cost Optimization as Major Architectural Consideration for Cloud ApplicationUdayan Banerjee
 
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011Toby Bloom
 
Cloud cost optimization (AWS, GCP)
Cloud cost optimization (AWS, GCP)Cloud cost optimization (AWS, GCP)
Cloud cost optimization (AWS, GCP)Szabolcs Zajdó
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Sujee Maniyam
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsIgor Sfiligoi
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCOlga Lavrentieva
 
AWS Summit 2013 | Auckland - Understanding AWS Storage Options
AWS Summit 2013 | Auckland - Understanding AWS Storage OptionsAWS Summit 2013 | Auckland - Understanding AWS Storage Options
AWS Summit 2013 | Auckland - Understanding AWS Storage OptionsAmazon Web Services
 
Scalable Scientific Computing with Dask
Scalable Scientific Computing with DaskScalable Scientific Computing with Dask
Scalable Scientific Computing with DaskUwe Korn
 
Characterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsCharacterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsIgor Sfiligoi
 
Characterizing Network Paths in and out of the Clouds
Characterizing Network Paths in and out of the CloudsCharacterizing Network Paths in and out of the Clouds
Characterizing Network Paths in and out of the Cloudsinside-BigData.com
 

La actualidad más candente (20)

Big Data and High Performance Computing Solutions in the AWS Cloud
Big Data and High Performance Computing Solutions in the AWS CloudBig Data and High Performance Computing Solutions in the AWS Cloud
Big Data and High Performance Computing Solutions in the AWS Cloud
 
HPC on AWS
HPC on AWSHPC on AWS
HPC on AWS
 
Optimization of Resource Provisioning Cost in Cloud Computing
Optimization of Resource Provisioning Cost in Cloud ComputingOptimization of Resource Provisioning Cost in Cloud Computing
Optimization of Resource Provisioning Cost in Cloud Computing
 
HPC in AWS - Technical Workshop
HPC in AWS - Technical WorkshopHPC in AWS - Technical Workshop
HPC in AWS - Technical Workshop
 
(CMP202) Engineering Simulation and Analysis in the Cloud
(CMP202) Engineering Simulation and Analysis in the Cloud(CMP202) Engineering Simulation and Analysis in the Cloud
(CMP202) Engineering Simulation and Analysis in the Cloud
 
High Performance Computing (HPC) in cloud
High Performance Computing (HPC) in cloudHigh Performance Computing (HPC) in cloud
High Performance Computing (HPC) in cloud
 
AWS for HPC in Drug Discovery
AWS for HPC in Drug DiscoveryAWS for HPC in Drug Discovery
AWS for HPC in Drug Discovery
 
2016 Utah Cloud Summit: TCO & Cost Optimization
2016 Utah Cloud Summit: TCO & Cost Optimization2016 Utah Cloud Summit: TCO & Cost Optimization
2016 Utah Cloud Summit: TCO & Cost Optimization
 
NAVGEM on the Cloud: Computational Evaluation of Cloud HPC with a Global Atmo...
NAVGEM on the Cloud: Computational Evaluation of Cloud HPC with a Global Atmo...NAVGEM on the Cloud: Computational Evaluation of Cloud HPC with a Global Atmo...
NAVGEM on the Cloud: Computational Evaluation of Cloud HPC with a Global Atmo...
 
Cost Optimization as Major Architectural Consideration for Cloud Application
Cost Optimization as Major Architectural Consideration for Cloud ApplicationCost Optimization as Major Architectural Consideration for Cloud Application
Cost Optimization as Major Architectural Consideration for Cloud Application
 
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
 
Cloud cost optimization (AWS, GCP)
Cloud cost optimization (AWS, GCP)Cloud cost optimization (AWS, GCP)
Cloud cost optimization (AWS, GCP)
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the Clouds
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPC
 
AWS Summit 2013 | Auckland - Understanding AWS Storage Options
AWS Summit 2013 | Auckland - Understanding AWS Storage OptionsAWS Summit 2013 | Auckland - Understanding AWS Storage Options
AWS Summit 2013 | Auckland - Understanding AWS Storage Options
 
Scalable Scientific Computing with Dask
Scalable Scientific Computing with DaskScalable Scientific Computing with Dask
Scalable Scientific Computing with Dask
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Characterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsCharacterizing network paths in and out of the Clouds
Characterizing network paths in and out of the Clouds
 
Characterizing Network Paths in and out of the Clouds
Characterizing Network Paths in and out of the CloudsCharacterizing Network Paths in and out of the Clouds
Characterizing Network Paths in and out of the Clouds
 

Similar a Suitability of Commercial Clouds for NASA's HPC Applications

AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...
AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...
AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...Amazon Web Services
 
AWS Summit London 2014 | Optimising TCO for the AWS Cloud (100)
AWS Summit London 2014 | Optimising TCO for the AWS Cloud (100)AWS Summit London 2014 | Optimising TCO for the AWS Cloud (100)
AWS Summit London 2014 | Optimising TCO for the AWS Cloud (100)Amazon Web Services
 
High Performance Computing with AWS
High Performance Computing with AWSHigh Performance Computing with AWS
High Performance Computing with AWSAmazon Web Services
 
Migrating thousands of workloads to AWS at enterprise scale
Migrating thousands of workloads to AWS at enterprise scaleMigrating thousands of workloads to AWS at enterprise scale
Migrating thousands of workloads to AWS at enterprise scaleTom Laszewski
 
Migration Recipes for Success - AWS Summit Cape Town 2017
Migration Recipes for Success - AWS Summit Cape Town 2017 Migration Recipes for Success - AWS Summit Cape Town 2017
Migration Recipes for Success - AWS Summit Cape Town 2017 Amazon Web Services
 
Cloud Overview
Cloud OverviewCloud Overview
Cloud Overviewiasaglobal
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computingiasaglobal
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresCloudLightning
 
Webinar: Burst ANSYS Workloads to the Cloud with Univa & UberCloud
Webinar: Burst ANSYS Workloads to the Cloud with Univa & UberCloudWebinar: Burst ANSYS Workloads to the Cloud with Univa & UberCloud
Webinar: Burst ANSYS Workloads to the Cloud with Univa & UberCloudThomas Francis
 
Platform as a service standard for hadoop environment
Platform as a service standard for hadoop environmentPlatform as a service standard for hadoop environment
Platform as a service standard for hadoop environmentAbhay Pai
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedAnant Kumar
 
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...Angela Williams
 
Optimized placement in Openstack for NFV
Optimized placement in Openstack for NFVOptimized placement in Openstack for NFV
Optimized placement in Openstack for NFVDebojyoti Dutta
 
Navops talk at hpc in the cloud meetup 19 march 2019
Navops talk at hpc in the cloud meetup 19 march 2019Navops talk at hpc in the cloud meetup 19 march 2019
Navops talk at hpc in the cloud meetup 19 march 2019Abhishek Gupta
 
(ENT206) Migrating Thousands of Workloads to AWS at Enterprise Scale | AWS re...
(ENT206) Migrating Thousands of Workloads to AWS at Enterprise Scale | AWS re...(ENT206) Migrating Thousands of Workloads to AWS at Enterprise Scale | AWS re...
(ENT206) Migrating Thousands of Workloads to AWS at Enterprise Scale | AWS re...Amazon Web Services
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...Amazon Web Services
 
Guide to Application Performance: Planning to Continued Optimization
Guide to Application Performance: Planning to Continued OptimizationGuide to Application Performance: Planning to Continued Optimization
Guide to Application Performance: Planning to Continued OptimizationMuleSoft
 

Similar a Suitability of Commercial Clouds for NASA's HPC Applications (20)

AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...
AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...
AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...
 
AWS Summit London 2014 | Optimising TCO for the AWS Cloud (100)
AWS Summit London 2014 | Optimising TCO for the AWS Cloud (100)AWS Summit London 2014 | Optimising TCO for the AWS Cloud (100)
AWS Summit London 2014 | Optimising TCO for the AWS Cloud (100)
 
High Performance Computing with AWS
High Performance Computing with AWSHigh Performance Computing with AWS
High Performance Computing with AWS
 
Migrating thousands of workloads to AWS at enterprise scale
Migrating thousands of workloads to AWS at enterprise scaleMigrating thousands of workloads to AWS at enterprise scale
Migrating thousands of workloads to AWS at enterprise scale
 
Cloud
CloudCloud
Cloud
 
Migration Recipes for Success - AWS Summit Cape Town 2017
Migration Recipes for Success - AWS Summit Cape Town 2017 Migration Recipes for Success - AWS Summit Cape Town 2017
Migration Recipes for Success - AWS Summit Cape Town 2017
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Cloud Overview
Cloud OverviewCloud Overview
Cloud Overview
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud Infrastructures
 
Cloud computing_Final
Cloud computing_FinalCloud computing_Final
Cloud computing_Final
 
Webinar: Burst ANSYS Workloads to the Cloud with Univa & UberCloud
Webinar: Burst ANSYS Workloads to the Cloud with Univa & UberCloudWebinar: Burst ANSYS Workloads to the Cloud with Univa & UberCloud
Webinar: Burst ANSYS Workloads to the Cloud with Univa & UberCloud
 
Platform as a service standard for hadoop environment
Platform as a service standard for hadoop environmentPlatform as a service standard for hadoop environment
Platform as a service standard for hadoop environment
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme based
 
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
 
Optimized placement in Openstack for NFV
Optimized placement in Openstack for NFVOptimized placement in Openstack for NFV
Optimized placement in Openstack for NFV
 
Navops talk at hpc in the cloud meetup 19 march 2019
Navops talk at hpc in the cloud meetup 19 march 2019Navops talk at hpc in the cloud meetup 19 march 2019
Navops talk at hpc in the cloud meetup 19 march 2019
 
(ENT206) Migrating Thousands of Workloads to AWS at Enterprise Scale | AWS re...
(ENT206) Migrating Thousands of Workloads to AWS at Enterprise Scale | AWS re...(ENT206) Migrating Thousands of Workloads to AWS at Enterprise Scale | AWS re...
(ENT206) Migrating Thousands of Workloads to AWS at Enterprise Scale | AWS re...
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
 
Guide to Application Performance: Planning to Continued Optimization
Guide to Application Performance: Planning to Continued OptimizationGuide to Application Performance: Planning to Continued Optimization
Guide to Application Performance: Planning to Continued Optimization
 

Más de inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

Más de inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Último

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 

Último (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Suitability of Commercial Clouds for NASA's HPC Applications

  • 1. National Aeronautics and Space Administration www.nasa.gov Suitability of Commercial Clouds for NASA's HPC Applications Robert Hood, InuTeq NASA Advanced Supercomputing Division NASA Ames Research Center HPC User Forum
  • 2. The HECC Project at NASA Ames High End Computing Capability •  Facilitating Science/Engineering at NASA •  4 compute systems, including: •  Pleiades: #24 in Top500; #14 in HPCG •  Electra: #43 & #37 (both should go up) •  1300+ users from across Agency Directorates •  Aeronautics, Human Exploration, Science •  ~900 MPI applications (+ non-MPI codes) Goal: Maximize resource delivery to users •  Strive for >80% utilization of resources •  Embrace new facility technologies •  Continually evaluate procurement options
  • 3. NASA HQ: Is HECC doing the best it can? Marketing Reps say USE
  • 4. Answering the Question: A Trade Study Key Questions: •  Would clouds be a cost-effective replacement for on-premises resources? •  Under what conditions could cloud usage be cost effective? Approach: Compare costs of running NASA workload on premises & in cloud •  Evaluation Systems •  Existing resources on premises at NASA InfiniBand interconnect •  Amazon Web Services (AWS) Ethernet interconnect •  Penguin-On-Demand (POD) InfiniBand, OmniPath interconnects •  Benchmarks •  NAS Parallel Benchmarks (NPBs): Classes C and D •  6 full-sized applications: •  Weather/Climate: FVCore, ECCO (really MITgcm), WRF/nuWRF •  Science: Athena++, Enzo •  CFD: OpenFoam representing NASA applications that are export controlled and couldn’t be used for security reasons
  • 5. Trade Study Approach, continued Cost Basis: •  Use worst-case cost estimate for on premises: the full cost •  Compute cost of delivering a System Billing Unit (SBU) of work: CostPerSBU = (Total HECC Budget) ➗ (number of SBUs delivered to Agency) •  Cost of run is: (number of SBUs used in run) ✖ CostPerSBU •  Note: includes all hardware, software, power, maintenance, staff, and facility costs •  Use best-case cost estimate for cloud: the compute-only cost •  AWS: use estimate of best spot price (30% of on-demand price) •  POD: use published rates (discounts likely available, esp. for volume) •  Note: does not include costs for storage, software licenses, bandwidth, HECC staff, etc. Results and Analysis •  Do a performance and cost analysis for jobs running on similar processor types •  Produce findings and identify actions for HECC
  • 6. Comparison of Compute Resources Used Finding: The per-hour full cost of HECC resources is cheaper than the (compute-only) spot price of similar resources at AWS and significantly cheaper than the (compute-only) price of similar resources at POD.
  • 7. NAS Parallel Benchmark Results Performed runs from to 16–1024 MPI ranks to evaluate scaling •  POD performance was similar to HECC •  AWS performance was worse than HECC on communication- intensive benchmarks Cost to run the full set of the NPBs •  AWS was 5.8–12 times more expensive than HECC, depending on the processor type used •  POD was 4.7 times more expensive than equivalent runs at HECC
  • 8. Full-Sized Application Results •  Large computations took substantially more time on AWS than on HECC •  When cores/node are equal (BRO), POD performance is close to HECC’s, but still lags •  AWS was 1.9x more expensive than HECC for the application workload, even with spot pricing •  POD was 5.3x more expensive than HECC for the application workload Finding: Tightly-coupled, multi-node applications from the NASA workload take somewhat more time when run on cloud-based nodes connected with HPC-level interconnects; they take significantly more time when run on cloud-based nodes that use conventional, Ethernet-based interconnects.
  • 9. Main Finding Finding: Commercial clouds do not offer a viable, cost-effective approach for replacing in-house HPC resources at NASA Why might others get different results? •  Our applications have different computation & communication patterns from theirs? •  Low HECC Costs •  Equipment, power, cooling •  Staff: size of system means lower staff costs per SBU than smaller centers •  High HECC Utilization •  Evolution of system: •  Effectively same system for 10 years •  User familiarity reduces ramp up costs to reach high utilization of new resources •  Minimize interruptions for maintenance •  Rolling upgrades, suspend/resume, … So, are there cases in HECC where clouds might be cost effective?
  • 10. Cost-Effective Uses of Cloud for NASA When utilization would be low •  New resource types, e.g. GPU-accelerated nodes •  Use cloud-based resources until demand rises to the point where it is more cost effective to acquire resources and move work on premises When there are other costs to consider •  Opportunity costs associated with high utilization, e.g. longer waits in the queue •  What if we move 1-node jobs to the cloud? Would that speed up scheduling of remaining jobs? Real-time requirements •  e.g web services Finding: Commercial clouds provide a variety of resources not available at HECC. Certain use cases may be cost-effective to run there.
  • 11. Follow-up Work Identified Get a better understanding of potential benefits and costs of having a portion of the HECC workload in the cloud •  Impact of 1-node jobs on scheduling of rest of workload •  Understand performance characteristics of jobs that might run there Define a comprehensive model that allows accurate comparisons of cost of running jobs depending on resources used Prepare for a broadening of services provided by HECC to include a portion of its workload running on commercial cloud resources •  For HECC users •  For non-HECC users
  • 12. Running 1-Node Jobs in the Cloud Motivation •  One-node jobs not as likely to see performance impact due to commodity interconnect •  Does moving them to the cloud speed up scheduling of wider jobs? Experiment •  Simulate scheduling of actual job mix with 1-node jobs removed •  Determine impact to starting time of remaining jobs Results •  While some jobs started much sooner, they
 tended to be using relatively few nodes •  Weighted average of improvement was ~2 hours •  Not clear that benefit would be worth added
 cost of using cloud resources
  • 13. Integrating Cloud Usage into HECC Pilot project to move jobs from HECC resources to AWS •  HECC user logs into cloud front end to build executable •  User annotates batch scripts, indicating files that need to be staged to/from cloud •  User submits batch jobs to “cloud” queue •  PBS server moves jobs to server running at AWS; stages input files •  Server in cloud allocates resources, runs job, stages output files back •  Accounting is done manually; HECC pays •  Limited to non-export controlled codes and data (i.e. “low” security plan) User-defined software stacks •  Security concerns with Singularity (runs as root) •  Charliecloud-based containers are unprivileged •  Deployed Charliecloud on HECC; tested on AWS as well •  Don’t yet have a solution for MPI programs running on multiple nodes
  • 14. Future Work Extend Pilot Project to Include: •  Full accounting, with account limits and automated tracking of consumption •  Moderate security plan Extend Services to Include non-HECC Users •  They bring their own funding •  Provide user interface for defining and running jobs, moving data, etc. •  HECC would add cost-recovery fee to make this self sustaining Cost Methodology •  Need to be able to do meaningful cost comparisons between on-premises and in-cloud in order to determine which would be most cost effective •  Include opportunity costs, too •  Benchmark suite •  Adjust processes for acquisition and phase out of resources to include commercial cloud
  • 15. Acknowledgements The Team: S. Chang, R. Hood, H. Jin, S. Heistand, J. Chang, S. Cheung, J. Djomehri, G. Jost, D. Kokron Helpful Advice from: R. Biswas, B. Ciotti, L. Hogle, T. Lee, P. Mehrotra, and W. Thigpen Resources: •  The NASA High-End Computing Capability (HECC) •  Penguin Computing made their cloud-based resources available to our evaluation at no cost and provided early access to their Skylake-based nodes The Report: NAS Technical Report NAS-2018-001 https://www.nas.nasa.gov/assets/pdf/ papers/NAS_Technical_Report_NAS-2018-01.pdf
  • 16. ?
  • 18. Full-Sized Application Results: 
 AWS vs HECC •  Large computations took substantially more time on AWS than on HECC •  AWS was 1.9x more expensive than HECC for the application workload •  Even with optimistic assumptions about getting spot pricing
  • 19. •  When cores/node are equal, POD performance is just a little bit worse than HECC’s •  POD was 5.3 times more expensive than HECC for the application workload Finding: Tightly-coupled, multi-node applications from the NASA workload take somewhat more time when run on cloud-based nodes connected with HPC-level interconnects; they take significantly more time when run on cloud-based nodes that use conventional, Ethernet-based interconnects. APP Case NCPUS # of HECC nodes # of POD nodes HECC time (sec) Total HECC full cost POD time (sec) Total POD compute cost ENZO (HAS) SBU2 196 9 10 1827 $2.44 2355 $11.78 ENZO (BRO) SBU2 196 7 7 1625 $2.04 1870 $10.18 ENZO (SKY) SBU2 196 5 5 1519 $2.15 1751 $10.70 WRF (HAS) SBU1 384 16 20 1243 $2.95 1802 $18.02 WRF (BRO) SBU1 384 14 14 1225 $3.08 1436 $15.64 WRF (SKY) SBU1 384 10 10 1069 $3.02 1352 $16.52 Total Cost $15.68 $82.84 Full-Sized Application Results: 
 POD vs HECC