SlideShare a Scribd company logo
1 of 22
Download to read offline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Set Up a Million-Core Cluster to
Accelerate HPC Workloads
C M P 4 0 4
Alex Emilcar
Solutions Architect
Amazon Web Services
Aydn Bekirov
Technical Account Manager
Amazon Web Services
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Workshop architecture
Spot Fleet
(Managed Compute Environment)
S3 Bucket for Job
Input/Outputs
Job Queue AWS Batch
Scheduler
ECS RunTaskSubmit Job
EU-West-1 IRELAND Region
(3 AZs)
Job Definition
S3 Bucket with
Container Pre-Reqs
1
2
345
Logs + Metrics
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Lessons from large scale runs (>100K vCPUs)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Extended AWS Team
Anh Tran
Sr HPC Specialized SA
Matt Easton
Enterprise Account Manager
Andrea Rodolico
Principal HPC BDM
Pierre-Yves Aquilanti
Sr HPC Specialized SA
© 2018 Western Digital Corporation or its affiliates. All rights reserved. 11/26/2018
CMP404―Set Up a Million-Core
Cluster to Accelerate HPC
Workloads
Hiroshi Kobayashi
26th November, 2018 @ AWS re:Invent
Senior Solution Architect
11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 7
This presentation contains certain forward-looking statements that involve risks and uncertainties, including, but not
limited to, statements regarding HDD technology progression, need for energy-assisted recording, potential growth, and
trends. Forward-looking statements should not be read as a guarantee of future performance or results, and will not
necessarily be accurate indications of the times at, or by, which such performance or results will be achieved, if at all.
Forward-looking statements are subject to risks and uncertainties that could cause actual performance or results to
differ materially from those expressed in or suggested by the forward-looking statements.
Key risks and uncertainties include volatility in global economic conditions, actions by competitors, business conditions,
growth in our markets, pricing trends and fluctuations in average selling prices, and other risks and uncertainties listed in
our filings with the Securities and Exchange Commission (the “SEC”) and available on the SEC’s website at www.sec.gov,
including our most recently filed periodic report, to which your attention is directed. We do not undertake any obligation
to publicly update or revise any forward-looking statement, whether as a result of new information, future
developments or otherwise, except as otherwise required by law.
Safe Harbor | Disclaimers
Forward-Looking Statements
11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 8
Agenda
1. Who am I?
2. Western Digital
3. AWS Batch architecture
4. Design principles
5. Conclusion
11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 9
---
- Name: Hiroshi Kobayashi
Company: Western Digital
Team: Global Engineering Services
Location: Japan
- Favorites:
AWS services: Amazon Simple Storage Service
(Amazon S3), AWS Batch
Community: Japan AWS User Group (JAWS-UG)
Tools: AWS Command Line Interface (AWS CLI),
Ansible, Jenkins
Who am I?
11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 10
Brand architecture
Corporate and product brands
OpenFlex™ Ultrastar® iNAND® …
Corporate brand
Product brands
Target audience
Western Digital
branded B2B
product lines
Consumer Consumer Consumer Consumer B2B
11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 11
Major enablers
“Moore’s Law” for HDDs
10~25% CGR
60% CGR
80% CGR
30% CGR
70% CGR
30% CGR
PMR
2006 He Sealed
Drive
Ferrite head
Thin film
Inductive
reader
Thin film
media
AMR reader
PRML Channel
AFC media
60 years  500 million X increase in Areal Density
IBM
RAMAC
IBM 3380
He10
Gbit/in2
GMR reader
EPRML channel
TMR reader
LDPC channel
Energy
Assist
11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 12
Writeability at high TPI (track per inch)
Why energy assisted recording is required
Scaling beyond Perpendicular Magnetic Recording (PMR) requires Energy Assisted Recording
Scaling
Capacity
Switching
Challenge
Magnetic
Stability
Energy
Assist
MicrowaveLaser Heat
11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 13
Why cloud?
• The “width” of the cloud far outsizes any internal resource (throughput limited)
• Advantages
– Scale
– Newest hardware available
• Concerns and resolutions
– Cost for computing, storage and data transfer―Spot Instances, Amazon S3, compress
everything
– Security―Amazon Virtual Private Cloud (Amazon VPC), AWS Direct Connect, Amazon S3 VPC
endpoints
• In a nutshell: Throughput to cost ratio is key metric
11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 14
• Docker image build process is
automated by Jenkins
• Fetch & Run
– Pull input files from Amazon S3
– Run simulations
– Put output files to Amazon S3
• Responsibility
– User = Dockerfile, job queue, job
definition, Jenkins
– IT Admin = AMI, compute
environment, Jenkins
AWS Batch + Jenkins
Architecture
users
Spot Fleet
Code Repo
Jenkins
Push
Hook
PushBuild
Submit job
Input/Output
Pull
The Jenkins logo is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License (https://creativecommons.org/licenses/by-sa/3.0/) from the Jenkins project (https://jenkins.io).
11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 15
Design the workflow for failure
• Minimize cost
–Compress everything, reduces network and storage cost
–Minimize “local” storage
–Minimize Amazon S3 access
• Assume Spot Instances can go away without warning
–Keep compute short (<15 minutes) or support “checkpoints”
–Save results as soon as possible to persistent store like Amazon S3
• Keep it simple, try to standardize
– Easy to debug
– AWS CLI, Bash v4 and Boto3
• Computational characteristics
– Single threaded and pleasingly parallel, 1:45-2:15 job durations, 0.2-2GB memory, minimal network
Design principles
11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 16
• Construct a generic run script at submission time
– Environment variables―Docker image or submit-job command
– Setup run directory and input files
– Run Job – switching executable for AVX generation (40% faster per job)
– Push results to Amazon S3
– Clean up run directory
– Exit code
Simple run script
11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 17
• Construct a generic run script at submission time
– Compress/decompress automatically reduces storage and network cost
– Expect a problem any AWS CLI command is wrapped in ‘until’ construct
– Add random prefix to increase Amazon S3 performance
Optimize Amazon S3 access
11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 18
• Avoid losing computation
– Use Kernel’s ‘inotify’ and ‘inotifytools’
– Only triggered when a file is “closed after write”
– Spawn compression, push to Amazon S3 and deletion of file to unblock for next event
Save results as soon as they appear
11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 19
Conclusion
• We containerized our HPC applications
• We scaled to over 1 million vCPUs
– Using containers, Spot Fleet and Amazon S3
• We optimized for the cloud
– Profiled over 30 instance offerings, narrowing down on medium sized AVX supporting
instances
– 20 days worth of work in eight hours at 1M vCPU
– 40% faster per jobs than previously (CPU_Time vs vCPU)
– Used for real design of future HDD components
© 2018 Western Digital Corporation or its affiliates. All rights reserved. 11/26/2018
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Alex Emilcar
Solutions Architect
Amazon Web Services
Aydn Bekirov
Technical Account Manager
Amazon Web Services
Hiroshi Kobayashi
Sr. Solutions Architect
Western Digital
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

More Related Content

What's hot

OTT 成功的關鍵:打造影劇品質監控儀表板 (Level: 200)
OTT 成功的關鍵:打造影劇品質監控儀表板 (Level: 200)OTT 成功的關鍵:打造影劇品質監控儀表板 (Level: 200)
OTT 成功的關鍵:打造影劇品質監控儀表板 (Level: 200)Amazon Web Services
 
Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018
Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018
Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018Amazon Web Services
 
AWS 微服務中的 Container 選項比較 (Level 400)
AWS 微服務中的 Container 選項比較   (Level 400)AWS 微服務中的 Container 選項比較   (Level 400)
AWS 微服務中的 Container 選項比較 (Level 400)Amazon Web Services
 
Networking for VMware Cloud on AWS (NET307-R1) - AWS re:Invent 2018
Networking for VMware Cloud on AWS (NET307-R1) - AWS re:Invent 2018Networking for VMware Cloud on AWS (NET307-R1) - AWS re:Invent 2018
Networking for VMware Cloud on AWS (NET307-R1) - AWS re:Invent 2018Amazon Web Services
 
ENT208 Transform your Business with VMware Cloud on AWS
ENT208 Transform your Business with VMware Cloud on AWSENT208 Transform your Business with VMware Cloud on AWS
ENT208 Transform your Business with VMware Cloud on AWSAmazon Web Services
 
ENT307 Move your Desktops and Apps to AWS with Amazon WorkSpaces and AppStre...
 ENT307 Move your Desktops and Apps to AWS with Amazon WorkSpaces and AppStre... ENT307 Move your Desktops and Apps to AWS with Amazon WorkSpaces and AppStre...
ENT307 Move your Desktops and Apps to AWS with Amazon WorkSpaces and AppStre...Amazon Web Services
 
Deep Dive - Amazon Relational Database Services_AWSPSSummit_Singapore
Deep Dive - Amazon Relational Database Services_AWSPSSummit_SingaporeDeep Dive - Amazon Relational Database Services_AWSPSSummit_Singapore
Deep Dive - Amazon Relational Database Services_AWSPSSummit_SingaporeAmazon Web Services
 
Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...
Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...
Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...Amazon Web Services
 
Use SD-WAN to Manage Your AWS Environment and Branch Office Connectivity (NET...
Use SD-WAN to Manage Your AWS Environment and Branch Office Connectivity (NET...Use SD-WAN to Manage Your AWS Environment and Branch Office Connectivity (NET...
Use SD-WAN to Manage Your AWS Environment and Branch Office Connectivity (NET...Amazon Web Services
 
Another Day in the Life of a Cloud Network Engineer at Netflix (NET312) - AWS...
Another Day in the Life of a Cloud Network Engineer at Netflix (NET312) - AWS...Another Day in the Life of a Cloud Network Engineer at Netflix (NET312) - AWS...
Another Day in the Life of a Cloud Network Engineer at Netflix (NET312) - AWS...Amazon Web Services
 
Computação de Alta Performance (HPC) na AWS - CMP201 - Sao Paulo Summit
Computação de Alta Performance (HPC) na AWS -  CMP201 - Sao Paulo SummitComputação de Alta Performance (HPC) na AWS -  CMP201 - Sao Paulo Summit
Computação de Alta Performance (HPC) na AWS - CMP201 - Sao Paulo SummitAmazon Web Services
 
Optimizing Network Performance for Amazon EC2 Instances (CMP308-R1) - AWS re:...
Optimizing Network Performance for Amazon EC2 Instances (CMP308-R1) - AWS re:...Optimizing Network Performance for Amazon EC2 Instances (CMP308-R1) - AWS re:...
Optimizing Network Performance for Amazon EC2 Instances (CMP308-R1) - AWS re:...Amazon Web Services
 
Hands-On: Deploy Remote Graphics Desktops for Content Production (CMP422) - A...
Hands-On: Deploy Remote Graphics Desktops for Content Production (CMP422) - A...Hands-On: Deploy Remote Graphics Desktops for Content Production (CMP422) - A...
Hands-On: Deploy Remote Graphics Desktops for Content Production (CMP422) - A...Amazon Web Services
 
Supercharge VMware Cloud on AWS Environments with Native AWS Services (CMP360...
Supercharge VMware Cloud on AWS Environments with Native AWS Services (CMP360...Supercharge VMware Cloud on AWS Environments with Native AWS Services (CMP360...
Supercharge VMware Cloud on AWS Environments with Native AWS Services (CMP360...Amazon Web Services
 
Using Containers and Serverless to Deploy Microservices
Using Containers and Serverless to Deploy MicroservicesUsing Containers and Serverless to Deploy Microservices
Using Containers and Serverless to Deploy MicroservicesAmazon Web Services
 
DevSecOps 的規模化實踐 (Level: 300-400)
DevSecOps 的規模化實踐 (Level: 300-400)DevSecOps 的規模化實踐 (Level: 300-400)
DevSecOps 的規模化實踐 (Level: 300-400)Amazon Web Services
 
Reuters Lives: Scaling & Monitoring Live Video in the Cloud (DEV316-S) - AWS ...
Reuters Lives: Scaling & Monitoring Live Video in the Cloud (DEV316-S) - AWS ...Reuters Lives: Scaling & Monitoring Live Video in the Cloud (DEV316-S) - AWS ...
Reuters Lives: Scaling & Monitoring Live Video in the Cloud (DEV316-S) - AWS ...Amazon Web Services
 
Migrating to VMware on AWS as the First Step Towards the AWS Cloud (GPSCT206)...
Migrating to VMware on AWS as the First Step Towards the AWS Cloud (GPSCT206)...Migrating to VMware on AWS as the First Step Towards the AWS Cloud (GPSCT206)...
Migrating to VMware on AWS as the First Step Towards the AWS Cloud (GPSCT206)...Amazon Web Services
 
Accelerating Development Using Custom Hardware Accelerations with Amazon EC2 ...
Accelerating Development Using Custom Hardware Accelerations with Amazon EC2 ...Accelerating Development Using Custom Hardware Accelerations with Amazon EC2 ...
Accelerating Development Using Custom Hardware Accelerations with Amazon EC2 ...Amazon Web Services
 

What's hot (20)

OTT 成功的關鍵:打造影劇品質監控儀表板 (Level: 200)
OTT 成功的關鍵:打造影劇品質監控儀表板 (Level: 200)OTT 成功的關鍵:打造影劇品質監控儀表板 (Level: 200)
OTT 成功的關鍵:打造影劇品質監控儀表板 (Level: 200)
 
Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018
Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018
Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018
 
AWS 微服務中的 Container 選項比較 (Level 400)
AWS 微服務中的 Container 選項比較   (Level 400)AWS 微服務中的 Container 選項比較   (Level 400)
AWS 微服務中的 Container 選項比較 (Level 400)
 
Networking for VMware Cloud on AWS (NET307-R1) - AWS re:Invent 2018
Networking for VMware Cloud on AWS (NET307-R1) - AWS re:Invent 2018Networking for VMware Cloud on AWS (NET307-R1) - AWS re:Invent 2018
Networking for VMware Cloud on AWS (NET307-R1) - AWS re:Invent 2018
 
ENT208 Transform your Business with VMware Cloud on AWS
ENT208 Transform your Business with VMware Cloud on AWSENT208 Transform your Business with VMware Cloud on AWS
ENT208 Transform your Business with VMware Cloud on AWS
 
Amazon Aurora 深度探討
Amazon Aurora 深度探討Amazon Aurora 深度探討
Amazon Aurora 深度探討
 
ENT307 Move your Desktops and Apps to AWS with Amazon WorkSpaces and AppStre...
 ENT307 Move your Desktops and Apps to AWS with Amazon WorkSpaces and AppStre... ENT307 Move your Desktops and Apps to AWS with Amazon WorkSpaces and AppStre...
ENT307 Move your Desktops and Apps to AWS with Amazon WorkSpaces and AppStre...
 
Deep Dive - Amazon Relational Database Services_AWSPSSummit_Singapore
Deep Dive - Amazon Relational Database Services_AWSPSSummit_SingaporeDeep Dive - Amazon Relational Database Services_AWSPSSummit_Singapore
Deep Dive - Amazon Relational Database Services_AWSPSSummit_Singapore
 
Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...
Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...
Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...
 
Use SD-WAN to Manage Your AWS Environment and Branch Office Connectivity (NET...
Use SD-WAN to Manage Your AWS Environment and Branch Office Connectivity (NET...Use SD-WAN to Manage Your AWS Environment and Branch Office Connectivity (NET...
Use SD-WAN to Manage Your AWS Environment and Branch Office Connectivity (NET...
 
Another Day in the Life of a Cloud Network Engineer at Netflix (NET312) - AWS...
Another Day in the Life of a Cloud Network Engineer at Netflix (NET312) - AWS...Another Day in the Life of a Cloud Network Engineer at Netflix (NET312) - AWS...
Another Day in the Life of a Cloud Network Engineer at Netflix (NET312) - AWS...
 
Computação de Alta Performance (HPC) na AWS - CMP201 - Sao Paulo Summit
Computação de Alta Performance (HPC) na AWS -  CMP201 - Sao Paulo SummitComputação de Alta Performance (HPC) na AWS -  CMP201 - Sao Paulo Summit
Computação de Alta Performance (HPC) na AWS - CMP201 - Sao Paulo Summit
 
Optimizing Network Performance for Amazon EC2 Instances (CMP308-R1) - AWS re:...
Optimizing Network Performance for Amazon EC2 Instances (CMP308-R1) - AWS re:...Optimizing Network Performance for Amazon EC2 Instances (CMP308-R1) - AWS re:...
Optimizing Network Performance for Amazon EC2 Instances (CMP308-R1) - AWS re:...
 
Hands-On: Deploy Remote Graphics Desktops for Content Production (CMP422) - A...
Hands-On: Deploy Remote Graphics Desktops for Content Production (CMP422) - A...Hands-On: Deploy Remote Graphics Desktops for Content Production (CMP422) - A...
Hands-On: Deploy Remote Graphics Desktops for Content Production (CMP422) - A...
 
Supercharge VMware Cloud on AWS Environments with Native AWS Services (CMP360...
Supercharge VMware Cloud on AWS Environments with Native AWS Services (CMP360...Supercharge VMware Cloud on AWS Environments with Native AWS Services (CMP360...
Supercharge VMware Cloud on AWS Environments with Native AWS Services (CMP360...
 
Using Containers and Serverless to Deploy Microservices
Using Containers and Serverless to Deploy MicroservicesUsing Containers and Serverless to Deploy Microservices
Using Containers and Serverless to Deploy Microservices
 
DevSecOps 的規模化實踐 (Level: 300-400)
DevSecOps 的規模化實踐 (Level: 300-400)DevSecOps 的規模化實踐 (Level: 300-400)
DevSecOps 的規模化實踐 (Level: 300-400)
 
Reuters Lives: Scaling & Monitoring Live Video in the Cloud (DEV316-S) - AWS ...
Reuters Lives: Scaling & Monitoring Live Video in the Cloud (DEV316-S) - AWS ...Reuters Lives: Scaling & Monitoring Live Video in the Cloud (DEV316-S) - AWS ...
Reuters Lives: Scaling & Monitoring Live Video in the Cloud (DEV316-S) - AWS ...
 
Migrating to VMware on AWS as the First Step Towards the AWS Cloud (GPSCT206)...
Migrating to VMware on AWS as the First Step Towards the AWS Cloud (GPSCT206)...Migrating to VMware on AWS as the First Step Towards the AWS Cloud (GPSCT206)...
Migrating to VMware on AWS as the First Step Towards the AWS Cloud (GPSCT206)...
 
Accelerating Development Using Custom Hardware Accelerations with Amazon EC2 ...
Accelerating Development Using Custom Hardware Accelerations with Amazon EC2 ...Accelerating Development Using Custom Hardware Accelerations with Amazon EC2 ...
Accelerating Development Using Custom Hardware Accelerations with Amazon EC2 ...
 

Similar to Set Up a Million-Core Cluster to Accelerate HPC Workloads (CMP404) - AWS re:Invent 2018

Building Serverless ETL Pipelines
Building Serverless ETL PipelinesBuilding Serverless ETL Pipelines
Building Serverless ETL PipelinesAmazon Web Services
 
Rightsizing Your Silicon Design Environment: Elastic Clusters for EDA Workloa...
Rightsizing Your Silicon Design Environment: Elastic Clusters for EDA Workloa...Rightsizing Your Silicon Design Environment: Elastic Clusters for EDA Workloa...
Rightsizing Your Silicon Design Environment: Elastic Clusters for EDA Workloa...Amazon Web Services
 
Getting Started with ARM-Based EC2 A1 Instances - CMP302 - Anaheim AWS Summit
Getting Started with ARM-Based EC2 A1 Instances - CMP302 - Anaheim AWS SummitGetting Started with ARM-Based EC2 A1 Instances - CMP302 - Anaheim AWS Summit
Getting Started with ARM-Based EC2 A1 Instances - CMP302 - Anaheim AWS SummitAmazon Web Services
 
Use HPC on AWS for Physics-Based Simulation, ML, and Statistics in CAE (CMP32...
Use HPC on AWS for Physics-Based Simulation, ML, and Statistics in CAE (CMP32...Use HPC on AWS for Physics-Based Simulation, ML, and Statistics in CAE (CMP32...
Use HPC on AWS for Physics-Based Simulation, ML, and Statistics in CAE (CMP32...Amazon Web Services
 
Real-Time Web Analytics with Amazon Kinesis Data Analytics (ADT401) - AWS re:...
Real-Time Web Analytics with Amazon Kinesis Data Analytics (ADT401) - AWS re:...Real-Time Web Analytics with Amazon Kinesis Data Analytics (ADT401) - AWS re:...
Real-Time Web Analytics with Amazon Kinesis Data Analytics (ADT401) - AWS re:...Amazon Web Services
 
AWS Snowball Edge and AWS Greengrass for Fun and Profit (STG388) - AWS re:Inv...
AWS Snowball Edge and AWS Greengrass for Fun and Profit (STG388) - AWS re:Inv...AWS Snowball Edge and AWS Greengrass for Fun and Profit (STG388) - AWS re:Inv...
AWS Snowball Edge and AWS Greengrass for Fun and Profit (STG388) - AWS re:Inv...Amazon Web Services
 
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS SummitPerforming serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS SummitAmazon Web Services
 
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018Amazon Web Services
 
How Nubank Automates Fine-Grained Security with IAM, AWS Lambda, and CI/CD (F...
How Nubank Automates Fine-Grained Security with IAM, AWS Lambda, and CI/CD (F...How Nubank Automates Fine-Grained Security with IAM, AWS Lambda, and CI/CD (F...
How Nubank Automates Fine-Grained Security with IAM, AWS Lambda, and CI/CD (F...Amazon Web Services
 
Become a Serverless Black Belt - Optimizing Your Serverless Applications - AW...
Become a Serverless Black Belt - Optimizing Your Serverless Applications - AW...Become a Serverless Black Belt - Optimizing Your Serverless Applications - AW...
Become a Serverless Black Belt - Optimizing Your Serverless Applications - AW...Amazon Web Services
 
How UCSD Simplified Data Protection with Rubrik and AWS (STG207-S) - AWS re:I...
How UCSD Simplified Data Protection with Rubrik and AWS (STG207-S) - AWS re:I...How UCSD Simplified Data Protection with Rubrik and AWS (STG207-S) - AWS re:I...
How UCSD Simplified Data Protection with Rubrik and AWS (STG207-S) - AWS re:I...Amazon Web Services
 
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Amazon Web Services
 
Breaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdfBreaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdfAmazon Web Services
 
Easy and Efficient Batch Computing on AWS
Easy and Efficient Batch Computing on AWSEasy and Efficient Batch Computing on AWS
Easy and Efficient Batch Computing on AWSAmazon Web Services
 
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018Amazon Web Services
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftAmazon Web Services
 
What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018
What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018
What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018Amazon Web Services
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon RedshiftAmazon Web Services
 
利用 Fargate - 無伺服器的容器環境建置高可用的系統
利用 Fargate - 無伺服器的容器環境建置高可用的系統利用 Fargate - 無伺服器的容器環境建置高可用的系統
利用 Fargate - 無伺服器的容器環境建置高可用的系統Amazon Web Services
 

Similar to Set Up a Million-Core Cluster to Accelerate HPC Workloads (CMP404) - AWS re:Invent 2018 (20)

Building Serverless ETL Pipelines
Building Serverless ETL PipelinesBuilding Serverless ETL Pipelines
Building Serverless ETL Pipelines
 
Rightsizing Your Silicon Design Environment: Elastic Clusters for EDA Workloa...
Rightsizing Your Silicon Design Environment: Elastic Clusters for EDA Workloa...Rightsizing Your Silicon Design Environment: Elastic Clusters for EDA Workloa...
Rightsizing Your Silicon Design Environment: Elastic Clusters for EDA Workloa...
 
Getting Started with ARM-Based EC2 A1 Instances - CMP302 - Anaheim AWS Summit
Getting Started with ARM-Based EC2 A1 Instances - CMP302 - Anaheim AWS SummitGetting Started with ARM-Based EC2 A1 Instances - CMP302 - Anaheim AWS Summit
Getting Started with ARM-Based EC2 A1 Instances - CMP302 - Anaheim AWS Summit
 
Use HPC on AWS for Physics-Based Simulation, ML, and Statistics in CAE (CMP32...
Use HPC on AWS for Physics-Based Simulation, ML, and Statistics in CAE (CMP32...Use HPC on AWS for Physics-Based Simulation, ML, and Statistics in CAE (CMP32...
Use HPC on AWS for Physics-Based Simulation, ML, and Statistics in CAE (CMP32...
 
Real-Time Web Analytics with Amazon Kinesis Data Analytics (ADT401) - AWS re:...
Real-Time Web Analytics with Amazon Kinesis Data Analytics (ADT401) - AWS re:...Real-Time Web Analytics with Amazon Kinesis Data Analytics (ADT401) - AWS re:...
Real-Time Web Analytics with Amazon Kinesis Data Analytics (ADT401) - AWS re:...
 
AWS Snowball Edge and AWS Greengrass for Fun and Profit (STG388) - AWS re:Inv...
AWS Snowball Edge and AWS Greengrass for Fun and Profit (STG388) - AWS re:Inv...AWS Snowball Edge and AWS Greengrass for Fun and Profit (STG388) - AWS re:Inv...
AWS Snowball Edge and AWS Greengrass for Fun and Profit (STG388) - AWS re:Inv...
 
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS SummitPerforming serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
 
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018
 
How Nubank Automates Fine-Grained Security with IAM, AWS Lambda, and CI/CD (F...
How Nubank Automates Fine-Grained Security with IAM, AWS Lambda, and CI/CD (F...How Nubank Automates Fine-Grained Security with IAM, AWS Lambda, and CI/CD (F...
How Nubank Automates Fine-Grained Security with IAM, AWS Lambda, and CI/CD (F...
 
Become a Serverless Black Belt - Optimizing Your Serverless Applications - AW...
Become a Serverless Black Belt - Optimizing Your Serverless Applications - AW...Become a Serverless Black Belt - Optimizing Your Serverless Applications - AW...
Become a Serverless Black Belt - Optimizing Your Serverless Applications - AW...
 
How UCSD Simplified Data Protection with Rubrik and AWS (STG207-S) - AWS re:I...
How UCSD Simplified Data Protection with Rubrik and AWS (STG207-S) - AWS re:I...How UCSD Simplified Data Protection with Rubrik and AWS (STG207-S) - AWS re:I...
How UCSD Simplified Data Protection with Rubrik and AWS (STG207-S) - AWS re:I...
 
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
 
Breaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdfBreaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdf
 
Easy and Efficient Batch Computing on AWS
Easy and Efficient Batch Computing on AWSEasy and Efficient Batch Computing on AWS
Easy and Efficient Batch Computing on AWS
 
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018
What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018
What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon Redshift
 
利用 Fargate - 無伺服器的容器環境建置高可用的系統
利用 Fargate - 無伺服器的容器環境建置高可用的系統利用 Fargate - 無伺服器的容器環境建置高可用的系統
利用 Fargate - 無伺服器的容器環境建置高可用的系統
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Set Up a Million-Core Cluster to Accelerate HPC Workloads (CMP404) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Set Up a Million-Core Cluster to Accelerate HPC Workloads C M P 4 0 4 Alex Emilcar Solutions Architect Amazon Web Services Aydn Bekirov Technical Account Manager Amazon Web Services
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Workshop architecture Spot Fleet (Managed Compute Environment) S3 Bucket for Job Input/Outputs Job Queue AWS Batch Scheduler ECS RunTaskSubmit Job EU-West-1 IRELAND Region (3 AZs) Job Definition S3 Bucket with Container Pre-Reqs 1 2 345 Logs + Metrics
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Lessons from large scale runs (>100K vCPUs)
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Extended AWS Team Anh Tran Sr HPC Specialized SA Matt Easton Enterprise Account Manager Andrea Rodolico Principal HPC BDM Pierre-Yves Aquilanti Sr HPC Specialized SA
  • 6. © 2018 Western Digital Corporation or its affiliates. All rights reserved. 11/26/2018 CMP404―Set Up a Million-Core Cluster to Accelerate HPC Workloads Hiroshi Kobayashi 26th November, 2018 @ AWS re:Invent Senior Solution Architect
  • 7. 11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 7 This presentation contains certain forward-looking statements that involve risks and uncertainties, including, but not limited to, statements regarding HDD technology progression, need for energy-assisted recording, potential growth, and trends. Forward-looking statements should not be read as a guarantee of future performance or results, and will not necessarily be accurate indications of the times at, or by, which such performance or results will be achieved, if at all. Forward-looking statements are subject to risks and uncertainties that could cause actual performance or results to differ materially from those expressed in or suggested by the forward-looking statements. Key risks and uncertainties include volatility in global economic conditions, actions by competitors, business conditions, growth in our markets, pricing trends and fluctuations in average selling prices, and other risks and uncertainties listed in our filings with the Securities and Exchange Commission (the “SEC”) and available on the SEC’s website at www.sec.gov, including our most recently filed periodic report, to which your attention is directed. We do not undertake any obligation to publicly update or revise any forward-looking statement, whether as a result of new information, future developments or otherwise, except as otherwise required by law. Safe Harbor | Disclaimers Forward-Looking Statements
  • 8. 11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 8 Agenda 1. Who am I? 2. Western Digital 3. AWS Batch architecture 4. Design principles 5. Conclusion
  • 9. 11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 9 --- - Name: Hiroshi Kobayashi Company: Western Digital Team: Global Engineering Services Location: Japan - Favorites: AWS services: Amazon Simple Storage Service (Amazon S3), AWS Batch Community: Japan AWS User Group (JAWS-UG) Tools: AWS Command Line Interface (AWS CLI), Ansible, Jenkins Who am I?
  • 10. 11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 10 Brand architecture Corporate and product brands OpenFlex™ Ultrastar® iNAND® … Corporate brand Product brands Target audience Western Digital branded B2B product lines Consumer Consumer Consumer Consumer B2B
  • 11. 11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 11 Major enablers “Moore’s Law” for HDDs 10~25% CGR 60% CGR 80% CGR 30% CGR 70% CGR 30% CGR PMR 2006 He Sealed Drive Ferrite head Thin film Inductive reader Thin film media AMR reader PRML Channel AFC media 60 years  500 million X increase in Areal Density IBM RAMAC IBM 3380 He10 Gbit/in2 GMR reader EPRML channel TMR reader LDPC channel Energy Assist
  • 12. 11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 12 Writeability at high TPI (track per inch) Why energy assisted recording is required Scaling beyond Perpendicular Magnetic Recording (PMR) requires Energy Assisted Recording Scaling Capacity Switching Challenge Magnetic Stability Energy Assist MicrowaveLaser Heat
  • 13. 11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 13 Why cloud? • The “width” of the cloud far outsizes any internal resource (throughput limited) • Advantages – Scale – Newest hardware available • Concerns and resolutions – Cost for computing, storage and data transfer―Spot Instances, Amazon S3, compress everything – Security―Amazon Virtual Private Cloud (Amazon VPC), AWS Direct Connect, Amazon S3 VPC endpoints • In a nutshell: Throughput to cost ratio is key metric
  • 14. 11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 14 • Docker image build process is automated by Jenkins • Fetch & Run – Pull input files from Amazon S3 – Run simulations – Put output files to Amazon S3 • Responsibility – User = Dockerfile, job queue, job definition, Jenkins – IT Admin = AMI, compute environment, Jenkins AWS Batch + Jenkins Architecture users Spot Fleet Code Repo Jenkins Push Hook PushBuild Submit job Input/Output Pull The Jenkins logo is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License (https://creativecommons.org/licenses/by-sa/3.0/) from the Jenkins project (https://jenkins.io).
  • 15. 11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 15 Design the workflow for failure • Minimize cost –Compress everything, reduces network and storage cost –Minimize “local” storage –Minimize Amazon S3 access • Assume Spot Instances can go away without warning –Keep compute short (<15 minutes) or support “checkpoints” –Save results as soon as possible to persistent store like Amazon S3 • Keep it simple, try to standardize – Easy to debug – AWS CLI, Bash v4 and Boto3 • Computational characteristics – Single threaded and pleasingly parallel, 1:45-2:15 job durations, 0.2-2GB memory, minimal network Design principles
  • 16. 11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 16 • Construct a generic run script at submission time – Environment variables―Docker image or submit-job command – Setup run directory and input files – Run Job – switching executable for AVX generation (40% faster per job) – Push results to Amazon S3 – Clean up run directory – Exit code Simple run script
  • 17. 11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 17 • Construct a generic run script at submission time – Compress/decompress automatically reduces storage and network cost – Expect a problem any AWS CLI command is wrapped in ‘until’ construct – Add random prefix to increase Amazon S3 performance Optimize Amazon S3 access
  • 18. 11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 18 • Avoid losing computation – Use Kernel’s ‘inotify’ and ‘inotifytools’ – Only triggered when a file is “closed after write” – Spawn compression, push to Amazon S3 and deletion of file to unblock for next event Save results as soon as they appear
  • 19. 11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 19 Conclusion • We containerized our HPC applications • We scaled to over 1 million vCPUs – Using containers, Spot Fleet and Amazon S3 • We optimized for the cloud – Profiled over 30 instance offerings, narrowing down on medium sized AVX supporting instances – 20 days worth of work in eight hours at 1M vCPU – 40% faster per jobs than previously (CPU_Time vs vCPU) – Used for real design of future HDD components
  • 20. © 2018 Western Digital Corporation or its affiliates. All rights reserved. 11/26/2018
  • 21. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Alex Emilcar Solutions Architect Amazon Web Services Aydn Bekirov Technical Account Manager Amazon Web Services Hiroshi Kobayashi Sr. Solutions Architect Western Digital
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.