Set Up a Million-Core Cluster to Accelerate HPC Workloads (CMP404) - AWS re:Invent 2018

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Set Up a Million-Core Cluster to
Accelerate HPC Workloads
C M P 4 0 4
Alex Emilcar
Solutions Architect
Amazon Web Services
Aydn Bekirov
Technical Account Manager
Amazon Web Services

Workshop architecture
Spot Fleet
(Managed Compute Environment)
S3 Bucket for Job
Input/Outputs
Job Queue AWS Batch
Scheduler
ECS RunTaskSubmit Job
EU-West-1 IRELAND Region
(3 AZs)
Job Definition
S3 Bucket with
Container Pre-Reqs
1
2
345
Logs + Metrics

Lessons from large scale runs (>100K vCPUs)

Extended AWS Team
Anh Tran
Sr HPC Specialized SA
Matt Easton
Enterprise Account Manager
Andrea Rodolico
Principal HPC BDM
Pierre-Yves Aquilanti
Sr HPC Specialized SA

© 2018 Western Digital Corporation or its affiliates. All rights reserved. 11/26/2018
CMP404―Set Up a Million-Core
Cluster to Accelerate HPC
Workloads
Hiroshi Kobayashi
26th November, 2018 @ AWS re:Invent
Senior Solution Architect

11/26/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 7
This presentation contains certain forward-looking statements that involve risks and uncertainties, including, but not
limited to, statements regarding HDD technology progression, need for energy-assisted recording, potential growth, and
trends. Forward-looking statements should not be read as a guarantee of future performance or results, and will not
necessarily be accurate indications of the times at, or by, which such performance or results will be achieved, if at all.
Forward-looking statements are subject to risks and uncertainties that could cause actual performance or results to
differ materially from those expressed in or suggested by the forward-looking statements.
Key risks and uncertainties include volatility in global economic conditions, actions by competitors, business conditions,
growth in our markets, pricing trends and fluctuations in average selling prices, and other risks and uncertainties listed in
our filings with the Securities and Exchange Commission (the “SEC”) and available on the SEC’s website at www.sec.gov,
including our most recently filed periodic report, to which your attention is directed. We do not undertake any obligation
to publicly update or revise any forward-looking statement, whether as a result of new information, future
developments or otherwise, except as otherwise required by law.
Safe Harbor | Disclaimers
Forward-Looking Statements

Agenda
1. Who am I?
2. Western Digital
3. AWS Batch architecture
4. Design principles
5. Conclusion

---
- Name: Hiroshi Kobayashi
Company: Western Digital
Team: Global Engineering Services
Location: Japan
- Favorites:
AWS services: Amazon Simple Storage Service
(Amazon S3), AWS Batch
Community: Japan AWS User Group (JAWS-UG)
Tools: AWS Command Line Interface (AWS CLI),
Ansible, Jenkins
Who am I?

Brand architecture
Corporate and product brands
OpenFlex™ Ultrastar® iNAND® …
Corporate brand
Product brands
Target audience
Western Digital
branded B2B
product lines
Consumer Consumer Consumer Consumer B2B

Major enablers
“Moore’s Law” for HDDs
10~25% CGR
60% CGR
80% CGR
30% CGR
70% CGR
30% CGR
PMR
2006 He Sealed
Drive
Ferrite head
Thin film
Inductive
reader
Thin film
media
AMR reader
PRML Channel
AFC media
60 years  500 million X increase in Areal Density
IBM
RAMAC
IBM 3380
He10
Gbit/in2
GMR reader
EPRML channel
TMR reader
LDPC channel
Energy
Assist

Writeability at high TPI (track per inch)
Why energy assisted recording is required
Scaling beyond Perpendicular Magnetic Recording (PMR) requires Energy Assisted Recording
Scaling
Capacity
Switching
Challenge
Magnetic
Stability
Energy
Assist
MicrowaveLaser Heat

Why cloud?
• The “width” of the cloud far outsizes any internal resource (throughput limited)
• Advantages
– Scale
– Newest hardware available
• Concerns and resolutions
– Cost for computing, storage and data transfer―Spot Instances, Amazon S3, compress
everything
– Security―Amazon Virtual Private Cloud (Amazon VPC), AWS Direct Connect, Amazon S3 VPC
endpoints
• In a nutshell: Throughput to cost ratio is key metric

• Docker image build process is
automated by Jenkins
• Fetch & Run
– Pull input files from Amazon S3
– Run simulations
– Put output files to Amazon S3
• Responsibility
– User = Dockerfile, job queue, job
definition, Jenkins
– IT Admin = AMI, compute
environment, Jenkins
AWS Batch + Jenkins
Architecture
users
Spot Fleet
Code Repo
Jenkins
Push
Hook
PushBuild
Submit job
Input/Output
Pull
The Jenkins logo is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License (https://creativecommons.org/licenses/by-sa/3.0/) from the Jenkins project (https://jenkins.io).

Design the workflow for failure
• Minimize cost
–Compress everything, reduces network and storage cost
–Minimize “local” storage
–Minimize Amazon S3 access
• Assume Spot Instances can go away without warning
–Keep compute short (<15 minutes) or support “checkpoints”
–Save results as soon as possible to persistent store like Amazon S3
• Keep it simple, try to standardize
– Easy to debug
– AWS CLI, Bash v4 and Boto3
• Computational characteristics
– Single threaded and pleasingly parallel, 1:45-2:15 job durations, 0.2-2GB memory, minimal network
Design principles

• Construct a generic run script at submission time
– Environment variables―Docker image or submit-job command
– Setup run directory and input files
– Run Job – switching executable for AVX generation (40% faster per job)
– Push results to Amazon S3
– Clean up run directory
– Exit code
Simple run script

• Construct a generic run script at submission time
– Compress/decompress automatically reduces storage and network cost
– Expect a problem any AWS CLI command is wrapped in ‘until’ construct
– Add random prefix to increase Amazon S3 performance
Optimize Amazon S3 access

• Avoid losing computation
– Use Kernel’s ‘inotify’ and ‘inotifytools’
– Only triggered when a file is “closed after write”
– Spawn compression, push to Amazon S3 and deletion of file to unblock for next event
Save results as soon as they appear

Conclusion
• We containerized our HPC applications
• We scaled to over 1 million vCPUs
– Using containers, Spot Fleet and Amazon S3
• We optimized for the cloud
– Profiled over 30 instance offerings, narrowing down on medium sized AVX supporting
instances
– 20 days worth of work in eight hours at 1M vCPU
– 40% faster per jobs than previously (CPU_Time vs vCPU)
– Used for real design of future HDD components

Thank you!
Alex Emilcar
Solutions Architect
Amazon Web Services
Aydn Bekirov
Technical Account Manager
Amazon Web Services
Hiroshi Kobayashi
Sr. Solutions Architect
Western Digital

Set Up a Million-Core Cluster to Accelerate HPC Workloads (CMP404) - AWS re:Invent 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Set Up a Million-Core Cluster to Accelerate HPC Workloads (CMP404) - AWS re:Invent 2018

Similar to Set Up a Million-Core Cluster to Accelerate HPC Workloads (CMP404) - AWS re:Invent 2018 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Set Up a Million-Core Cluster to Accelerate HPC Workloads (CMP404) - AWS re:Invent 2018