ABSTRACT: Perlmutter is the newest supercomputer at Berkeley Lab, California, and features a whopping 35 PB all-flash Lustre file system. Let's dive into its architecture, showing some early performance figures and unique performance considerations, using low-level Lustre tests that achieve over 90% of the theoretical bandwidth of the SSDs, to showcase how Perlmutter achieves the performance of a burst buffer and the resilience of a scratch file system. Lastly, some performance considerations unique to an all-flash Lustre file system, along with tips on how better I/O patterns can make the most of such powerful architectures.
BIO: Alberto Chiusole studied Data Science and Scientific Computing in Trieste when he had the opportunity to spend some months at CERN, in Geneva, benchmarking their Ceph file system against a classic Lustre file system from eXact lab, the HPC consulting company in Trieste he was working for at the time. After Trieste, he worked as a Storage and I/O Software Engineer at Berkeley Lab, California, a national scientific laboratory, where he assisted scientists with improving their I/O and data needs. He now works for Seqera Labs as an HPC DevOps Engineer, focusing on infrastructure support.
08448380779 Call Girls In Friends Colony Women Seeking Men
Architecting a 35 PB distributed parallel file system for science
1. Architecting a 35 PB distributed
parallel file system for science
(formerly) Storage and I/O Software Engineer at NERSC, Berkeley Lab, US
(currently) HPC DevOps Engineer at Seqera Labs, Barcelona
Speck&Tech #53
Trento - May 29, 2023
Alberto Chiusole
2. - (2014-2017) BSc in Information and Business Organization Eng. - U. of Trento
- 5 months exchange student at Technical University of Denmark, Copenhagen
- (2017-2019) MSc in Data Science and Scientific Computing - U. of Trieste
- (2017-2020) HPC Sysadmin and Scientific software developer - eXact Lab, Trieste
- 3 months at CERN, Geneva, to work on Master’s thesis
- Comparison between CephFS at CERN and Lustre FS at eXact lab
- Presented at ISC High Performance in Frankfurt, July 2019
- (2020-2022) Storage and I/O Software Engineer - NERSC, Berkeley Lab, Cal., US
- Worked on Perlmutter and its Lustre FS, first full-flash only 35 PB parallel FS
- (2023 - now) HPC DevOps Engineer - Seqera Labs (remote)
https://www.linkedin.com/in/albertochiusole/
https://bit.ly/Alberto-Chiusole-Scholar
2
How I ended up working on Supercomputers
3. High Performance Computing (HPC) empowers breakthroughs
- Supercomputers run parallel applications to solve complex problems
- Applications come from all kind of sciences
- astrophysics, nuclear physics, molecular design, computational fluid dynamics, nuclear
warheads status simulation, climate and weather forecasts, COVID vaccines (!), to name a few
- Different from grid computing (nodes in HPC are more tightly coupled)
- At massive scale several complex problems appear
- Extremely expensive setups
- Certain labs are a matter of national security (think Men in black)
Let’s step back: why would anyone need such a FS?
3
4. So.. how do we get there? The hardware
HPC is a combination of advanced hardware and specialized software
4
5. A Namesake for Remarkable Contributions
Perlmutter is the newest supercomputer at NERSC (Berkeley
Lab, California, US)
Named after Saul Perlmutter, Nobel prize in Physics (2011)
for his 1988 discovery that the universe is expanding.
He confirmed his observations by running thousands of
simulations at NERSC, and his research team is believed to
have been the first to use supercomputers to analyze and
validate observational data in cosmology.
5
7. The hardware (Perlmutter)
- Hardware is made of several racks of “blades”
- CPU, GPU and now FPGA-enhanced nodes
- Fast network interconnection
- On PM: Cray (HPE) Slingshot 11
- Single-digit µs latency (~1-2 µs, <10 under heavy load)
- Optimized for HPC: offload into silicon
- Mix of Ethernet and InfiniBand protocols over fiber
- InfiniBand cheaper for same performance
- Liquid cooled units (note the colored pipes)
- Requires maintenance (& downtime) to change liquid
- Fast and large file systems
- Different tiers, for different time-scales
7
Special tiles!
8. The software landscape
- Linux-only world (mainly Red Hat, some SUSE, few Ubuntu, some custom)
- https://top500.org/statistics/list/
- Parallel programming
- OpenMP for intra-node comm., Message Passing Interface (MPI) for extra-node comm.
- Fortran kingdom!
- And C.. rarely C++. Python is gaining traction for data analysis and ML/AI steps
- Job schedulers to allocate resources to users
- Slurm (most popular), PBS, Torque, LSF, Moab, Grid Engine, etc
- User requests a certain “portion” of the cluster for their jobs
- Jobs are placed in a queue and wait for enough resources to start
- The scheduler prepares the environment, collects logs, wraps up when jobs are done
8
9. - 💿 Storage usage, I/O and data transfer
- Write the least to disk; write smartly (will see soon); avoid I/O bottlenecks
- 📐 Data locality
- Keep data as much as possible inside the node/rack
- ⚡ Power usage
- Servers use a lot of energy resources
- Perlmutter (US): 2.5 MW at full power – Fugaku (JP): 29.9 MW
- 830 households at max power (3 KW in Italy)
- 🥶 Cooling
- Location of data center is important
- Berkeley Lab benefits from the always cool temp of the Bay Area (19 C max year)
- Water is needed: can’t place DCs in deserts
Some of the challenges
9
10. What is I/O?
- Input/Output: everything that works with data and its storage
- At large scale you need multiple discs/drives and servers to store data
- Synchronization and consistency issues
- Two processes writing to a single file (strong or eventual consistency?)
- A process reading a file just written by another process (cache invalidation)
- A process writing to/reading from a file in a disc that crashed (fault tolerance)
- Duplicating files to increase aggregate read bandwidth
- Data locality: a temporary file may be written to a local FS rather than parallel FS
- Optimizing I/O is crucial
- CPUs work at the order of the ns (10-9
s), network/NVMe work at most at µs (10-6
s)
- Reducing network phase improves overall compute walltime considerably
10
11. Different file system scopes
The slower the drive, the higher the capacity
- Memory/NVMe drives are blazingly fast,
but they are expensive
- Scratch file systems should only be used
for temporary storage (are purged often)
11
- Data used in the same month should be moved to HDD
- Archive data should be moved to tape (it’s like VHS!)
- Movement of data may be enforced or automatic (like
S3 → Glacier)
12. - PM ships with the first all-flash file system in HPC
- 3,480 Samsung PM1733 PCIe NVMe drives (15.36 TB each)
- 3.5 GB/s seq. read, 3.2 seq. write speed by specifications
- 35 PB of usable POSIX storage (as in 'df -h')
- Directly integrated in the Slingshot compute network
- No need for LNet routers
Perlmutter scratch file system
12
13. - PM ships with the first all-flash file system in HPC
- 3,480 Samsung PM1733 PCIe NVMe drives (15.36 TB each)
- 3.5 GB/s seq. read, 3.2 seq. write speed by specifications
- 35 PB of usable POSIX storage (as in 'df -h')
- Directly integrated in the Slingshot compute network
- No need for LNet routers
- Enough to backup The Lord of The Rings trilogy 2.7M times
- Or 152k times for the extended cut in 4k Ultra HD
Perlmutter scratch file system
13
14. Metadata servers (MDS)
- Store the directory structure, file names,
inode locations inside OSS, etc
- Decide the file layout on OSSs (striping, etc)
- “Metadata” I/O, not bandwidth I/O
Object storage servers (OSS)
- Store chunks of data as binary
- Write 1 MiB stripes over OSS (like a RAID-0)
On PM: 16 MDS, 274 OSS
Parallel and distributed FS: Lustre
14
15. Inside ClusterStor E1000
MDS/OSS unit in the rack: twin servers
- Single-socket AMD Rome (128x PCIe Gen4 lanes)
- Allows switchless design
- 48 lanes for 24x NVMes, 32 lanes for 2x NICs
- Each server responsible for 12 NVMe drives, can
take over the other half if needed
- GridRAID (HPE) + ldiskfs to maximize
performance
- OSS = 8+2+1 RAID6 (GridRAID)
- MDS = 11-way RAID10 (mdraid)
15
16. Common HPC software used
16
Several tools are available to ease the coding for HPC: often intertwined
MPI is the bread and butter for multi-node communication
MPI-IO is its I/O layer, which helps managing files and transferring data
- File preallocation, offset management, etc
HDF5 uses MPI/MPI-IO to perform parallel I/O
NetCDF uses HDF5 as its storage format
IOR: benchmarking tool capable of generating synthetic I/O like HPC applications
18. Metadata performance of Perlmutter
Using IOR in a “production” run
- 230 clients x 6 procs/client = 1380 procs
- 1.6 M files/s created
In a “full-scale” run
- 1382 clients x 2 procs/client = 2764 procs
- 1.3 M files/s deleted
Way smoother User Experience than previous HDD-based Cori file system
18
19. Some surprises found during performance evaluation
SSDs slow down with age
- Like “HDD fragmentation”
- -10% write bandwidth after 5x capacity
written to an OST
- A fstrim is enough to fix it
- 5x OST size: 665 TB
- 2.2-2.9 PB daily expected writes
- 5x writes ~ 60/80 days
- The longer you wait, the longer fstrim takes
- Performed nightly to keep performance up
19
20. Thanks! Questions?
By the way, I use arch
20
PS: is hiring! seqera.io/careers
This material is based upon work supported by the U.S. Department of Energy, Office of Science, under contract
DE-AC02-05CH11231. This research used resources and data generated from resources of the National Energy Research Scientific
Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under
Contract No. DE-AC02-05CH11231.