Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Lustre, RoCE, and MAN

In this deck from the DDN User Group at ISC 2019, Marek Magryś from Cyfronet presents: Lustre, RoCE, and MAN.

"This talk will describe the architecture and implementation of high capacity Lustre file system for the need of a data intensive project. Storage is based on DDN ES7700 building block and uses RDMA over Converged Ethernet as network transport. What is unusual is that the storage system is located over 10 kilometers away from the supercomputer. Challenges, performance benchmarks and tuning will be the main topic of the presentation."

Watch the video: https://wp.me/p3RLHQ-kAn

Learn more: http://www.cyfronet.krakow.pl/
and
https://www.ddn.com/company/events/isc-user-group/

Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter

  • Sé el primero en comentar

Lustre, RoCE, and MAN

  1. 1. Lustre, RoCE and MAN Łukasz Flis, Marek Magryś Dominika Kałafut, Patryk Lasoń, Adrian Marszalik, Maciej Pawlik
  2. 2. Academic Computer Centre Cyfronet AGH ● The biggest Polish Academic Computer Centre ○ Over 45 years of experience in IT provision ○ Centre of excellence in HPC and Grid Computing ○ Home for Prometheus and Zeus supercomputers ● Legal status: an autonomous within AGH University of Science and Technology ● Staff: > 160 , ca. 60 in R&D ● Leader of PLGrid: Polish Grid and Cloud Infrastructure for Science ● NGI Coordination in EGI e-Infrastructure 2
  3. 3. Network backbone ●4 main links to achieve maximum reliability ●Each link with 7x 10 Gbps capacity ●Additional 2x 100 Gbps dedicated links ●Direct connection with GEANT scientific network ●Over 40 switches ●Security ●Monitoring 3
  4. 4. Academic Computer Centre Cyfronet AGH Prometheus ● 2.4 PFLOPS ● 53 604 cores ● 1st HPC system in Poland (174st on Top 500, 38th in 2015) 4 Zeus ● 374 TFLOPS ● 25 468 cores ● 1st HPC system in Poland (from 2009 to 2015, highest rank on Top500 – 81st in 2011) Computing portals and frameworks ● OneData ● PLG-Data ● DataNet ● Rimrock ● InSilicoLab Data Centres ● 3 independent data centres ● dedicated backbone links Research & Development ● distributed computing environments ● computing acceleration ● machine learning ● software development & optimization Storage ● 48 PB ● hierarchical data management Computational Cloud ● based on OpenStack
  5. 5. HPC@Cyfronet 5 ●Prometheus and Zeus clusters ○ 6475 active users (at the end of 2018) ○ 350+ computational grants ○ 8+ millions of jobs in 2018 ○ 371+ millions of CPU hours spent in 2018 ○ Biggest jobs in 2018 ■ 27 648 cores ■ 261 152 CPU hours in one job ○ 900+ (Prometheus) and 600+ (Zeus) software modules ○ Custom users helper tools developed in-house
  6. 6. The fastest supercomputer in Poland: Prometheus 6 ● Installed in Q2 2015 (upgraded in Q4 2015) ● Centos 7 + SLURM ● HP Apollo 8000 - direct warm cooled system – PUE 1.06 ○ 20 racks (4 CDU, 16 compute) ● 2235 nodes, 53 604 CPU cores (Haswell, Xeon E5-2680v3 12C 2.5GHz), 282 TB RAM ○ 2160 regular nodes (2 CPUs, 128 GB RAM) ○ 72 nodes with GPGPUs (2x NVIDIA Tesla K40 XL) ○ 4 islands ● Main storage based on Lustre ○ Scratch: 5 PB, 120 GB/s, 4x DDN SFA12kx ○ Archive: 5 PB, 60 GB/s, 2x DDN SFA12kx ● 2.4 PFLOPS total performance (Rpeak) ● < 850 kW power (including cooling) ● TOP500: current 174th position, highest: 38th (XI 2015)
  7. 7. Project background 7 ● Industrial partner ● Areas: ○ Data storage ■ POSIX ■ 10s of PBs ■ Incremental growth ○ HPC ○ Networking ○ Consulting ● PoC in 2017 ● Infrastructure tests and design in 2018 ● Production in Q1 2019 Photo: wikipedia.org
  8. 8. Challenges 8 ● How to separate industrial and academic workloads? ○ Isolated storage platform ○ Dedicated network + dedicated IB partition ○ Custom compute OS image ○ Scheduler (SLURM) setup ○ Do not mix funding sources ● Which hardware platform to use? ○ ZFS JBOD vs RAID ○ Infiniband vs Ethernet ○ Capacity/performance ratio ○ Single vs partitioned namespace
  9. 9. Location 9 Storage to compute distance: 14 km over fibre (81 µs) DC Nawojki DC Pychowice Map: openstreetmap.org MAN backup link Dark fibre
  10. 10. Infrastructure overview 10
  11. 11. Solution 11 ● DDN SFA200NV for Lustre MDT ○ 10x 1.5 TB NVMe + 1 spare ● DDN ES7990 building block for OST ○ > 4 PiB usable space ○ ~ 20 GB/s performance ○ 450x 14 TB NL SAS ○ 4x 100 Gb/s Ethernet ○ Embedded Exascaler ● Juniper QFX10008 ○ Deep buffers (100ms) ● Vertiv DCM racks ○ 48 U, custom depth: 130 cm ○ 1500 kg static load
  12. 12. Network: RDMA over Converged Ethernet RoCE v1: ● L2 - Ethernet Link Layer Protocol (Ethertype 0x8915) ● requires link level flow control for lossless Ethernet (PAUSE frames or Priority Flow Control) ● not routable RoCE v2: ● L3 - uses UDP/IP packets, port 4791 ● link level flow control optional ● can use ECN (Explicit Congestion Notification) for controlling flows on lossy networks ● routable Mellanox ConnectX HCAs implement hardware offload for RoCE protocols 12
  13. 13. LNET: TCP vs RoCE v2 LNET selftest, default tuning for ksocknald and ko2iblnd, Lustre: 2.10.5, ConnectX-4 Adapters, 100 GbE, congestion free env., MTU 9216 (RoCE uses 4k max) 1310874.4 Local: MAX TCP: 4114.7 MiB/s @ 4 RPCs vs MAX RoCE v2: 10874.4 MiB/s @ 16 RPCs Remote: MAX TCP: 3662.2 MiB/s @ 4 RPCs vs MAX RoCE v2: 6805.7 MiB/s @ 32 RPCs Theoretical Max: 11682 MiB/s (12250 MB/s)
  14. 14. LNET: TCP vs RoCE v2 Short summary TCP vs RoCE v2 p2p (no congestion) Short range test: ● RoCE v2 out-of-box LNET bandwidth 2.6x better than TCP ● link saturation 93% Long range test (14km): ● out-of-box LNET: RoCE v2 1.85x better than TCP ● link saturation: 58% (default settings) ● tuning required - ko2iblnd concurrent_sends=4, peer_credits=64 gives 11332.66 MiB/s (97% saturation) HW offloaded RoCE allows for full link utilization and low CPU usage. Single LNET router is easily able to saturate 100 Gb/s link 14
  15. 15. Explicit Congestion Notification ● RoCEv2 can be used over lossy links ● Packet drops == retransmissions == bandwidth hiccups ● Enabling ECN effectively reduces packet drops on congested ports ● ECN must be enabled on all devices over the path ● If HCA sees ECN mark on received packet: ○ 1. CNP packet is sent back to the sender ○ 2. Sender reduces transmission speed in reaction to CNP 15
  16. 16. ECN how to 1. Use ECN capable switches 2. Use RoCE capable host adapters (CX4 and CX5 were tested) 3. Use DSCP field in IP header to tag RDMA and CNP packets on host (cma_roce_tos) 4. Enable ECN for RoCE traffic on switches 5. Prioritize CNP packets to assure proper congestion signaling 6. Enjoy stable transfers and significantly reduced frame drops 7. Optionally use L3 and OSPF or BGP to handle backup routes 16
  17. 17. LNET: congested long link Lustre 2.10.5, DC1 to DC2 2x100 GbE, test: write 4:2 Congestion appears on the DC1 to DC2 link due to 4:2 link reduction 17 RoCEv2 no FC: 12818.9 MiB/s 54.86% TCP no FC: 15368.3 MiB/s 65.78% RoCEv2 ECN: 19426.8 MiB/s 83.14%
  18. 18. RoCEv2: ECN vs no ECN Effects of disabling ECN 18
  19. 19. Real life test 2x DDN ES7990 (4 OSS), 4 LNET routers (RoCE <-> IB FDR), 14 km Bandwidth: IOR 112 tasks @ 28 client nodes Max Write: 29872.21 MiB/sec (31323.28 MB/sec) Max Read: 34368.27 MiB/sec (36037.74 MB/sec) 19
  20. 20. Conclusions 20 ● For bandwidth workloads latency on MAN distances is not an issue ● ECN mechanisms for RoCE needs to be enabled to significantly reduce packet drops during congestion ● Aggregation of links (LACP+Adaptive Load Balancing or ECMP for L3) allows to scale bandwidth linearly by evenly utilizing available links ● RoCE allows more flexibility in terms of transport links compared to IB - ie. backup routing, cheaper and more scalable infrastructure
  21. 21. Acknowledgements 21 Thanks for the test infrastructure and support
  22. 22. 22 Visit us at booth H-710! (and taste some krówka) Thank you!

×