SlideShare a Scribd company logo
1 of 22
Lustre, RoCE and MAN
Łukasz Flis, Marek Magryś
Dominika Kałafut, Patryk Lasoń, Adrian Marszalik, Maciej Pawlik
Academic Computer Centre Cyfronet AGH
● The biggest Polish Academic Computer Centre
○ Over 45 years of experience in IT provision
○ Centre of excellence in HPC and Grid Computing
○ Home for Prometheus and Zeus supercomputers
● Legal status: an autonomous within AGH University of Science and Technology
● Staff: > 160 , ca. 60 in R&D
● Leader of PLGrid: Polish Grid and Cloud Infrastructure for Science
● NGI Coordination in EGI e-Infrastructure
2
Network backbone
●4 main links to achieve maximum reliability
●Each link with 7x 10 Gbps capacity
●Additional 2x 100 Gbps dedicated links
●Direct connection with GEANT scientific network
●Over 40 switches
●Security
●Monitoring
3
Academic Computer Centre Cyfronet AGH
Prometheus
● 2.4 PFLOPS
● 53 604 cores
● 1st HPC system
in Poland (174st on Top 500, 38th in 2015)
4
Zeus
● 374 TFLOPS
● 25 468 cores
● 1st HPC system in Poland
(from 2009 to 2015, highest
rank on Top500 – 81st in 2011)
Computing portals and
frameworks
● OneData
● PLG-Data
● DataNet
● Rimrock
● InSilicoLab
Data Centres
● 3 independent data centres
● dedicated backbone links
Research & Development
● distributed computing environments
● computing acceleration
● machine learning
● software development & optimization
Storage
● 48 PB
● hierarchical data management
Computational Cloud
● based on OpenStack
HPC@Cyfronet 5
●Prometheus and Zeus clusters
○ 6475 active users (at the end of 2018)
○ 350+ computational grants
○ 8+ millions of jobs in 2018
○ 371+ millions of CPU hours spent in 2018
○ Biggest jobs in 2018
■ 27 648 cores
■ 261 152 CPU hours in one job
○ 900+ (Prometheus) and 600+ (Zeus) software modules
○ Custom users helper tools developed in-house
The fastest supercomputer in Poland:
Prometheus 6
● Installed in Q2 2015 (upgraded in Q4 2015)
● Centos 7 + SLURM
● HP Apollo 8000 - direct warm cooled system – PUE 1.06
○ 20 racks (4 CDU, 16 compute)
● 2235 nodes, 53 604 CPU cores (Haswell, Xeon E5-2680v3 12C 2.5GHz), 282 TB RAM
○ 2160 regular nodes (2 CPUs, 128 GB RAM)
○ 72 nodes with GPGPUs (2x NVIDIA Tesla K40 XL)
○ 4 islands
● Main storage based on Lustre
○ Scratch: 5 PB, 120 GB/s, 4x DDN SFA12kx
○ Archive: 5 PB, 60 GB/s, 2x DDN SFA12kx
● 2.4 PFLOPS total performance (Rpeak)
● < 850 kW power (including cooling)
● TOP500: current 174th position, highest: 38th (XI 2015)
Project background 7
● Industrial partner
● Areas:
○ Data storage
■ POSIX
■ 10s of PBs
■ Incremental growth
○ HPC
○ Networking
○ Consulting
● PoC in 2017
● Infrastructure tests and design in 2018
● Production in Q1 2019
Photo: wikipedia.org
Challenges 8
● How to separate industrial and academic workloads?
○ Isolated storage platform
○ Dedicated network + dedicated IB partition
○ Custom compute OS image
○ Scheduler (SLURM) setup
○ Do not mix funding sources
● Which hardware platform to use?
○ ZFS JBOD vs RAID
○ Infiniband vs Ethernet
○ Capacity/performance ratio
○ Single vs partitioned namespace
Location 9
Storage to compute distance: 14 km over fibre (81 µs)
DC Nawojki
DC Pychowice
Map: openstreetmap.org
MAN backup link
Dark fibre
Infrastructure overview 10
Solution 11
● DDN SFA200NV for Lustre MDT
○ 10x 1.5 TB NVMe + 1 spare
● DDN ES7990 building block for OST
○ > 4 PiB usable space
○ ~ 20 GB/s performance
○ 450x 14 TB NL SAS
○ 4x 100 Gb/s Ethernet
○ Embedded Exascaler
● Juniper QFX10008
○ Deep buffers (100ms)
● Vertiv DCM racks
○ 48 U, custom depth: 130 cm
○ 1500 kg static load
Network: RDMA over Converged Ethernet
RoCE v1:
● L2 - Ethernet Link Layer Protocol (Ethertype 0x8915)
● requires link level flow control for lossless Ethernet
(PAUSE frames or Priority Flow Control)
● not routable
RoCE v2:
● L3 - uses UDP/IP packets, port 4791
● link level flow control optional
● can use ECN (Explicit Congestion Notification) for
controlling flows on lossy networks
● routable
Mellanox ConnectX HCAs implement hardware offload for
RoCE protocols
12
LNET: TCP vs RoCE v2
LNET selftest, default tuning for ksocknald and ko2iblnd, Lustre: 2.10.5, ConnectX-4 Adapters, 100 GbE, congestion free env., MTU 9216
(RoCE uses 4k max)
1310874.4
Local: MAX TCP: 4114.7 MiB/s @ 4 RPCs vs MAX RoCE v2: 10874.4 MiB/s @ 16 RPCs
Remote: MAX TCP: 3662.2 MiB/s @ 4 RPCs vs MAX RoCE v2: 6805.7 MiB/s @ 32 RPCs
Theoretical Max: 11682 MiB/s (12250 MB/s)
LNET: TCP vs RoCE v2
Short summary TCP vs RoCE v2 p2p (no congestion)
Short range test:
● RoCE v2 out-of-box LNET bandwidth 2.6x better than TCP
● link saturation 93%
Long range test (14km):
● out-of-box LNET: RoCE v2 1.85x better than TCP
● link saturation: 58% (default settings)
● tuning required - ko2iblnd concurrent_sends=4, peer_credits=64
gives 11332.66 MiB/s (97% saturation)
HW offloaded RoCE allows for full link utilization and low CPU usage.
Single LNET router is easily able to saturate 100 Gb/s link
14
Explicit Congestion Notification
● RoCEv2 can be used over lossy links
● Packet drops == retransmissions == bandwidth hiccups
● Enabling ECN effectively reduces packet drops on
congested ports
● ECN must be enabled on all devices over the path
● If HCA sees ECN mark on received packet:
○ 1. CNP packet is sent back to the sender
○ 2. Sender reduces transmission speed in reaction to CNP
15
ECN how to
1. Use ECN capable switches
2. Use RoCE capable host adapters (CX4 and CX5 were tested)
3. Use DSCP field in IP header to tag RDMA and CNP packets
on host (cma_roce_tos)
4. Enable ECN for RoCE traffic on switches
5. Prioritize CNP packets to assure proper congestion signaling
6. Enjoy stable transfers and significantly reduced frame drops
7. Optionally use L3 and OSPF or BGP to handle backup routes
16
LNET: congested long link
Lustre 2.10.5, DC1 to DC2 2x100 GbE, test: write 4:2
Congestion appears on the DC1 to DC2 link due to 4:2 link reduction
17
RoCEv2 no FC: 12818.9 MiB/s 54.86%
TCP no FC: 15368.3 MiB/s 65.78%
RoCEv2 ECN: 19426.8 MiB/s 83.14%
RoCEv2: ECN vs no ECN
Effects of disabling ECN
18
Real life test
2x DDN ES7990 (4 OSS), 4 LNET routers (RoCE <-> IB FDR), 14 km
Bandwidth: IOR 112 tasks @ 28 client nodes
Max Write: 29872.21 MiB/sec (31323.28 MB/sec)
Max Read: 34368.27 MiB/sec (36037.74 MB/sec)
19
Conclusions 20
● For bandwidth workloads latency on MAN distances is
not an issue
● ECN mechanisms for RoCE needs to be enabled to
significantly reduce packet drops during congestion
● Aggregation of links (LACP+Adaptive Load Balancing or
ECMP for L3) allows to scale bandwidth linearly by
evenly utilizing available links
● RoCE allows more flexibility in terms of transport links
compared to IB - ie. backup routing, cheaper and more
scalable infrastructure
Acknowledgements 21
Thanks for the test infrastructure and support
22
Visit us at booth H-710!
(and taste some krówka)
Thank you!

More Related Content

What's hot

IRDeck_Q322Highlights_FINAL.pdf
IRDeck_Q322Highlights_FINAL.pdfIRDeck_Q322Highlights_FINAL.pdf
IRDeck_Q322Highlights_FINAL.pdfxiso
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningDataWorks Summit
 
OpenStack Architecture and Use Cases
OpenStack Architecture and Use CasesOpenStack Architecture and Use Cases
OpenStack Architecture and Use CasesJalal Mostafa
 
OpenStack Telco Architecture: OpenStack Summit Boston 2017
OpenStack Telco Architecture: OpenStack Summit Boston 2017OpenStack Telco Architecture: OpenStack Summit Boston 2017
OpenStack Telco Architecture: OpenStack Summit Boston 2017Christian "kiko" Reis
 
Implementation &amp; Comparison Of Rdma Over Ethernet
Implementation &amp; Comparison Of Rdma Over EthernetImplementation &amp; Comparison Of Rdma Over Ethernet
Implementation &amp; Comparison Of Rdma Over EthernetJames Wernicke
 
Machine configoperatorのちょっとイイかもしれない話
Machine configoperatorのちょっとイイかもしれない話 Machine configoperatorのちょっとイイかもしれない話
Machine configoperatorのちょっとイイかもしれない話 Toshihiro Araki
 
Kubernetes networking: Introduction to overlay networks, communication models...
Kubernetes networking: Introduction to overlay networks, communication models...Kubernetes networking: Introduction to overlay networks, communication models...
Kubernetes networking: Introduction to overlay networks, communication models...Murat Mukhtarov
 
Red Hat OpenShift Container Platform Overview
Red Hat OpenShift Container Platform OverviewRed Hat OpenShift Container Platform Overview
Red Hat OpenShift Container Platform OverviewJames Falkner
 
Cephのベンチマークをしました
CephのベンチマークをしましたCephのベンチマークをしました
CephのベンチマークをしましたOSSラボ株式会社
 
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料) 40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料) hamaken
 
Kuberneteの運用を支えるGitOps
Kuberneteの運用を支えるGitOpsKuberneteの運用を支えるGitOps
Kuberneteの運用を支えるGitOpsshunki fujiwara
 
Project calico introduction - OpenStack最新情報セミナー 2017年7月
Project calico introduction - OpenStack最新情報セミナー 2017年7月Project calico introduction - OpenStack最新情報セミナー 2017年7月
Project calico introduction - OpenStack最新情報セミナー 2017年7月VirtualTech Japan Inc.
 
OpenShift-Technical-Overview.pdf
OpenShift-Technical-Overview.pdfOpenShift-Technical-Overview.pdf
OpenShift-Technical-Overview.pdfJuanSalinas593459
 
Kubernetes
KubernetesKubernetes
Kuberneteserialc_w
 
Hadoopの概念と基本的知識
Hadoopの概念と基本的知識Hadoopの概念と基本的知識
Hadoopの概念と基本的知識Ken SASAKI
 
Issues of OpenStack multi-region mode
Issues of OpenStack multi-region modeIssues of OpenStack multi-region mode
Issues of OpenStack multi-region modeJoe Huang
 
Kubernates vs Openshift: What is the difference and comparison between Opensh...
Kubernates vs Openshift: What is the difference and comparison between Opensh...Kubernates vs Openshift: What is the difference and comparison between Opensh...
Kubernates vs Openshift: What is the difference and comparison between Opensh...jeetendra mandal
 
Google Cloud のネットワークとロードバランサ
Google Cloud のネットワークとロードバランサGoogle Cloud のネットワークとロードバランサ
Google Cloud のネットワークとロードバランサGoogle Cloud Platform - Japan
 

What's hot (20)

IRDeck_Q322Highlights_FINAL.pdf
IRDeck_Q322Highlights_FINAL.pdfIRDeck_Q322Highlights_FINAL.pdf
IRDeck_Q322Highlights_FINAL.pdf
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
 
OpenStack Architecture and Use Cases
OpenStack Architecture and Use CasesOpenStack Architecture and Use Cases
OpenStack Architecture and Use Cases
 
OpenStack Telco Architecture: OpenStack Summit Boston 2017
OpenStack Telco Architecture: OpenStack Summit Boston 2017OpenStack Telco Architecture: OpenStack Summit Boston 2017
OpenStack Telco Architecture: OpenStack Summit Boston 2017
 
Implementation &amp; Comparison Of Rdma Over Ethernet
Implementation &amp; Comparison Of Rdma Over EthernetImplementation &amp; Comparison Of Rdma Over Ethernet
Implementation &amp; Comparison Of Rdma Over Ethernet
 
Machine configoperatorのちょっとイイかもしれない話
Machine configoperatorのちょっとイイかもしれない話 Machine configoperatorのちょっとイイかもしれない話
Machine configoperatorのちょっとイイかもしれない話
 
Kubernetes networking: Introduction to overlay networks, communication models...
Kubernetes networking: Introduction to overlay networks, communication models...Kubernetes networking: Introduction to overlay networks, communication models...
Kubernetes networking: Introduction to overlay networks, communication models...
 
Red Hat OpenShift Container Platform Overview
Red Hat OpenShift Container Platform OverviewRed Hat OpenShift Container Platform Overview
Red Hat OpenShift Container Platform Overview
 
Cephのベンチマークをしました
CephのベンチマークをしましたCephのベンチマークをしました
Cephのベンチマークをしました
 
CloudStack Architecture
CloudStack ArchitectureCloudStack Architecture
CloudStack Architecture
 
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料) 40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
 
Kuberneteの運用を支えるGitOps
Kuberneteの運用を支えるGitOpsKuberneteの運用を支えるGitOps
Kuberneteの運用を支えるGitOps
 
Project calico introduction - OpenStack最新情報セミナー 2017年7月
Project calico introduction - OpenStack最新情報セミナー 2017年7月Project calico introduction - OpenStack最新情報セミナー 2017年7月
Project calico introduction - OpenStack最新情報セミナー 2017年7月
 
OpenShift-Technical-Overview.pdf
OpenShift-Technical-Overview.pdfOpenShift-Technical-Overview.pdf
OpenShift-Technical-Overview.pdf
 
Kubernetes
KubernetesKubernetes
Kubernetes
 
Hadoopの概念と基本的知識
Hadoopの概念と基本的知識Hadoopの概念と基本的知識
Hadoopの概念と基本的知識
 
Issues of OpenStack multi-region mode
Issues of OpenStack multi-region modeIssues of OpenStack multi-region mode
Issues of OpenStack multi-region mode
 
Nutanix basic
Nutanix basicNutanix basic
Nutanix basic
 
Kubernates vs Openshift: What is the difference and comparison between Opensh...
Kubernates vs Openshift: What is the difference and comparison between Opensh...Kubernates vs Openshift: What is the difference and comparison between Opensh...
Kubernates vs Openshift: What is the difference and comparison between Opensh...
 
Google Cloud のネットワークとロードバランサ
Google Cloud のネットワークとロードバランサGoogle Cloud のネットワークとロードバランサ
Google Cloud のネットワークとロードバランサ
 

Similar to Lustre, RoCE, and MAN

Cilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDPCilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDPThomas Graf
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...PROIDEA
 
cisco-n3k-c3172pq-10ge-datasheet.pdf
cisco-n3k-c3172pq-10ge-datasheet.pdfcisco-n3k-c3172pq-10ge-datasheet.pdf
cisco-n3k-c3172pq-10ge-datasheet.pdfHi-Network.com
 
100G Networking Berlin.pdf
100G Networking Berlin.pdf100G Networking Berlin.pdf
100G Networking Berlin.pdfJunZhao68
 
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdfJunZhao68
 
An FPGA for high end Open Networking
An FPGA for high end Open NetworkingAn FPGA for high end Open Networking
An FPGA for high end Open Networkingrinnocente
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programminginside-BigData.com
 
Adding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux Device
Adding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux DeviceAdding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux Device
Adding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux DeviceSamsung Open Source Group
 
Networking essentials lect1
Networking essentials lect1Networking essentials lect1
Networking essentials lect1Roman Brovko
 
FAR/MARS Avionics CDR
FAR/MARS Avionics CDRFAR/MARS Avionics CDR
FAR/MARS Avionics CDRCade Walton
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Andriy Berestovskyy
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialmadhuinturi
 
PLNOG 13: Piotr Szolkowski: 100G Ethernet – Case Study
PLNOG 13: Piotr Szolkowski: 100G Ethernet – Case StudyPLNOG 13: Piotr Szolkowski: 100G Ethernet – Case Study
PLNOG 13: Piotr Szolkowski: 100G Ethernet – Case StudyPROIDEA
 

Similar to Lustre, RoCE, and MAN (20)

Cilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDPCilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDP
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
 
SFScon 21 - Stefan Schmidt - The Rise of IPv6 in IoT Protocols
SFScon 21 - Stefan Schmidt - The Rise of IPv6 in IoT ProtocolsSFScon 21 - Stefan Schmidt - The Rise of IPv6 in IoT Protocols
SFScon 21 - Stefan Schmidt - The Rise of IPv6 in IoT Protocols
 
6LoWPAN: An open IoT Networking Protocol
6LoWPAN: An open IoT Networking Protocol6LoWPAN: An open IoT Networking Protocol
6LoWPAN: An open IoT Networking Protocol
 
cisco-n3k-c3172pq-10ge-datasheet.pdf
cisco-n3k-c3172pq-10ge-datasheet.pdfcisco-n3k-c3172pq-10ge-datasheet.pdf
cisco-n3k-c3172pq-10ge-datasheet.pdf
 
100G Networking Berlin.pdf
100G Networking Berlin.pdf100G Networking Berlin.pdf
100G Networking Berlin.pdf
 
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
 
An FPGA for high end Open Networking
An FPGA for high end Open NetworkingAn FPGA for high end Open Networking
An FPGA for high end Open Networking
 
100 M pps on PC.
100 M pps on PC.100 M pps on PC.
100 M pps on PC.
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programming
 
6LoWPAN: An Open IoT Networking Protocol
6LoWPAN: An Open IoT Networking Protocol6LoWPAN: An Open IoT Networking Protocol
6LoWPAN: An Open IoT Networking Protocol
 
Introduction to Internet of Things
Introduction to Internet of ThingsIntroduction to Internet of Things
Introduction to Internet of Things
 
Adding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux Device
Adding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux DeviceAdding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux Device
Adding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux Device
 
Networking essentials lect1
Networking essentials lect1Networking essentials lect1
Networking essentials lect1
 
Run Your Own 6LoWPAN Based IoT Network
Run Your Own 6LoWPAN Based IoT NetworkRun Your Own 6LoWPAN Based IoT Network
Run Your Own 6LoWPAN Based IoT Network
 
FAR/MARS Avionics CDR
FAR/MARS Avionics CDRFAR/MARS Avionics CDR
FAR/MARS Avionics CDR
 
Brkdct 3101
Brkdct 3101Brkdct 3101
Brkdct 3101
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
PLNOG 13: Piotr Szolkowski: 100G Ethernet – Case Study
PLNOG 13: Piotr Szolkowski: 100G Ethernet – Case StudyPLNOG 13: Piotr Szolkowski: 100G Ethernet – Case Study
PLNOG 13: Piotr Szolkowski: 100G Ethernet – Case Study
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 

Recently uploaded (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Lustre, RoCE, and MAN

  • 1. Lustre, RoCE and MAN Łukasz Flis, Marek Magryś Dominika Kałafut, Patryk Lasoń, Adrian Marszalik, Maciej Pawlik
  • 2. Academic Computer Centre Cyfronet AGH ● The biggest Polish Academic Computer Centre ○ Over 45 years of experience in IT provision ○ Centre of excellence in HPC and Grid Computing ○ Home for Prometheus and Zeus supercomputers ● Legal status: an autonomous within AGH University of Science and Technology ● Staff: > 160 , ca. 60 in R&D ● Leader of PLGrid: Polish Grid and Cloud Infrastructure for Science ● NGI Coordination in EGI e-Infrastructure 2
  • 3. Network backbone ●4 main links to achieve maximum reliability ●Each link with 7x 10 Gbps capacity ●Additional 2x 100 Gbps dedicated links ●Direct connection with GEANT scientific network ●Over 40 switches ●Security ●Monitoring 3
  • 4. Academic Computer Centre Cyfronet AGH Prometheus ● 2.4 PFLOPS ● 53 604 cores ● 1st HPC system in Poland (174st on Top 500, 38th in 2015) 4 Zeus ● 374 TFLOPS ● 25 468 cores ● 1st HPC system in Poland (from 2009 to 2015, highest rank on Top500 – 81st in 2011) Computing portals and frameworks ● OneData ● PLG-Data ● DataNet ● Rimrock ● InSilicoLab Data Centres ● 3 independent data centres ● dedicated backbone links Research & Development ● distributed computing environments ● computing acceleration ● machine learning ● software development & optimization Storage ● 48 PB ● hierarchical data management Computational Cloud ● based on OpenStack
  • 5. HPC@Cyfronet 5 ●Prometheus and Zeus clusters ○ 6475 active users (at the end of 2018) ○ 350+ computational grants ○ 8+ millions of jobs in 2018 ○ 371+ millions of CPU hours spent in 2018 ○ Biggest jobs in 2018 ■ 27 648 cores ■ 261 152 CPU hours in one job ○ 900+ (Prometheus) and 600+ (Zeus) software modules ○ Custom users helper tools developed in-house
  • 6. The fastest supercomputer in Poland: Prometheus 6 ● Installed in Q2 2015 (upgraded in Q4 2015) ● Centos 7 + SLURM ● HP Apollo 8000 - direct warm cooled system – PUE 1.06 ○ 20 racks (4 CDU, 16 compute) ● 2235 nodes, 53 604 CPU cores (Haswell, Xeon E5-2680v3 12C 2.5GHz), 282 TB RAM ○ 2160 regular nodes (2 CPUs, 128 GB RAM) ○ 72 nodes with GPGPUs (2x NVIDIA Tesla K40 XL) ○ 4 islands ● Main storage based on Lustre ○ Scratch: 5 PB, 120 GB/s, 4x DDN SFA12kx ○ Archive: 5 PB, 60 GB/s, 2x DDN SFA12kx ● 2.4 PFLOPS total performance (Rpeak) ● < 850 kW power (including cooling) ● TOP500: current 174th position, highest: 38th (XI 2015)
  • 7. Project background 7 ● Industrial partner ● Areas: ○ Data storage ■ POSIX ■ 10s of PBs ■ Incremental growth ○ HPC ○ Networking ○ Consulting ● PoC in 2017 ● Infrastructure tests and design in 2018 ● Production in Q1 2019 Photo: wikipedia.org
  • 8. Challenges 8 ● How to separate industrial and academic workloads? ○ Isolated storage platform ○ Dedicated network + dedicated IB partition ○ Custom compute OS image ○ Scheduler (SLURM) setup ○ Do not mix funding sources ● Which hardware platform to use? ○ ZFS JBOD vs RAID ○ Infiniband vs Ethernet ○ Capacity/performance ratio ○ Single vs partitioned namespace
  • 9. Location 9 Storage to compute distance: 14 km over fibre (81 µs) DC Nawojki DC Pychowice Map: openstreetmap.org MAN backup link Dark fibre
  • 11. Solution 11 ● DDN SFA200NV for Lustre MDT ○ 10x 1.5 TB NVMe + 1 spare ● DDN ES7990 building block for OST ○ > 4 PiB usable space ○ ~ 20 GB/s performance ○ 450x 14 TB NL SAS ○ 4x 100 Gb/s Ethernet ○ Embedded Exascaler ● Juniper QFX10008 ○ Deep buffers (100ms) ● Vertiv DCM racks ○ 48 U, custom depth: 130 cm ○ 1500 kg static load
  • 12. Network: RDMA over Converged Ethernet RoCE v1: ● L2 - Ethernet Link Layer Protocol (Ethertype 0x8915) ● requires link level flow control for lossless Ethernet (PAUSE frames or Priority Flow Control) ● not routable RoCE v2: ● L3 - uses UDP/IP packets, port 4791 ● link level flow control optional ● can use ECN (Explicit Congestion Notification) for controlling flows on lossy networks ● routable Mellanox ConnectX HCAs implement hardware offload for RoCE protocols 12
  • 13. LNET: TCP vs RoCE v2 LNET selftest, default tuning for ksocknald and ko2iblnd, Lustre: 2.10.5, ConnectX-4 Adapters, 100 GbE, congestion free env., MTU 9216 (RoCE uses 4k max) 1310874.4 Local: MAX TCP: 4114.7 MiB/s @ 4 RPCs vs MAX RoCE v2: 10874.4 MiB/s @ 16 RPCs Remote: MAX TCP: 3662.2 MiB/s @ 4 RPCs vs MAX RoCE v2: 6805.7 MiB/s @ 32 RPCs Theoretical Max: 11682 MiB/s (12250 MB/s)
  • 14. LNET: TCP vs RoCE v2 Short summary TCP vs RoCE v2 p2p (no congestion) Short range test: ● RoCE v2 out-of-box LNET bandwidth 2.6x better than TCP ● link saturation 93% Long range test (14km): ● out-of-box LNET: RoCE v2 1.85x better than TCP ● link saturation: 58% (default settings) ● tuning required - ko2iblnd concurrent_sends=4, peer_credits=64 gives 11332.66 MiB/s (97% saturation) HW offloaded RoCE allows for full link utilization and low CPU usage. Single LNET router is easily able to saturate 100 Gb/s link 14
  • 15. Explicit Congestion Notification ● RoCEv2 can be used over lossy links ● Packet drops == retransmissions == bandwidth hiccups ● Enabling ECN effectively reduces packet drops on congested ports ● ECN must be enabled on all devices over the path ● If HCA sees ECN mark on received packet: ○ 1. CNP packet is sent back to the sender ○ 2. Sender reduces transmission speed in reaction to CNP 15
  • 16. ECN how to 1. Use ECN capable switches 2. Use RoCE capable host adapters (CX4 and CX5 were tested) 3. Use DSCP field in IP header to tag RDMA and CNP packets on host (cma_roce_tos) 4. Enable ECN for RoCE traffic on switches 5. Prioritize CNP packets to assure proper congestion signaling 6. Enjoy stable transfers and significantly reduced frame drops 7. Optionally use L3 and OSPF or BGP to handle backup routes 16
  • 17. LNET: congested long link Lustre 2.10.5, DC1 to DC2 2x100 GbE, test: write 4:2 Congestion appears on the DC1 to DC2 link due to 4:2 link reduction 17 RoCEv2 no FC: 12818.9 MiB/s 54.86% TCP no FC: 15368.3 MiB/s 65.78% RoCEv2 ECN: 19426.8 MiB/s 83.14%
  • 18. RoCEv2: ECN vs no ECN Effects of disabling ECN 18
  • 19. Real life test 2x DDN ES7990 (4 OSS), 4 LNET routers (RoCE <-> IB FDR), 14 km Bandwidth: IOR 112 tasks @ 28 client nodes Max Write: 29872.21 MiB/sec (31323.28 MB/sec) Max Read: 34368.27 MiB/sec (36037.74 MB/sec) 19
  • 20. Conclusions 20 ● For bandwidth workloads latency on MAN distances is not an issue ● ECN mechanisms for RoCE needs to be enabled to significantly reduce packet drops during congestion ● Aggregation of links (LACP+Adaptive Load Balancing or ECMP for L3) allows to scale bandwidth linearly by evenly utilizing available links ● RoCE allows more flexibility in terms of transport links compared to IB - ie. backup routing, cheaper and more scalable infrastructure
  • 21. Acknowledgements 21 Thanks for the test infrastructure and support
  • 22. 22 Visit us at booth H-710! (and taste some krówka) Thank you!