Rack Cluster Deployment for SDSC Supercomputer

Rack Cluster Deployment for SDSC
AI Supercomputer
Shawn Strande, San Diego Supercomputer Center
Sree Ganesan, Habana Labs
Thomas Jorgensen, Supermicro
11/10/2021
© 2021 Supermicro

SDSC: Thirty-Five Years of Excellence in High
Performance and Data Intensive Computing
• Established as a national supercomputer
resource center in 1985 by NSF
• Serves the national, UC San Diego, UC
System, and State of California research
communities.
• Supports research in all domains, incl. life
sciences, physics, materials science, social
sciences, and others.
• Design, deployment, and operations of
large-scale, innovative supercomputer and
data resources.
• Operates a state-of-the-art data center on
the UC San Diego campus
• Strong connections to the local tech sector

NSF Award 2005369
PIs: Amit Majumdar (PI), Rommie Amaro,
Javier Duarte, Mai Nguyen, Bob Sinkovits
SDSC/UCSD
Me Shawn Strande, SDSC Deputy Director and Voyager
Project Manager

Voyager deployment is underway now at SDSC!
• Supermicro handover to SDSC
complete
• SDSC performing systems and
application installs
• Early access in Jan ’22
• Formal operations est. Feb ’22
• 3 years as a focused testbed
• 2 years with wider access
offered through NSF
allocations
• Opportunities for access and
collaboration with and by
industry

Voyager System and Software
• 42x Training Nodes each with 8 Habana Gaudi processors (336 total); 3rd
Generation Intel® Xeon® Scalable processors; 6 TB node-local NVMe
• 2x Inference Nodes each with 8x Habana Goya processors (16 total); 2nd
Generation Intel® Xeon® Scalable processors; 3TB NVMe node—local storage
• 36x Intel x86 two-socket compute nodes
• Gaudi network: 400GbE Arista - RDMA over Converged Ethernet; on-chip
• 2PB Storage system – potential to experiment with various parallel file systems
(Ceph, Lustre); connectivity to compute via 25GbE
• 200 TB HFS; connectivity to compute via 25GbE
• DL frameworks - TensorFlow, PyTorch + Habana Synapse AI software development
tools
• https://habana.ai/ (white papers on Gaudi and Goya and other information)

Science application characteristics
Application domain AI techniques ML frameworks
Training vs
inference
Astronomy NN TensorFlow Mostly T
Atmospheric science NN TensorFlow Mostly T
Chemistry, Biophysics NN Custom, PyTorch Both T & I
Chemistry, Materials NN Custom, PyTorch Mostly I
Computer science Reinforcement learning, RNN TensorFlow Mostly T
Human microbiome mmvec, GAN TensorFlow, PyTorch Mostly T
Particle physics CNN, GAN, GNN, RNN, NN, VAE TensorFlow, PyTorch Both T & I
Population genetics CNN TensorFlow Mostly T
Satellite image analysis
U-Net, CNN, GAN, cluster
analysis, PCA
TensorFlow Mostly T
Systems biology CNN, SVM TensorFlow, PyTorch Both T & I
Key: CNN=Convolutional Neural
Network, GAN=Generative
Adversarial Network,
GNN=Graph Neural Network,
I=Inference; NN=Dense Neural
Network, PCA=Principal
Components Analysis,
RNN=Recurrent Neural Network,
SVM=Support Vector
Machine, T=Training;
VAE=Variational Auto-
Encoder

High energy physics application – Javier Duarte, UCSD
Data processing pipeline for Higgs boson to bottom quark event
processing can benefit from Voyager's inference processors to filter
data coming out of detector and Voyager's training processors in
processing data that passes the high-level trigger. Credit: Javier Duarte
• LHC at CERN generates massive amount of data
• More than 99% of events (responsible for
discovery of Higgs boson) are discarded
immediately
• Remaining petabytes of data are further analyzed
• Duarte and collaborators use ML for triggering,
event reconstruction and data analysis
• For triggering, ML improves signal selection
efficiency
• For data analysis, various ML algorithms
(including dense, convolutional, recurrent, and
graph neural networks) are used to classify each
event as signal or background and to identify
particle signatures (such as Higgs boson decay
candidates)
• GNNs on Gaudi to improve particle identification
and event reconstruction
• Goya to test software-based triggering step of the
data processing pipeline

Satellite image analysis – Mai Nguyen, Ilkay Altintas and
collaborators, SDSC
• Applying DL to image analysis, disaster management, NLP, others
• A Voyager project - DL algorithms on satellite images to determine land covers across different areas in the
context of wildfire management
• WIFIRE: WIFIRE HOME | WIFIRE (ucsd.edu)
• Goal is to combine AI models with fire science models and fire science expertise
• Study and simulate fire behavior under different conditions
• Algorithms developed on the TF framework will be ported to Voyager
• Easy transition of DL models to Habana expected
Satellite Imagery
Tiles
Crop Images
Pan-Sharpen Reproject Create RGB Downsample
Data Preparation
Feature
Extraction
Cluster PCA Sort
Machine Learning
Ordered
Clusters
Histogram and Map
CNN
Satellite image processing pipeline, showing data preparation and
machine learning steps. ML model training and inference will be
accomplished using Voyager's processers.
• For land-cover map generation, U-Nets and
CNN will be trained on Gaudi processors for
segmentation and classification
• DL models are used to extract features from
satellite images for
o Region-of-interest detection to locate
schools in rural areas
o Demographic analysis to understand
organization of a city and refugee camp
formation

Our partnership with Supermicro and Intel Habana Labs is
allowing us to deploy a cutting-edge AI supercomputer for
research
• Ability and willingness to engage in a project with innovative
technology for advanced computing and AI in science and
engineering research
• Deep technical collaboration with Supermicro and Intel Habana Labs
in advanced AI processors, high performance networking, and
systems integration
• Rigorous pre-delivery testing for reliability and performance
• Onsite installation, 5-years support

A little about Habana
10
• Founded in 2016 to develop purpose-built AI processors
• Launched inference processor in 2018, training processor in 2019
• Acquired by Intel in late-2019
• Fully leveraging Intel’s scale, resources and infrastructure
• Accessing Intel ecosystem and customer partnerships
• Gaudi AI processor is now available on AWS; DL1 is the first non-GPU instance on AWS
• Continuing with our mission to build AI processors optimized for data center and cloud
performance and efficiency
&

Gaudi: architected for efficiency
12
Designed to optimize AI performance, delivering higher
efficiency than traditional CPUs & GPUs
• Heterogeneous compute architecture
- Configurable centralized GEMM engine (MME)
- Fully programmable, AI-customized Tensor Processing Cores
• Software-managed memory architecture
- 32 GB of HBM2 memory
• Natively integrated 10 x 100Gb Ethernet RoCE for scaling

13
Designed for flexible and easy model migration
Ease of use
Integrated with TensorFlow and
PyTorch; minimal code changes
to get started
 SynapseAI maps model
topology onto Gaudi devices
Developers can enjoy the same
abstraction they are accustomed
to today
Customization
SynapseAI TPC SDK
facilitates development of
custom kernels
Developers can customize
models to extract best
performance
32GB HBM2 memories similar
to GPUs, so existing DL models
will fit into Gaudi memory
Developers can spend less
effort to port their models to
Gaudi
Balanced compute
& memory

Designed for Scaling Efficiency
14
The industry’s FIRST:
Native integration of 10 x 100 Gigabit Ethernet RoCE ports onto every Gaudi
• Eliminates network bottlenecks
• Standard Ethernet inside the server and across nodes
• Eliminates lock-in with proprietary interfaces
• Lowers total system cost and power by reducing discrete components

15
15
Scaling Within A Gaudi Server
• 8 Gaudi OCP OAM cards
• 24 x 100GbE RDMA RoCEfor
scale-out
• Non-blocking, all-2-all internal
interconnect across Gaudi AI
processors
• Separate PCIe ports for
external HostCPUtraffic
Example of Integrated Server with eight Gaudi AI processors, two Xeon CPU and
multiple Ethernet Interfaces

16
16
Rack And Pod Level Scaling
Easily build rack and pod- scale training systems with off-the-shelf
standard ethernet switches
Example of rack configuration with four
Gaudi servers (eight Gaudi processors
per server) connected to a single
Ethernet switch

SynapseAI® Software Suite:
designed for performance and ease of use
17
Driving end-user efficiency for
model build and migration
• Train deep learning models on Gaudi with
minimal code changes
• Integrated with TensorFlow & PyTorch
• Habana Developer Site & GitHub
• Support with reference models, kernel
libraries, documentation and “how tos”
• Advanced users can write their own custom
kernels
Graph Compiler
Habana Communication Libraries
Habana Kernel
Library
Customer
Kernel Library
User Mode Driver
Kernel Mode Driver
Debugging
&
Profiling
Tools
T
PC
Programming
Tools
Framework Integration Layer

import tensorflow as tf
from TensorFlow.common.library_loader import load_habana_module
load_habana_module()
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(10),
])
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5, batch_size=128)
model.evaluate(x_test, y_test)
Load the Habana libraries needed to
use Gaudi aka HPU device.
Once loaded, the HPU device is
registered in TensorFlow and
prioritized over CPU.
When an Op is available for both CPU
and HPU, the Op is assigned to the
HPU.
When an Op is not supported on
HPU, it runs on the CPU
Getting Started with TensorFlow on Gaudi

Habana Developer Platform
Developer.Habana.Ai

SDSC Voyager Supercomputer powered by Supermicro
Habana AI processors
• Supermicro X12 8-Gaudi Server
powering Voyager
• 16 Goya inference processors in
8-card server from Supermicro
20

SMC Executive Summary
11/10/2021 Better Faster Greener™ © 2021 Supermicro
21
• Founded and Headquartered in San Jose, USA since 1993
• Top-ranked Server & Storage Solutions Provider Worldwide
• Industry’s Broadest Range of Application-Optimized Products
• Product Time-to-Market Leadership for Every Refresh Cycle
• The Best Product Performance Per-watt/Per-sq. ft./Value
• Green Computing Leader Offering the Lowest TCO and TCE
Charles Liang, CEO/Founder, has been
the chief designer and architect of
Supermicro – the fastest growing server
hardware solutions company in USA for
the past 27 years.

Supermicro Habana Gaudi AI Training Solution
Better Faster Greener™ © 2021 Supermicro
22
AI Training Accelerator
Unified software stack
Optimized for AI
models & workloads
Software Solution
X12 Gaudi AI System
+ SuperCloud Composer

X12 Gaudi AI Training System Overview
23
System Front View
SYS-420GH-TNGR
System Rear View
Supermicro Confidential
Specifications
CPU – Dual Socket
Dual Intel® Xeon® Scalable Processors (Ice Lake)
TDP up to 270W
Memory – 32 DIMM Slots
32 DIMMs, up to 8TB Registered ECC DDR4
3200MHz SDRAM
Drives – 4 Hot-Swap Bays
4x 2.5” SAS/SATA/NVMe Hybrid
Expansion
1x PCIe Gen 4 x16 (FHHL)
2x AIOM PCIe Gen 4 x16
Networking
1x 10GbE BaseT
6x 400Gb QSFP-DD
Power Supply
4x 3000W redundant, Titanium Level
• Key Features
• 4U System for 8x Habana Gaudi HL-205 AI Processors
• Purpose Built for AI/Deep Learning Training
• Lower system cost with build-in 100GbE Ethernet ports
• 24 x 100GbE RDMA (6 QSFP-DDs) for scale-out
Up to 40% better price performance than existing AI Training Solutions

Purpose-designed for data center AI Training efficiency
• Cost-efficient AI Training
• Usability to ease model migration
• Hardware and software architected for scalability
A new class of AI Training: X12 Gaudi AI System
24 Better Faster Greener™ © 2021 Supermicro
11/10/2021 Supermicro Confidential
AI training workloads targeted: Computer Vision, NLP and Recommendation
SuperCloud Composer: Single Pane of
Glass Management – Gaudi System, storage and
networking integration

25
Solution Brief and Rack Level Reference Design Available
Success Story: SDSC’s Voyager Research Program
• Support research conducted across range of science and engineering domains
- Astronomy, climate sciences, chemistry and particle physics
• 336 Gaudi Training accelerators with native RoCE scaling

DISCLAIMER
Super Micro Computer, Inc. may make changes to specifications and product descriptions at any time, without notice. The
information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions
and typographical errors. Any performance tests and ratings are measured using systems that reflect the approximate
performance of Super Micro Computer, Inc. products as measured by those tests. Any differences in software or hardware
configuration may affect actual performance, and Super Micro Computer, Inc. does not control the design or implementation of
third party benchmarks or websites referenced in this document. The information contained herein is subject to change and may
be rendered inaccurate for many reasons, including but not limited to any changes in product and/or roadmap, component and
hardware revision changes, new model and/or product releases, software changes, firmware changes, or the like. Super Micro
Computer, Inc. assumes no obligation to update or otherwise correct or revise this information.
SUPER MICRO COMPUTER, INC. MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE
CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT
MAY APPEAR IN THIS INFORMATION.
SUPER MICRO COMPUTER, INC. SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR
FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL SUPER MICRO COMPUTER, INC. BE LIABLE TO ANY
PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF
ANY INFORMATION CONTAINED HEREIN, EVEN IF SUPER MICRO COMPUTER, Inc. IS EXPRESSLY ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2021 Super Micro Computer, Inc. All rights reserved.
26 © 2021 Supermicro

Rack Cluster Deployment for SDSC Supercomputer

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Rack Cluster Deployment for SDSC Supercomputer

Similar a Rack Cluster Deployment for SDSC Supercomputer (20)

Más de Rebekah Rodriguez

Más de Rebekah Rodriguez (20)

Último

Último (20)

Rack Cluster Deployment for SDSC Supercomputer