Supermicro designed and implemented a rack-level cluster solution for San Diego Supercomputing Center (SDSC) optimized for their custom and experimental AI training and inferencing workloads, and meeting their environmental and TCO requirements. The project team will discuss the journey of designing and deploying our Rack Plug and Play cluster, and Shawn Strande, Dupty Director, SDSC, will be sharing his experience of partnering with the Supermicro team to solve his challgenges in HPC and AI.
The team will also share the technology that powers the SDSC Voyager Supercomputer, the Habana Gaudi AI system with 3rd Gen Intel® Xeon® Scalable processors for Deep Learning Training, and Habana Goya for Inferencing.
Watch the webinar: https://www.brighttalk.com/webcast/17278/517013
2. SDSC: Thirty-Five Years of Excellence in High
Performance and Data Intensive Computing
• Established as a national supercomputer
resource center in 1985 by NSF
• Serves the national, UC San Diego, UC
System, and State of California research
communities.
• Supports research in all domains, incl. life
sciences, physics, materials science, social
sciences, and others.
• Design, deployment, and operations of
large-scale, innovative supercomputer and
data resources.
• Operates a state-of-the-art data center on
the UC San Diego campus
• Strong connections to the local tech sector
3. NSF Award 2005369
PIs: Amit Majumdar (PI), Rommie Amaro,
Javier Duarte, Mai Nguyen, Bob Sinkovits
SDSC/UCSD
Me Shawn Strande, SDSC Deputy Director and Voyager
Project Manager
4. Voyager deployment is underway now at SDSC!
• Supermicro handover to SDSC
complete
• SDSC performing systems and
application installs
• Early access in Jan ’22
• Formal operations est. Feb ’22
• 3 years as a focused testbed
• 2 years with wider access
offered through NSF
allocations
• Opportunities for access and
collaboration with and by
industry
5. Voyager System and Software
• 42x Training Nodes each with 8 Habana Gaudi processors (336 total); 3rd
Generation Intel® Xeon® Scalable processors; 6 TB node-local NVMe
• 2x Inference Nodes each with 8x Habana Goya processors (16 total); 2nd
Generation Intel® Xeon® Scalable processors; 3TB NVMe node—local storage
• 36x Intel x86 two-socket compute nodes
• Gaudi network: 400GbE Arista - RDMA over Converged Ethernet; on-chip
• 2PB Storage system – potential to experiment with various parallel file systems
(Ceph, Lustre); connectivity to compute via 25GbE
• 200 TB HFS; connectivity to compute via 25GbE
• DL frameworks - TensorFlow, PyTorch + Habana Synapse AI software development
tools
• https://habana.ai/ (white papers on Gaudi and Goya and other information)
6. Science application characteristics
Application domain AI techniques ML frameworks
Training vs
inference
Astronomy NN TensorFlow Mostly T
Atmospheric science NN TensorFlow Mostly T
Chemistry, Biophysics NN Custom, PyTorch Both T & I
Chemistry, Materials NN Custom, PyTorch Mostly I
Computer science Reinforcement learning, RNN TensorFlow Mostly T
Human microbiome mmvec, GAN TensorFlow, PyTorch Mostly T
Particle physics CNN, GAN, GNN, RNN, NN, VAE TensorFlow, PyTorch Both T & I
Population genetics CNN TensorFlow Mostly T
Satellite image analysis
U-Net, CNN, GAN, cluster
analysis, PCA
TensorFlow Mostly T
Systems biology CNN, SVM TensorFlow, PyTorch Both T & I
Key: CNN=Convolutional Neural
Network, GAN=Generative
Adversarial Network,
GNN=Graph Neural Network,
I=Inference; NN=Dense Neural
Network, PCA=Principal
Components Analysis,
RNN=Recurrent Neural Network,
SVM=Support Vector
Machine, T=Training;
VAE=Variational Auto-
Encoder
7. High energy physics application – Javier Duarte, UCSD
Data processing pipeline for Higgs boson to bottom quark event
processing can benefit from Voyager's inference processors to filter
data coming out of detector and Voyager's training processors in
processing data that passes the high-level trigger. Credit: Javier Duarte
• LHC at CERN generates massive amount of data
• More than 99% of events (responsible for
discovery of Higgs boson) are discarded
immediately
• Remaining petabytes of data are further analyzed
• Duarte and collaborators use ML for triggering,
event reconstruction and data analysis
• For triggering, ML improves signal selection
efficiency
• For data analysis, various ML algorithms
(including dense, convolutional, recurrent, and
graph neural networks) are used to classify each
event as signal or background and to identify
particle signatures (such as Higgs boson decay
candidates)
• GNNs on Gaudi to improve particle identification
and event reconstruction
• Goya to test software-based triggering step of the
data processing pipeline
8. Satellite image analysis – Mai Nguyen, Ilkay Altintas and
collaborators, SDSC
• Applying DL to image analysis, disaster management, NLP, others
• A Voyager project - DL algorithms on satellite images to determine land covers across different areas in the
context of wildfire management
• WIFIRE: WIFIRE HOME | WIFIRE (ucsd.edu)
• Goal is to combine AI models with fire science models and fire science expertise
• Study and simulate fire behavior under different conditions
• Algorithms developed on the TF framework will be ported to Voyager
• Easy transition of DL models to Habana expected
Satellite Imagery
Tiles
Crop Images
Pan-Sharpen Reproject Create RGB Downsample
Data Preparation
Feature
Extraction
Cluster PCA Sort
Machine Learning
Ordered
Clusters
Histogram and Map
CNN
Satellite image processing pipeline, showing data preparation and
machine learning steps. ML model training and inference will be
accomplished using Voyager's processers.
• For land-cover map generation, U-Nets and
CNN will be trained on Gaudi processors for
segmentation and classification
• DL models are used to extract features from
satellite images for
o Region-of-interest detection to locate
schools in rural areas
o Demographic analysis to understand
organization of a city and refugee camp
formation
9. Our partnership with Supermicro and Intel Habana Labs is
allowing us to deploy a cutting-edge AI supercomputer for
research
• Ability and willingness to engage in a project with innovative
technology for advanced computing and AI in science and
engineering research
• Deep technical collaboration with Supermicro and Intel Habana Labs
in advanced AI processors, high performance networking, and
systems integration
• Rigorous pre-delivery testing for reliability and performance
• Onsite installation, 5-years support
10. A little about Habana
10
• Founded in 2016 to develop purpose-built AI processors
• Launched inference processor in 2018, training processor in 2019
• Acquired by Intel in late-2019
• Fully leveraging Intel’s scale, resources and infrastructure
• Accessing Intel ecosystem and customer partnerships
• Gaudi AI processor is now available on AWS; DL1 is the first non-GPU instance on AWS
• Continuing with our mission to build AI processors optimized for data center and cloud
performance and efficiency
&
11.
12. Gaudi: architected for efficiency
12
Designed to optimize AI performance, delivering higher
efficiency than traditional CPUs & GPUs
• Heterogeneous compute architecture
- Configurable centralized GEMM engine (MME)
- Fully programmable, AI-customized Tensor Processing Cores
• Software-managed memory architecture
- 32 GB of HBM2 memory
• Natively integrated 10 x 100Gb Ethernet RoCE for scaling
13. 13
Designed for flexible and easy model migration
Ease of use
Integrated with TensorFlow and
PyTorch; minimal code changes
to get started
SynapseAI maps model
topology onto Gaudi devices
Developers can enjoy the same
abstraction they are accustomed
to today
Customization
SynapseAI TPC SDK
facilitates development of
custom kernels
Developers can customize
models to extract best
performance
32GB HBM2 memories similar
to GPUs, so existing DL models
will fit into Gaudi memory
Developers can spend less
effort to port their models to
Gaudi
Balanced compute
& memory
14. Designed for Scaling Efficiency
14
The industry’s FIRST:
Native integration of 10 x 100 Gigabit Ethernet RoCE ports onto every Gaudi
• Eliminates network bottlenecks
• Standard Ethernet inside the server and across nodes
• Eliminates lock-in with proprietary interfaces
• Lowers total system cost and power by reducing discrete components
15. 15
15
Scaling Within A Gaudi Server
• 8 Gaudi OCP OAM cards
• 24 x 100GbE RDMA RoCEfor
scale-out
• Non-blocking, all-2-all internal
interconnect across Gaudi AI
processors
• Separate PCIe ports for
external HostCPUtraffic
Example of Integrated Server with eight Gaudi AI processors, two Xeon CPU and
multiple Ethernet Interfaces
16. 16
16
Rack And Pod Level Scaling
Easily build rack and pod- scale training systems with off-the-shelf
standard ethernet switches
Example of rack configuration with four
Gaudi servers (eight Gaudi processors
per server) connected to a single
Ethernet switch
17. SynapseAI® Software Suite:
designed for performance and ease of use
17
Driving end-user efficiency for
model build and migration
• Train deep learning models on Gaudi with
minimal code changes
• Integrated with TensorFlow & PyTorch
• Habana Developer Site & GitHub
• Support with reference models, kernel
libraries, documentation and “how tos”
• Advanced users can write their own custom
kernels
Graph Compiler
Habana Communication Libraries
Habana Kernel
Library
Customer
Kernel Library
User Mode Driver
Kernel Mode Driver
Debugging
&
Profiling
Tools
T
PC
Programming
Tools
Framework Integration Layer
18. import tensorflow as tf
from TensorFlow.common.library_loader import load_habana_module
load_habana_module()
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(10),
])
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5, batch_size=128)
model.evaluate(x_test, y_test)
Load the Habana libraries needed to
use Gaudi aka HPU device.
Once loaded, the HPU device is
registered in TensorFlow and
prioritized over CPU.
When an Op is available for both CPU
and HPU, the Op is assigned to the
HPU.
When an Op is not supported on
HPU, it runs on the CPU
Getting Started with TensorFlow on Gaudi
20. SDSC Voyager Supercomputer powered by Supermicro
Habana AI processors
• Supermicro X12 8-Gaudi Server
powering Voyager
• 16 Goya inference processors in
8-card server from Supermicro
20