SlideShare a Scribd company logo
1 of 36
Download to read offline
Kotsalos Christos
Scientific and Parallel Computing Group (SPC)
μCluster +
Supercomputing/Cluster Technologies
μCluster
2 NVIDIA Jetson Nano
Boards
3 Raspberry Pi
Boards
1 NVIDIA Jetson TX2
Board
1 Network Switch
16-ports
Support 1 Gbit/s
Cat. 6
Network Cables
Support 1 Gbit/s
GPU 128-Cuda Core Maxwell
CPU Quad-core ARM A57 @ 1.43 GHz
Memory 4 GB 64-bit LPDDR4 25.6 GB/s
Storage microSD
Connectivity Gigabit Ethernet
Display HDMI and display port
USB 4x USB 3.0, USB 2.0 Micro-B
μCluster components
https://developer.nvidia.com/embedded/jetson-nano-developer-kit
Technical Specifications
JETSON TX2 MODULE
• NVIDIA Pascal Architecture GPU
• 2 Denver 64-bit CPUs + Quad-Core A57
• 8 GB L128 bit DDR4 Memory
• 32 GB eMMC 5.1 Flash Storage
• Connectivity to 802.11ac Wi-Fi and Bluetooth-Enabled Devices
• 10/100/1000BASE-T Ethernet
I/O
• USB 3.0 Type A
• USB 2.0 Micro AB
• HDMI
• Gigabit Ethernet
• Full-Size SD
• SATA Data and Power
POWER OPTIONS
• External 19V AC Adapter
μCluster components
https://developer.nvidia.com/embedded/jetson-tx2-developer-kit
Developer Kits
Modules
Jetson family
Marketed for AI apps
(Tensor cores)
https://developer.nvidia.com/embedded/jetson-modules
Jetson Nano
Jetson TX2 Series
Jetson Xavier NX Jetson AGX Xavier Series
TX2 4GB TX2 TX2i
AI Performance 472 GFLOPs (FP16) 1.33 TFLOPs (FP16)
1.26
TFLOPs
(FP16)
21 TOPs (INT8) 32 TOPs (INT8)
GPU 128-core NVIDIA Maxwell GPU 256-core NVIDIA Pascal GPU
384-core NVIDIA Volta GPU with
48 Tensor Cores
512-core NVIDIA Volta GPU with
64 Tensor Cores
CPU Quad-Core ARM Cortex-A57
Dual-Core NVIDIA Denver 1.5
64-Bit CPU and Quad-Core
ARM Cortex-A57
6-core NVIDIA Carmel ARM v8.2
64-bit CPU
6MB L2 + 4MB L3
8-core NVIDIA Carmel Arm v8.2
64-bit CPU
8MB L2 + 4MB L3
Memory
4 GB 64-bit LPDDR4
25.6GB/s
4 GB 128-
bit
LPDDR4
51.2GB/s
8 GB 128-
bit
LPDDR4
59.7GB/s
8 GB 128-
bit
LPDDR4
(ECC
Support)
51.2GB/s
8 GB 128-bit LPDDR4x
51.2GB/s
32 GB 256-bit LPDDR4x
136.5GB/s
Storage 16 GB eMMC 5.1
16 GB
eMMC 5.1
32 GB eMMC 5.1 16 GB eMMC 5.1 32 GB eMMC 5.1
Power 5W / 10W 7.5W / 15W
10W /
20W
10W / 15W 10W / 15W / 30W
Networking 10/100/1000 BASE-T Ethernet
10/100/1
000 BASE-
T
Ethernet,
WLAN
10/100/1000 BASE-T Ethernet
Tesla P100-PCIE-16GB: 16GB RAM, 3584 CUDA Cores
Jetson family Specs
https://developer.nvidia.com/embedded/jetson-modules TOPS: Tera-Operations per Second
Raspberry Pi 3 Model B
• Quad Core 1.2GHz Broadcom BCM2837 64bit CPU
• 1GB RAM
• BCM43438 wireless LAN and Bluetooth Low Energy (BLE) on board
• 100 Base Ethernet
• Micro SD port for loading your operating system and storing data
μCluster components
https://www.raspberrypi.org/products/raspberry-pi-3-model-b-plus/
Raspberry Pi latest version
https://www.raspberrypi.org/products/raspberry-pi-4-model-b/
Turing Pi (alternative path)
https://turingpi.com/v1/
Raspberry Pi
Compute Module
Turing Pi (alternative path)
https://turingpi.com/v1/
Steps to build the μCluster
1. Install OS (in just one board)
2. Install software (in just one board)
3. Copy the image (OS+software) to all the other boards
4. Setup static ips
5. Setup NFS (Network File System)
6. Run your MPI applications using a hostfile
mpirun -np # --hostfile myHostFile ./executable $(arguments)
j01 slots=4 max-slots=4 # Jetson TX2
j02 slots=4 max-slots=4 # Jetson Nano
j03 slots=4 max-slots=4 # Jetson Nano
j04 slots=4 max-slots=4 # Raspberry Pi
j05 slots=4 max-slots=4 # Raspberry Pi
j06 slots=4 max-slots=4 # Raspberry Pi
myHostFile
ips of the available infrastructure
(can be found in /etc/hosts & /etc/hostname)
https://selkie-macalester.org/csinparallel/modules/RosieCluster/build/html/#
SLURM
as an alternative
(Resource Manager)
NFS vs parallel FS
Storage
n0
n1 nm…
NOT SCALABLE
NFS
Storage
n0
n1 nm…
Storage
nk
…
SCALABLE
PFS
Experiences on File Systems Which is the best file system for you? Jakob Blomer CERN PH/SFT 2015
https://developer.ibm.com/tutorials/l-network-filesystems/
Hotspot
No Stress for the Network!
parallel FS
Metadata services store namespace
metadata, such as filenames, directories,
access permissions, and file layout
Object storage contains actual file data.
Clients pull the location of files and
directories from the metadata services,
then access file storage directly.
Popular PFS:
Lustre
BeeGFS
GlusterFS
https://azure.microsoft.com/en-ca/resources/parallel-virtual-file-systems-on-microsoft-azure/
https://techcommunity.microsoft.com/t5/azure-global/parallel-file-systems-for-hpc-storage-on-azure/ba-p/306223
parallel FS
Lustre
Metadata services Object storage
IOPs: I/O operations per second
https://azure.microsoft.com/en-ca/resources/parallel-virtual-file-systems-on-microsoft-azure/
https://techcommunity.microsoft.com/t5/azure-global/parallel-file-systems-for-hpc-storage-on-azure/ba-p/306223
No Stress for the Network!
μCluster benchmarks:
Palabos-npFEM
Red Blood Cells (red bodies) : 272
Platelets (yellow bodies) : 27
Reference case study
Hematocrit 20%, box 50x50x50 μm3
https://gitlab.com/unigespc/palabos.git (coupledSimulators/npFEM)
μCluster benchmarks:
Palabos-npFEM
x6.5 x8.1
Promising alternative considering:
• Energy consumption
• Low cost of components
μCluster benchmarks:
Palabos-npFEM
Intra-node
communication
2 nodes -
Gbit Ethernet
μCluster benchmarks:
Network Performance
OSU Micro-Benchmarks
Testing 2 nodes
osu_get_bw:
Bandwidth Test
MPI_Isend & MPI_Irecv
μCluster benchmarks:
Network Performance
OSU Micro-Benchmarks
Testing 2 nodes
osu_get_latency:
Latency Test
MPI_Send & MPI_Recv
ping-pong test
μCluster benchmarks: Network Performance mpiGraph
Piz Daint BaobabμCluster
Bandwidth
Packet size
μCluster benchmarks:
Network Performance
OSU Micro-Benchmarks
https://ulhpc-tutorials.readthedocs.io/en/latest/parallel/mpi/OSU_MicroBenchmarks/
Importance of:
• MPI implementation
• Network capabilities
Interconnect/Network Technologies
InfiniBand
Ethernet
Omni-Path
TOP500 June 2020
https://www.top500.org/
SDR DDR QDR FDR10 FDR EDR HDR NDR XDR
Theoretical
effective
throughput
(Gbit/s)
for 1
link
2 4 8 10 13.64 25 50 100 250
for 4
links
8 16 32 40 54.54 100 200 400 1000
for 8
links
16 32 64 80 109.08 200 400 800 2000
for 12
links
24 48 96 120 163.64 300 600 1200 3000
Adapter latency (µs) 5 2.5 1.3 0.7 0.7 0.5 less? t.b.d. t.b.d.
Year
2001
2003
2005 2007 2011 2011 2014 2017
after
2020
after
2023?
Interconnect/Network Technologies
InfiniBand (IB)
Omni-Path (Intel)
Aries (Cray)
copper or fiber optic wires
Each interconnect technology
comes with its own hardware:
• NIC: Network Interface Card
• Switch: Create Subnets
• Router: Link Subnets
• Bridge/Gateway: Bridge different
networks (IB & Ethernet)
Ethernethttps://en.wikipedia.org/wiki/InfiniBand
Interconnect/Network Technologies
NVLink (NVIDIA)
GPU-to-GPU data transfers at up to 160 Gbytes/s of bidirectional bandwidth, 5x the bandwidth of PCIe
https://prace-ri.eu/training-support/best-practice-guides/
Interconnect/Network Technologies
Absolute Priorities
High Bandwidth
Low Latency but not only
RDMA
Combined with
smartNICs/FPGAs
(try to reduce the stress from CPUs)
RDMA over Converged Ethernet (RoCE) is a
network protocol that allows remote direct
memory access over an Ethernet network. It
does this by encapsulating an IB transport
packet over Ethernet.
https://prace-ri.eu/training-support/best-practice-guides/
Ethernet uses a hierarchical topology which
involves more computing power by the CPU
in contrast to the flat fabric topology of IB
where the data is directly moved by the
network card using RDMA requests,
reducing the CPU involvement
Interconnect/Network Technologies
GPU-aware MPI - NVIDIA GPUDirect (family of technologies)
GPU-aware MPI implementations can automatically handle MPI transactions with pointers to GPU memory
• GPUdirect Peer to Peer among GPUs on the same node (native support through drivers & CUDA toolkit
- through PCIe or NVLink)
• GPUdirect RDMA among GPUs on different nodes (specialized hardware with RDMA support)
• MVAPICH, OpenMPI, Cray MPI, ...
How it works (@CSCS)
▪ Set the environment variable: export MPICH_RDMA_ENABLED_CUDA=1
▪ Each pointer passed to MPI is checked to see if it is in host or device memory. If not set, MPI assumes
that all pointers point at host memory, and your application will probably crash with seg faults
https://www.cscs.ch/
Interconnect/Network Technologies
https://spcl.inf.ethz.ch/ (slide from Torsten Hoefler)
Interconnect/Network Technologies
Route to Exascale: Network Acceleration
sPIN: High-performance streaming Processing in the Network
Torsten Hoefler, Salvatore Di Girolamo, Konstantin Taranov, Ryan E. Grant, Ron Brightwell
• Today’s network cards contain rather
powerful processors optimized for data
movement
• Offload simple packet processing
functions to the network card
• Portable packet-processing network
acceleration model similar to compute
acceleration with CUDA or OpenCL
https://spcl.inf.ethz.ch/ (slide from Torsten Hoefler)
Interconnect/Network Technologies
Route to Exascale & HPC network topologies
https://prace-ri.eu/wp-content/uploads/Best-Practice-Guide_Modern-Interconnects.pdf
Fat trees
Torus
Dragonfly
Good luck with the scalability on
exascale machines without smart
components/accelerators at the
hotspots (compute, network, storage)
NVIDIA GPU Micro-architecture
(chronologically)
1. Tesla
2. Fermi
3. Kepler
4. Maxwell
5. Pascal (CSCS)
6. Volta (Summit ORNL)
7. Turing
8. Ampere
GPU Architecture
GPU Architecture
https://www.nvidia.com/
GPU Architecture
Evolution
Pascal Streaming Multiprocessor
Ampere Streaming Multiprocessor
https://www.nvidia.com/
NVIDIA Ampere NVIDIA Turing NVIDIA Volta
Supported Tensor Core Precisions FP64, TF32, bfloat16, FP16, INT8, INT4, INT1 FP16, INT8, INT4, INT1 FP16
Supported CUDA Core Precisions FP64, FP32, FP16, bfloat16, INT8 FP64, FP32, FP16, INT8 FP64, FP32, FP16, INT8
GPU Architecture: Tensor Cores
Each Tensor Core operates on a 4x4 matrix and performs the following operation:
In Volta, each Tensor Core performs 64 floating point FMA* operations per clock, and eight Tensor Cores in
an SM perform a total of 512 FMA operations
*FMA (Fused Multiply-Accumulate)
https://www.nvidia.com/
Vectorization like in CPUs
GPU Architecture: Tensor Cores
How to use/activate them:
▪ The code simply needs to use a flag to tell the API and drivers that you want to
use tensor cores, the data type needs to be one supported by the cores, and
the dimensions of the matrices need to be a multiple of 8. After that, that
hardware will handle everything else.
▪ Easiest way is through cuBLAS or other NVIDIA HPC libraries
▪ For now avoid the API to access them (ugly like CPU-vectorization)
Multiple of 8 for tensor cores
&
Multiple of 32 for vanilla CUDA cores
(warp scheduling)
// First, create a cuBLAS handle:
cublasStatus_t cublasStat = cublasCreate(&handle);
// Set the math mode to allow cuBLAS to use Tensor Cores:
cublasStat = cublasSetMathMode(handle, CUBLAS_TENSOR_OP_MATH);
// Allocate and initialize your matrices (only the A matrix is shown):
size_t matrixSizeA = (size_t)rowsA * colsA;
T_ELEM_IN **devPtrA = 0;
cudaMalloc((void**)&devPtrA[0], matrixSizeA * sizeof(devPtrA[0][0]));
T_ELEM_IN A = (T_ELEM_IN *)malloc(matrixSizeA * sizeof(A[0]));
memset( A, 0xFF, matrixSizeA* sizeof(A[0]));
status1 = cublasSetMatrix(rowsA, colsA, sizeof(A[0]), A, rowsA, devPtrA[i], rowsA);
// ... allocate and initialize B and C matrices (not shown) ...
// Invoke the GEMM, ensuring k, lda, ldb, and ldc are all multiples of 8,
// and m is a multiple of 4:
cublasStat = cublasGemmEx( handle, transa, transb, m, n, k, alpha,
A, CUDA_R_16F, lda,
B, CUDA_R_16F, ldb,
beta, C, CUDA_R_16F, ldc, CUDA_R_32F, algo);
Few Rules
• The routine must be a GEMM; currently,
only GEMMs support Tensor Core
execution.
• GEMMs that do not satisfy the rules will
fall back to a non-Tensor Core
implementation.
https://www.nvidia.com/
https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/
GPU Architecture: Tensor Cores (cuBLAS example)
Demo Time!
Future considerations:
• Hardware Homogeneity
• Fast Network
• Parallel Storage

More Related Content

What's hot

DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchJim St. Leger
 
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...Jim St. Leger
 
Cellular technology with Embedded Linux - COSCUP 2016
Cellular technology with Embedded Linux - COSCUP 2016Cellular technology with Embedded Linux - COSCUP 2016
Cellular technology with Embedded Linux - COSCUP 2016SZ Lin
 
In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)Naoto MATSUMOTO
 
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics WorkshopLagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics WorkshopLagopus SDN/OpenFlow switch
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingMichelle Holley
 
mSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software SwitchmSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software Switchmicchie
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Symmetric Crypto for DPDK - Declan Doherty
Symmetric Crypto for DPDK - Declan DohertySymmetric Crypto for DPDK - Declan Doherty
Symmetric Crypto for DPDK - Declan Dohertyharryvanhaaren
 
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. GrayOVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. Grayharryvanhaaren
 
Intel® RDT Hands-on Lab
Intel® RDT Hands-on LabIntel® RDT Hands-on Lab
Intel® RDT Hands-on LabMichelle Holley
 
Using VPP and SRIO-V with Clear Containers
Using VPP and SRIO-V with Clear ContainersUsing VPP and SRIO-V with Clear Containers
Using VPP and SRIO-V with Clear ContainersMichelle Holley
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
DPDK Summit 2015 - Intel - Keith Wiles
DPDK Summit 2015 - Intel - Keith WilesDPDK Summit 2015 - Intel - Keith Wiles
DPDK Summit 2015 - Intel - Keith WilesJim St. Leger
 
SDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeSDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeIntel® Software
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Andriy Berestovskyy
 
Netsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfvNetsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfvIntel
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesIntel® Software
 

What's hot (20)

DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
 
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
 
Cellular technology with Embedded Linux - COSCUP 2016
Cellular technology with Embedded Linux - COSCUP 2016Cellular technology with Embedded Linux - COSCUP 2016
Cellular technology with Embedded Linux - COSCUP 2016
 
In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)
 
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics WorkshopLagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet Processing
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
mSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software SwitchmSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software Switch
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Symmetric Crypto for DPDK - Declan Doherty
Symmetric Crypto for DPDK - Declan DohertySymmetric Crypto for DPDK - Declan Doherty
Symmetric Crypto for DPDK - Declan Doherty
 
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. GrayOVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
 
Intel® RDT Hands-on Lab
Intel® RDT Hands-on LabIntel® RDT Hands-on Lab
Intel® RDT Hands-on Lab
 
Intel dpdk Tutorial
Intel dpdk TutorialIntel dpdk Tutorial
Intel dpdk Tutorial
 
Using VPP and SRIO-V with Clear Containers
Using VPP and SRIO-V with Clear ContainersUsing VPP and SRIO-V with Clear Containers
Using VPP and SRIO-V with Clear Containers
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
DPDK Summit 2015 - Intel - Keith Wiles
DPDK Summit 2015 - Intel - Keith WilesDPDK Summit 2015 - Intel - Keith Wiles
DPDK Summit 2015 - Intel - Keith Wiles
 
SDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeSDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's Stampede
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Netsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfvNetsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfv
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing Technologies
 

Similar to uCluster

Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Ontico
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...PROIDEA
 
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Haidee McMahon
 
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseTackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseDatabricks
 
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoWebinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoEmbarcados
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PC Cluster Consortium
 
HPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeHPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeAnand Haridass
 
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-GeneOpenStack Korea Community
 
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
HiPEAC Computing Systems Week 2022_Mario Porrmann presentationHiPEAC Computing Systems Week 2022_Mario Porrmann presentation
HiPEAC Computing Systems Week 2022_Mario Porrmann presentationVEDLIoT Project
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersCastLabKAIST
 
Parallel Rendering of Webpages
Parallel Rendering of WebpagesParallel Rendering of Webpages
Parallel Rendering of WebpagesLangtech
 
組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステムShinnosuke Furuya
 
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)The Linux Foundation
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceLEGATO project
 

Similar to uCluster (20)

Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
 
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
 
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseTackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
 
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoWebinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
 
HPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeHPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand Challenge
 
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
 
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
HiPEAC Computing Systems Week 2022_Mario Porrmann presentationHiPEAC Computing Systems Week 2022_Mario Porrmann presentation
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
Parallel Rendering of Webpages
Parallel Rendering of WebpagesParallel Rendering of Webpages
Parallel Rendering of Webpages
 
組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム
 
100 M pps on PC.
100 M pps on PC.100 M pps on PC.
100 M pps on PC.
 
The Universal Dataplane
The Universal DataplaneThe Universal Dataplane
The Universal Dataplane
 
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
 
Tos tutorial
Tos tutorialTos tutorial
Tos tutorial
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe Conference
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 

Recently uploaded

TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxdhanalakshmis0310
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxAmita Gupta
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 

Recently uploaded (20)

TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 

uCluster

  • 1. Kotsalos Christos Scientific and Parallel Computing Group (SPC) μCluster + Supercomputing/Cluster Technologies
  • 2. μCluster 2 NVIDIA Jetson Nano Boards 3 Raspberry Pi Boards 1 NVIDIA Jetson TX2 Board 1 Network Switch 16-ports Support 1 Gbit/s Cat. 6 Network Cables Support 1 Gbit/s
  • 3. GPU 128-Cuda Core Maxwell CPU Quad-core ARM A57 @ 1.43 GHz Memory 4 GB 64-bit LPDDR4 25.6 GB/s Storage microSD Connectivity Gigabit Ethernet Display HDMI and display port USB 4x USB 3.0, USB 2.0 Micro-B μCluster components https://developer.nvidia.com/embedded/jetson-nano-developer-kit
  • 4. Technical Specifications JETSON TX2 MODULE • NVIDIA Pascal Architecture GPU • 2 Denver 64-bit CPUs + Quad-Core A57 • 8 GB L128 bit DDR4 Memory • 32 GB eMMC 5.1 Flash Storage • Connectivity to 802.11ac Wi-Fi and Bluetooth-Enabled Devices • 10/100/1000BASE-T Ethernet I/O • USB 3.0 Type A • USB 2.0 Micro AB • HDMI • Gigabit Ethernet • Full-Size SD • SATA Data and Power POWER OPTIONS • External 19V AC Adapter μCluster components https://developer.nvidia.com/embedded/jetson-tx2-developer-kit
  • 5. Developer Kits Modules Jetson family Marketed for AI apps (Tensor cores) https://developer.nvidia.com/embedded/jetson-modules
  • 6. Jetson Nano Jetson TX2 Series Jetson Xavier NX Jetson AGX Xavier Series TX2 4GB TX2 TX2i AI Performance 472 GFLOPs (FP16) 1.33 TFLOPs (FP16) 1.26 TFLOPs (FP16) 21 TOPs (INT8) 32 TOPs (INT8) GPU 128-core NVIDIA Maxwell GPU 256-core NVIDIA Pascal GPU 384-core NVIDIA Volta GPU with 48 Tensor Cores 512-core NVIDIA Volta GPU with 64 Tensor Cores CPU Quad-Core ARM Cortex-A57 Dual-Core NVIDIA Denver 1.5 64-Bit CPU and Quad-Core ARM Cortex-A57 6-core NVIDIA Carmel ARM v8.2 64-bit CPU 6MB L2 + 4MB L3 8-core NVIDIA Carmel Arm v8.2 64-bit CPU 8MB L2 + 4MB L3 Memory 4 GB 64-bit LPDDR4 25.6GB/s 4 GB 128- bit LPDDR4 51.2GB/s 8 GB 128- bit LPDDR4 59.7GB/s 8 GB 128- bit LPDDR4 (ECC Support) 51.2GB/s 8 GB 128-bit LPDDR4x 51.2GB/s 32 GB 256-bit LPDDR4x 136.5GB/s Storage 16 GB eMMC 5.1 16 GB eMMC 5.1 32 GB eMMC 5.1 16 GB eMMC 5.1 32 GB eMMC 5.1 Power 5W / 10W 7.5W / 15W 10W / 20W 10W / 15W 10W / 15W / 30W Networking 10/100/1000 BASE-T Ethernet 10/100/1 000 BASE- T Ethernet, WLAN 10/100/1000 BASE-T Ethernet Tesla P100-PCIE-16GB: 16GB RAM, 3584 CUDA Cores Jetson family Specs https://developer.nvidia.com/embedded/jetson-modules TOPS: Tera-Operations per Second
  • 7. Raspberry Pi 3 Model B • Quad Core 1.2GHz Broadcom BCM2837 64bit CPU • 1GB RAM • BCM43438 wireless LAN and Bluetooth Low Energy (BLE) on board • 100 Base Ethernet • Micro SD port for loading your operating system and storing data μCluster components https://www.raspberrypi.org/products/raspberry-pi-3-model-b-plus/
  • 8. Raspberry Pi latest version https://www.raspberrypi.org/products/raspberry-pi-4-model-b/
  • 9. Turing Pi (alternative path) https://turingpi.com/v1/ Raspberry Pi Compute Module
  • 10. Turing Pi (alternative path) https://turingpi.com/v1/
  • 11. Steps to build the μCluster 1. Install OS (in just one board) 2. Install software (in just one board) 3. Copy the image (OS+software) to all the other boards 4. Setup static ips 5. Setup NFS (Network File System) 6. Run your MPI applications using a hostfile mpirun -np # --hostfile myHostFile ./executable $(arguments) j01 slots=4 max-slots=4 # Jetson TX2 j02 slots=4 max-slots=4 # Jetson Nano j03 slots=4 max-slots=4 # Jetson Nano j04 slots=4 max-slots=4 # Raspberry Pi j05 slots=4 max-slots=4 # Raspberry Pi j06 slots=4 max-slots=4 # Raspberry Pi myHostFile ips of the available infrastructure (can be found in /etc/hosts & /etc/hostname) https://selkie-macalester.org/csinparallel/modules/RosieCluster/build/html/# SLURM as an alternative (Resource Manager)
  • 12. NFS vs parallel FS Storage n0 n1 nm… NOT SCALABLE NFS Storage n0 n1 nm… Storage nk … SCALABLE PFS Experiences on File Systems Which is the best file system for you? Jakob Blomer CERN PH/SFT 2015 https://developer.ibm.com/tutorials/l-network-filesystems/ Hotspot No Stress for the Network!
  • 13. parallel FS Metadata services store namespace metadata, such as filenames, directories, access permissions, and file layout Object storage contains actual file data. Clients pull the location of files and directories from the metadata services, then access file storage directly. Popular PFS: Lustre BeeGFS GlusterFS https://azure.microsoft.com/en-ca/resources/parallel-virtual-file-systems-on-microsoft-azure/ https://techcommunity.microsoft.com/t5/azure-global/parallel-file-systems-for-hpc-storage-on-azure/ba-p/306223
  • 14. parallel FS Lustre Metadata services Object storage IOPs: I/O operations per second https://azure.microsoft.com/en-ca/resources/parallel-virtual-file-systems-on-microsoft-azure/ https://techcommunity.microsoft.com/t5/azure-global/parallel-file-systems-for-hpc-storage-on-azure/ba-p/306223 No Stress for the Network!
  • 15. μCluster benchmarks: Palabos-npFEM Red Blood Cells (red bodies) : 272 Platelets (yellow bodies) : 27 Reference case study Hematocrit 20%, box 50x50x50 μm3 https://gitlab.com/unigespc/palabos.git (coupledSimulators/npFEM)
  • 16. μCluster benchmarks: Palabos-npFEM x6.5 x8.1 Promising alternative considering: • Energy consumption • Low cost of components
  • 18. μCluster benchmarks: Network Performance OSU Micro-Benchmarks Testing 2 nodes osu_get_bw: Bandwidth Test MPI_Isend & MPI_Irecv
  • 19. μCluster benchmarks: Network Performance OSU Micro-Benchmarks Testing 2 nodes osu_get_latency: Latency Test MPI_Send & MPI_Recv ping-pong test
  • 20. μCluster benchmarks: Network Performance mpiGraph Piz Daint BaobabμCluster
  • 21. Bandwidth Packet size μCluster benchmarks: Network Performance OSU Micro-Benchmarks https://ulhpc-tutorials.readthedocs.io/en/latest/parallel/mpi/OSU_MicroBenchmarks/ Importance of: • MPI implementation • Network capabilities
  • 23. SDR DDR QDR FDR10 FDR EDR HDR NDR XDR Theoretical effective throughput (Gbit/s) for 1 link 2 4 8 10 13.64 25 50 100 250 for 4 links 8 16 32 40 54.54 100 200 400 1000 for 8 links 16 32 64 80 109.08 200 400 800 2000 for 12 links 24 48 96 120 163.64 300 600 1200 3000 Adapter latency (µs) 5 2.5 1.3 0.7 0.7 0.5 less? t.b.d. t.b.d. Year 2001 2003 2005 2007 2011 2011 2014 2017 after 2020 after 2023? Interconnect/Network Technologies InfiniBand (IB) Omni-Path (Intel) Aries (Cray) copper or fiber optic wires Each interconnect technology comes with its own hardware: • NIC: Network Interface Card • Switch: Create Subnets • Router: Link Subnets • Bridge/Gateway: Bridge different networks (IB & Ethernet) Ethernethttps://en.wikipedia.org/wiki/InfiniBand
  • 24. Interconnect/Network Technologies NVLink (NVIDIA) GPU-to-GPU data transfers at up to 160 Gbytes/s of bidirectional bandwidth, 5x the bandwidth of PCIe https://prace-ri.eu/training-support/best-practice-guides/
  • 25. Interconnect/Network Technologies Absolute Priorities High Bandwidth Low Latency but not only RDMA Combined with smartNICs/FPGAs (try to reduce the stress from CPUs) RDMA over Converged Ethernet (RoCE) is a network protocol that allows remote direct memory access over an Ethernet network. It does this by encapsulating an IB transport packet over Ethernet. https://prace-ri.eu/training-support/best-practice-guides/ Ethernet uses a hierarchical topology which involves more computing power by the CPU in contrast to the flat fabric topology of IB where the data is directly moved by the network card using RDMA requests, reducing the CPU involvement
  • 26. Interconnect/Network Technologies GPU-aware MPI - NVIDIA GPUDirect (family of technologies) GPU-aware MPI implementations can automatically handle MPI transactions with pointers to GPU memory • GPUdirect Peer to Peer among GPUs on the same node (native support through drivers & CUDA toolkit - through PCIe or NVLink) • GPUdirect RDMA among GPUs on different nodes (specialized hardware with RDMA support) • MVAPICH, OpenMPI, Cray MPI, ... How it works (@CSCS) ▪ Set the environment variable: export MPICH_RDMA_ENABLED_CUDA=1 ▪ Each pointer passed to MPI is checked to see if it is in host or device memory. If not set, MPI assumes that all pointers point at host memory, and your application will probably crash with seg faults https://www.cscs.ch/
  • 28. Interconnect/Network Technologies Route to Exascale: Network Acceleration sPIN: High-performance streaming Processing in the Network Torsten Hoefler, Salvatore Di Girolamo, Konstantin Taranov, Ryan E. Grant, Ron Brightwell • Today’s network cards contain rather powerful processors optimized for data movement • Offload simple packet processing functions to the network card • Portable packet-processing network acceleration model similar to compute acceleration with CUDA or OpenCL https://spcl.inf.ethz.ch/ (slide from Torsten Hoefler)
  • 29. Interconnect/Network Technologies Route to Exascale & HPC network topologies https://prace-ri.eu/wp-content/uploads/Best-Practice-Guide_Modern-Interconnects.pdf Fat trees Torus Dragonfly Good luck with the scalability on exascale machines without smart components/accelerators at the hotspots (compute, network, storage)
  • 30. NVIDIA GPU Micro-architecture (chronologically) 1. Tesla 2. Fermi 3. Kepler 4. Maxwell 5. Pascal (CSCS) 6. Volta (Summit ORNL) 7. Turing 8. Ampere GPU Architecture
  • 32. GPU Architecture Evolution Pascal Streaming Multiprocessor Ampere Streaming Multiprocessor https://www.nvidia.com/
  • 33. NVIDIA Ampere NVIDIA Turing NVIDIA Volta Supported Tensor Core Precisions FP64, TF32, bfloat16, FP16, INT8, INT4, INT1 FP16, INT8, INT4, INT1 FP16 Supported CUDA Core Precisions FP64, FP32, FP16, bfloat16, INT8 FP64, FP32, FP16, INT8 FP64, FP32, FP16, INT8 GPU Architecture: Tensor Cores Each Tensor Core operates on a 4x4 matrix and performs the following operation: In Volta, each Tensor Core performs 64 floating point FMA* operations per clock, and eight Tensor Cores in an SM perform a total of 512 FMA operations *FMA (Fused Multiply-Accumulate) https://www.nvidia.com/ Vectorization like in CPUs
  • 34. GPU Architecture: Tensor Cores How to use/activate them: ▪ The code simply needs to use a flag to tell the API and drivers that you want to use tensor cores, the data type needs to be one supported by the cores, and the dimensions of the matrices need to be a multiple of 8. After that, that hardware will handle everything else. ▪ Easiest way is through cuBLAS or other NVIDIA HPC libraries ▪ For now avoid the API to access them (ugly like CPU-vectorization) Multiple of 8 for tensor cores & Multiple of 32 for vanilla CUDA cores (warp scheduling)
  • 35. // First, create a cuBLAS handle: cublasStatus_t cublasStat = cublasCreate(&handle); // Set the math mode to allow cuBLAS to use Tensor Cores: cublasStat = cublasSetMathMode(handle, CUBLAS_TENSOR_OP_MATH); // Allocate and initialize your matrices (only the A matrix is shown): size_t matrixSizeA = (size_t)rowsA * colsA; T_ELEM_IN **devPtrA = 0; cudaMalloc((void**)&devPtrA[0], matrixSizeA * sizeof(devPtrA[0][0])); T_ELEM_IN A = (T_ELEM_IN *)malloc(matrixSizeA * sizeof(A[0])); memset( A, 0xFF, matrixSizeA* sizeof(A[0])); status1 = cublasSetMatrix(rowsA, colsA, sizeof(A[0]), A, rowsA, devPtrA[i], rowsA); // ... allocate and initialize B and C matrices (not shown) ... // Invoke the GEMM, ensuring k, lda, ldb, and ldc are all multiples of 8, // and m is a multiple of 4: cublasStat = cublasGemmEx( handle, transa, transb, m, n, k, alpha, A, CUDA_R_16F, lda, B, CUDA_R_16F, ldb, beta, C, CUDA_R_16F, ldc, CUDA_R_32F, algo); Few Rules • The routine must be a GEMM; currently, only GEMMs support Tensor Core execution. • GEMMs that do not satisfy the rules will fall back to a non-Tensor Core implementation. https://www.nvidia.com/ https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/ GPU Architecture: Tensor Cores (cuBLAS example)
  • 36. Demo Time! Future considerations: • Hardware Homogeneity • Fast Network • Parallel Storage