Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Top 10 Supercomputers With Descriptive Information & Analysis

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Submitted by:
NOMAN SIDDIQUI
SEC: A (Evening)
Seat No.: EB21102087
3rd
Semester (BSCS)
Assignment Report:
Top 10 Supercomp...
Top 10 Supercomputers Report
What is Supercomputer?
A supercomputer is a computer with a high level of performance as com...
Block Diagram:
Functional Units:
Functional Units, Co-Design and System for the Supercomputer “Fugaku”
1. Performance esti...
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 34 Anuncio

Top 10 Supercomputers With Descriptive Information & Analysis

Descargar para leer sin conexión

Top 10 Supercomputers Report

 What is Supercomputer?
A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instructions per second (MIPS). Since 2017, there are supercomputers which can perform over 1017 FLOPS (a hundred quadrillion FLOPS, 100 petaFLOPS or 100 PFLOPS
Supercomputers play an important role in the field of computational science, and are used for a wide range of computationally intensive tasks in various fields, including quantum mechanics, weather forecasting, climate research, oil and gas exploration, molecular modeling (computing the structures and properties of chemical compounds, biological macromolecules, polymers, and crystals), and physical simulations (such as simulations of the early moments of the universe, airplane and spacecraft aerodynamics, the detonation of nuclear weapons, and nuclear fusion). They have been essential in the field of cryptanalysis.

1. The Fugaku Supercomputer
Introduction:
Fugaku is a petascale supercomputer (while only at petascale for mainstream benchmark), at the Riken Center for Computational Science in Kobe, Japan. It started development in 2014 as the successor to the K computer, and started operating in 2021. Fugaku made its debut in 2020, and became the fastest supercomputer in the world in the June 2020 TOP500 list, as well as becoming the first ARM architecture-based computer to achieve this. In June 2020, it achieved 1.42 exaFLOPS (in HPL-AI benchmark making it the first ever supercomputer that achieved 1 exaFLOPS. As of November 2021, Fugaku is the fastest supercomputer in the world. It is named after an alternative name for Mount Fuji.




Block Diagram:

Functional Units:
Functional Units, Co-Design and System for the Supercomputer “Fugaku”
1. Performance estimation tool: This tool, taking Fujitsu FX100 (FX100 is the previous Fujitsu supercomputer) execution profile data as an input, enables the performance projection by a given set of architecture parameters. The performance projection is modeled according to the Fujitsu microarchitecture. This tool can also estimate the power consumption based on the architecture model.
2. Fujitsu in-house processor simulator: We used an extended FX100 SPARC instruction-set simulator and compiler, developed by Fujitsu, for preliminary studies in the initial phase, and an Armv8þSVE simulator and compiler afterward.
3. Gem5 simulator for the Post-K processor: The Post-K processor simulator3 based on an opensource system-level processor simulator, Gem5, was developed by RIKEN during the co-design for architecture verification and performance tuning. A fundamental problem is the scale of scientific applications that are expected to be run on Post-K. Even our target applications are thousands of lines of code and are written to use complex algorithms and data structures. Altho

Top 10 Supercomputers Report

 What is Supercomputer?
A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instructions per second (MIPS). Since 2017, there are supercomputers which can perform over 1017 FLOPS (a hundred quadrillion FLOPS, 100 petaFLOPS or 100 PFLOPS
Supercomputers play an important role in the field of computational science, and are used for a wide range of computationally intensive tasks in various fields, including quantum mechanics, weather forecasting, climate research, oil and gas exploration, molecular modeling (computing the structures and properties of chemical compounds, biological macromolecules, polymers, and crystals), and physical simulations (such as simulations of the early moments of the universe, airplane and spacecraft aerodynamics, the detonation of nuclear weapons, and nuclear fusion). They have been essential in the field of cryptanalysis.

1. The Fugaku Supercomputer
Introduction:
Fugaku is a petascale supercomputer (while only at petascale for mainstream benchmark), at the Riken Center for Computational Science in Kobe, Japan. It started development in 2014 as the successor to the K computer, and started operating in 2021. Fugaku made its debut in 2020, and became the fastest supercomputer in the world in the June 2020 TOP500 list, as well as becoming the first ARM architecture-based computer to achieve this. In June 2020, it achieved 1.42 exaFLOPS (in HPL-AI benchmark making it the first ever supercomputer that achieved 1 exaFLOPS. As of November 2021, Fugaku is the fastest supercomputer in the world. It is named after an alternative name for Mount Fuji.




Block Diagram:

Functional Units:
Functional Units, Co-Design and System for the Supercomputer “Fugaku”
1. Performance estimation tool: This tool, taking Fujitsu FX100 (FX100 is the previous Fujitsu supercomputer) execution profile data as an input, enables the performance projection by a given set of architecture parameters. The performance projection is modeled according to the Fujitsu microarchitecture. This tool can also estimate the power consumption based on the architecture model.
2. Fujitsu in-house processor simulator: We used an extended FX100 SPARC instruction-set simulator and compiler, developed by Fujitsu, for preliminary studies in the initial phase, and an Armv8þSVE simulator and compiler afterward.
3. Gem5 simulator for the Post-K processor: The Post-K processor simulator3 based on an opensource system-level processor simulator, Gem5, was developed by RIKEN during the co-design for architecture verification and performance tuning. A fundamental problem is the scale of scientific applications that are expected to be run on Post-K. Even our target applications are thousands of lines of code and are written to use complex algorithms and data structures. Altho

Anuncio
Anuncio

Más Contenido Relacionado

Similares a Top 10 Supercomputers With Descriptive Information & Analysis (20)

Más reciente (20)

Anuncio

Top 10 Supercomputers With Descriptive Information & Analysis

  1. 1. Submitted by: NOMAN SIDDIQUI SEC: A (Evening) Seat No.: EB21102087 3rd Semester (BSCS) Assignment Report: Top 10 Supercomputers With Descriptive Information & Analysis Submitted To: SIR KHALID AHMED Department of Computer Science - (UBIT) UNIVERSITY OF KARACHI
  2. 2. Top 10 Supercomputers Report What is Supercomputer? A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instructions per second (MIPS). Since 2017, there are supercomputers which can perform over 1017 FLOPS (a hundred quadrillion FLOPS, 100 petaFLOPS or 100 PFLOPS Supercomputers play an important role in the field of computational science, and are used for a wide range of computationally intensive tasks in various fields, including quantum mechanics, weather forecasting, climate research, oil and gas exploration, molecular modeling (computing the structures and properties of chemical compounds, biological macromolecules, polymers, and crystals), and physical simulations (such as simulations of the early moments of the universe, airplane and spacecraft aerodynamics, the detonation of nuclear weapons, and nuclear fusion). They have been essential in the field of cryptanalysis. 1. The Fugaku Supercomputer Introduction: Fugaku is a petascale supercomputer (while only at petascale for mainstream benchmark), at the Riken Center for Computational Science in Kobe, Japan. It started development in 2014 as the successor to the K computer, and started operating in 2021. Fugaku made its debut in 2020, and became the fastest supercomputer in the world in the June 2020 TOP500 list, as well as becoming the first ARM architecture-based computer to achieve this. In June 2020, it achieved 1.42 exaFLOPS (in HPL-AI benchmark making it the first ever supercomputer that achieved 1 exaFLOPS. As of November 2021, Fugaku is the fastest supercomputer in the world. It is named after an alternative name for Mount Fuji.
  3. 3. Block Diagram: Functional Units: Functional Units, Co-Design and System for the Supercomputer “Fugaku” 1. Performance estimation tool: This tool, taking Fujitsu FX100 (FX100 is the previous Fujitsu supercomputer) execution profile data as an input, enables the performance projection by a given set of architecture parameters. The performance projection is modeled according to the Fujitsu microarchitecture. This tool can also estimate the power consumption based on the architecture model. 2. Fujitsu in-house processor simulator: We used an extended FX100 SPARC instruction- set simulator and compiler, developed by Fujitsu, for preliminary studies in the initial phase, and an Armv8þSVE simulator and compiler afterward. 3. Gem5 simulator for the Post-K processor: The Post-K processor simulator3 based on an opensource system-level processor simulator, Gem5, was developed by RIKEN during the co-design for architecture verification and performance tuning. A fundamental problem is the scale of scientific applications that are expected to be run on Post-K. Even our target applications are thousands of lines of code and are written to use complex algorithms and data structures. Although the processor simulators are capable of providing very accurate performance results at the cycle level, they are very slow and are limited to execution on a single processor without MPI communications between the nodes. Our performance estimation tool is useful since it enables performance analysis based on the execution profile taken from an actual run on the FX100 hardware. It has a
  4. 4. rich set of performance counters, including busy cycles for read/write memory access, busy cycles for L1/L2 cache access, busy cycles of floating-point arithmetic, and cycles for instruction commit. These features enable the performance projection for a new set of hardware parameters by changing the busy cycles of functional blocks. The breakdown of the execution time (cycles) can be calculated by summing the busy cycles of each functional block in the pipeline according to the processor microarchitecture. Since the execution time is estimated by a simple formula modeling the pipeline, it can be applied to a region of uniform behavior such as a kernel loop. The first step of performance analysis is to identify kernels in each target application and insert the library calls to get the execution profile. The total execution time is calculated by summing the estimated execution time of each kernel using the performance estimation tool with some architecture parameters. We repeated this process changing several architecture parameters for design space exploration. Some important kernels were extracted as independent programs. These kernels can be executed by the cycle- level processor simulators for more accurate analysis. Since the performance estimation tool is not able to take the impact of the out-of-order (O3) resources into account, the Fujitsu in-house processor simulator was used to analyze a new instruction set and the effect of changing the O3 resources. These kernels were also used for the processor emulator for logic-design verification. Co-Design of the Manycore Processor Prior to the FLAGSHIP 2020 project, feasibility study projects were carried out to investigate the basic March/April 2022 IEEE Micro 27 COOL CHIPS design from 2012 to 2013. As a result, the basic architecture suggested by the feasibility study was a largescale system using a general-purpose manycore processor with wide single- instruction/multiple-data (SIMD) arithmetic units. The choice of the instruction set architecture was an important decision for architecture design. Fujitsu offered the Armv8 instruction set with the Arm SIMD instruction set called the scalable vector extension (SVE).4 The Arm instruction-set architecture has been widely accepted by software developers and users not only for mobile processors, but also, recently, for HPC. For example, Cavium Thunder X2 is an Arm processor designed for servers and HPC, and has been used for several supercomputer systems, including Astra5 and Isambard.6 The SVE is an extended SIMD instruction set. The most significant feature of the SVE realizes vector length agnostic programming; as the name suggests, it does not depend on the vector length. We have decided to have two 512-bits-width SIMD arithmetic units, as suggested by the feasibility study. The processor is custom designed by Fujitsu using their microarchitecture as a backend of processor core. Fujitsu proposed the basic structure of the manycore processor architecture according to their microarchitecture: Each core has an L1 cache, and a cluster of cores shares an L2 cache and a memory controller. This cluster of cores is called a core-memory group (CMG). While other high- performance processors, such as those of Intel and AMD, have L1 and L2 caches in the core and share an L3 cache as a lastlevel cache, the core of our processor has only an
  5. 5. L1 cache to reduce the die size for the core. Our technology target for silicon fabrication was 7- nm FinFET technology. The die size of the chip is the most dominant factor in terms of cost. It is known that the cost of the chip increases in proportion to the size and increases significantly beyond a certain size, and the yield of the chip becomes worse as the size of the chip increases. One configuration is to use small chips and connect these chips by multichip module (MCM) technology. Recently, AMD has used this “chiplet” approach successfully. The advantage of this approach is that a small chip can be relatively cheaper with a good yield. However, at the time of the basic design, the cost of MCM was deemed too high, and a different kind of chip for the interconnect and I/O must be made, resulting in even higher costs. The connection between chips on the MCM would also increase the power consumption. Thus, our decision was to use a single large die containing some CMGs and the network interface for interconnect and PCIe for I/O connected by a network-on-chip. As a result we decided to use 48 cores (plus four cores) and 12 cores/CMG 4 CMGs. The size of the die fitted within about 400 mm2 , which was reasonable in terms of cost for 7-nm FinFET technology. As the peak floating-point performance of the central processing unit (CPU) chip was expected to reach a few TFLOPS, the memory bandwidth of DDR4 was too low compared to the performance. Thus, high-speed memory technologies, such as HBM and hybrid memory cube, were examined to balance the memory bandwidth and arithmetic performance. The HBM is a stacked memory chip connected via TSV on a silicon interposer. The HBM2 provides a bandwidth of 256 GB/s per module, but the capacity of HBM2 is just up to 8 GiB, and the cost is high because the silicon interposer is required. As a memory technology available around 2019, HBM2 was chosen for its power efficiency and high memory bandwidth. We decided not to use any additional DDR memory to reduce the cost. As described previously, the number of HBM2 modules attached to CMGs is four, that is, the main memory capacity is 32 GiB. Although it seems small for certain applications, we already have many scalable applications developed for the K computer. Such scalable applications can increase the problem size by increasing the number of used nodes. The key to designing a cache architecture is to provide a high hit rate for many applications and to prevent a bottleneck when data are supplied with full bandwidth from memory. We examined various parameters, such as the line size, the number of ways, and the capacity, in order to optimize the cache performance under the constraint of the size of the area on the die and the amount of power consumption. To decide the cache structure and size, we examined the impact of the cache configuration on the performance by running some kernels extracted from target applications on the simulator for a single CMG. We designed the cache to save power for accessing data in a set associative cache. Data read from ways and tag search may be used in parallel to reduce the latency, but this may waste power because the data will not be used when the tag is not matched. In our design, data access is performed after a tag match. While it causes a long latency, there is less impact on the performance in the case of through put intensive HPC applications. This design was applied to the L1 cache for vector access and the L2 cache, resulting in the reduction of power by 10% in HPL with almost no performance degradation. The microarchitecture is an O3 architecture designed by Fujitsu. The
  6. 6. amount of the O3 resources was decided by the tradeoff between the performance and the impact to the die size by the evaluation of some kernels extracted from the target applications. OVERVIEW OF FUGAKU SYSTEM In 2019, the name of the system was decided as “Fugaku,” and the installation was completed in May 2020. layer storage system is the global file system, which is a Luster- based parallel file system, developed by Fujitsu. A Linux kernel runs on each node. All system daemons run on two or four assistant cores. The CPU chip with two assistant cores is used on compute-only nodes. The chip with four assistant cores is used on compute and I/O nodes because such nodes service I/ O functions requiring more CPU resources. Final specification for architecture parameters by our co-design. Item Co-design parameter Spec. design paramet er Chip a CMG/chip 4 s Core/chip 48(+4)* Memory/chip  Technology 1113M2  Memory size 32 GB  Memory 8W 1024 GB/s CMG a Core/CMG 12 (+W L2 cache / CMG  Sae 8 MiB.  a way 16 way  Load BW to Ll 128 GWs  Store BW from L1 64 GB/s  Line size 256 bytes Core SIMI) width 512 bits  SIMD unit 2 LID cache / Care  Sae 64103  a way 4 way  Load 8W 256 GB/s  Store BW 128 GB/s Out of order resource/core  Reorder buffer 128 entries  Reservation Station 60 entries  sPhysical &MD register 128  Load buffer 40 entries  Store buffer 24 entries *Assistant core.
  7. 7. **Cache BW is with the CPU clock speed of 2 GHz7 Software Used: Fugaku will use a "light-weight multi-kernel operating system" named IHK/McKernel. The operating system uses both Linux and the McKernel light-weight kernel operating simultaneously and side by side. The infrastructure that both kernels run on is termed the Interface for Heterogeneous Kernels (IHK). The high-performance simulations are run on McKernel, with Linux available for all other POSIX-compatible services 2. Summit Supercomputer Introduction: Summit or OLCF-4 is a supercomputer developed by IBM for use at Oak Ridge National Laboratory, capable of 200 petaFLOPS, making it the second fastest supercomputer in the world (it held the number 1 position from November 2018 to June 2020.) Its current LINPACK benchmark is clocked at 148.6 petaFLOPS. As of November 2019, the supercomputer had ranked as the 5th most energy efficient in the world with a measured power efficiency of 14.668 gigaFLOPS/watt. Summit was the first supercomputer to reach exaflop (a quintillion operations per second) speed Block Diagram:
  8. 8. Software Used: Red Hat Enterprise Linux is also widely deployed in National Labs and research centers around the globe and is a proven platform for large-scale computing across multiple hardware architectures. The total system design of Summit, consisting of 4,608 IBM computer servers, aims to make it easier to bring research applications to this behemoth. Part of this is the consistent environment provided by Red Hat Enterprise Linux. Functional Units: System Overview & Specifications Summit is an IBM system located at the Oak Ridge Leadership Computing Facility. With a theoretical peak double-precision performance of approximately 200 PF, it is one of the most capable systems in the world for a wide range of traditional computational science applications. It is also one of the “smartest” computers in the world for deep learning applications with a mixed-precision capability in excess of 3 EF.
  9. 9. Core Pipeline NVDIA Tesla v100 GPU Architecture 3. Sierra Supercomputer Introduction: Sierra or ATS-2 is a supercomputer built for the Lawrence Livermore National Laboratory for use by the National Nuclear Security Administration as the second Advanced Technology System. It is primarily used for predictive applications in stockpile stewardship, helping to assure the safety, reliability and effectiveness of the United States' nuclear weapons. Sierra is very similar in architecture to the Summit supercomputer built for the Oak Ridge National Laboratory. The Sierra system uses IBM POWER9 CPUs in conjunction with Nvidia Tesla V100 GPUs. The nodes in Sierra are Witherspoon IBM S922LC
  10. 10. OpenPOWER servers with two GPUs per CPU and four GPUs per node. These nodes are connected with EDR InfiniBand. In 2019 Sierra was upgraded with IBM Power System A922 nodes. Block Diagram: Software Used: The Summit and Sierra supercomputer cores are IBM POWER9 central processing units (CPUs) and NVIDIA V100 graphic processing units (GPUs). NVIDIA claims that its GPUs are delivering 95% of Summit’s performance. Both supercomputers use a Linux operating system. Functional Units: Sierra boasts a peak performance of 125 petaFLOPS—125 quadrillion floating-point operations per second. Early indications using existing codes and benchmark tests are
  11. 11. promising, demonstrating as predicted that Sierra can perform most required calculations far more efficiently in terms of cost and power consumption than computers consisting of CPUs alone. Depending on the application, Sierra is expected to be six to 10 times more capable than LLNL’s 20-petaFLOP Sequoia, currently the world’s eighth-fastest supercomputer. To prepare for this architecture, LLNL has partnered with IBM and NVIDIA to rapidly develop codes and prepare applications to effectively optimize the CPU/GPU nodes. IBM and NVIDIA personnel worked closely with LLNL, both on-site and remotely, on code development and restructuring to achieve maximum performance. Meanwhile, LLNL personnel provided feedback on system design and the software stack to the vendor. LLNL selected the IBM/NVIDIA system due to its energy and cost-efficiency, as well as its potential to effectively run NNSA applications. Sierra’s IBM POWER9 processors feature CPU-to-GPU connection via NVIDIA NVLink interconnect, enabling greater memory bandwidth between each node so Sierra can move data throughout the system for maximum performance and efficiency. Backing Sierra is 154 petabytes of IBM Spectrum Scale, a software-defined parallel file system, deployed across 24 racks of Elastic Storage Servers (ESS). To meet the scaling demands of the heterogeneous systems, ESS delivers 1.54 terabytes per second in both read and write bandwidth and can manage 100 billion files per file system. “The next frontier of supercomputing lies in artificial intelligence,” said John Kelly, senior vice president, Cognitive Solutions and IBM Research. “IBM's decades-long partnership with LLNL has allowed us to build Sierra from the ground up with the unique design and architecture needed for applying AI to massive data sets. The tremendous insights researchers are seeing will only accelerate high-performance computing for research and business.” As the first NNSA production supercomputer backed by GPU-accelerated architecture, Sierra’s acquisition required a fundamental shift in how scientists at the three NNSA laboratories program their codes to take advantage of the GPUs. The system’s NVIDIA GPUs also present scientists with an opportunity to investigate the use of machine learning and deep learning to accelerate the time-to-solution of physics codes. It is expected that simulation, leveraged by acceleration coming from the use of artificial intelligence technology will be increasingly employed over the coming decade. In addition to critical national security applications, a companion unclassified system, called Lassen, also has been installed in the Livermore Computing Center. This institutionally focused supercomputer will play a role in projects aimed at speeding cancer drug discovery, precision medicine, research on traumatic brain injury, seismology, climate, astrophysics, materials science, and other basic science benefiting society. Sierra continues the long lineage of world-class LLNL supercomputers and represents the penultimate step on NNSA’s road to exascale computing, which is expected to start by 2023 with an LLNL system called “El Capitan.” Funded by the NNSA’s Advanced
  12. 12. Simulation and Computing (ASC) program, El Capitan will be NNSA’s first exascale supercomputer, capable of more than a quintillion calculations per second—about 10 times greater performance than Sierra. Such computing power will be easily absorbed by NNSA for its mission, having required the most advanced computing capabilities and deep partnerships with American industry. 4. Sunway TaihuLight Supercomputer Introduction: The Sunway TaihuLight is a Chinese supercomputer which, as of November 2021, is ranked fourth in the TOP500 list, with a LINPACK benchmark rating of 93 petaflops The name is translated as divine power, the light of Taihu Lake. This is nearly three times as fast as the previous Tianhe-2, which ran at 34 petaflops. As of June 2017, it is ranked as the 16th most energy-efficient supercomputer in the Green500, with an efficiency of 6.051 GFlops/watt. It was designed by the National Research Center of Parallel Computer Engineering & Technology (NRCPC) and is located at the National Supercomputing Center in Wuxi in the city of Wuxi, in Jiangsu province, China. Block Diagram:
  13. 13. Software Used: The system runs on its own operating system, Sunway RaiseOS 2.0.5, which is based on Linux. The system has its own customized implementation of OpenACC 2.0 to aid the parallelization of code. Functional Units The Sunway TaihuLight supercomputer: an overview The Sunway TaihuLight supercomputer is hosted at the National Supercomputing Center in Wuxi (NSCCWuxi), which operates as a collaboration center between the City of Wuxi, Jiangsu Province, and Tsinghua University. NSCC-Wuxi focuses on the development needs of technological innovation and industrial upgrading around Jiangsu Province and the Yangtze River Delta economic circle, as well as the demands of the national key strategies on science and technology development.
  14. 14. The SW26010 many-core processor: One major technology innovation of the Sunway TaihuLight supercomputer is the homegrown SW26010 many-core processor. The general architecture of the SW26010 processor [10] is shown in Figure 2. The processor includes four core-groups (CGs). Each CG includes one management processing element (MPE), one computing processing element (CPE) cluster with eight by eight CPEs, and one memory controller (MC). These four CGs are connected via the network on chip (NoC). Each CG has its own memory space, which is connected to the MPE and the CPE cluster through the MC. The processor connects to other outside devices through a system interface (SI). The MPE is a complete 64-bit RISC core, which can run in both the user and system modes. The MPE completely supports the interrupt functions, memory management, superscalar processing, and outof-order execution. Therefore, the MPE is an ideal core for handling management and communication functions. In contrast, the CPE is also a 64- bit RISC core, but with limited functions. The CPE can only run in user mode and does not support interrupt functions. The design goal of this element is to achieve the maximum aggregated computing power, while minimizing the complexity of the micro-architecture. The CPE cluster is organized as an eight by eight mesh, with a mesh network to achieve low-latency register data communication among the eight by eight CPEs. The mesh also includes a mesh controller that handles interrupt and synchronization controls. Both the MPE and CPE support 256-bit vector instructions. 4 Subcomponent systems of the Sunway TaihuLight In this section, we provide more detail about the various subcomponent systems of the Sunway TaihuLight, specifically the computing, network, peripheral, maintenance and diagnostic, power and cooling, and the software systems. 4.1 The computing system Aiming for a peak performance of 125 PFlops, the computing system of the Sunway TaihuLight is built using a fully customized integration approach with a number of different levels: (1) computing node (one CPU per computing node); (2) super node s(256
  15. 15. computing nodes per super node); (3) cabinet (4 super nodes per cabinet); and (4) the entire computing system (40 cabinets). The computing nodes are the basic units of the computing system, and include one SW26010 processor, 32 GB memory, a node management controller, power supply, interface circuits, etc. Groups of 256 computing nodes, are integrated into a tightly coupled super node using a fully-connected crossing switch, so as to support computationally-intensive, communication-intensive, and I/O- intensive computing jobs. 4.2 The network system The network system consists of three different levels, with the central switching network at the top, super node network in the middle, and resource-sharing network at the bottom. The bisection network bandwidth is 70 TB/s, with a network diameter of 7. Each super node includes 256 Sunway processors that are fully connected by the super node network, which achieves both high bandwidth and low latency for all-to-all communications among the entire 65536 processing elements. The central switching network is responsible for building connections and enabling data exchange between different super nodes. The resource-sharing network connects the sharing resources to the super nodes, and provides services for I/O communication and fault tolerance of the computing nodes. Fu H H, et al. Sci China Inf Sci July 2016 Vol. 59 072001:6 4.3 The peripheral system The peripheral system consists of the network storage system and peripheral management system. The network storage system includes both the storage network and storage disk array, providing a total storage of 20 PB and a high-speed and reliable data storage service for the computing nodes. The peripheral management system includes the system console, management server, and management network, which enable system management and service. 4.3 The power supply system and cooling system The TaihuLight supercomputer uses a mutual-backup power input of 2 × 35 KV. The cabinets of the system use a three-level (300 V-12 V-0.9 V) DC power supply mode. The front-end power supply output is 300 V, which is directly linked to the cabinet. The main power supply of the cabinet converts 300 V DC to 12 V DC, and the CPU power supply converts 12 V into the voltage that the CPU needs. The cabinets of the computing and network systems use indirect water cooling, while the peripheral devices use air and water exchange, and the power system uses forced air cooling. The cabinets use closed-loop, static hydraulic pressure for cavum, indirect parallel flow water cooling technology, which provides effective cooling for the full-scale Linpack run.
  16. 16. 5. Tianhe-2A Supercomputer Introduction: It was the world's fastest supercomputer according to the TOP500 lists for June 2013, November 2013, June 2014, November 2014, June 2015, and November 2015. The record was surpassed in June 2016 by the Sunway TaihuLight. In 2015, plans of the Sun Yat-sen University in collaboration with Guangzhou district and city administration to double its computing capacities were stopped by a U.S. government rejection of Intel's application for an export license for the CPUs and coprocessor boards. In response to the U.S. sanction, China introduced the Sunway TaihuLight supercomputer in 2016, which substantially outperforms the Tianhe-2 (and also affected the update of Tianhe-2 to Tianhe-2A replacing US tech), and now ranks fourth in the TOP500 list while using completely domestic technology including the Sunway manycore microprocessor. Block Diagram:
  17. 17. Software Used: Tianhe-2 ran on Kylin Linux, a version of the operating system developed by NUDT Functional Unit: System Architecture & Compute Blade The original TH-2 compute blade consisted of two nodes split into two modules: (1) the Computer Processor Module (CPM) module and (2) the Accelerator Processor Unit (APU) module (Figure 5). The CPM contained four Ivy Bridge CPUs, memory, and one Xeon Phi KNC accelerator, and the APU contained five Xeon Phi KNC accelerators. Connections from the Ivy Bridge CPUs to each of the KNC accelerators are made through a ×16 PCI Express 2.0 multiboard with 10 Gbps of bandwidth. The actual design and implementation of the board supports PCI Express 3.0, but the Xeon Phi KNC accelerator only supports PCI Express 2.0. There was also a PCI Express connection for the network interface controller (NIC). With the upgraded TH-2A, the Intel Xeon Phi KNC accelerators have been replaced. The CPM module still has four Ivy Bridge CPUs but is no longer housing an accelerator. The APU now houses four Matrix-2000 accelerators instead of the five Intel Xeon Phi KNC accelerators. So, in the TH-2A, the compute blade has two heterogeneous compute nodes, and each compute node is equipped with two Intel Ivy Bridge CPUs and two proprietary Matrix-2000 accelerators. Each node has 192 GB memory, and a peak performance of 5.3376 Tflop/s. The Intel Ivy Bridge processors have not been changed and are the same as in the original TH-2. Each of the Intel Ivy Bridge CPU’s 12 compute cores can perform 8 FLOPs per cycle per core, which results in 211.2 Gflop/s total peak performance per socket (12 cores × 8 FLOPs per cycle × 2.2 GHz clock). The logical structure of the compute node is shown in Figure 6. The two Intel Ivy Bridge CPUs are linked using two Intel Quick Path Interconnects (QPI). Each CPU has four memory channels with eight dual in-line memory module (DIMM) slots. CPU0 expands its I/O devices using Intel’s Platform Controller Hub (PCH) chipset and connects with a 14G proprietary NIC through a ×16 PCI Express 3.0 connection. Each CPU also uses a ×16 PCI Express 3.0 connection to access the Matrix-2000 accelerators. Each accelerator has eight memory channels. In a compute node, the CPUs are equipped with 64 GB of DDR3 memory, while the accelerators are equipped with 128 GB of DDR4 memory. With 17,792 compute nodes, the total memory capacity of the whole system is 3.4 PB. H-2A compute blade is composed of two parts: the CPM (left) and the APU (middle). The CPM integrates four Ivy Bridge CPUs, and the APU integrates four Matrix2000 accelerators. Each compute blade contains two heterogeneous compute nodes. As stated earlier, the peak performance of each Ivy Bridge CPU is 211.2 Gflop/s, and the peak performance of each Matrix-2000 accelerator is 2.4576 Tflop/s. Thus, the peak performance of each compute node can be calculated as (0.2112 Tflop/s × 2) + (2.4576
  18. 18. Tflop/s × 2) = 5.3376 Tflop/s. With 17,792 compute nodes, the peak performance of the whole system is 94.97 Pflop/s (5.3376 Tflop/s x × 17,792 nodes = 94.97 Pflop/s total) 6. Frontera Supercomputer Introduction: In August 2018, Dell EMC and Intel announced intentions to jointly design Frontera, an academic supercomputer funded by a $60 million grant from the National Science Foundation that would replace Stampede2 at the University of Texas at Austin’s Texas Advanced Computing Center (TACC). Those plans came to fruition in June when the two companies deployed Frontera, which was formally unveiled this morning. Intel claims that Frontera can achieve peak performance of 38.7 quadrillion floating point operations per second, or petaflops, making it the world’s fastest computer designed for academic workloads like modeling and simulation, big data, and machine learning. (That’s compared with Stampede2’s peak performance of 18 petaflops.) Earlier this year, Frontera earned the fifth spot on the twice-annual Top500 list with 23.5 petaflops on the LINPACK benchmark, which ranks the world’s most powerful non-distributed computer systems. Block Diagram:
  19. 19. Software Used: With a peak-performance rating of 38.7 petaFLOPS, the supercomputer is about twice as powerful as TACC's Stampede2 system, which is currently the 19th fastest supercomputer in the world. Dell EMC provided the primary computing system for Frontera, based on Dell EMC PowerEdge™ C6420 servers. Functional Unit the Frontera system will provide academic researchers with the ability to calculate and handle artificial intelligence-related jobs with extremely high levels of complexity, even never existing before. 'With the integration of many Intel-exclusive technologies, this supercomputer opens up a wealth of new possibilities in the field of scientific and technical research in general, thereby fostering deeper understanding. for complex, scholarly issues related to space research, cures, energy needs, and artificial intelligence, 'says Trish Damkroger, Intel vice president and general manager of the team. Intel computing official Trish Damkroger said.  Supercomputers can fully detect cyber threats Hundreds of 2nd generation Xeon processors that can be expanded to 28 cores (Cascade Lake) housed in Dell EMC PowerEdge servers will be responsible for handling Frontera's heavy computing tasks, besides Nvidia nodes. ensure single-precision calculation. Frontera's processor architecture is built on Intel's advanced Advanced Vector Extensions 512 (AVX-512) model. Basically, the AVX-512 is a set of instructions that allows doubling the number of FLOPS per clock speed compared to the previous generation. It is also important to mention that one extremely important part of a supercomputer is the cooling system. Frontera uses a liquid cooling mechanism for most of its nodes. In particular, Dell EMC is responsible for water and cooling oil, combined with CoolIT and Green Revolution Cool systems. This supercomputer uses Mellanox HDR and HDR-100 connections to transfer data at up to 200Gbps on each link between the switches, which is responsible for connecting 8.008 nodes across the system. Each node is expected to consume about 65 kilowatts of electricity, about one third of which is used by TACC from wind and solar power to save costs Frontera will provide academic researchers with the ability to calculate and handle artificial intelligence-related jobs with extremely high complexity. In terms of storage, Frontera owns 4 different environments designed and built by DataDirect Networks, which will have a total of more than 50 petabytes paired with 3 petabytes of NAND flash (equivalent to about 480GB of SSD storage. on each node). Besides, this supercomputer also possesses extremely fast connectivity, with a speed of up to 1.5 terabytes per second. Finally, Frontera is also very effective at Intel Optane DC, the 'non-volatile' memory technology developed by Intel and Micron Technology, which has PIN and DDR4 compatibility, and incorporates a large cache memory. with a smaller DRAM group (192GB per node), thereby improving performance improvement. Not stopping there, Intel Optane DC on Frontera is also combined with the latest generation Xeon Scalable Processors, delivering up to 287,000 operations per second, compared to 3,116
  20. 20. operations per second of Conventional DRAM systems. With such equipment, Frontera's reboot time only takes 17 seconds. Basic specifications of Frontera supercomputer Basic calculation system The configuration of each node in Frontera is described as follows (Frontera owns 8.008 available nodes):  Processor: Intel Xeon Platinum 8280 ("Cascade Lake"); Number of cores: 28 per socket, 56 per node; Pulse Rate: 2.7Ghz ("Base Frequency")  Maximum node performance: 4.8TF, double precision  RAM: DDR-4, 192GB / node  Local drive: 480GB SSD / node  Network: Mellanox InfiniBand, HDR-100 Subsystems Liquid submerged system:  Processor: 360 NVIDIA Quadro RTX 5000 GPU  Ram: 128GB / node  Cooling: GRC ICEraQ ™ system  Network: Mellanox InfiniBand, HDR-100  Maximum performance: 4PF single precision Longhorn:  Processor: IBM POWER9-hosted system with 448 NVIDIA V100 GPUs  Ram: 256GB / node  Storage: 5 petabyte filesystem  Network: Infiniband EDR  Maximum performance: 3.5PF double precision; 7.0PF single precision 7. Piz Daint Introduction: Piz Daint is a supercomputer in the Swiss National Supercomputing Centre, named after the mountain Piz Daint in the Swiss Alps. It was ranked 8th on the TOP500 ranking of supercomputers until the end of 2015, higher than any other supercomputer in Europe. At the end of 2016, the computing performance of Piz Daint was tripled to reach 25 petaflops; it thus became the third most powerful supercomputer in the world. As of November 2021, Piz Daint is ranked 20th on the TOP500. The original Piz Daint Cray XC30 system was installed in December 2012. This system was extended with Piz Dora, a Cray XC40 with 1,256 compute nodes, in 2013.[9] In October 2016, Piz Daint and Piz Dora were upgraded and combined into the current Cray XC50/XC40 system featuring Nvidia Tesla P100 GPUs.
  21. 21. Block Diagram: Software Used: Architecture Intel Xeon E5-26xx (various) , Nvidia Tesla P100 Operating system Linux (CLE) 8. Trinity Supercomputer Introduction: Trinity (or ATS-1) is a United States supercomputer built by the National Nuclear Security Administration (NNSA) for the Advanced Simulation and Computing Program (ASC).[2] The aim of the ASC program is to simulate, test, and maintain the United States nuclear stockpile. Block Diagram:
  22. 22. Software Used: Trinity uses a Sonexion based Lustre file system with a total capacity of 78 PB. Throughput on this tier is about 1.8 TB/s (1.6 TiB/s). It is used to stage data in preparation for HPC operations. Data residence in this tier is typically several weeks. Functional Unit Trinity is a Cray XC40 supercomputer, with delivery over two phases; phase 1 is based on Intel Xeon Haswell compute nodes, and phase 2 will add Intel Xeon Phi Knights Landing (KNL) compute nodes. Phase 1 was delivered and accepted in the latter part of 2016, and consists of 54 cabinets, including multiple node types. Foremost are 9436 Haswell-based compute nodes, delivering ~1 PiB of memory capacity and ~11 PF/s of peak performance. Each Haswell compute node features two 16-core Haswell processors operating at 2.3 GHz, along with 128GiB of DDR4- 2133 memory, spread across 8 channels (4 per CPU). Phase 1 also includes 114 Lustre router nodes (see Section III.B) and 300 burst buffer nodes (see Section IV). Trinity utilizes a Sonexion based Lustre filesystem with 78 PB of usable storage and approximately 1.6 TB/s of bandwidth. However, due to the limited number of Lustre router nodes in Phase 1, only about half of this bandwidth is currently achievable. Phase 1 also includes all of the other typical service nodes: 2 boot, 2 SDB, 2 UDSL, 6 DVS, 12 MOM, and 10 RSIP. Additionally, Trinity utilizes 6 external login
  23. 23. nodes. Phase 2 is scheduled to begin delivery in mid-2016. It adds more than 9500 Xeon Phi Knights Landing (KNL) based compute nodes. Each KNL compute node consists of a single KNL with 16 GiB of on-package memory and 96 GiB of DDR4- 2400 memory. It has a peak performance of approximately 3 TF/s. In total, the KNL nodes add ~1 PiB of memory capacity and ~29 PF/s peak performance. In addition to the KNLs, Phase 2 also adds the balance of the Lustre router nodes (108 additional, total of 222) and burst buffer nodes (276 additional, total of 576). When all burst buffer nodes are installed, they will provide 3.69 PB of raw storage capacity and 3.28 TB/s of bandwidth. BURST BUFFER INTEGRATION AND PERFORMANCE 1. Design Trinity includes the first large scale instance of on-platform burst buffers using the Cray DataWarp® product. The Trinity burst buffer is provided in two phases along with the two phases of Trinity. The phase 1 burst buffer consists of 300 DataWarp nodes. This is expanded to 576 DataWarp nodes by phase 2. In this section, unless otherwise noted, the phase 1 burst buffer will be described. The 300 DataWarp nodes are built from Cray service nodes, each with a 16 core Intel Sandy Bridge processor and 64 gigabytes of memory. Storage on each DataWarp node is provided by two Intel P3608 Solid State Drive (SSD) cards. The DataWarp nodes use the Aries high speed network for communications with the Trinity compute nodes and for communications with the Lustre Parallel File System (PFS) via the LNET router nodes. Each SSD card has 4 TB of capacity and is attached to the service node via a PCI-E x4 interface. The SSD cards are overprovisioned to improve the endurance of the card from the normal 3 Drive Writes Per Day (DWPD) over 5 years to 10 DWPD over 5 years. This reduces the available capacity of each card. The total usable capacity of the 300 DataWarp nodes is 1.7 PiB. The DataWarp nodes run a Cray provided version of Linux together with a DataWarp specific software stack consisting of an enhanced Data Virtualization Service (DVS) server and various configuration and management services. The DataWarp nodes also provide a staging function that can be used to asynchronously move data between the PFS and DataWarp. There is a centralized DataWarp registration service that runs on one of the Cray System Management nodes. Compute nodes run a DVS client that is enhanced to provide support for DataWarp. The DataWarp resources can be examined and controlled via several DataWarp specific command line interface (CLI) utilities that run on any of the system’s nodes. DataWarp can be configured to operate in a number of different modes. The primary use case at ACES is to support checkpoint and analysis files, these are supported by the striped scratch mode of DataWarp. Striped scratch provides a single file name space that is visible to multiple compute nodes with the file data striped across one or more DataWarp nodes. A striped private mode is additionally available. In the future, paging space and cache modes may be provided. This section will discuss LANL’s experience with striped scratch mode. A DataWarp allocation is normally configured by job script directives. Trinity uses the
  24. 24. Moab Work Load Manager (WLM). The WLM reads the job script at job submission time and records the DataWarp directives for future use. When the requested DataWarp capacity is available, the WLM will start the job. Prior to the job starting, the WLM uses DataWarp CLI utilities to request instantiation of a DataWarp allocation and any requested stage-in of data from the PFS. After the job completes, the WLM requests stage-out of data and then frees the DataWarp allocation. The stage-in and stage-out happen without any allocated compute nodes or any compute node involvement. The DataWarp allocation is made accessible via mount on only the compute nodes of the requesting job. Unix file permissions are effective for files in DataWarp and are preserved by stage-in and stage-out operations. A DataWarp allocation is normally only available for the life of the requesting job, with the exception of a persistent DataWarp allocation that may be accessed by multiple jobs, possibly simultaneously. Simultaneous access by multiple jobs is used to support in-transit data visualization and analysis use cases. 2. Integration Correct operation of DataWarp in conjunction with the WLM was achieved after several months of extended integration testing on-site at LANL. Numerous fixes and functional enhancements have improved the stability and usability of the DataWarp feature on Trinity. Due to this effort, production use of DataWarp has been limited as of late April, 2016. 3. Performance All performance measurements were conducted with IOR. The runs were made with:  reader or writer process per node  32 GiB total data read or written per node  256, 512 or 1024 KiB block size  Nodes counts from 512 to 4096  The DataWarp allocation striped across all 300 DataWarp nodes These characteristics were selected to approximate the IO patterns expected when applications use the HIO library. Additional investigation and optimization of IO characteristics is needed. 9. AI Bridging Cloud Infrastructure
  25. 25. Introduction: AI Bridging Cloud Infrastructure (ABCI) is a planned supercomputer being built at the University of Tokyo for use in artificial intelligence, machine learning, and deep learning. It is being built by Japan's National Institute of Advanced Industrial Science and Technology. ABCI is expected to be completed in first quarter 2018 with a planned performance of 130 petaFLOPS. Power consumption is targeting 3 megawatts, and a planned power usage effectiveness of 1.1. If performance meets expectations, ABCI would be the second most powerful supercomputer built, surpassing the current leader Sunway TaihuLight's 93 petaflops. But still behind the Summit (supercomputer). Block Diagram: Software Used:
  26. 26. Along with Docker, Singularity and other tools, Univa Grid Engine plays a key role in ABCI’s software stack, ensuring that workloads run as efficiently as possible. Functional Units The ABCI prototype, which was installed in March, consisted of 50 two-socket “Broadwell” Xeon E5 servers, each equipped with 256 GB of main memory, 480 GB of SSD flash memory, and eight of the tesla P100 GPU accelerators in the SMX2 form factor hooked to each other using the NVLink 1.0 interconnect. Another 68 nodes were just plain vanilla servers, plus two nodes for interactive management and sixteen nodes for other functions on the cluster. The system was configured with 4 PB of SF14K clustered disk storage from DataDirect Networks running the GRIDScaler implementation of IBM’s GPFS parallel file system, and the whole shebang was clustered together using 100 Gb/sec EDR InfiniBand from Mellanox Technologies, specifically 216 of its CS7250 director switches. Among the many workloads running on this cluster was the Apache Spark in-memory processing framework. The goal with the real ABCI system was to deliver a machine with somewhere between 130 petaflops and 200 petaflops of AI processing power, which means half precision and single precision for the most part, with a power usage effectiveness (PUE) of somewhere under 1.1, which is a ratio of the energy consumed for the datacenter compared to the compute complex that does actual work. (This is about as good as most hyperscale datacenters, by the way.) The system was supposed to have about 20 PB of parallel file storage and, with the compute, storage, and switching combined, burn under 3 megawatts of juice. The plan was to get the full ABCI system operational by the fourth quarter of 2017 or the first quarter of 2018, and this obviously depended on the availability of the compute and networking components. Here is how the proposed ABCI system was stacked up against the K supercomputer at the RIKEN research lab in Japan and the Tsubame 3.0 machine at the Tokyo Institute of Technology: The K machine, which is based on the Sparc64 architecture and which was the first machine in the world to break the 10 petaflops barrier, will eventually be replaced by a massively parallel ARM system using the Tofu interconnect made for the K system and subsequently enhanced. The Oakforest-PACs machine built by University of Tokyo and University of Tsukuba is based on a mix of “Knights Landing” Xeon Phi processors and Omni-Path interconnect from Intel, and weighs in at 25 petaflops peak double precision. It is not on this comparison table of big Japanese supercomputers. But perhaps it should be. While the Tsubame 3.0 machine is said to focus on double precision performance, the big difference is really that the Omni-Path network hooking all of the nodes together in Tsubame 3.0 was configured to maximize extreme injection bandwidth and to have very high bi-section bandwidth across the network. The machine learning workloads that are expected to run on ABCI are not as sensitive to these factors and, importantly, the idea
  27. 27. here is to build something that looks more like a high performance cloud datacenter that can be replicated in other facilities, using standard 19-inch equipment rather than the specialized HPE and SGI gear that TiTech has used in the Tsubame line to date. In the case of both Tsubame 3.0 and ABCI, the thermal density of the compute and switching is in the range of 50 kilowatts to 60 kilowatts per rack, which is a lot higher than the 3 kilowatts to 6 kilowatts per rack in a service provider datacenter, and the PUE at under 1.1 is a lot lower than the 1.5 to 3.0 rating a typical service provider datacenter. (The hyperscalers do a lot better than this average, obviously.) This week, AIST awarded the job of building the full ABCI system to Fujitsu, and nailed down the specs. The system will be installed at a shiny new datacenter at the Kashiwa II campus of the University of Tokyo, and is now going to start operations in Fujitsu’s fiscal 2018, which begins next April. The ABCI system will be comprised of 1,088 of Fujitsu’s Primergy CX2570 server nodes, which are half-width server sleds that slide into the Primergy CX400 2U chassis. Each sled can accommodate two Intel “Skylake” Xeon SP processors, and in this case AIST is using a Xeon SP Gold variant, presumably with a large (but not extreme) number of cores. Each node is equipped with four of the Volta SMX2 GPU accelerators, so the entire machine has 2,176 CPU sockets and 4,352 GPU sockets. The use of the SXM2 variants of the Volta GPU accelerators requires liquid cooling because they run a little hotter, but the system has an air-cooled option for the Volta accelerators that hook into the system over the PCI-Express bus. The off-the-shelf models of the CX2570 server sleds also support the lower-grade Silver and Bronze Xeon SP processors as well as the high-end Platinum chips, so AIST is going in the middle of the road. There are Intel DC 4600 flash SSDs for local storage on the machine. It is not clear who won the deal for the GPFS file system for this machine, and if it came in at 20 PB as expected. Fujitsu says that the resulting ABCI system will have 37 petaflops of aggregate peak double precision floating point oomph, and will be rated at 550 petaflops, and 525 petaflops off that comes from using the 16-bit Tensor Core units that were created explicitly to speed up machine learning workloads. That is a lot more deep learning performance than was planned, obviously. AIST has amassed $172 million to fund the prototype and full ABCI machines as well as build the new datacenter that will house this system. About $10 million of that funding is for the datacenter, which had its ground breaking this summer. The initial datacenter setup has a maximum power draw of 3.25 megawatts, and it has 3.2 megawatts of cooling capacity, of which 3 megawatts come from a free cooling tower assembly and another 200 kilowatts comes from a chilling unit. The datacenter has a single concrete slab floor, which is cheap and easy, and will start out with 90 racks of capacity – that’s 18 for storage and 72 for compute – with room for expansion.
  28. 28. 10. SuperMUC-NG Supercomputer Introduction:
  29. 29. SuperMUC was a supercomputer of the Leibniz Supercomputing Centre (LRZ) of the Bavarian Academy of Sciences. It was housed in the LRZ's data centre in Garching near Munich. It was decommissioned in January 2020, having been superseded by the more powerful SuperMUC-NG. SuperMUC was the fastest European supercomputer when it entered operation in the summer of 2012 and is currently ranked #20 in the Top500 list of the world's fastest supercomputers. SuperMUC serves European researchers of many fields, including medicine, astrophysics, quantum chromodynamics, computational fluid dynamics, computational chemistry, life sciences, genome analysis and earth quake simulations. Block Diagram: Software Used: Compute Nodes Operating System Thin Nodes Suse Linux (SLES) Batch Scheduling System SLURM High Performance Parallel Filesystem IBM Spectrum Scale (GPFS)
  30. 30. Programming Environment Intel Parallel Studio XE GNU compilers Functional Unit Just like the CoolMUC-2, the SuperMUC-NG is located at the Leibniz Supercomputing Centre in Germany and was built by Lenovo. The system has 311,040 physical cores and a main memory of 719 TB resulting in a peak performance of 26.9 PFlop/s. A fat- tree is used as network topology and the bandwidth is 100 Gb/s using the Intel’s Omni- Path interconnect [25]. The CPUs used in this system are Intel’s Skylake Xeon Platinum 8174, with 24 cores clocked at 3.1 GHz [24]. The SuperMUC-NG was designed as a general-purpose supercomputer to support applications of all scientific domains like life sciences, meteorology, geophysics and climatology. The most dominant scientific domain using LRZ’s supercomputers is Astrophysics. Recently they also made their resources available for COVID-19 related research. Fine grain: Single processor core • instruction parallelism • multiple floating point units • SIMD style parallelism: single instruction multiple data • Medium grain: Multi-core / multi-socket system • independent processes or threads perform calculations • on a shared memory area • Coarse grain: Interconnected (independent) systems • explicitly programmed data transfers between nodes of the system • fulfill high memory requirements Levels of Parallelism Examples:
  31. 31. ● Node Level (e.g. SuperMUC has approx. 10,000 nodes) ● Accelerator Level (e.g. SuperMIC has 2 CPUs and 2 Xeon Phi Accelerators) ● Socket Level (e.g. fat nodes have 4 CPU Sockets) ● Core Level (e.g. SuperMUC Phase 2 has 14 cores per CPU) ● Vector Level (e.g. AVX2 has 16 vector registers per core) ● Pipeline Level (how many simultaneous pipelines) ● Instruction Level (instructions per cycle) Getting data from: Getting some food from: CPU register 1 ns fridge 10 s L2 cache 10 ns microwave 100 s memory 80 ns pizza service 800 s network(IB) 200 ns city mall 2,000 s GPU(PCIe) 50,000 ns mum sends cake 500,000 s harddisk 500,000 ns grown in own garden 5,000,000 s Fine grain parallelism • On CPU level: operations composed from elementary instructions: • load operand(s) (Instructions are sent to one of the ports of the execution unit) • perform arithmetic operations • store result Efficient if: • increment loop count / check for loop exit • good mix of instructions • branching / jumping • arguments are available from memory • low data dependencies Comparison Analysis Of Supercomputers: Rank System Memory Cores Rmax(PFlop Rpeak(PFlop/s) Power(kW)
  32. 32. /s) 1. Fugaku 5,087,232 GB 7,630,848 442,010.0 537,212.0 29,899 2. Summit 2,801,664 GB 2,414,592 148,600.0 200,794.9 10,096 3. Sierra 1,382,400 GB 1,572,480 94,640.0 125,712.0 7,438 4. Sunway TaihuLight 1,310,720 GB 10,649,60 93,014.6 125,435.9 15,371 5. Tianhe-2A 2,277,376 GB 4,981,760 61,444.5 100,678.7 18,482 6. Frontera 1,537,536 GB 448,448 23.5 38.7 6000 7. Piz Daint 365,056 GB 387,872 21.23 27.154 2272 8. Trinity 0 GB 979,072 20.16 41.46 8000 9. AI Bridging Cloud Infrastructur e 417,792 GB 391,680 19.8 32.577 3000 10. SuperMUC- NG 75,840 GB 305,856 19.4 26.83 2000
  33. 33. Thank You

×