SlideShare una empresa de Scribd logo
1 de 40
Fugaku, the Successes and the Lessons
Learned
Satoshi Matsuoka, Director Riken R-CCS
R-CCS Symposium 2022
07 Feb 2022
2
Jun20
1.42
2.58
倍
Jun
70,980
Jun20
415.5
148.6
TOP500
(PetaFLOPS)
浮動小数点の演算
での性能評価
Jun20
13.4
5.48
X
2.93
HPCG
(PetaFLOPS)
実際にアプリケーションを
稼働させた性能評価
1.41
HPL-AI
(ExaFLOPS)
AI処理での
性能評価
23,756
Graph500
(GigaTEPS)
ビッグデータ処理
での性能評価
Fugaku Crowned World #1 in all HPC Benchmarks, 2020&2021
Four editions in a row (2020 & 2021 Jun&Nov)
Nov21
442.0
Nov21
16.0 Nov21
2.00exa
Nov21
102,950
2.97
X
1.42
X
4.33
X
High performance not just in Linpack, but
across application proxies
AI (CNN Proxy) Big Data
(Graph Proxy)
Simulation
(Sparse Proxy)
DoE
ORNL
Summit
DoE
ORNL
Summit
DoE
ORNL
Summit TaihuLight
Classical ‘Dense’ HPC
Applications
Modern ‘Sparse’ HPC
Applications
Convolution in
Deep Learning
Graph Processing
in Big Data
93% efficiency
Simulation
(Dense Proxy)
“Applications First” Exascale R&D
Fugaku Target Applications – Priority Research Areas
11
 Advanced Applications
Co-Design Program to
Parallel Fugaku R&D
 Select one representative
app from 9 priority areas
 Health & Medicine
 Environment & Disaster
 Energy
 Materials & Manufacturing
 Basic Sciences
 Up to 100x speedup c.f.
K-Computer => achieved!
131x(GENESIS)
23x(Genomon) 63x(GAMERA)
127x (NICAM+ LETKF)
70x(NTChem)
63x(Adventure)
38x (RSDFT)
51x(FFB)
38x(LQCD) Average
~70x
Fugaku HPC+Big Data+AI+Cloud ‘Converged’ Software Stack
Red Hat Enterprise Linux 8 Libraries
Batch Job and Management
System
Open Source
Management Tool
Spack and other DoE
ECP Software
Hierarchical File System
Tuning and Debugging Tools
Fujitsu: Profiler, Debugger, GUI
Math Libraries
Fujitsu: BLAS, LAPACK, ScaLAPACK, SSL II
RIKEN: EigenEXA, KMATH_FFT3D, Batched BLAS,,,,
High-level Prog. Lang.
XMP
Domain Spec. Lang.
FDPS
Red Hat Enterprise Linux Kernel+ optional light-weight kernel (McKernel)
File I/O
DTF
Communication
Fujitsu MPI
RIKEN MPI
Low Level Communication
uTofu, LLC
File I/O for Hierarchical Storage
Lustre/LLIO
Process/Thread
PIP
Virtualization & Container
KVM, Singularity
Compiler and Script Languages
Fortran, C/C++, OpenMP, Java, python, …
(Multiple Compilers suppoted: Fujitsu, Arm, GNU,
LLVM/CLANG, PGI, …)
4
Fugaku AI (DL4Fugaku)
RIKEN: Chainer, PyTorch, TensorFlow, DNNL…
Cloud Software Stack
OpenStack, Kubernetis, NEWT...
Live Data Analytics
Apache Flink, Kibana, ….
ObjectStore
S3 Compatible
~3000 Apps sup-
ported by Spack
Most applications will work
with simple recompile from
x86/RHEL environment to t
he Arm processor.
LLNL Spack automates this.
Traditional HPC system eg K-computer
Traditional Clouds eg EC2
Fugaku and
future HPCI systems
5
 C.f. Xeon (Skylake/Cascade Lake)
 superior efficiency/peak (7.03% vs 2.20/3.96%)
 superior per socket time to performance (11.19 sec vs 20.61/8.29 x 2 sec)
CUBE / COVID-19 Aerosol Sim Performance
MLPerf HPC v1.0 Result (CosmoFlow)
○Fugaku took first place in
MLPerf HPC v1.0 (CosmoFlow,
weak scaling metric)
○82,944 CPUs are used (128
CPUs per model instance)
○Trained 637 models in 495.66
minutes
© 2021 Fujitsu Limited
0.68
0.73
1.29
models/min
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Perlmutter
(5120 GPUs,
BS 512)
NVIDIA
(4096 GPUs,
BS 128)
Fugaku
(82944 CPUs,
BS 128)
Throughput
[models/min]
CosmoFlow: Throughput
1.77x
6
7
 Co-designアプリを含む、多くのアプリでは設計性能が出ている
 しかし、性能が出る期待値が高い場合にも関わらず出ない状況も散見される
 考えられる原因
 1. 並列性の不足:ノード単位で十分な並列性が得られない。その場合、逐次性能に優れるXeonに
後れをとる
 2. I/Oシステムのスケーラビリティの不足: 従来小規模マシンで実行していたデータ処理中心のアプリを
富岳で実行、CPUの増加により、I/O(特に回数)も比例して増加、その際の富岳のI/Oシステムがス
ケールしない状況が生じ、そこがボトルネックとなる(ゲノム系のアプリなど)
 3. コンパイラのベクトル化の失敗:A64FXで高性能を得るには、計算バウンド・メモリバウンドにかかわ
らず、SVE命令でのベクトル化による最適化がカギとなる。このあたり、富士通コンパイラは、ターゲットア
プリには対応しているが、それ以外の一部のアプリで本来行われるべきであるベクトル化が行われない場
合が生じている
 4. CPUハードの実行資源不足: A64FXは少ないターゲットアプリを性能基準として設計されていて、
同時にIntelと比べて遥かに優れた(3倍)の電力効率を達成するため、内部のハードウェア資源の余裕
がきつめに設計されている。結果としてIntelと比べてピーキーな面が生じ、資源内で動くアプリは速い
が、外れると性能ダウンする
 実際起こっている回数は、3 >> 2~4 > 1 と推測される
 R-CCSは技術的対応を開始している
ではなぜ富岳/A64FXの性能問題が生じるか:分析
8
 1.並列性の不足: 基本はRISTやその他による個別のアプリ・ユーザ毎
の対応で、チューニングで並列化を向上させる。
 また、そもそも並列化効率の低いアプリを富岳で動かす場合は、実アプリ
利用として、大量実行でパラメータ探索などを行うべき。単一の実行の効
率を云々するのはその場合そもそも技術的におかしい
 2. I/Oシステムのスケーラビリティの不足:
 富岳にはI/OをスケーリングさせるためのNVMeが配備されており、それの活用を
進め、さらにはソフトウェアとしもこのようなI/Oインテンシブ(正確にはI/Oトランサ
クションインテンシブ)なワークロードに対応
 すでに対応を開始:LLIOに代わるBeeGFSの採用、ゲノム系などのトランスアク
ションインテンシブなアプリのDarshanなどのツールでの評価
 ゲノム関係のアプリは、理研内外とのコラボを制定し、新たなチームの創設をもくろ
む
対応策のあらまし(1)
9
 3.コンパイラのベクトル化の失敗
 コンパイラの開発体制の抜本的な見直し、富士通に頼らないOSSをベー
スとしたコンパイラエコシステムへの金銭・人的投資を来年度から。コンパイ
ラ開発の運用技術ユニットの創設も視野に入れる
 アプリケーションの網羅的な評価
 性能研究チームの設立
 4 CPUハードの実行資源不足:
 網羅的かつ微細な評価により、正確な資源不足の状況の解明
 それらに基づき、その影響を低減してロバストネスを向上する新たな最適
化アルゴリズム
 ユーザと共に、プログラム側の対応も視野に入れる
対応策のあらまし(2)
10
 Positives: proper project vision and management
 General purpose low power CPU w/good FLOPs and high BW
 Arm ecosystem extremely important, for programming, tools, & apps
 Aggressive R&D+adoption of new (risky) technologies: on-
die HBM2, Embedded ~400Gbps partially optical switchless
interconnect, mainframe RAS, low power etc.
 Co-design and co-working at (inter-)nationally
 Some evolutions to cope with massive parallelism (K=>Fugaku)
 Addition of modern architecture features e.g. FP16
 Shortcomings: lack of widespread commercial adoption
 Co-design: focused too much on target app optimization (only)
 Immaturity of software stack esp. compilers & libraries w/SVE
 Still too focused on classic HPC for industry & cloud adoption
 Failed to look at modern apps: data, AI, entertainment, mobility
 Failed to ‘deprecate’ classes of algorithms towards Post-Moore
 What elements can we learn for sustained perf. improvements
Lessons Learned from Fugaku
Performance projection of many-core CPU systems based on
IRDS roadmap
11
2028: Post-Moore Era
~2015 25 years of sustained scaling
in the Manycore period
(Post-Dennard scaling)
2016~ Difficulty in advancing Moore's law
2025~ Post-Moore Era
The end of transistor-power
advancement
Challenge: Exploration of computer architectures that will enable
performance improvement even around the year 2028
Key to sustained performance improvement:
FLOPS to Bytes, "data movement-centric architecture"
 Reconfigurable, data-driven, vector computing
 Ultra-deep and ultra-wide bandwidth memory architectures
 Optical networks
 system software, programing, algorithms that correspond to new architectures
Performance projection of many-core CPU systems based on
IRDS roadmap
12
A
Predictions based on the IRDS Roadmap(2020 ed.), extrapolation of
traditional many core architectures relying merely on advances of
semiconductor technologies will achieve only 1.8EFLOPS Peak (3.37x
c.f. Fugaku), if a machine with broad applicability will be built
• Methodoogies(CPU part):
Assumptions from IRDS Roadmap
Systems and Architectures
‒ Cores/socket=70 cores
‒ SIMD width=2048-bit x 2
‒ Clock frequency=3.9GHz
‒ Socket TDP = 351W
• System assumptions
‒ System Power=30, 40, 50MW
‒ PUE=1.1
‒ CPU power occupy=60,70,80%
NGACI white paper
https://sites.google.com/
view/ngaci/home
From NGACI white paper
最もアグレッシブなシステム構成(50MW電力バジェット、
CPUで80%電力消費)においても1.8EF程度の性能と予測
Socket
Cores
Injection
BW(Tb/s)
Storage
(EBytes)
Application Kernel Categorization & SC
Architecture
Compute Bound
(aka Top500)
Bandwidth Bound
(aka HPCG)
Latency Bound
(aka Graph500)
Classic Vector (e.g. Earth Simulator) ~90s
COTS-CPU based clusters late 90s~late 2000s (ASCI XXX, Tsubame1/T2K, Jaguar, K)
Standard Memory Technologies (DDR DRAM), Massively Parallel
GPU CPU
GPU
GPU
GPU-Based ‘Heterogeneous’ Machines: high (compute & BW & latency) for GPU
Tsubame2/3, ABCI, Summit, Piz-Daint, Fronter, Aurora, …
Fugaku/A64FX, Sapphire Rapids: aggressive technology & Good SW Ecosystem
GPU/Matrix CPU
Unexplored but best? (programmability, performance, industry adoption, …)
BLAS / GEMM utilization in HPC Applications
[Domke et. al.@R-CCS, IPDPS2020]
14
 Analyzed various data sources:
 Historical data from K computer: only 53,4% of node-hours (in FY18) were
consumed by applications which had GEMM functions in the symbol table
 Library dependencies: only 9% of Spack packages have direct BLAS lib
dependency (51.5% have indirect dependency)
 TensorCore benefit for DL: up to 7.6x speedup for MLperf kernels
 GEMM utilization in HPC: sampled across 77 HPC benchmarks (ECP proxy,
RIKEN fiber, TOP500, SPEC CPU/OMP/MPI) and measured/profiled via
Score-P and Vtune
How to achieve our performance target for dominant
memory-bound HPC applications?
Real apps
performance
現行のマシンでのアプリの性能
倍精度演算性能:1.6 Tflop/s
メモリ性能:0.35 TB/s
ハードウェア/アプリチューニング
性能100倍を達成する場合
シナリオ3
(HPL, MODYLASを除く)低精度演算によりF/B
が4倍向上した場合
(要求)倍精度演算性能:15 Tflop/s
(要求)メモリ性能:8.6 TB/s
* GPU V100: 7.45 Tflops (0.9 TB/s)
* A64FX: 2.7 Tflops (1TB/s)
ハードウェアのみで
性能100倍を達成する場合
シナリオ1
全てのベンチマークで100倍達成
シナリオ2
(HPLを除く)全てのベンチマークで100倍達成
(要求)倍精度演算性能:110 Tflop/s
(要求)メモリ性能:34.5 TB/s
(要求)倍精度演算性能:30.8 Tflop/s
(要求)メモリ性能:34.5 TB/s
• オンチップメモリやSDRAMなど容量拡大や性能向上によってはより容易に達成で
きる可能性もある
• データフロー、Massive Cores in Memory side、アルゴリズム向上による演算密度
の向上
 それらを考慮したより広範な探索も今後行う
13
16
 Increasing FLOPS via increasing the number of
ALUs no longer viable
 Compute power = ALU logic switching power + data
movement between ALUs and registers/memory
 ALU logic power saturation faster than lithography
saturation
 No more acceleration of pure FLOPS
 Only way to increase performance at low level is logic
simplification, e.g., lower precision, alternative numerical formats
 At higher levels, decreasing the # of numerical operations very
effective => sparse (iterative) methods (general HPC), network
compaction (AI), algorithmic pruning (HPC & AI)
FLOPS to BYTES for future acceleration? (1)
17
 Data movement has its own problems but promising w/ new
device and packaging tech + architectures & algorithms to exploit
them
 Devices & Packaging
 3-D stacking of memory + logic
 Photonic interconnect
 Dense and fast memory devices from SRAM to MRAM
 Architecture
 Large & high bandwidth local memory processor (very large L1/L2)
 Customized datapaths for frequent compute patters –
stencils/convolution, matrix, FFT, tensor operations, ... => can they be
generalized? Micro dataflow in a core?
 Coarse grained dataflow (CGRA)? => optimize data movement in general
over standard CPU/GPU(SMT Vector)
 Near memory processing
 FLOPS to BYTES!
 Same motivation as embedded computing
FLOPS to BYTES for future acceleration? (2)
 Towards 2030 Post-Moore era
• End of ALU compute (FLOPS) advance
• Disrupritve reduction in data movement
cost with new devices, packaging
• Algorithm advances to reduce the
computational order (+ more reliance on
data movement)
• Unification of BD/AI/Simulation towards
data-centric view
Post-Moore Algorithmic Development
Bandwidth Centric
Sparse NN
SVM FFT CG
O(n) QM
𝑂(𝑛)
𝑂(2𝑛
)
Quantum
Alg.
New
Paradigm
Quantum CPU and/or GPU + α (Data Movement Acceleration, eg CGRA?)
H-MatrixcGraph
Computational
Complexity
Machine Learning, HPC Siulations
Data movement reduction
DL・Quantum Chem
Combinatorial
Optimization
Lower order algorithm
Advanced
Algorithms
Quantum
Chem
Architecture
Algorithm
Domain
Data Movement (BYTES) Centric
Search &
Optiization
NP Hard
or HSP
Categorization of Algorithms and Their Doamains
 “New problem domains require new computing accelerators”
 In practice challenging, due to algorithms & programming
Copyright 2021 FUJITSU LIMITED
Data Movement (bandwidth) bound
HF SVM FFT CG
CNN
𝑂(𝑛 ) 𝑂(𝑛 ) 𝑂(𝑛)
𝑂(2𝑛
)
Quantum
Algorithms
Compute Bound
New Paradigm
Quantum&
Digital
Annealer
Quantum
Gates GPU+MM CPU or GPU w/HBM etc.
H-
Matrix
Graph
Computational
Complexity
Machine Learning, HPC Simulations
Traditional but Important
Deep Learning
Quantum Systems
Combinatorial
Optimization
“Innovation Challenge) New DL, Vision
Izzing
Model
Crypto etc.
Architecture
Algorithms
Domain
Data Movement (BYTES)
Centric
Search&
Optimization FLOPS Centric
NP Hard
1
2021 present day
𝑂(𝑛 log 𝑛)
2030
Latency Centric
NeuroM
19
 Compute bound via matrix/tensor HW
 Fairly low utilization
 Low memory capacity (O(n^k))
 Easy to encapsulate in library etc.
 Latency bound via standard
localization & hiding techniques
 Good single thread / low latency
communication HW
 Multithreading/latency hiding
 Latency-avoiding / localization
algorithms
 BW bound via 3D stacked near memory & photonics
 Tiered memory, extreme high BW memory is capacity limited c.f. FLOPS (see figure)
 Require algorithmic changes and innovations, generic (eg temporal blocking), customized, …
 Some apps/algorithms may not survive the change (eg traditional unstructured mesh…)
All is not Rosy: Modernizing & Downselecting
Application & Algorithm Types
Are Domain-Specific Accelerators Useful for HPC?
• On chip integration (SoC)
– Accelerator on the same die with CPU or even embedded
within a CPU (e.g. vector/matrix engines within CPU cores)n
– Shared various resources with CPUs e.g. on-chip cache
– low energy of data movement, homogeneous across nodes.
• Multi-chip packaging
– Interconnect accelerator chiplets with CPU chiplets using interposers
etc.
– Shared main memory, medium energy data movement
• On-Node accelerators + CPUs
– Accelerator – CPU connection via standard chip-chip interconnect
e.g. PCI-E, CXL, CAPI
– Low bandwidth, higher energy of data movement
– Scalable if homogeneous and workload exclusive to ACC or CPU
• Specific accelerated nodes/machines, via LAN or even WAN
– Expensive data movement, workload largely confined to each
– Limited utility, high cost of heterogeneous management, not scalable
– Only makes sense if workload is well known and largely fixed
 Accelerators are means to and end, not a purpose by itself
2021/2/25 20
CPU Acc.
L1$
L2$
Mem
cont.
I/O
CPU Acc.
HBM
CPU
Acc.
PCI-E
Acc.
Acc.
CPU
 From the user’s point of view, computing system should be
uniform, with heterogeneity, distribution etc. hidden under the hood
 Success of clouds achieved with this principle
 Modern IT involves massive software ecosystem, heterogeneity
hinders their use => integration with CPU(orGPU) most sensible
 Fugaku / A64FX was designed exactly with this principle
 Performance always governed by Amdahl’s law (strong scaling)
and Gustafson’s law (weak scaling)
 Employing multiple heterogeneous accelerators in an app => bad idea
 “Homogeneous” parallelization of workloads exclusively confined to a
SINGLE accelerator type (or CPU) per each node with good load
balancing is the ONLY way to overcome the Amdahl’s law
 Successful applications on large GPU machines follow this principle
 “Balanced” use of GPU and CPU a myth => EITHER GPU or CPU
Reality of Accelerated Computing
22
 SmartPhone SOC subject to
Amdahl Speedup (Law)
 Supercomputers subject to
Amdahl & Gustafson Speedup
Smartphones NOT extrapolatable to HPC
Apple A15 SoC
(source https://semianalysis.com/apple-a15-die-shot-and-
annotation-ip-block-area-analysis/)
ECE 695NS Lecture 3: Practical Assessment of Code Performance
by: Peter Bermel, Harvard University
https://nanohub.org/resources/20560/watch?resid=25763
Gustafson J.L. (2011) Gustafson’s Law. In: Padua D. (eds) Encyclopedia of Parallel C
omputing. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09766-4_78
23
 Accelerators are subject to Amdahl’s law (strong scaling)
 Large-scale parallel computing subject to Gustafson’s law
Accelerators vs. Amdahl’s Law & Gustafson’s Law (1)
Non-Acc
Time-to-solution t
Non-Acc
Time-to-solution
Acceleratable Acc Non-Acc
Time-to-solution
1/(1-a)t
A
c
c
Non-Par
Time-to-solution
Parallelizable Non-Par
Time-to-solution (constant)
Parallelizable
Parallelizable
Parallelizable
Parallelism: P
Problem size: xP
(weak scaling)
Performance: ~= xP for large P,
if non-parallelizable overhead would
be minimized, e.g., w/load balancing,
communication minimization, etc. =>
entails uniform, well balanced
processing for every node
For accelerators to work, non-
accelerated portion must be as
small as possible
e.g. GPU-CPU, CPU processing
must be minimized
Non-Par
Non-Par
a
1-a
 Combining Amdahl’s law and Gustafson’s law in a supercomputer
Accelerators vs. Amdahl’s Law & Gustafson’s Law (2)
Non-Acc
~=Non-Par
Node
Performance
(1-a)
A
c
c
Theoretical
asymptotic
performance gain
(1-a)P
BUT
Extremely
susceptible to
overhead, e.g.,
load imbalance,
communication
overhead, etc.
Non-Acc
~=Non-Par
A
c
c
Parallelism
P
Principles of Accelerated Supercomputer:
• Maximizing acceleration under Amdahl
=> Dominant processing done on the same
accelerator on every node
BAD:“intra-node” heterogeneous processing
• Extremely uniform load balancing
=> SPMD over uniform accelerators the best
BAD: heterogeneous task parallelism over
multiple types of accelerators
• Minimize parallelization overhead e.g.
communication
=> tight communication coupling of accelerated
components, on-chip > on-package > on node >
different machines
BAD: any segregation entailing data movement,
poor interconnect, etc.
25
 It is no accident that, every successful large-scale accelerated
supercomputers (esp. GPU machines) are
 built with a singular node configuration across the entire machine
 tight coupling and robust interconnect (& I/O) to sustain maximum
bandwidth in/out of accelerator processor
 dominant processing on the GPU for maximum performance
 SPMD with very good load balancing (incl. data parallel DNN training)
 Tsubame, Tianhe-2A, Titan/Summit, Piz-Daint, ABCI,
Fugaku, Frontier, Lumi, Aurora, …
 … and this is the consequence of physical laws, so will continue to be
applicable to future machines (no extreme heterogeneity, asynchrony, …)
Accelerators vs. Amdahl’s Law & Gustafson’s Law (3)
Tsubame3
 Amahl’s law also will hit communication time and energy/power
consumption
 Even if we achieve considerable speedup with low energy on the
accelerator, moving the data around to be processed by other
accelerators will be hit with the Amdahl’s law in communication time
and power/energy consumption
 Neither can be brought down, the more distance the signal travels from on-
chip towards inter-rack or inter IDC, becoming the overall overhead factor
 Thus the right approach to minimize the effect of the Amdahl’s law is to
do SoC or even CPU integration of acceleration features, NOT
SEPARATE HETEROGENEOUS ACCELERATOR CHIPS&SYSTEMS
 Again, Fugaku / A64FX was designed with this principle
“Multiple Heterogeneous Domain Specific Accelerator”
Considered Harmful
 Towards 2030 Post-Moore era
• End of ALU compute (FLOPS) advance
• Disrupritve reduction in data movement cost
with new devices, packaging
• Algorithm advances to reduce the computational
order (+ more reliance on data movement)
• Unification of BD/AI/Simulation towards data-
centric view
Post-Moore Algorithmic Development
Bandwidth Centric
Sparse NN
SVM FFT CG
O(n) QM
𝑂(𝑛)
𝑂(2𝑛
)
Quantum
Alg.
New
Paradigm
Quantum CPU and/or GPU + α (Data Movement Acceleration, eg CGRA?)
H-MatrixcGraph
Computational
Complexity
Machine Learning, HPC Siulations
Data movement reduction
DL・Quantum Chem
Combinatorial
Optimization
Lower order algorithm
Advanced
Algorithms
Quantum
Chem
Architecture
Algorithm
Domain
Data Movement (BYTES) Centric
Search &
Optiization
NP Hard
or HSP
Categorization of Algorithms and Their Doamains
 “New problem domains require new computing accelerators”
 In practice challenging, due to algorithms & programming
Copyright 2021 FUJITSU LIMITED
Data Movement (bandwidth) bound
HF SVM FFT CG
CNN
𝑂(𝑛 ) 𝑂(𝑛 ) 𝑂(𝑛)
𝑂(2𝑛
)
Quantum
Algorithms
Compute Bound
New Paradigm
Quantum&
Digital
Annealer
Quantum
Gates GPU+MM CPU or GPU w/HBM etc.
H-
Matrix
Graph
Computational
Complexity
Machine Learning, HPC Simulations
Traditional but Important
Deep Learning
Quantum Systems
Combinatorial
Optimization
“Innovation Challenge) New DL, Vision
Izzing
Model
Crypto etc.
Architecture
Algorithms
Domain
Data Movement (BYTES)
Centric
Search&
Optimization FLOPS Centric
NP Hard
1
2021 present day
𝑂(𝑛 log 𝑛)
2030
Latency Centric
NeuroM
Quantum Future Non-Quantum Future
28
Investigating the Quantum Future with HPC
29
 Applications that are infeasible to solve on conventional
computers due to high complexity => impractical time-to-solution
 Material science- first principles simulations (wave functions)
 Difficult higher-order problems difficult to solve with conventional
means due to high complexity : O(n^k) where k > 4
 Other examples: cryptography
 Applications that can be solved with existing computers but
beneficial on quantum computers due to cheaper OPEX/CAPX
 Optimization problems – TSP and variants
 Some classes of AI/DL – quantum learning
 They are important applications, OTOH the list is unfortunately not
very long (and likely will not be…)
What are the applications we desire quantum accelerator?
 Accelerators are subject to Amdahl’s law (strong scaling)
 Possible (polynomial?) speedup for NP-Hard problems, exponential
speedup for HSP problems
Non-Acc
Time-to-solution t
Non-Acc
Time-to-solution
Acceleratable Acc Non-Acc
Time-to-solution
1/(1-a)t
A
c
c
Non-QC
Time-to-solution
Quantum
Advantage
For accelerators to work, non-
accelerated portion must be as
small as possible
Quantum Computers only possible Amdahl Accelerator
Non-QC
Quantum
Advantage
Quantum
Advantage
Non-QC
Quantum
Advantage
Quantum
Advantage
Quantum
Advantage
Quantum
Advantage
Non-QC QC Non-QC QC
Q
C
Non-QC QC
Q
C
Q
C
Problem complexity increase
Q
C
Research for Quantum Computing/Computer (QC) @ R-CCS
① Development of large-scale QC simulators using Fugaku
• Bracket simulator(R-CCS Ito Team) Large-medium scale (#qubits<50)
• Qulacs simulator (RQC Fujii Team) Medium-small scale (#qubits<30)
• QC simulator designed using Tensor-network (R-CCS yunoki Team)
Development supported by Program “The enhancement of Fugaku
useability” for Fugaku CPU resource.
② Design of Hybrid programming
environment for integration of QC
(simulator) and classic HPC
supercomputer
• Workflow and task-parallel programming model
for offloading for QC (R-CCS Sato Team)
• Design and implementation for a common
framework such as Qibo(IHPC, Singapore)
③ Design of QC algorithm and Development of QC applications
for QC and HPC hybrid computing
• Target:Material simulation for the optimization of ground state of molecules
by VQE method using more than 40 qubits of QC simulator on Fugaku.
• It is expected to be used for the real QC developed by RQC.
 Supported by a special program in the by Program “The enhancement of Fugaku
useability” for Fugaku CPU resource.
 Access to the external real QC (IBM Q, D-WAVE)
④ Research on the architecture to accelerate
QC simulation
To accelerate QC research, the technology for high-
speed QC simulation is important. (R-CCS Kondo Team)
The acquisition of GPU-based system (NVIDIA A100)
Expected to use the outcome for Fugaku Next ⑤ Inter-national collaborations
• IHPC(Singapore) ・CEA (France)
Integrated
Execute
Tech
Transfer
Collaboration with RIKEN
Center for Quantum
Computing (RQC)
32
 Increasing FLOPS via increasing the number of
ALUs no longer viable
 Compute power = ALU logic switching power + data
movement between ALUs and registers/memory
 ALU logic power saturation faster than lithography
saturation
 No more acceleration of pure FLOPS
 Only way to increase performance at low level is logic
simplification, e.g., lower precision, alternative numerical formats
 At higher levels, decreasing the # of numerical operations very
effective => sparse (iterative) methods (general HPC), network
compaction (AI), algorithmic pruning (HPC & AI)
Investigating the non-Quantum Future
FLOPS to BYTES for future acceleration? (1)
33
 Data movement has its own problems but promising w/ new
device and packaging tech + architectures & algorithms to exploit
them
 Devices & Packaging
 3-D stacking of memory + logic
 Photonic interconnect
 Dense and fast memory devices from SRAM to MRAM
 Architecture
 Large & high bandwidth local memory processor (very large L1/L2)
 Customized datapaths for frequent compute patters – stencils/convolution,
matrix, FFT, tensor operations, ... => can they be generalized? Micro
dataflow in a core?
 Coarse grained dataflow (CGRA)? => optimize data movement in general
over standard CPU/GPU(SMT Vector)
 Near memory processing
 FLOPS to BYTES!
 Same motivation as embedded computing
Investigating the non-Quantum Future
FLOPS to BYTES for future acceleration? (2)
H
H
B
H
B
M
B
M
2
M
28
28
G
8
G
iG
B
iB
iB
 Extremely high bandwidth in caches and memory
• A64FX has out-of-order mechanisms in cores, caches and memory controllers.
It maximizes the capability of each layer’s bandwidth
Performance
>2.7TFLOPS
L1 Cache
>11.0TB/s (BF ratio = 4)
A64FX Bandwidth Numbers
CMG
L2 Cache
>3.6TB/s (BF ratio = 1.3)
L1D 64KiB, 4way
512-bit wide SIMD
2x FMAs
Core Core Core
Core
>230
GB/s
>115
GB/s
12x Computing Cores + 1x Assistant Core
Memory
1024GB/s (BF ratio =~0.37)
>115
GB/s
>57
GB/s
HBM2 8GiB
L2 Cache 8MiB, 16way
256
GB/s
13 All Rights Reserved. Copyright © FUJITSU LIMITED 2018
Our Project: Exploring versatile HPC architecture and system software technologies
to achieve 100x performance by 2028
35
Problems to be solved and goals to be achieved
• General-purpose computer architectures that will accelerate a wide range of applications in
the post-Moore era have not yet been established.
• What is a feasible approach for versatile HPC systems based on bandwidth improvement?
• Goal: to explore architectures that can achieve 100x performance in a wide range of
applications around 2028
Exemlar FLOPS to BYTES
Architecture
gen-purpose
many-core CPUs
other
approaches
(like CGRA)
near-
memory
proc
new
memory
device
new
memory
hierarch
y,
connecti
on for
cores
near-
memory
proc
new
memory
device
Subtask1.1 Performance
characterization and modeling with
benchmarks to identify directions
for exploration and improvement
(Riken R-CCS)
Subtask1.2 Exploring a
reconfigurable vector data-
flow architecture (CGRA) that
can exploit increased data
transfer capability
(Rijen R-CCS)
Subtask3 Exploring near-
memory computing for highly
effective bandwidth and
cooling efficiency for general
purpose computing (U-Tokyo)
Approaches and subtasks
• Exploration of future CPU node
architectures and necessary
technologies
2018 2019 2020 2021 2022
Plan
Explore individual technologies Integrate promising technologies
for a target node architecture
Developing
stage ...
2023~
Subtask2 Exploring innovative
memory architectures with ultra-
deep and ultra-wide bandwidth
(Tokyo Tech.)
Subtask4 Exploration of node
architectures as extension of
existing many-core CPUs with
non von-Neumann methods
(unnamed company)
Planned
Advantages of our project
36
Approach of exploration
• Target gen-purpose architectures
instead of special-purpose accelerators,
where conventional codes are available.
• Explorer x100 performance node around 2028
• Discover node architectures to be developed
Novelty and innovativeness
• FLOPS to BYTES architecture
to improve performance effectively
using increased bandwidth
• Performance modeling and analysis
to guide development of node
architectures
• Exploration of fundamental
technologies to achieve
the target node architecture
+ Innovative memory systems
+ Reconfigurable vector
data-flow architecture
(CGRA)
+ Near-memory computing
Specs of HPC CPU around 2018
Intel Skylake Xeon Platinum 8180:
Process: Intel 14nm CMOS
CPU: Intel Xeon CPU、28 cores
Frequency: 1.9GHz (for vector comp with AVX)
Peak FP64 perf: 1.7 TF (3.4 TF for FP32)
Memories: DDR4-2666 DRAM, 6ch 130GB/s
NW injection bw: 100Gbps (Infiniband EDR)
Power: 300W (including memories and NW)
Specs of HPC CPUs and systems
expected in Post-Moore Era around 2028:
Process: 3nm CMOS (UV lithography)
PCRAM、non-volatile memory as novel devices
CPU: normal core & vector data-flow, near-memory core
Frequency: 2GHz+
Peak FP64 perf: 20 TF (200TF for FP32, w/ or w/o Matrix Engine)
memory: highly-dense TSV/silicon-on-silicon 3D stack device
1st layer 3D TSV Stacked SRAM >10TB/s, cap. >8GByte
2nd layer 2.5D TSV Stacked DRAM >3TB/s, cap. >64GByte
3rd layer 2.5D TSV Flash/NVM ~30GB/s, cap. >1TB
NW injection bw:
12Tbps (VCSEL CWDM Micro-Optics)
Power: 300W+ (including memories and NW)
37
 Assume constant memory per core
 #cores ~ total problem (total machine (memory)) size n ~ core performance
 Modern massively parallel architectures: core performance constant,
performance gains ~ increasing #cores in system, runtime T ~ problem
complexity / core performance
 Compute-bound codes, O(nk) complexity where k > 1 : runtime T ~ nk-1 , so
increasing total machine size increases T, even w/ constant memory per core
 Memory-bound codes, O(n) complexity, runtime T ~ # memory controllers
 At core level, # memory controllers (e.g. access to cache) ~ #cores so runtime
remains constant with increasing cores (weak scaling).
 However, at chip level (external memory access), memory controllers are
constant even with #cores increase, so T ~ #cores (no scaling)
 Increasing memory size further per core meaningless, since T ~ n
 Maintaining memory size per core, let alone increase, will not lead to effective
performance gains, diminishing Gustafson’s Law
Towards Strong Scaling (1)
38
 Some applications are inherently strong scaling
 E.g. Molecular Dynamics, c.f., Anton
 Even traditional weak scaling codes need to strong scale
 Architectural requirements : memory high BW / low latency requires capacity
to be small
 Science requirements: from demonstrative big runs to real R&D
 Ensemble of multiple smaller problem sizes
 Time to solution >> problem size
 If Gustafson’s law is well satisfied (e.g., well load balanced), then strong
scaling will work up to the point of bad load balance and/or non-parallel
region becoming significant
 Applications must be prepared to strong scale, or at least deal with
hierarchical memory
 Advanced localization e.g. temporal blocking, putting only BW sensitive data
in fast memory, …
Towards Strong Scaling (2)
39
New!: Comprehensive benchmarking platform effort
 Collect benchmarks and machines incl. Fugaku also x86&GPU
 Construct a platform to do all benches x all machines benchmarking
 Make all benchmarks be repeatedly executable so that new
instrumentations can be done easily
 Couple with architectural simulators to conduct what-if analysis
New!: Enhancing system software robustness, contributing to
compilers and other performance OSS tools
 Make Fugaku be performance robust, not focus on co-design apps
 OSS as future dev platform e.g. LLVM and contribute result to
community
 New optimizations for new architectures before actual HW
New Efforts at R-CCS towards Non-Quantum Future
If you are excited about future of HPC…
40
 “Feasibility Study 2.0” starting Apr 2022, ~$4m USD/y
 We are hiring!
 Team/Unit leaders (digital twins, possibly more in future)
 Various researcher and post-doc positions
 For details ‘google’ R-CCS Home page

Más contenido relacionado

La actualidad más candente

モデルアーキテクチャ観点からのDeep Neural Network高速化
モデルアーキテクチャ観点からのDeep Neural Network高速化モデルアーキテクチャ観点からのDeep Neural Network高速化
モデルアーキテクチャ観点からのDeep Neural Network高速化Yusuke Uchida
 
グラフニューラルネットワーク入門
グラフニューラルネットワーク入門グラフニューラルネットワーク入門
グラフニューラルネットワーク入門ryosuke-kojima
 
モデルではなく、データセットを蒸留する
モデルではなく、データセットを蒸留するモデルではなく、データセットを蒸留する
モデルではなく、データセットを蒸留するTakahiro Kubo
 
[DL輪読会]Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
[DL輪読会]Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets[DL輪読会]Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
[DL輪読会]Grokking: Generalization Beyond Overfitting on Small Algorithmic DatasetsDeep Learning JP
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデルMasahiro Suzuki
 
近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer近年のHierarchical Vision Transformer
近年のHierarchical Vision TransformerYusuke Uchida
 
全力解説!Transformer
全力解説!Transformer全力解説!Transformer
全力解説!TransformerArithmer Inc.
 
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法SSII
 
[DL輪読会]DropBlock: A regularization method for convolutional networks
[DL輪読会]DropBlock: A regularization method for convolutional networks[DL輪読会]DropBlock: A regularization method for convolutional networks
[DL輪読会]DropBlock: A regularization method for convolutional networksDeep Learning JP
 
[DL輪読会]When Does Label Smoothing Help?
[DL輪読会]When Does Label Smoothing Help?[DL輪読会]When Does Label Smoothing Help?
[DL輪読会]When Does Label Smoothing Help?Deep Learning JP
 
Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―
Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―
Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―Yosuke Shinya
 
【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks?
【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks? 【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks?
【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks? Deep Learning JP
 
Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)Yoshitaka Ushiku
 
分散学習のあれこれ~データパラレルからモデルパラレルまで~
分散学習のあれこれ~データパラレルからモデルパラレルまで~分散学習のあれこれ~データパラレルからモデルパラレルまで~
分散学習のあれこれ~データパラレルからモデルパラレルまで~Hideki Tsunashima
 
Layer Normalization@NIPS+読み会・関西
Layer Normalization@NIPS+読み会・関西Layer Normalization@NIPS+読み会・関西
Layer Normalization@NIPS+読み会・関西Keigo Nishida
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Yusuke Uchida
 
Deep Learningと画像認識   ~歴史・理論・実践~
Deep Learningと画像認識 ~歴史・理論・実践~Deep Learningと画像認識 ~歴史・理論・実践~
Deep Learningと画像認識   ~歴史・理論・実践~nlab_utokyo
 
マルチコアを用いた画像処理
マルチコアを用いた画像処理マルチコアを用いた画像処理
マルチコアを用いた画像処理Norishige Fukushima
 
最適輸送の解き方
最適輸送の解き方最適輸送の解き方
最適輸送の解き方joisino
 
時系列予測にTransformerを使うのは有効か?
時系列予測にTransformerを使うのは有効か?時系列予測にTransformerを使うのは有効か?
時系列予測にTransformerを使うのは有効か?Fumihiko Takahashi
 

La actualidad más candente (20)

モデルアーキテクチャ観点からのDeep Neural Network高速化
モデルアーキテクチャ観点からのDeep Neural Network高速化モデルアーキテクチャ観点からのDeep Neural Network高速化
モデルアーキテクチャ観点からのDeep Neural Network高速化
 
グラフニューラルネットワーク入門
グラフニューラルネットワーク入門グラフニューラルネットワーク入門
グラフニューラルネットワーク入門
 
モデルではなく、データセットを蒸留する
モデルではなく、データセットを蒸留するモデルではなく、データセットを蒸留する
モデルではなく、データセットを蒸留する
 
[DL輪読会]Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
[DL輪読会]Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets[DL輪読会]Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
[DL輪読会]Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデル
 
近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer
 
全力解説!Transformer
全力解説!Transformer全力解説!Transformer
全力解説!Transformer
 
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
 
[DL輪読会]DropBlock: A regularization method for convolutional networks
[DL輪読会]DropBlock: A regularization method for convolutional networks[DL輪読会]DropBlock: A regularization method for convolutional networks
[DL輪読会]DropBlock: A regularization method for convolutional networks
 
[DL輪読会]When Does Label Smoothing Help?
[DL輪読会]When Does Label Smoothing Help?[DL輪読会]When Does Label Smoothing Help?
[DL輪読会]When Does Label Smoothing Help?
 
Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―
Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―
Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―
 
【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks?
【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks? 【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks?
【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks?
 
Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)
 
分散学習のあれこれ~データパラレルからモデルパラレルまで~
分散学習のあれこれ~データパラレルからモデルパラレルまで~分散学習のあれこれ~データパラレルからモデルパラレルまで~
分散学習のあれこれ~データパラレルからモデルパラレルまで~
 
Layer Normalization@NIPS+読み会・関西
Layer Normalization@NIPS+読み会・関西Layer Normalization@NIPS+読み会・関西
Layer Normalization@NIPS+読み会・関西
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
 
Deep Learningと画像認識   ~歴史・理論・実践~
Deep Learningと画像認識 ~歴史・理論・実践~Deep Learningと画像認識 ~歴史・理論・実践~
Deep Learningと画像認識   ~歴史・理論・実践~
 
マルチコアを用いた画像処理
マルチコアを用いた画像処理マルチコアを用いた画像処理
マルチコアを用いた画像処理
 
最適輸送の解き方
最適輸送の解き方最適輸送の解き方
最適輸送の解き方
 
時系列予測にTransformerを使うのは有効か?
時系列予測にTransformerを使うのは有効か?時系列予測にTransformerを使うのは有効か?
時系列予測にTransformerを使うのは有効か?
 

Similar a Fugaku, the Successes and the Lessons Learned

OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC
 
08 Supercomputer Fugaku
08 Supercomputer Fugaku08 Supercomputer Fugaku
08 Supercomputer FugakuRCCSRENKEI
 
OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020OpenACC
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceLEGATO project
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...PT Datacomm Diangraha
 
Top 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & AnalysisTop 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & AnalysisNomanSiddiqui41
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingMichelle Holley
 
Hardware architecture of Summit Supercomputer
 Hardware architecture of Summit Supercomputer Hardware architecture of Summit Supercomputer
Hardware architecture of Summit SupercomputerVigneshwarRamaswamy
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
 
186 devlin p-poster(2)
186 devlin p-poster(2)186 devlin p-poster(2)
186 devlin p-poster(2)vaidehi87
 
OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC
 
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIinside-BigData.com
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks
 
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ijdpsjournal
 
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ijdpsjournal
 
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...journalBEEI
 

Similar a Fugaku, the Successes and the Lessons Learned (20)

OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
08 Supercomputer Fugaku
08 Supercomputer Fugaku08 Supercomputer Fugaku
08 Supercomputer Fugaku
 
OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe Conference
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
 
Top 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & AnalysisTop 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & Analysis
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet Processing
 
Hardware architecture of Summit Supercomputer
 Hardware architecture of Summit Supercomputer Hardware architecture of Summit Supercomputer
Hardware architecture of Summit Supercomputer
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
186 devlin p-poster(2)
186 devlin p-poster(2)186 devlin p-poster(2)
186 devlin p-poster(2)
 
OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020
 
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
 
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
 
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
 
OpenPOWER Boot camp in Zurich
OpenPOWER Boot camp in ZurichOpenPOWER Boot camp in Zurich
OpenPOWER Boot camp in Zurich
 
chameleon chip
chameleon chipchameleon chip
chameleon chip
 

Más de RCCSRENKEI

第15回 配信講義 計算科学技術特論B(2022)
第15回 配信講義 計算科学技術特論B(2022)第15回 配信講義 計算科学技術特論B(2022)
第15回 配信講義 計算科学技術特論B(2022)RCCSRENKEI
 
第14回 配信講義 計算科学技術特論B(2022)
第14回 配信講義 計算科学技術特論B(2022)第14回 配信講義 計算科学技術特論B(2022)
第14回 配信講義 計算科学技術特論B(2022)RCCSRENKEI
 
第12回 配信講義 計算科学技術特論B(2022)
第12回 配信講義 計算科学技術特論B(2022)第12回 配信講義 計算科学技術特論B(2022)
第12回 配信講義 計算科学技術特論B(2022)RCCSRENKEI
 
第13回 配信講義 計算科学技術特論B(2022)
第13回 配信講義 計算科学技術特論B(2022)第13回 配信講義 計算科学技術特論B(2022)
第13回 配信講義 計算科学技術特論B(2022)RCCSRENKEI
 
第11回 配信講義 計算科学技術特論B(2022)
第11回 配信講義 計算科学技術特論B(2022)第11回 配信講義 計算科学技術特論B(2022)
第11回 配信講義 計算科学技術特論B(2022)RCCSRENKEI
 
第10回 配信講義 計算科学技術特論B(2022)
第10回 配信講義 計算科学技術特論B(2022)第10回 配信講義 計算科学技術特論B(2022)
第10回 配信講義 計算科学技術特論B(2022)RCCSRENKEI
 
第9回 配信講義 計算科学技術特論B(2022)
 第9回 配信講義 計算科学技術特論B(2022) 第9回 配信講義 計算科学技術特論B(2022)
第9回 配信講義 計算科学技術特論B(2022)RCCSRENKEI
 
第8回 配信講義 計算科学技術特論B(2022)
第8回 配信講義 計算科学技術特論B(2022)第8回 配信講義 計算科学技術特論B(2022)
第8回 配信講義 計算科学技術特論B(2022)RCCSRENKEI
 
第7回 配信講義 計算科学技術特論B(2022)
第7回 配信講義 計算科学技術特論B(2022)第7回 配信講義 計算科学技術特論B(2022)
第7回 配信講義 計算科学技術特論B(2022)RCCSRENKEI
 
第6回 配信講義 計算科学技術特論B(2022)
第6回 配信講義 計算科学技術特論B(2022)第6回 配信講義 計算科学技術特論B(2022)
第6回 配信講義 計算科学技術特論B(2022)RCCSRENKEI
 
第5回 配信講義 計算科学技術特論B(2022)
第5回 配信講義 計算科学技術特論B(2022)第5回 配信講義 計算科学技術特論B(2022)
第5回 配信講義 計算科学技術特論B(2022)RCCSRENKEI
 
Realization of Innovative Light Energy Conversion Materials utilizing the Sup...
Realization of Innovative Light Energy Conversion Materials utilizing the Sup...Realization of Innovative Light Energy Conversion Materials utilizing the Sup...
Realization of Innovative Light Energy Conversion Materials utilizing the Sup...RCCSRENKEI
 
Current status of the project "Toward a unified view of the universe: from la...
Current status of the project "Toward a unified view of the universe: from la...Current status of the project "Toward a unified view of the universe: from la...
Current status of the project "Toward a unified view of the universe: from la...RCCSRENKEI
 
第4回 配信講義 計算科学技術特論B(2022)
第4回 配信講義 計算科学技術特論B(2022)第4回 配信講義 計算科学技術特論B(2022)
第4回 配信講義 計算科学技術特論B(2022)RCCSRENKEI
 
第3回 配信講義 計算科学技術特論B(2022)
第3回 配信講義 計算科学技術特論B(2022)第3回 配信講義 計算科学技術特論B(2022)
第3回 配信講義 計算科学技術特論B(2022)RCCSRENKEI
 
第2回 配信講義 計算科学技術特論B(2022)
第2回 配信講義 計算科学技術特論B(2022)第2回 配信講義 計算科学技術特論B(2022)
第2回 配信講義 計算科学技術特論B(2022)RCCSRENKEI
 
第1回 配信講義 計算科学技術特論B(2022)
第1回 配信講義 計算科学技術特論B(2022)第1回 配信講義 計算科学技術特論B(2022)
第1回 配信講義 計算科学技術特論B(2022)RCCSRENKEI
 
210603 yamamoto
210603 yamamoto210603 yamamoto
210603 yamamotoRCCSRENKEI
 
第15回 配信講義 計算科学技術特論A(2021)
第15回 配信講義 計算科学技術特論A(2021)第15回 配信講義 計算科学技術特論A(2021)
第15回 配信講義 計算科学技術特論A(2021)RCCSRENKEI
 
第14回 配信講義 計算科学技術特論A(2021)
第14回 配信講義 計算科学技術特論A(2021)第14回 配信講義 計算科学技術特論A(2021)
第14回 配信講義 計算科学技術特論A(2021)RCCSRENKEI
 

Más de RCCSRENKEI (20)

第15回 配信講義 計算科学技術特論B(2022)
第15回 配信講義 計算科学技術特論B(2022)第15回 配信講義 計算科学技術特論B(2022)
第15回 配信講義 計算科学技術特論B(2022)
 
第14回 配信講義 計算科学技術特論B(2022)
第14回 配信講義 計算科学技術特論B(2022)第14回 配信講義 計算科学技術特論B(2022)
第14回 配信講義 計算科学技術特論B(2022)
 
第12回 配信講義 計算科学技術特論B(2022)
第12回 配信講義 計算科学技術特論B(2022)第12回 配信講義 計算科学技術特論B(2022)
第12回 配信講義 計算科学技術特論B(2022)
 
第13回 配信講義 計算科学技術特論B(2022)
第13回 配信講義 計算科学技術特論B(2022)第13回 配信講義 計算科学技術特論B(2022)
第13回 配信講義 計算科学技術特論B(2022)
 
第11回 配信講義 計算科学技術特論B(2022)
第11回 配信講義 計算科学技術特論B(2022)第11回 配信講義 計算科学技術特論B(2022)
第11回 配信講義 計算科学技術特論B(2022)
 
第10回 配信講義 計算科学技術特論B(2022)
第10回 配信講義 計算科学技術特論B(2022)第10回 配信講義 計算科学技術特論B(2022)
第10回 配信講義 計算科学技術特論B(2022)
 
第9回 配信講義 計算科学技術特論B(2022)
 第9回 配信講義 計算科学技術特論B(2022) 第9回 配信講義 計算科学技術特論B(2022)
第9回 配信講義 計算科学技術特論B(2022)
 
第8回 配信講義 計算科学技術特論B(2022)
第8回 配信講義 計算科学技術特論B(2022)第8回 配信講義 計算科学技術特論B(2022)
第8回 配信講義 計算科学技術特論B(2022)
 
第7回 配信講義 計算科学技術特論B(2022)
第7回 配信講義 計算科学技術特論B(2022)第7回 配信講義 計算科学技術特論B(2022)
第7回 配信講義 計算科学技術特論B(2022)
 
第6回 配信講義 計算科学技術特論B(2022)
第6回 配信講義 計算科学技術特論B(2022)第6回 配信講義 計算科学技術特論B(2022)
第6回 配信講義 計算科学技術特論B(2022)
 
第5回 配信講義 計算科学技術特論B(2022)
第5回 配信講義 計算科学技術特論B(2022)第5回 配信講義 計算科学技術特論B(2022)
第5回 配信講義 計算科学技術特論B(2022)
 
Realization of Innovative Light Energy Conversion Materials utilizing the Sup...
Realization of Innovative Light Energy Conversion Materials utilizing the Sup...Realization of Innovative Light Energy Conversion Materials utilizing the Sup...
Realization of Innovative Light Energy Conversion Materials utilizing the Sup...
 
Current status of the project "Toward a unified view of the universe: from la...
Current status of the project "Toward a unified view of the universe: from la...Current status of the project "Toward a unified view of the universe: from la...
Current status of the project "Toward a unified view of the universe: from la...
 
第4回 配信講義 計算科学技術特論B(2022)
第4回 配信講義 計算科学技術特論B(2022)第4回 配信講義 計算科学技術特論B(2022)
第4回 配信講義 計算科学技術特論B(2022)
 
第3回 配信講義 計算科学技術特論B(2022)
第3回 配信講義 計算科学技術特論B(2022)第3回 配信講義 計算科学技術特論B(2022)
第3回 配信講義 計算科学技術特論B(2022)
 
第2回 配信講義 計算科学技術特論B(2022)
第2回 配信講義 計算科学技術特論B(2022)第2回 配信講義 計算科学技術特論B(2022)
第2回 配信講義 計算科学技術特論B(2022)
 
第1回 配信講義 計算科学技術特論B(2022)
第1回 配信講義 計算科学技術特論B(2022)第1回 配信講義 計算科学技術特論B(2022)
第1回 配信講義 計算科学技術特論B(2022)
 
210603 yamamoto
210603 yamamoto210603 yamamoto
210603 yamamoto
 
第15回 配信講義 計算科学技術特論A(2021)
第15回 配信講義 計算科学技術特論A(2021)第15回 配信講義 計算科学技術特論A(2021)
第15回 配信講義 計算科学技術特論A(2021)
 
第14回 配信講義 計算科学技術特論A(2021)
第14回 配信講義 計算科学技術特論A(2021)第14回 配信講義 計算科学技術特論A(2021)
第14回 配信講義 計算科学技術特論A(2021)
 

Último

basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomyDrAnita Sharma
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...navyadasi1992
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squaresusmanzain586
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 

Último (20)

basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomy
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squares
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 

Fugaku, the Successes and the Lessons Learned

  • 1. Fugaku, the Successes and the Lessons Learned Satoshi Matsuoka, Director Riken R-CCS R-CCS Symposium 2022 07 Feb 2022
  • 2. 2 Jun20 1.42 2.58 倍 Jun 70,980 Jun20 415.5 148.6 TOP500 (PetaFLOPS) 浮動小数点の演算 での性能評価 Jun20 13.4 5.48 X 2.93 HPCG (PetaFLOPS) 実際にアプリケーションを 稼働させた性能評価 1.41 HPL-AI (ExaFLOPS) AI処理での 性能評価 23,756 Graph500 (GigaTEPS) ビッグデータ処理 での性能評価 Fugaku Crowned World #1 in all HPC Benchmarks, 2020&2021 Four editions in a row (2020 & 2021 Jun&Nov) Nov21 442.0 Nov21 16.0 Nov21 2.00exa Nov21 102,950 2.97 X 1.42 X 4.33 X High performance not just in Linpack, but across application proxies AI (CNN Proxy) Big Data (Graph Proxy) Simulation (Sparse Proxy) DoE ORNL Summit DoE ORNL Summit DoE ORNL Summit TaihuLight Classical ‘Dense’ HPC Applications Modern ‘Sparse’ HPC Applications Convolution in Deep Learning Graph Processing in Big Data 93% efficiency Simulation (Dense Proxy)
  • 3. “Applications First” Exascale R&D Fugaku Target Applications – Priority Research Areas 11  Advanced Applications Co-Design Program to Parallel Fugaku R&D  Select one representative app from 9 priority areas  Health & Medicine  Environment & Disaster  Energy  Materials & Manufacturing  Basic Sciences  Up to 100x speedup c.f. K-Computer => achieved! 131x(GENESIS) 23x(Genomon) 63x(GAMERA) 127x (NICAM+ LETKF) 70x(NTChem) 63x(Adventure) 38x (RSDFT) 51x(FFB) 38x(LQCD) Average ~70x
  • 4. Fugaku HPC+Big Data+AI+Cloud ‘Converged’ Software Stack Red Hat Enterprise Linux 8 Libraries Batch Job and Management System Open Source Management Tool Spack and other DoE ECP Software Hierarchical File System Tuning and Debugging Tools Fujitsu: Profiler, Debugger, GUI Math Libraries Fujitsu: BLAS, LAPACK, ScaLAPACK, SSL II RIKEN: EigenEXA, KMATH_FFT3D, Batched BLAS,,,, High-level Prog. Lang. XMP Domain Spec. Lang. FDPS Red Hat Enterprise Linux Kernel+ optional light-weight kernel (McKernel) File I/O DTF Communication Fujitsu MPI RIKEN MPI Low Level Communication uTofu, LLC File I/O for Hierarchical Storage Lustre/LLIO Process/Thread PIP Virtualization & Container KVM, Singularity Compiler and Script Languages Fortran, C/C++, OpenMP, Java, python, … (Multiple Compilers suppoted: Fujitsu, Arm, GNU, LLVM/CLANG, PGI, …) 4 Fugaku AI (DL4Fugaku) RIKEN: Chainer, PyTorch, TensorFlow, DNNL… Cloud Software Stack OpenStack, Kubernetis, NEWT... Live Data Analytics Apache Flink, Kibana, …. ObjectStore S3 Compatible ~3000 Apps sup- ported by Spack Most applications will work with simple recompile from x86/RHEL environment to t he Arm processor. LLNL Spack automates this. Traditional HPC system eg K-computer Traditional Clouds eg EC2 Fugaku and future HPCI systems
  • 5. 5  C.f. Xeon (Skylake/Cascade Lake)  superior efficiency/peak (7.03% vs 2.20/3.96%)  superior per socket time to performance (11.19 sec vs 20.61/8.29 x 2 sec) CUBE / COVID-19 Aerosol Sim Performance
  • 6. MLPerf HPC v1.0 Result (CosmoFlow) ○Fugaku took first place in MLPerf HPC v1.0 (CosmoFlow, weak scaling metric) ○82,944 CPUs are used (128 CPUs per model instance) ○Trained 637 models in 495.66 minutes © 2021 Fujitsu Limited 0.68 0.73 1.29 models/min 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Perlmutter (5120 GPUs, BS 512) NVIDIA (4096 GPUs, BS 128) Fugaku (82944 CPUs, BS 128) Throughput [models/min] CosmoFlow: Throughput 1.77x 6
  • 7. 7  Co-designアプリを含む、多くのアプリでは設計性能が出ている  しかし、性能が出る期待値が高い場合にも関わらず出ない状況も散見される  考えられる原因  1. 並列性の不足:ノード単位で十分な並列性が得られない。その場合、逐次性能に優れるXeonに 後れをとる  2. I/Oシステムのスケーラビリティの不足: 従来小規模マシンで実行していたデータ処理中心のアプリを 富岳で実行、CPUの増加により、I/O(特に回数)も比例して増加、その際の富岳のI/Oシステムがス ケールしない状況が生じ、そこがボトルネックとなる(ゲノム系のアプリなど)  3. コンパイラのベクトル化の失敗:A64FXで高性能を得るには、計算バウンド・メモリバウンドにかかわ らず、SVE命令でのベクトル化による最適化がカギとなる。このあたり、富士通コンパイラは、ターゲットア プリには対応しているが、それ以外の一部のアプリで本来行われるべきであるベクトル化が行われない場 合が生じている  4. CPUハードの実行資源不足: A64FXは少ないターゲットアプリを性能基準として設計されていて、 同時にIntelと比べて遥かに優れた(3倍)の電力効率を達成するため、内部のハードウェア資源の余裕 がきつめに設計されている。結果としてIntelと比べてピーキーな面が生じ、資源内で動くアプリは速い が、外れると性能ダウンする  実際起こっている回数は、3 >> 2~4 > 1 と推測される  R-CCSは技術的対応を開始している ではなぜ富岳/A64FXの性能問題が生じるか:分析
  • 8. 8  1.並列性の不足: 基本はRISTやその他による個別のアプリ・ユーザ毎 の対応で、チューニングで並列化を向上させる。  また、そもそも並列化効率の低いアプリを富岳で動かす場合は、実アプリ 利用として、大量実行でパラメータ探索などを行うべき。単一の実行の効 率を云々するのはその場合そもそも技術的におかしい  2. I/Oシステムのスケーラビリティの不足:  富岳にはI/OをスケーリングさせるためのNVMeが配備されており、それの活用を 進め、さらにはソフトウェアとしもこのようなI/Oインテンシブ(正確にはI/Oトランサ クションインテンシブ)なワークロードに対応  すでに対応を開始:LLIOに代わるBeeGFSの採用、ゲノム系などのトランスアク ションインテンシブなアプリのDarshanなどのツールでの評価  ゲノム関係のアプリは、理研内外とのコラボを制定し、新たなチームの創設をもくろ む 対応策のあらまし(1)
  • 9. 9  3.コンパイラのベクトル化の失敗  コンパイラの開発体制の抜本的な見直し、富士通に頼らないOSSをベー スとしたコンパイラエコシステムへの金銭・人的投資を来年度から。コンパイ ラ開発の運用技術ユニットの創設も視野に入れる  アプリケーションの網羅的な評価  性能研究チームの設立  4 CPUハードの実行資源不足:  網羅的かつ微細な評価により、正確な資源不足の状況の解明  それらに基づき、その影響を低減してロバストネスを向上する新たな最適 化アルゴリズム  ユーザと共に、プログラム側の対応も視野に入れる 対応策のあらまし(2)
  • 10. 10  Positives: proper project vision and management  General purpose low power CPU w/good FLOPs and high BW  Arm ecosystem extremely important, for programming, tools, & apps  Aggressive R&D+adoption of new (risky) technologies: on- die HBM2, Embedded ~400Gbps partially optical switchless interconnect, mainframe RAS, low power etc.  Co-design and co-working at (inter-)nationally  Some evolutions to cope with massive parallelism (K=>Fugaku)  Addition of modern architecture features e.g. FP16  Shortcomings: lack of widespread commercial adoption  Co-design: focused too much on target app optimization (only)  Immaturity of software stack esp. compilers & libraries w/SVE  Still too focused on classic HPC for industry & cloud adoption  Failed to look at modern apps: data, AI, entertainment, mobility  Failed to ‘deprecate’ classes of algorithms towards Post-Moore  What elements can we learn for sustained perf. improvements Lessons Learned from Fugaku
  • 11. Performance projection of many-core CPU systems based on IRDS roadmap 11 2028: Post-Moore Era ~2015 25 years of sustained scaling in the Manycore period (Post-Dennard scaling) 2016~ Difficulty in advancing Moore's law 2025~ Post-Moore Era The end of transistor-power advancement Challenge: Exploration of computer architectures that will enable performance improvement even around the year 2028 Key to sustained performance improvement: FLOPS to Bytes, "data movement-centric architecture"  Reconfigurable, data-driven, vector computing  Ultra-deep and ultra-wide bandwidth memory architectures  Optical networks  system software, programing, algorithms that correspond to new architectures
  • 12. Performance projection of many-core CPU systems based on IRDS roadmap 12 A Predictions based on the IRDS Roadmap(2020 ed.), extrapolation of traditional many core architectures relying merely on advances of semiconductor technologies will achieve only 1.8EFLOPS Peak (3.37x c.f. Fugaku), if a machine with broad applicability will be built • Methodoogies(CPU part): Assumptions from IRDS Roadmap Systems and Architectures ‒ Cores/socket=70 cores ‒ SIMD width=2048-bit x 2 ‒ Clock frequency=3.9GHz ‒ Socket TDP = 351W • System assumptions ‒ System Power=30, 40, 50MW ‒ PUE=1.1 ‒ CPU power occupy=60,70,80% NGACI white paper https://sites.google.com/ view/ngaci/home From NGACI white paper 最もアグレッシブなシステム構成(50MW電力バジェット、 CPUで80%電力消費)においても1.8EF程度の性能と予測 Socket Cores Injection BW(Tb/s) Storage (EBytes)
  • 13. Application Kernel Categorization & SC Architecture Compute Bound (aka Top500) Bandwidth Bound (aka HPCG) Latency Bound (aka Graph500) Classic Vector (e.g. Earth Simulator) ~90s COTS-CPU based clusters late 90s~late 2000s (ASCI XXX, Tsubame1/T2K, Jaguar, K) Standard Memory Technologies (DDR DRAM), Massively Parallel GPU CPU GPU GPU GPU-Based ‘Heterogeneous’ Machines: high (compute & BW & latency) for GPU Tsubame2/3, ABCI, Summit, Piz-Daint, Fronter, Aurora, … Fugaku/A64FX, Sapphire Rapids: aggressive technology & Good SW Ecosystem GPU/Matrix CPU Unexplored but best? (programmability, performance, industry adoption, …)
  • 14. BLAS / GEMM utilization in HPC Applications [Domke et. al.@R-CCS, IPDPS2020] 14  Analyzed various data sources:  Historical data from K computer: only 53,4% of node-hours (in FY18) were consumed by applications which had GEMM functions in the symbol table  Library dependencies: only 9% of Spack packages have direct BLAS lib dependency (51.5% have indirect dependency)  TensorCore benefit for DL: up to 7.6x speedup for MLperf kernels  GEMM utilization in HPC: sampled across 77 HPC benchmarks (ECP proxy, RIKEN fiber, TOP500, SPEC CPU/OMP/MPI) and measured/profiled via Score-P and Vtune
  • 15. How to achieve our performance target for dominant memory-bound HPC applications? Real apps performance 現行のマシンでのアプリの性能 倍精度演算性能:1.6 Tflop/s メモリ性能:0.35 TB/s ハードウェア/アプリチューニング 性能100倍を達成する場合 シナリオ3 (HPL, MODYLASを除く)低精度演算によりF/B が4倍向上した場合 (要求)倍精度演算性能:15 Tflop/s (要求)メモリ性能:8.6 TB/s * GPU V100: 7.45 Tflops (0.9 TB/s) * A64FX: 2.7 Tflops (1TB/s) ハードウェアのみで 性能100倍を達成する場合 シナリオ1 全てのベンチマークで100倍達成 シナリオ2 (HPLを除く)全てのベンチマークで100倍達成 (要求)倍精度演算性能:110 Tflop/s (要求)メモリ性能:34.5 TB/s (要求)倍精度演算性能:30.8 Tflop/s (要求)メモリ性能:34.5 TB/s • オンチップメモリやSDRAMなど容量拡大や性能向上によってはより容易に達成で きる可能性もある • データフロー、Massive Cores in Memory side、アルゴリズム向上による演算密度 の向上  それらを考慮したより広範な探索も今後行う 13
  • 16. 16  Increasing FLOPS via increasing the number of ALUs no longer viable  Compute power = ALU logic switching power + data movement between ALUs and registers/memory  ALU logic power saturation faster than lithography saturation  No more acceleration of pure FLOPS  Only way to increase performance at low level is logic simplification, e.g., lower precision, alternative numerical formats  At higher levels, decreasing the # of numerical operations very effective => sparse (iterative) methods (general HPC), network compaction (AI), algorithmic pruning (HPC & AI) FLOPS to BYTES for future acceleration? (1)
  • 17. 17  Data movement has its own problems but promising w/ new device and packaging tech + architectures & algorithms to exploit them  Devices & Packaging  3-D stacking of memory + logic  Photonic interconnect  Dense and fast memory devices from SRAM to MRAM  Architecture  Large & high bandwidth local memory processor (very large L1/L2)  Customized datapaths for frequent compute patters – stencils/convolution, matrix, FFT, tensor operations, ... => can they be generalized? Micro dataflow in a core?  Coarse grained dataflow (CGRA)? => optimize data movement in general over standard CPU/GPU(SMT Vector)  Near memory processing  FLOPS to BYTES!  Same motivation as embedded computing FLOPS to BYTES for future acceleration? (2)
  • 18.  Towards 2030 Post-Moore era • End of ALU compute (FLOPS) advance • Disrupritve reduction in data movement cost with new devices, packaging • Algorithm advances to reduce the computational order (+ more reliance on data movement) • Unification of BD/AI/Simulation towards data-centric view Post-Moore Algorithmic Development Bandwidth Centric Sparse NN SVM FFT CG O(n) QM 𝑂(𝑛) 𝑂(2𝑛 ) Quantum Alg. New Paradigm Quantum CPU and/or GPU + α (Data Movement Acceleration, eg CGRA?) H-MatrixcGraph Computational Complexity Machine Learning, HPC Siulations Data movement reduction DL・Quantum Chem Combinatorial Optimization Lower order algorithm Advanced Algorithms Quantum Chem Architecture Algorithm Domain Data Movement (BYTES) Centric Search & Optiization NP Hard or HSP Categorization of Algorithms and Their Doamains  “New problem domains require new computing accelerators”  In practice challenging, due to algorithms & programming Copyright 2021 FUJITSU LIMITED Data Movement (bandwidth) bound HF SVM FFT CG CNN 𝑂(𝑛 ) 𝑂(𝑛 ) 𝑂(𝑛) 𝑂(2𝑛 ) Quantum Algorithms Compute Bound New Paradigm Quantum& Digital Annealer Quantum Gates GPU+MM CPU or GPU w/HBM etc. H- Matrix Graph Computational Complexity Machine Learning, HPC Simulations Traditional but Important Deep Learning Quantum Systems Combinatorial Optimization “Innovation Challenge) New DL, Vision Izzing Model Crypto etc. Architecture Algorithms Domain Data Movement (BYTES) Centric Search& Optimization FLOPS Centric NP Hard 1 2021 present day 𝑂(𝑛 log 𝑛) 2030 Latency Centric NeuroM
  • 19. 19  Compute bound via matrix/tensor HW  Fairly low utilization  Low memory capacity (O(n^k))  Easy to encapsulate in library etc.  Latency bound via standard localization & hiding techniques  Good single thread / low latency communication HW  Multithreading/latency hiding  Latency-avoiding / localization algorithms  BW bound via 3D stacked near memory & photonics  Tiered memory, extreme high BW memory is capacity limited c.f. FLOPS (see figure)  Require algorithmic changes and innovations, generic (eg temporal blocking), customized, …  Some apps/algorithms may not survive the change (eg traditional unstructured mesh…) All is not Rosy: Modernizing & Downselecting Application & Algorithm Types
  • 20. Are Domain-Specific Accelerators Useful for HPC? • On chip integration (SoC) – Accelerator on the same die with CPU or even embedded within a CPU (e.g. vector/matrix engines within CPU cores)n – Shared various resources with CPUs e.g. on-chip cache – low energy of data movement, homogeneous across nodes. • Multi-chip packaging – Interconnect accelerator chiplets with CPU chiplets using interposers etc. – Shared main memory, medium energy data movement • On-Node accelerators + CPUs – Accelerator – CPU connection via standard chip-chip interconnect e.g. PCI-E, CXL, CAPI – Low bandwidth, higher energy of data movement – Scalable if homogeneous and workload exclusive to ACC or CPU • Specific accelerated nodes/machines, via LAN or even WAN – Expensive data movement, workload largely confined to each – Limited utility, high cost of heterogeneous management, not scalable – Only makes sense if workload is well known and largely fixed  Accelerators are means to and end, not a purpose by itself 2021/2/25 20 CPU Acc. L1$ L2$ Mem cont. I/O CPU Acc. HBM CPU Acc. PCI-E Acc. Acc. CPU
  • 21.  From the user’s point of view, computing system should be uniform, with heterogeneity, distribution etc. hidden under the hood  Success of clouds achieved with this principle  Modern IT involves massive software ecosystem, heterogeneity hinders their use => integration with CPU(orGPU) most sensible  Fugaku / A64FX was designed exactly with this principle  Performance always governed by Amdahl’s law (strong scaling) and Gustafson’s law (weak scaling)  Employing multiple heterogeneous accelerators in an app => bad idea  “Homogeneous” parallelization of workloads exclusively confined to a SINGLE accelerator type (or CPU) per each node with good load balancing is the ONLY way to overcome the Amdahl’s law  Successful applications on large GPU machines follow this principle  “Balanced” use of GPU and CPU a myth => EITHER GPU or CPU Reality of Accelerated Computing
  • 22. 22  SmartPhone SOC subject to Amdahl Speedup (Law)  Supercomputers subject to Amdahl & Gustafson Speedup Smartphones NOT extrapolatable to HPC Apple A15 SoC (source https://semianalysis.com/apple-a15-die-shot-and- annotation-ip-block-area-analysis/) ECE 695NS Lecture 3: Practical Assessment of Code Performance by: Peter Bermel, Harvard University https://nanohub.org/resources/20560/watch?resid=25763 Gustafson J.L. (2011) Gustafson’s Law. In: Padua D. (eds) Encyclopedia of Parallel C omputing. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09766-4_78
  • 23. 23  Accelerators are subject to Amdahl’s law (strong scaling)  Large-scale parallel computing subject to Gustafson’s law Accelerators vs. Amdahl’s Law & Gustafson’s Law (1) Non-Acc Time-to-solution t Non-Acc Time-to-solution Acceleratable Acc Non-Acc Time-to-solution 1/(1-a)t A c c Non-Par Time-to-solution Parallelizable Non-Par Time-to-solution (constant) Parallelizable Parallelizable Parallelizable Parallelism: P Problem size: xP (weak scaling) Performance: ~= xP for large P, if non-parallelizable overhead would be minimized, e.g., w/load balancing, communication minimization, etc. => entails uniform, well balanced processing for every node For accelerators to work, non- accelerated portion must be as small as possible e.g. GPU-CPU, CPU processing must be minimized Non-Par Non-Par a 1-a
  • 24.  Combining Amdahl’s law and Gustafson’s law in a supercomputer Accelerators vs. Amdahl’s Law & Gustafson’s Law (2) Non-Acc ~=Non-Par Node Performance (1-a) A c c Theoretical asymptotic performance gain (1-a)P BUT Extremely susceptible to overhead, e.g., load imbalance, communication overhead, etc. Non-Acc ~=Non-Par A c c Parallelism P Principles of Accelerated Supercomputer: • Maximizing acceleration under Amdahl => Dominant processing done on the same accelerator on every node BAD:“intra-node” heterogeneous processing • Extremely uniform load balancing => SPMD over uniform accelerators the best BAD: heterogeneous task parallelism over multiple types of accelerators • Minimize parallelization overhead e.g. communication => tight communication coupling of accelerated components, on-chip > on-package > on node > different machines BAD: any segregation entailing data movement, poor interconnect, etc.
  • 25. 25  It is no accident that, every successful large-scale accelerated supercomputers (esp. GPU machines) are  built with a singular node configuration across the entire machine  tight coupling and robust interconnect (& I/O) to sustain maximum bandwidth in/out of accelerator processor  dominant processing on the GPU for maximum performance  SPMD with very good load balancing (incl. data parallel DNN training)  Tsubame, Tianhe-2A, Titan/Summit, Piz-Daint, ABCI, Fugaku, Frontier, Lumi, Aurora, …  … and this is the consequence of physical laws, so will continue to be applicable to future machines (no extreme heterogeneity, asynchrony, …) Accelerators vs. Amdahl’s Law & Gustafson’s Law (3) Tsubame3
  • 26.  Amahl’s law also will hit communication time and energy/power consumption  Even if we achieve considerable speedup with low energy on the accelerator, moving the data around to be processed by other accelerators will be hit with the Amdahl’s law in communication time and power/energy consumption  Neither can be brought down, the more distance the signal travels from on- chip towards inter-rack or inter IDC, becoming the overall overhead factor  Thus the right approach to minimize the effect of the Amdahl’s law is to do SoC or even CPU integration of acceleration features, NOT SEPARATE HETEROGENEOUS ACCELERATOR CHIPS&SYSTEMS  Again, Fugaku / A64FX was designed with this principle “Multiple Heterogeneous Domain Specific Accelerator” Considered Harmful
  • 27.  Towards 2030 Post-Moore era • End of ALU compute (FLOPS) advance • Disrupritve reduction in data movement cost with new devices, packaging • Algorithm advances to reduce the computational order (+ more reliance on data movement) • Unification of BD/AI/Simulation towards data- centric view Post-Moore Algorithmic Development Bandwidth Centric Sparse NN SVM FFT CG O(n) QM 𝑂(𝑛) 𝑂(2𝑛 ) Quantum Alg. New Paradigm Quantum CPU and/or GPU + α (Data Movement Acceleration, eg CGRA?) H-MatrixcGraph Computational Complexity Machine Learning, HPC Siulations Data movement reduction DL・Quantum Chem Combinatorial Optimization Lower order algorithm Advanced Algorithms Quantum Chem Architecture Algorithm Domain Data Movement (BYTES) Centric Search & Optiization NP Hard or HSP Categorization of Algorithms and Their Doamains  “New problem domains require new computing accelerators”  In practice challenging, due to algorithms & programming Copyright 2021 FUJITSU LIMITED Data Movement (bandwidth) bound HF SVM FFT CG CNN 𝑂(𝑛 ) 𝑂(𝑛 ) 𝑂(𝑛) 𝑂(2𝑛 ) Quantum Algorithms Compute Bound New Paradigm Quantum& Digital Annealer Quantum Gates GPU+MM CPU or GPU w/HBM etc. H- Matrix Graph Computational Complexity Machine Learning, HPC Simulations Traditional but Important Deep Learning Quantum Systems Combinatorial Optimization “Innovation Challenge) New DL, Vision Izzing Model Crypto etc. Architecture Algorithms Domain Data Movement (BYTES) Centric Search& Optimization FLOPS Centric NP Hard 1 2021 present day 𝑂(𝑛 log 𝑛) 2030 Latency Centric NeuroM Quantum Future Non-Quantum Future
  • 28. 28 Investigating the Quantum Future with HPC
  • 29. 29  Applications that are infeasible to solve on conventional computers due to high complexity => impractical time-to-solution  Material science- first principles simulations (wave functions)  Difficult higher-order problems difficult to solve with conventional means due to high complexity : O(n^k) where k > 4  Other examples: cryptography  Applications that can be solved with existing computers but beneficial on quantum computers due to cheaper OPEX/CAPX  Optimization problems – TSP and variants  Some classes of AI/DL – quantum learning  They are important applications, OTOH the list is unfortunately not very long (and likely will not be…) What are the applications we desire quantum accelerator?
  • 30.  Accelerators are subject to Amdahl’s law (strong scaling)  Possible (polynomial?) speedup for NP-Hard problems, exponential speedup for HSP problems Non-Acc Time-to-solution t Non-Acc Time-to-solution Acceleratable Acc Non-Acc Time-to-solution 1/(1-a)t A c c Non-QC Time-to-solution Quantum Advantage For accelerators to work, non- accelerated portion must be as small as possible Quantum Computers only possible Amdahl Accelerator Non-QC Quantum Advantage Quantum Advantage Non-QC Quantum Advantage Quantum Advantage Quantum Advantage Quantum Advantage Non-QC QC Non-QC QC Q C Non-QC QC Q C Q C Problem complexity increase Q C
  • 31. Research for Quantum Computing/Computer (QC) @ R-CCS ① Development of large-scale QC simulators using Fugaku • Bracket simulator(R-CCS Ito Team) Large-medium scale (#qubits<50) • Qulacs simulator (RQC Fujii Team) Medium-small scale (#qubits<30) • QC simulator designed using Tensor-network (R-CCS yunoki Team) Development supported by Program “The enhancement of Fugaku useability” for Fugaku CPU resource. ② Design of Hybrid programming environment for integration of QC (simulator) and classic HPC supercomputer • Workflow and task-parallel programming model for offloading for QC (R-CCS Sato Team) • Design and implementation for a common framework such as Qibo(IHPC, Singapore) ③ Design of QC algorithm and Development of QC applications for QC and HPC hybrid computing • Target:Material simulation for the optimization of ground state of molecules by VQE method using more than 40 qubits of QC simulator on Fugaku. • It is expected to be used for the real QC developed by RQC.  Supported by a special program in the by Program “The enhancement of Fugaku useability” for Fugaku CPU resource.  Access to the external real QC (IBM Q, D-WAVE) ④ Research on the architecture to accelerate QC simulation To accelerate QC research, the technology for high- speed QC simulation is important. (R-CCS Kondo Team) The acquisition of GPU-based system (NVIDIA A100) Expected to use the outcome for Fugaku Next ⑤ Inter-national collaborations • IHPC(Singapore) ・CEA (France) Integrated Execute Tech Transfer Collaboration with RIKEN Center for Quantum Computing (RQC)
  • 32. 32  Increasing FLOPS via increasing the number of ALUs no longer viable  Compute power = ALU logic switching power + data movement between ALUs and registers/memory  ALU logic power saturation faster than lithography saturation  No more acceleration of pure FLOPS  Only way to increase performance at low level is logic simplification, e.g., lower precision, alternative numerical formats  At higher levels, decreasing the # of numerical operations very effective => sparse (iterative) methods (general HPC), network compaction (AI), algorithmic pruning (HPC & AI) Investigating the non-Quantum Future FLOPS to BYTES for future acceleration? (1)
  • 33. 33  Data movement has its own problems but promising w/ new device and packaging tech + architectures & algorithms to exploit them  Devices & Packaging  3-D stacking of memory + logic  Photonic interconnect  Dense and fast memory devices from SRAM to MRAM  Architecture  Large & high bandwidth local memory processor (very large L1/L2)  Customized datapaths for frequent compute patters – stencils/convolution, matrix, FFT, tensor operations, ... => can they be generalized? Micro dataflow in a core?  Coarse grained dataflow (CGRA)? => optimize data movement in general over standard CPU/GPU(SMT Vector)  Near memory processing  FLOPS to BYTES!  Same motivation as embedded computing Investigating the non-Quantum Future FLOPS to BYTES for future acceleration? (2)
  • 34. H H B H B M B M 2 M 28 28 G 8 G iG B iB iB  Extremely high bandwidth in caches and memory • A64FX has out-of-order mechanisms in cores, caches and memory controllers. It maximizes the capability of each layer’s bandwidth Performance >2.7TFLOPS L1 Cache >11.0TB/s (BF ratio = 4) A64FX Bandwidth Numbers CMG L2 Cache >3.6TB/s (BF ratio = 1.3) L1D 64KiB, 4way 512-bit wide SIMD 2x FMAs Core Core Core Core >230 GB/s >115 GB/s 12x Computing Cores + 1x Assistant Core Memory 1024GB/s (BF ratio =~0.37) >115 GB/s >57 GB/s HBM2 8GiB L2 Cache 8MiB, 16way 256 GB/s 13 All Rights Reserved. Copyright © FUJITSU LIMITED 2018
  • 35. Our Project: Exploring versatile HPC architecture and system software technologies to achieve 100x performance by 2028 35 Problems to be solved and goals to be achieved • General-purpose computer architectures that will accelerate a wide range of applications in the post-Moore era have not yet been established. • What is a feasible approach for versatile HPC systems based on bandwidth improvement? • Goal: to explore architectures that can achieve 100x performance in a wide range of applications around 2028 Exemlar FLOPS to BYTES Architecture gen-purpose many-core CPUs other approaches (like CGRA) near- memory proc new memory device new memory hierarch y, connecti on for cores near- memory proc new memory device Subtask1.1 Performance characterization and modeling with benchmarks to identify directions for exploration and improvement (Riken R-CCS) Subtask1.2 Exploring a reconfigurable vector data- flow architecture (CGRA) that can exploit increased data transfer capability (Rijen R-CCS) Subtask3 Exploring near- memory computing for highly effective bandwidth and cooling efficiency for general purpose computing (U-Tokyo) Approaches and subtasks • Exploration of future CPU node architectures and necessary technologies 2018 2019 2020 2021 2022 Plan Explore individual technologies Integrate promising technologies for a target node architecture Developing stage ... 2023~ Subtask2 Exploring innovative memory architectures with ultra- deep and ultra-wide bandwidth (Tokyo Tech.) Subtask4 Exploration of node architectures as extension of existing many-core CPUs with non von-Neumann methods (unnamed company) Planned
  • 36. Advantages of our project 36 Approach of exploration • Target gen-purpose architectures instead of special-purpose accelerators, where conventional codes are available. • Explorer x100 performance node around 2028 • Discover node architectures to be developed Novelty and innovativeness • FLOPS to BYTES architecture to improve performance effectively using increased bandwidth • Performance modeling and analysis to guide development of node architectures • Exploration of fundamental technologies to achieve the target node architecture + Innovative memory systems + Reconfigurable vector data-flow architecture (CGRA) + Near-memory computing Specs of HPC CPU around 2018 Intel Skylake Xeon Platinum 8180: Process: Intel 14nm CMOS CPU: Intel Xeon CPU、28 cores Frequency: 1.9GHz (for vector comp with AVX) Peak FP64 perf: 1.7 TF (3.4 TF for FP32) Memories: DDR4-2666 DRAM, 6ch 130GB/s NW injection bw: 100Gbps (Infiniband EDR) Power: 300W (including memories and NW) Specs of HPC CPUs and systems expected in Post-Moore Era around 2028: Process: 3nm CMOS (UV lithography) PCRAM、non-volatile memory as novel devices CPU: normal core & vector data-flow, near-memory core Frequency: 2GHz+ Peak FP64 perf: 20 TF (200TF for FP32, w/ or w/o Matrix Engine) memory: highly-dense TSV/silicon-on-silicon 3D stack device 1st layer 3D TSV Stacked SRAM >10TB/s, cap. >8GByte 2nd layer 2.5D TSV Stacked DRAM >3TB/s, cap. >64GByte 3rd layer 2.5D TSV Flash/NVM ~30GB/s, cap. >1TB NW injection bw: 12Tbps (VCSEL CWDM Micro-Optics) Power: 300W+ (including memories and NW)
  • 37. 37  Assume constant memory per core  #cores ~ total problem (total machine (memory)) size n ~ core performance  Modern massively parallel architectures: core performance constant, performance gains ~ increasing #cores in system, runtime T ~ problem complexity / core performance  Compute-bound codes, O(nk) complexity where k > 1 : runtime T ~ nk-1 , so increasing total machine size increases T, even w/ constant memory per core  Memory-bound codes, O(n) complexity, runtime T ~ # memory controllers  At core level, # memory controllers (e.g. access to cache) ~ #cores so runtime remains constant with increasing cores (weak scaling).  However, at chip level (external memory access), memory controllers are constant even with #cores increase, so T ~ #cores (no scaling)  Increasing memory size further per core meaningless, since T ~ n  Maintaining memory size per core, let alone increase, will not lead to effective performance gains, diminishing Gustafson’s Law Towards Strong Scaling (1)
  • 38. 38  Some applications are inherently strong scaling  E.g. Molecular Dynamics, c.f., Anton  Even traditional weak scaling codes need to strong scale  Architectural requirements : memory high BW / low latency requires capacity to be small  Science requirements: from demonstrative big runs to real R&D  Ensemble of multiple smaller problem sizes  Time to solution >> problem size  If Gustafson’s law is well satisfied (e.g., well load balanced), then strong scaling will work up to the point of bad load balance and/or non-parallel region becoming significant  Applications must be prepared to strong scale, or at least deal with hierarchical memory  Advanced localization e.g. temporal blocking, putting only BW sensitive data in fast memory, … Towards Strong Scaling (2)
  • 39. 39 New!: Comprehensive benchmarking platform effort  Collect benchmarks and machines incl. Fugaku also x86&GPU  Construct a platform to do all benches x all machines benchmarking  Make all benchmarks be repeatedly executable so that new instrumentations can be done easily  Couple with architectural simulators to conduct what-if analysis New!: Enhancing system software robustness, contributing to compilers and other performance OSS tools  Make Fugaku be performance robust, not focus on co-design apps  OSS as future dev platform e.g. LLVM and contribute result to community  New optimizations for new architectures before actual HW New Efforts at R-CCS towards Non-Quantum Future
  • 40. If you are excited about future of HPC… 40  “Feasibility Study 2.0” starting Apr 2022, ~$4m USD/y  We are hiring!  Team/Unit leaders (digital twins, possibly more in future)  Various researcher and post-doc positions  For details ‘google’ R-CCS Home page