SlideShare una empresa de Scribd logo
1 de 38
Descargar para leer sin conexión
• Simon Hammond – Sandia National Laboratories (sdhammo@sandia.gov)
• Howard Pritchard – Los Alamos National Laboratory (howardp@lanl.gov)
UNCLASSIFIED UNLIMITED RELEASE
NNSA Explorations:
ARM for Supercomputing
Exciting Time to be in HPC…
Exascale Computing Adoption of ML/AI for HPC New Hardware/Software
What we’ll cover today
• Why Arm – what’s so interesting about it?
• Marvell Thunder TX2 overview and comparison with x86_64
• Astra/Vanguard Program
• ASC mini-app and application performance
• Porting to ARM
7/30/19 Unclassified
What’s sooooooo Interesting About Arm?
• In many ways not much…
• Its just an instruction set
• As long as it can run Fortan, C and C++
we are good right?
• In others ways quite a lot is
interesting
• Different business model, consortium
of implementations
• Open for partners to suggest new
instructions
• Broad range of intellectual property
opportunities
• Broad(er) range if implementations
than say X86, POWER, SPARC etc
What’s sooooooo Interesting About Arm?
• DOE invests more than $100M in the hardware of a typical supercomputer
(often substantially more than this when the final bill comes in)
• Competition helps to drive down prices and increase innovation
• We want to optimize price/perf for our machines – get the absolute best workload
performance we can for the best price we can buy hardware
• The future is interesting – Arm is an IP company, not an implementation
• What if we could blend existing Arm IP blocks with our own DOE inspired accelerators?
• Build workload optimized processors and computers that benefit DOE scientists?
• e.g. a machine just for designing new materials but one which is 100X faster than today?
• Arm is an opportunity to engage with a broad range of suppliers and an ecosystem
• Not the only way to do this, can partner with traditional vendors like Intel, IBM, AMD etc
7/30/19 Unclassified
Arm is Growing in HPC…
7/30/19 Unclassified
NNSA/ASC Vanguard Program
A proving ground for next-generation HPC technologies in support of the
NNSA mission
http://vanguard.sandia.gov
Astra – the First Petscale Arm based Supercomputer
7/30/19 Unclassified
Test Beds
• Small testbeds
(~10-100 nodes)
• Breadth of
architectures Key
• Brave users
Vanguard
• Larger-scale experimental
systems
• Focused efforts to mature
new technologies
• Broader user-base
• Not Production
• Tri-lab resource but not for
ATCC runs
ATS/CTS Platforms
• Leadership-class systems
(Petascale, Exascale, ...)
• Advanced technologies,
sometimes first-of-kind
• Broad user-base
• Production Use
ASC Test Beds Vanguard ATS and CTS Platforms
Greater Scalability, Larger Scale, Focus on Production
Higher Risk, Greater Architectural Diversity
Where Vanguard Fits in our Program Strategy
7/30/19 Unclassified
NNSA/ASC Advanced Trilab Software Environment (ATSE) Project
• Advanced Tri-lab Software Environment
• Sandia leading development with input from Tri-lab Arm team
• Will be the user programming environment for Vanguard-Astra
• Partnership across the NNSA/ASC Labs and with HPE
• Lasting value
• Documented specification of:
• Software components needed for HPC production applications
• How they are configured (i.e., what features and capabilities are enabled) and interact
• User interfaces and conventions
• Reference implementation:
• Deployable on multiple ASC systems and architectures with common look and feel
• Tested against real ASC workloads
• Community inspired, focused and supported
ATSE is an integrated software environment for ASC workloads
ATSE
stack
7/30/19 Unclassified
HPE’s HPC Software Stack
HPE:
• HPE MPI (+ XPMEM)
• HPE Cluster Manager
• Arm:
• Arm HPC Compilers
• Arm Math Libraries
• Allinea Tools
• Mellanox-OFED & HPC-X
• RedHat 7.x for aarch64
ATSE Collaboration with HPE’s HPC Software Stack
ATSE
stack
7/30/19 Unclassified
SVE Enablement – Next Generation of SIMD/Vector Instructions
• SVE work is underway
• SVE = Scalable Vector Extensions
• Length agnostic vector instructions at an ISA level
• Using ArmIE (fast emulation) and RIKEN GEM5 Simulator
• GCC and Arm toolchains
• Collaboration with RIKEN
• Visited Sandia (participants from SNL, LANL, LLNL, RIKEN)
• Discussion of performance and simulation techniques
• Deep-dive on SVE (GEM5)
• Short term plan
• Use of SVE intrinsics for Kokkos-Kernels SIMD C++/data parallel
types
• Underpins number of key performance routines for Trilinos
libraries
• Seen large (6X) speedups for AVX512 on KNL and Skylake
• Expect to see similar gains for SVE vector units
• Critical performance enablement for Sandia production codes
7/30/19 Unclassified
• Workflows leveraging containers and virtual machines
• Support for machine learning frameworks
• ARMv8.1 includes new virtualization extensions, SR-IOV
• Evaluating parallel filesystems + I/O systems @ scale
• GlusterFS, Ceph, BeeGFS, Sandia Data Warehouse, …
• Resilience studies over Astra lifetime
• Improved MPI thread support, matching acceleration
• OS optimizations for HPC @ scale
• Exploring spectrum from stock distro Linux kernel to HPC-tuned Linux
kernels to non-Linux lightweight kernels and multi-kernels
• Arm-specific optimizations
ATSE
stack
ATSE R&D Efforts – Developing Next-Generation NNSA Workflows
7/30/19 Unclassified
Marvell Thunder X2
7/30/19 Unclassified
ThunderX2 - Second Generation High-End Armv8-A Server SoC
7/30/19 Unclassified
Up to 32 custom Armv8.1 cores, up to 2.5GHz
Full OoO, 1, 2, 4 threads per core
1S and 2S Configuration
Up to 8 DDR4-2667 Memory Controllers, 1 & 2 DPC
Up to 56 lanes of PCIe, 14 PCIe controllers
Full SoC: Integrated SATAv3 USB3 and GPIOs
Server class RAS & Virtualization
Extensive Power Management
LGA and BGA for most flexibility
40+ SKUs (75W – 180W)
7/30/19 Unclassified
Marvell
ThunderX2
Haswell E5-2698
v3
Broadwell E5-
2695
Skylake Gold
6152
Cores/Socket 32 (max 4 HT) 16 (2 HT) 22 (2 HT) 22 (2 HT)
L1 Cache/Core 32KB I/D (8-way) 32KB I/D (8-way) 32KB I/D (8-way) 32KB I/D (8-way)
L2 Cache/Core 256KB (8-way) 256 KB (8-way) 256 KB (8-way) 1 MB (16-way)
L3 Cache/Socket 32 MB 40 MB 33 MB 30.25 MB
#Memory
Channels/Socket
8 DDR4 4 DDR4 4 DDR4 6 DDR4
Base Clock Rate 2.2 GHz 2.3 GHz 2.2 GHz 2.1 GHz
Vector/SIMD
Length
128b (NEON) 256b (AVX2) 256b (AVX2) 512b (AVX512)
ThunderX2 Comparison with Xeon Processors
7/30/19 Unclassified
Roofline Comparison
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
AMD EPYC Naples/7601 (1.13 TFlops)
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
AMD EPYC Naples/7601 (1.13 TFlops)
AMD EPYC Rome (2.41 TFlops)
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
AMD EPYC Naples/7601 (1.13 TFlops)
Amazon Graviton (294 GFlops)
42
G
flops/s
AMD EPYC Rome (2.41 TFlops)
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
AMD EPYC Naples/7601 (1.13 TFlops)
Amazon Graviton (294 GFlops)
42
G
flops/s
Intel KNL-7250 (3.04 TFlops)
400
G
flops/s
AMD EPYC Rome (2.41 TFlops)
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
AMD EPYC Naples/7601 (1.13 TFlops)
Amazon Graviton (294 GFlops)
42
G
flops/s
Intel KNL-7250 (3.04 TFlops)
400
G
flops/s
Nvidia tesla V100 (7.5 TFlops)
900
G
flops/s
AMD EPYC Rome (2.41 TFlops)
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
AMD EPYC Naples/7601 (1.13 TFlops)
Amazon Graviton (294 GFlops)
42
G
flops/s
Intel KNL-7250 (3.04 TFlops)
400
G
flops/s
Nvidia tesla V100 (7.5 TFlops)
900
G
flops/s
FDTD-Elastic-4thorder
AMD EPYC Rome (2.41 TFlops)
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
AMD EPYC Naples/7601 (1.13 TFlops)
Amazon Graviton (294 GFlops)
42
G
flops/s
Intel KNL-7250 (3.04 TFlops)
400
G
flops/s
Nvidia tesla V100 (7.5 TFlops)
900
G
flops/s
FDTD-Elastic-4thorder
FDTD-Acoustic(ISO)-8thorder
AMD EPYC Rome (2.41 TFlops)
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
AMD EPYC Naples/7601 (1.13 TFlops)
Amazon Graviton (294 GFlops)
42
G
flops/s
Intel KNL-7250 (3.04 TFlops)
400
G
flops/s
Nvidia tesla V100 (7.5 TFlops)
900
G
flops/s
FDTD-Elastic-4thorder
FDTD-Acoustic(ISO)-8thorder
FDTD-Acoustic(TTI)-8thorder
AMD EPYC Rome (2.41 TFlops)
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
AMD EPYC Naples/7601 (1.13 TFlops)
Amazon Graviton (294 GFlops)
42
G
flops/s
Intel KNL-7250 (3.04 TFlops)
400
G
flops/s
Nvidia tesla V100 (7.5 TFlops)
900
G
flops/s
FDTD-Elastic-4thorder
FDTD-Acoustic(ISO)-8thorder
FDTD-Acoustic(TTI)-8thorder
SEM-Elastic-4thorder
AMD EPYC Rome (2.41 TFlops)
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
AMD EPYC Naples/7601 (1.13 TFlops)
Amazon Graviton (294 GFlops)
42
G
flops/s
Intel KNL-7250 (3.04 TFlops)
400
G
flops/s
Nvidia tesla V100 (7.5 TFlops)
900
G
flops/s
FDTD-Elastic-4thorder
FDTD-Acoustic(ISO)-8thorder
FDTD-Acoustic(TTI)-8thorder
SEM-Elastic-4thorder
FDTD-Acoustic(VTI)-8thorder
AMD EPYC Rome (2.41 TFlops)
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
NVIDIA Tesla V100 (7.5 TFlops)
Fujitsu A64FX (2.99 TFlops)
Intel Skylake 8168 (2.08 TFlops)
Huawei Kunpeng920 (1.13 TFlops)
Marvell ThunderX2 (0.56 TFlops)
Amazon Graviton (0.294 TFlops)
AMD EPYC Rome (2.41 TFlops)
TheoreticalPeakGflops
4096
1024
256
64
16
4
0.25 1 644 16 256 1024
Arithmetic Intensity (Flop/Byte)
STREAM Triad Bandwidth
• ThunderX2 provides highest
bandwidth of all processors
• Vectorization makes no discernable
difference to performance at large
core counts
• Around 10% higher with NEON at
smaller core counts (5 – 14)
• Significant number of kernels in HPC
are bound by the rate at which they
can load/store to memory (“memory
bandwidth bound”)
• Makes high memory bandwidth
desireable
• Ideally want to get to these bandwidths
without needing to vectorize
7/30/19 Unclassified
0
50
100
150
200
0 10 20 30 40 50 60
MeasuredBandwidth(GB/s)
Processor Cores
ThunderX2 (NEON)
ThunderX2 (No Vec)
Skylake (AVX512)
Skylake (No Vec)
Haswell (AVX2)
Haswell (No Vec)
Higher is better
0
20
40
60
80
100
120
0 500000 1x10
6
1.5x10
6
2x10
6
2.5x10
6
MeasuredBandwidth(GB/s)
Data Array Size
Haswell Read
Skylake Read
Haswell Write
Skylake Write
ThunderX2 Read
ThunderX2 Write
Cache Performance
• Haswell has highest per-core
bandwidth (read and write) at L1,
slower at L2.
• Skylake redesigned cache sizes
(larger L2, smaller L3) shows up in
graph
• Higher performance for certain work-
set sizes (typical for unstructured
codes)
• TX2 more uniform bandwidth at
larger scale (see less asymmetry
between read/write)
7/30/19 Unclassified
Higher is better
Larger L2 capacity
for Skylake
0
100
200
300
400
500
600
700
800
0 5 10 15 20 25 30
MeasuredPerformance(GF/s)
Processor Cores
Skylake
Haswell
ThunderX2
DGEMM Compute Performance
• ThunderX2 has similar
performance at scale to Haswell
• Roughly twice as many cores (TX2)
• Half the vector width (TX2 vs. HSW)
• See strata in Intel MKL results,
usually a result of matrix-size
kernel optimization
• ARM PL provides smoother
performance results (essentially
linear growth)
7/30/19 Unclassified
Higher is better
Floating Point Performance Sanity Check: HPL
7/30/19 Unclassified
• ThunderX2 has about half the floating point capacity of comparable Xeon
CPUs
• Xeon 8180 vs. ThunderX2 • HPL.dat
163840 Ns
256 NBs
0 PMAP process mapping (0=Row-,1=Column-
major)
7 Ps
8 Qs
1 PFACTs (0=left, 1=Crout, 2=Right)
2 RFACTs (0=left, 1=Crout, 2=Right)
0 BCASTs
(0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
0 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
2.00E+03
8.82E+02
4.99E+02
0.00E+00
5.00E+02
1.00E+03
1.50E+03
2.00E+03
2.50E+03
Xeon 8180 SMT=2+Turbo SMT=4 w/o
Turbo
GFLOPS
HPL | N=163840 (200GB)
Results from using Astra and other TX2 Platforms
Applications
7/30/19 Unclassified
0
0.1
0.2
0.3
0.4
0.5
0 5 10 15 20 25 30
Giga-Updates/Second(GUP/s)
Processor Cores
ThunderX2 (No Vec)
Skylake (No Vec)
Haswell (No Vec)
GUPS Random Access
• Running all processors in SMT-1
mode, SMT(>1) is usually better
performance
• Expect SMT2/4 on TX2 to give better
numbers
• Usually more cores gives higher
performance (more load/store
units driving requests).
• Typical for TLB performance to be a
limiter
• Need to consider larger pages for
future runs
7/30/19 Unclassified
Higher is better
0
2000
4000
6000
8000
10000
12000
0 5 10 15 20 25 30
FigureofMerit(Zone/S)
Processor Cores
Skylake (AVX512)
Skylake (No Vec)
Haswell (AVX2)
Haswell (No Vec)
ThunderX2 (NEON)
ThunderX2 (No Vec)
LULESH Hydrodynamics Mini-App
• Typically fairly intensive L2
accesses for unstructured mesh
(although LULESH is regular
structure in unstructured format)
• Expect slightly higher
performance with SMT(>1)
modes for all processesors
7/30/19 Unclassified
Higher is better
0
2e+06
4e+06
6e+06
8e+06
1e+07
1.2e+07
0 5 10 15 20 25 30
FigureofMerit(Lookups/S)
Processor Cores
Skylake (AVX512)
Skylake (No Vec)
Haswell (AVX2)
Haswell (No Vec)
ThunderX2 (NEON)
ThunderX2 (No Vec)
XSBench Cross-Section Lookup Mini-App
• Two level random-like access into
memory, look-up in first table and
then use indirection to reach
second lookup
• Means random access but is more
like search so vectors can help
• See gain on Haswell and Skylake
which both have vector-gather
support
• No support for gather in NEON
• XSBench is mostly read-only
(gather)
7/30/19 Unclassified
Higher is better
Branson Mini-App and Benchmark
7/30/19 Unclassified
• Monte Carlo based Radiation transport
mini-app
• Lots of time spent in math intrinsics (exp,
log, sin, cos). Benefits from ARM
optimized math intrinsics
• Poor memory locality, benefits some from
large pages
• Doesn’t vectorize
• Random number generator not yet
optimized for ARM
• On a per node basis, TX2 is on par with
SKL-gold
• Need to improve vectorizability
0
0.2
0.4
0.6
0.8
1
1.2
1 2 4 8 16 32 64
TX2 TX2+armpl SKL-vec
Relative perf wrt SKL-gold
MPI Processes
EMPIRE on Astra
Trinity HSW 32 MPI x 1 OMP Astra TX2 56 MPI x 1 OMP
Strong and weak scaling studies for EMPIRE-PIC for awesome blob test case
Missing Trinity XL mesh 512 and 4096 node results because of MueLu FPE
Missing Astra XL mesh 2048 node results because of MueLu FPE
Work by Paul Lin7/30/19 Unclassified
EMPIRE on Astra
• TX2 node has ~2x memory bandwidth and 1.75x cores (56 vs. 32) of Trinity HSW
node
• (HSW time)/(TX2 time) > 1 means TX2 is faster
• Strong scaling for awesome blob small mesh (1-8 nodes), strong scaling for
medium mesh (8-64 nodes), strong scaling for large mesh (64-512)
• (HSW time)/(TX2 time) for linear solve not great, low
computation/communication regime
(Good)
7/30/19 Unclassified Work by Paul Lin
• TX2 node has ~2x memory bandwidth and 1.75x cores (56 vs. 32) of Trinity HSW
node
• (HSW time)/(TX2 time) > 1 means TX2 is faster
• Strong scaling for awesome blob medium mesh (1-8 nodes), strong scaling for
large mesh (8-64 nodes)
• (HSW time)/(TX2 time) for linear solve definite better than previous slide, due
to increased computation/communication
EMPIRE on Astra
(Good)
7/30/19 Unclassified Work by Paul Lin
xRAGE
7/30/19 Unclassified
• Eulerian-based
hydrodynamics/radiation
transport application
• Uses adaptive mesh
refinement
• Significant amount of
gather/scatter
• Does not currently benefit
from AVX2/512
vectorization
• Memory bound
0
100
200
300
400
500
600
700
8 16 32
TX2 (50ppn)
BWL (48ppn)
SKL (56ppn)
Results from Cray XC50 using
Cray CCE9 Compiler
Lower is better
#nodes
Walltime(secs)
PARTISN
7/30/19 Unclassified
• Neutron transport code –
deterministic SN method
• Sensitive to cache
performance, not typically
memory bound
• Vectorizes well for avx512,
NEON
• Can be run mixed
MPI/OpenMP
• Limited by cache BW on
TX2 and front end stalls
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1 2 4 8 16 32
TX2
SKL-novec
SKL-vec
MPI Processes
Higher is better
Relative Perf. To BWL-vec
PARTISN can benefit from 4 SMTs/core
7/30/19 Unclassified
• Example of code with
significant front end stalls
• Taucommander indicates
high rate of branch
misprediction in the sweep
kernel
Cray XC50 - CCE 9.0 compiler
RIKEN Fiber Benchmarks – Compiler Performance Comparison
7/30/19 Unclassified
• Comparison of Cray 8/9
compilers against Allinea19
using Riken Fiber benchmarks
• Results are mixed, no clear
winner in terms of compilers
• Takeaway is to try to build your
app with several compilers
Cray XC50 - CCE 9.0 compilerLowerisbetter
Early Results from Astra
7/30/19 Unclassified
System has been online for around two weeks , incredible team working round the
clock, already running full application ports and many of our key frameworks
Baseline: Trinity ASC Platform (Current Production), dual-socket Haswell
CFD Models Hydrodynamics Molecular DynamicsMonte Carlo
1.60X 1.45X 1.30X 1.42X
Linear Solvers
1.87X
Porting to ARM
7/30/19 Unclassified
Sanity Checks
• See if your software has already been ported to aarch64:
• www.gitlab.com/arm-hpc/packages/wikis
• See if its available via Spack https://github.com/spack/spack
• Don’t use old compilers:
• GCC 8.2 or newer, 9.1 better
• Allinea armflang/armclang 19.0 or newer
• If you’re package relies on some system packages in performance critical areas, may
want to build your own versions. Libraries that come with base release are not
optimized for Thunderx2
• If your application has lots of dependencies, this may be a good time to learn
how to use Spack
• Checkout training material at https://gitlab.com/arm-hpc/training
7/30/19 Unclassified
7/30/19 Unclassified
Porting Cheat Sheet
Ensure all dependencies have been ported.
•Arm HPC Packages Wiki: https://gitlab.com/arm-hpc/packages/wikis/categories/allPackages
Update or patch autotools and libtool as needed
•wget 'http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD' -O config.guess
•wget 'http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub;hb=HEAD' -O config.sub
•sed -i -e 's#wl=""#wl="-Wl,"#g' libtool
•sed -i -e 's#pic_flag=""#pic_flag=" -fPIC -DPIC"#g' libtool
Update build system to use the right compiler and architecture
•Check #ifdef in Makefiles. Use other architectures as a template.
Use the right compiler flags
•Start with -mcpu=native -Ofast
Avoid non-standard compiler extensions and language features
•Arm compiler team is actively adding new “unique” features, but it’s best to stick to the standard.
Update hard-wired intrinsics for other architectures
•https://developer.arm.com/technologies/neon/intrinsics
•Worst case: default to a slow code.
Update, and possibly fix, your test suite
•Regression tests are a porter’s best friend.
•Beware of tests that expect exactly the same answer on all architectures!
Know architectural features and what they mean for your code
•Arm’s weak memory model.
•Division by zero is silently zero on Arm.
Questions?
7/30/19 Unclassified

Más contenido relacionado

La actualidad más candente

dCUDA: Distributed GPU Computing with Hardware Overlap
 dCUDA: Distributed GPU Computing with Hardware Overlap dCUDA: Distributed GPU Computing with Hardware Overlap
dCUDA: Distributed GPU Computing with Hardware Overlapinside-BigData.com
 
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand SolutionsMellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand Solutionsinside-BigData.com
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformGanesan Narayanasamy
 
BXI: Bull eXascale Interconnect
BXI: Bull eXascale InterconnectBXI: Bull eXascale Interconnect
BXI: Bull eXascale Interconnectinside-BigData.com
 
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...inside-BigData.com
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performanceinside-BigData.com
 
A Fresh Look at HPC from Huawei Enterprise
A Fresh Look at HPC from Huawei EnterpriseA Fresh Look at HPC from Huawei Enterprise
A Fresh Look at HPC from Huawei Enterpriseinside-BigData.com
 
OpenHPC: A Comprehensive System Software Stack
OpenHPC: A Comprehensive System Software StackOpenHPC: A Comprehensive System Software Stack
OpenHPC: A Comprehensive System Software Stackinside-BigData.com
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERinside-BigData.com
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceOdinot Stanislas
 
High Performance Interconnects: Assessment & Rankings
High Performance Interconnects: Assessment & RankingsHigh Performance Interconnects: Assessment & Rankings
High Performance Interconnects: Assessment & Rankingsinside-BigData.com
 

La actualidad más candente (20)

dCUDA: Distributed GPU Computing with Hardware Overlap
 dCUDA: Distributed GPU Computing with Hardware Overlap dCUDA: Distributed GPU Computing with Hardware Overlap
dCUDA: Distributed GPU Computing with Hardware Overlap
 
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand SolutionsMellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
 
DOME 64-bit μDataCenter
DOME 64-bit μDataCenterDOME 64-bit μDataCenter
DOME 64-bit μDataCenter
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
BXI: Bull eXascale Interconnect
BXI: Bull eXascale InterconnectBXI: Bull eXascale Interconnect
BXI: Bull eXascale Interconnect
 
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
 
An Update on Arm HPC
An Update on Arm HPCAn Update on Arm HPC
An Update on Arm HPC
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performance
 
A Fresh Look at HPC from Huawei Enterprise
A Fresh Look at HPC from Huawei EnterpriseA Fresh Look at HPC from Huawei Enterprise
A Fresh Look at HPC from Huawei Enterprise
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
OpenHPC: A Comprehensive System Software Stack
OpenHPC: A Comprehensive System Software StackOpenHPC: A Comprehensive System Software Stack
OpenHPC: A Comprehensive System Software Stack
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWER
 
Arm in HPC
Arm in HPCArm in HPC
Arm in HPC
 
Lenovo HPC Strategy Update
Lenovo HPC Strategy UpdateLenovo HPC Strategy Update
Lenovo HPC Strategy Update
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application Performance
 
ARM HPC Ecosystem
ARM HPC EcosystemARM HPC Ecosystem
ARM HPC Ecosystem
 
High Performance Interconnects: Assessment & Rankings
High Performance Interconnects: Assessment & RankingsHigh Performance Interconnects: Assessment & Rankings
High Performance Interconnects: Assessment & Rankings
 
Summit workshop thompto
Summit workshop thomptoSummit workshop thompto
Summit workshop thompto
 
Intel dpdk Tutorial
Intel dpdk TutorialIntel dpdk Tutorial
Intel dpdk Tutorial
 

Similar a NNSA Explorations: ARM for Supercomputing

Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Linaro
 
Arm as a Viable Architecture for HPC and AI
Arm as a Viable Architecture for HPC and AIArm as a Viable Architecture for HPC and AI
Arm as a Viable Architecture for HPC and AIinside-BigData.com
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialmadhuinturi
 
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021Grigory Sapunov
 
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...KTN
 
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...Eric Van Hensbergen
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuAlan Sill
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
12.) fabric (your next data center)
12.) fabric (your next data center)12.) fabric (your next data center)
12.) fabric (your next data center)Jeff Green
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionNVIDIA Taiwan
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware LandscapeGrigory Sapunov
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaFacultad de Informática UCM
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...BigDataEverywhere
 
Infrastructure optimization for seismic processing (eng)
Infrastructure optimization for seismic processing (eng)Infrastructure optimization for seismic processing (eng)
Infrastructure optimization for seismic processing (eng)Vsevolod Shabad
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...inside-BigData.com
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 

Similar a NNSA Explorations: ARM for Supercomputing (20)

Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
 
Arm as a Viable Architecture for HPC and AI
Arm as a Viable Architecture for HPC and AIArm as a Viable Architecture for HPC and AI
Arm as a Viable Architecture for HPC and AI
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
 
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
 
TensorFlow for HPC?
TensorFlow for HPC?TensorFlow for HPC?
TensorFlow for HPC?
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
12.) fabric (your next data center)
12.) fabric (your next data center)12.) fabric (your next data center)
12.) fabric (your next data center)
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server Solution
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware Landscape
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
 
Infrastructure optimization for seismic processing (eng)
Infrastructure optimization for seismic processing (eng)Infrastructure optimization for seismic processing (eng)
Infrastructure optimization for seismic processing (eng)
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 

Más de inside-BigData.com

Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
 

Más de inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Making Supernovae with Jets
Making Supernovae with JetsMaking Supernovae with Jets
Making Supernovae with Jets
 

Último

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Último (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

NNSA Explorations: ARM for Supercomputing

  • 1. • Simon Hammond – Sandia National Laboratories (sdhammo@sandia.gov) • Howard Pritchard – Los Alamos National Laboratory (howardp@lanl.gov) UNCLASSIFIED UNLIMITED RELEASE NNSA Explorations: ARM for Supercomputing
  • 2. Exciting Time to be in HPC… Exascale Computing Adoption of ML/AI for HPC New Hardware/Software
  • 3. What we’ll cover today • Why Arm – what’s so interesting about it? • Marvell Thunder TX2 overview and comparison with x86_64 • Astra/Vanguard Program • ASC mini-app and application performance • Porting to ARM 7/30/19 Unclassified
  • 4. What’s sooooooo Interesting About Arm? • In many ways not much… • Its just an instruction set • As long as it can run Fortan, C and C++ we are good right? • In others ways quite a lot is interesting • Different business model, consortium of implementations • Open for partners to suggest new instructions • Broad range of intellectual property opportunities • Broad(er) range if implementations than say X86, POWER, SPARC etc
  • 5. What’s sooooooo Interesting About Arm? • DOE invests more than $100M in the hardware of a typical supercomputer (often substantially more than this when the final bill comes in) • Competition helps to drive down prices and increase innovation • We want to optimize price/perf for our machines – get the absolute best workload performance we can for the best price we can buy hardware • The future is interesting – Arm is an IP company, not an implementation • What if we could blend existing Arm IP blocks with our own DOE inspired accelerators? • Build workload optimized processors and computers that benefit DOE scientists? • e.g. a machine just for designing new materials but one which is 100X faster than today? • Arm is an opportunity to engage with a broad range of suppliers and an ecosystem • Not the only way to do this, can partner with traditional vendors like Intel, IBM, AMD etc 7/30/19 Unclassified
  • 6. Arm is Growing in HPC… 7/30/19 Unclassified
  • 7. NNSA/ASC Vanguard Program A proving ground for next-generation HPC technologies in support of the NNSA mission http://vanguard.sandia.gov
  • 8. Astra – the First Petscale Arm based Supercomputer 7/30/19 Unclassified
  • 9. Test Beds • Small testbeds (~10-100 nodes) • Breadth of architectures Key • Brave users Vanguard • Larger-scale experimental systems • Focused efforts to mature new technologies • Broader user-base • Not Production • Tri-lab resource but not for ATCC runs ATS/CTS Platforms • Leadership-class systems (Petascale, Exascale, ...) • Advanced technologies, sometimes first-of-kind • Broad user-base • Production Use ASC Test Beds Vanguard ATS and CTS Platforms Greater Scalability, Larger Scale, Focus on Production Higher Risk, Greater Architectural Diversity Where Vanguard Fits in our Program Strategy 7/30/19 Unclassified
  • 10. NNSA/ASC Advanced Trilab Software Environment (ATSE) Project • Advanced Tri-lab Software Environment • Sandia leading development with input from Tri-lab Arm team • Will be the user programming environment for Vanguard-Astra • Partnership across the NNSA/ASC Labs and with HPE • Lasting value • Documented specification of: • Software components needed for HPC production applications • How they are configured (i.e., what features and capabilities are enabled) and interact • User interfaces and conventions • Reference implementation: • Deployable on multiple ASC systems and architectures with common look and feel • Tested against real ASC workloads • Community inspired, focused and supported ATSE is an integrated software environment for ASC workloads ATSE stack 7/30/19 Unclassified
  • 11. HPE’s HPC Software Stack HPE: • HPE MPI (+ XPMEM) • HPE Cluster Manager • Arm: • Arm HPC Compilers • Arm Math Libraries • Allinea Tools • Mellanox-OFED & HPC-X • RedHat 7.x for aarch64 ATSE Collaboration with HPE’s HPC Software Stack ATSE stack 7/30/19 Unclassified
  • 12. SVE Enablement – Next Generation of SIMD/Vector Instructions • SVE work is underway • SVE = Scalable Vector Extensions • Length agnostic vector instructions at an ISA level • Using ArmIE (fast emulation) and RIKEN GEM5 Simulator • GCC and Arm toolchains • Collaboration with RIKEN • Visited Sandia (participants from SNL, LANL, LLNL, RIKEN) • Discussion of performance and simulation techniques • Deep-dive on SVE (GEM5) • Short term plan • Use of SVE intrinsics for Kokkos-Kernels SIMD C++/data parallel types • Underpins number of key performance routines for Trilinos libraries • Seen large (6X) speedups for AVX512 on KNL and Skylake • Expect to see similar gains for SVE vector units • Critical performance enablement for Sandia production codes 7/30/19 Unclassified
  • 13. • Workflows leveraging containers and virtual machines • Support for machine learning frameworks • ARMv8.1 includes new virtualization extensions, SR-IOV • Evaluating parallel filesystems + I/O systems @ scale • GlusterFS, Ceph, BeeGFS, Sandia Data Warehouse, … • Resilience studies over Astra lifetime • Improved MPI thread support, matching acceleration • OS optimizations for HPC @ scale • Exploring spectrum from stock distro Linux kernel to HPC-tuned Linux kernels to non-Linux lightweight kernels and multi-kernels • Arm-specific optimizations ATSE stack ATSE R&D Efforts – Developing Next-Generation NNSA Workflows 7/30/19 Unclassified
  • 15. ThunderX2 - Second Generation High-End Armv8-A Server SoC 7/30/19 Unclassified Up to 32 custom Armv8.1 cores, up to 2.5GHz Full OoO, 1, 2, 4 threads per core 1S and 2S Configuration Up to 8 DDR4-2667 Memory Controllers, 1 & 2 DPC Up to 56 lanes of PCIe, 14 PCIe controllers Full SoC: Integrated SATAv3 USB3 and GPIOs Server class RAS & Virtualization Extensive Power Management LGA and BGA for most flexibility 40+ SKUs (75W – 180W)
  • 16. 7/30/19 Unclassified Marvell ThunderX2 Haswell E5-2698 v3 Broadwell E5- 2695 Skylake Gold 6152 Cores/Socket 32 (max 4 HT) 16 (2 HT) 22 (2 HT) 22 (2 HT) L1 Cache/Core 32KB I/D (8-way) 32KB I/D (8-way) 32KB I/D (8-way) 32KB I/D (8-way) L2 Cache/Core 256KB (8-way) 256 KB (8-way) 256 KB (8-way) 1 MB (16-way) L3 Cache/Socket 32 MB 40 MB 33 MB 30.25 MB #Memory Channels/Socket 8 DDR4 4 DDR4 4 DDR4 6 DDR4 Base Clock Rate 2.2 GHz 2.3 GHz 2.2 GHz 2.1 GHz Vector/SIMD Length 128b (NEON) 256b (AVX2) 256b (AVX2) 512b (AVX512) ThunderX2 Comparison with Xeon Processors
  • 17. 7/30/19 Unclassified Roofline Comparison 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Huawei-Kunpeng920 (1.33 TFlops) Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Huawei-Kunpeng920 (1.33 TFlops) AMD EPYC Naples/7601 (1.13 TFlops) Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Huawei-Kunpeng920 (1.33 TFlops) AMD EPYC Naples/7601 (1.13 TFlops) AMD EPYC Rome (2.41 TFlops) Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Huawei-Kunpeng920 (1.33 TFlops) AMD EPYC Naples/7601 (1.13 TFlops) Amazon Graviton (294 GFlops) 42 G flops/s AMD EPYC Rome (2.41 TFlops) Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Huawei-Kunpeng920 (1.33 TFlops) AMD EPYC Naples/7601 (1.13 TFlops) Amazon Graviton (294 GFlops) 42 G flops/s Intel KNL-7250 (3.04 TFlops) 400 G flops/s AMD EPYC Rome (2.41 TFlops) Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Huawei-Kunpeng920 (1.33 TFlops) AMD EPYC Naples/7601 (1.13 TFlops) Amazon Graviton (294 GFlops) 42 G flops/s Intel KNL-7250 (3.04 TFlops) 400 G flops/s Nvidia tesla V100 (7.5 TFlops) 900 G flops/s AMD EPYC Rome (2.41 TFlops) Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Huawei-Kunpeng920 (1.33 TFlops) AMD EPYC Naples/7601 (1.13 TFlops) Amazon Graviton (294 GFlops) 42 G flops/s Intel KNL-7250 (3.04 TFlops) 400 G flops/s Nvidia tesla V100 (7.5 TFlops) 900 G flops/s FDTD-Elastic-4thorder AMD EPYC Rome (2.41 TFlops) Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Huawei-Kunpeng920 (1.33 TFlops) AMD EPYC Naples/7601 (1.13 TFlops) Amazon Graviton (294 GFlops) 42 G flops/s Intel KNL-7250 (3.04 TFlops) 400 G flops/s Nvidia tesla V100 (7.5 TFlops) 900 G flops/s FDTD-Elastic-4thorder FDTD-Acoustic(ISO)-8thorder AMD EPYC Rome (2.41 TFlops) Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Huawei-Kunpeng920 (1.33 TFlops) AMD EPYC Naples/7601 (1.13 TFlops) Amazon Graviton (294 GFlops) 42 G flops/s Intel KNL-7250 (3.04 TFlops) 400 G flops/s Nvidia tesla V100 (7.5 TFlops) 900 G flops/s FDTD-Elastic-4thorder FDTD-Acoustic(ISO)-8thorder FDTD-Acoustic(TTI)-8thorder AMD EPYC Rome (2.41 TFlops) Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Huawei-Kunpeng920 (1.33 TFlops) AMD EPYC Naples/7601 (1.13 TFlops) Amazon Graviton (294 GFlops) 42 G flops/s Intel KNL-7250 (3.04 TFlops) 400 G flops/s Nvidia tesla V100 (7.5 TFlops) 900 G flops/s FDTD-Elastic-4thorder FDTD-Acoustic(ISO)-8thorder FDTD-Acoustic(TTI)-8thorder SEM-Elastic-4thorder AMD EPYC Rome (2.41 TFlops) Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Huawei-Kunpeng920 (1.33 TFlops) AMD EPYC Naples/7601 (1.13 TFlops) Amazon Graviton (294 GFlops) 42 G flops/s Intel KNL-7250 (3.04 TFlops) 400 G flops/s Nvidia tesla V100 (7.5 TFlops) 900 G flops/s FDTD-Elastic-4thorder FDTD-Acoustic(ISO)-8thorder FDTD-Acoustic(TTI)-8thorder SEM-Elastic-4thorder FDTD-Acoustic(VTI)-8thorder AMD EPYC Rome (2.41 TFlops) Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) NVIDIA Tesla V100 (7.5 TFlops) Fujitsu A64FX (2.99 TFlops) Intel Skylake 8168 (2.08 TFlops) Huawei Kunpeng920 (1.13 TFlops) Marvell ThunderX2 (0.56 TFlops) Amazon Graviton (0.294 TFlops) AMD EPYC Rome (2.41 TFlops) TheoreticalPeakGflops 4096 1024 256 64 16 4 0.25 1 644 16 256 1024 Arithmetic Intensity (Flop/Byte)
  • 18. STREAM Triad Bandwidth • ThunderX2 provides highest bandwidth of all processors • Vectorization makes no discernable difference to performance at large core counts • Around 10% higher with NEON at smaller core counts (5 – 14) • Significant number of kernels in HPC are bound by the rate at which they can load/store to memory (“memory bandwidth bound”) • Makes high memory bandwidth desireable • Ideally want to get to these bandwidths without needing to vectorize 7/30/19 Unclassified 0 50 100 150 200 0 10 20 30 40 50 60 MeasuredBandwidth(GB/s) Processor Cores ThunderX2 (NEON) ThunderX2 (No Vec) Skylake (AVX512) Skylake (No Vec) Haswell (AVX2) Haswell (No Vec) Higher is better
  • 19. 0 20 40 60 80 100 120 0 500000 1x10 6 1.5x10 6 2x10 6 2.5x10 6 MeasuredBandwidth(GB/s) Data Array Size Haswell Read Skylake Read Haswell Write Skylake Write ThunderX2 Read ThunderX2 Write Cache Performance • Haswell has highest per-core bandwidth (read and write) at L1, slower at L2. • Skylake redesigned cache sizes (larger L2, smaller L3) shows up in graph • Higher performance for certain work- set sizes (typical for unstructured codes) • TX2 more uniform bandwidth at larger scale (see less asymmetry between read/write) 7/30/19 Unclassified Higher is better Larger L2 capacity for Skylake
  • 20. 0 100 200 300 400 500 600 700 800 0 5 10 15 20 25 30 MeasuredPerformance(GF/s) Processor Cores Skylake Haswell ThunderX2 DGEMM Compute Performance • ThunderX2 has similar performance at scale to Haswell • Roughly twice as many cores (TX2) • Half the vector width (TX2 vs. HSW) • See strata in Intel MKL results, usually a result of matrix-size kernel optimization • ARM PL provides smoother performance results (essentially linear growth) 7/30/19 Unclassified Higher is better
  • 21. Floating Point Performance Sanity Check: HPL 7/30/19 Unclassified • ThunderX2 has about half the floating point capacity of comparable Xeon CPUs • Xeon 8180 vs. ThunderX2 • HPL.dat 163840 Ns 256 NBs 0 PMAP process mapping (0=Row-,1=Column- major) 7 Ps 8 Qs 1 PFACTs (0=left, 1=Crout, 2=Right) 2 RFACTs (0=left, 1=Crout, 2=Right) 0 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) 0 DEPTHs (>=0) 2 SWAP (0=bin-exch,1=long,2=mix) 64 swapping threshold 2.00E+03 8.82E+02 4.99E+02 0.00E+00 5.00E+02 1.00E+03 1.50E+03 2.00E+03 2.50E+03 Xeon 8180 SMT=2+Turbo SMT=4 w/o Turbo GFLOPS HPL | N=163840 (200GB)
  • 22. Results from using Astra and other TX2 Platforms Applications 7/30/19 Unclassified
  • 23. 0 0.1 0.2 0.3 0.4 0.5 0 5 10 15 20 25 30 Giga-Updates/Second(GUP/s) Processor Cores ThunderX2 (No Vec) Skylake (No Vec) Haswell (No Vec) GUPS Random Access • Running all processors in SMT-1 mode, SMT(>1) is usually better performance • Expect SMT2/4 on TX2 to give better numbers • Usually more cores gives higher performance (more load/store units driving requests). • Typical for TLB performance to be a limiter • Need to consider larger pages for future runs 7/30/19 Unclassified Higher is better
  • 24. 0 2000 4000 6000 8000 10000 12000 0 5 10 15 20 25 30 FigureofMerit(Zone/S) Processor Cores Skylake (AVX512) Skylake (No Vec) Haswell (AVX2) Haswell (No Vec) ThunderX2 (NEON) ThunderX2 (No Vec) LULESH Hydrodynamics Mini-App • Typically fairly intensive L2 accesses for unstructured mesh (although LULESH is regular structure in unstructured format) • Expect slightly higher performance with SMT(>1) modes for all processesors 7/30/19 Unclassified Higher is better
  • 25. 0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 0 5 10 15 20 25 30 FigureofMerit(Lookups/S) Processor Cores Skylake (AVX512) Skylake (No Vec) Haswell (AVX2) Haswell (No Vec) ThunderX2 (NEON) ThunderX2 (No Vec) XSBench Cross-Section Lookup Mini-App • Two level random-like access into memory, look-up in first table and then use indirection to reach second lookup • Means random access but is more like search so vectors can help • See gain on Haswell and Skylake which both have vector-gather support • No support for gather in NEON • XSBench is mostly read-only (gather) 7/30/19 Unclassified Higher is better
  • 26. Branson Mini-App and Benchmark 7/30/19 Unclassified • Monte Carlo based Radiation transport mini-app • Lots of time spent in math intrinsics (exp, log, sin, cos). Benefits from ARM optimized math intrinsics • Poor memory locality, benefits some from large pages • Doesn’t vectorize • Random number generator not yet optimized for ARM • On a per node basis, TX2 is on par with SKL-gold • Need to improve vectorizability 0 0.2 0.4 0.6 0.8 1 1.2 1 2 4 8 16 32 64 TX2 TX2+armpl SKL-vec Relative perf wrt SKL-gold MPI Processes
  • 27. EMPIRE on Astra Trinity HSW 32 MPI x 1 OMP Astra TX2 56 MPI x 1 OMP Strong and weak scaling studies for EMPIRE-PIC for awesome blob test case Missing Trinity XL mesh 512 and 4096 node results because of MueLu FPE Missing Astra XL mesh 2048 node results because of MueLu FPE Work by Paul Lin7/30/19 Unclassified
  • 28. EMPIRE on Astra • TX2 node has ~2x memory bandwidth and 1.75x cores (56 vs. 32) of Trinity HSW node • (HSW time)/(TX2 time) > 1 means TX2 is faster • Strong scaling for awesome blob small mesh (1-8 nodes), strong scaling for medium mesh (8-64 nodes), strong scaling for large mesh (64-512) • (HSW time)/(TX2 time) for linear solve not great, low computation/communication regime (Good) 7/30/19 Unclassified Work by Paul Lin
  • 29. • TX2 node has ~2x memory bandwidth and 1.75x cores (56 vs. 32) of Trinity HSW node • (HSW time)/(TX2 time) > 1 means TX2 is faster • Strong scaling for awesome blob medium mesh (1-8 nodes), strong scaling for large mesh (8-64 nodes) • (HSW time)/(TX2 time) for linear solve definite better than previous slide, due to increased computation/communication EMPIRE on Astra (Good) 7/30/19 Unclassified Work by Paul Lin
  • 30. xRAGE 7/30/19 Unclassified • Eulerian-based hydrodynamics/radiation transport application • Uses adaptive mesh refinement • Significant amount of gather/scatter • Does not currently benefit from AVX2/512 vectorization • Memory bound 0 100 200 300 400 500 600 700 8 16 32 TX2 (50ppn) BWL (48ppn) SKL (56ppn) Results from Cray XC50 using Cray CCE9 Compiler Lower is better #nodes Walltime(secs)
  • 31. PARTISN 7/30/19 Unclassified • Neutron transport code – deterministic SN method • Sensitive to cache performance, not typically memory bound • Vectorizes well for avx512, NEON • Can be run mixed MPI/OpenMP • Limited by cache BW on TX2 and front end stalls 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 2 4 8 16 32 TX2 SKL-novec SKL-vec MPI Processes Higher is better Relative Perf. To BWL-vec
  • 32. PARTISN can benefit from 4 SMTs/core 7/30/19 Unclassified • Example of code with significant front end stalls • Taucommander indicates high rate of branch misprediction in the sweep kernel Cray XC50 - CCE 9.0 compiler
  • 33. RIKEN Fiber Benchmarks – Compiler Performance Comparison 7/30/19 Unclassified • Comparison of Cray 8/9 compilers against Allinea19 using Riken Fiber benchmarks • Results are mixed, no clear winner in terms of compilers • Takeaway is to try to build your app with several compilers Cray XC50 - CCE 9.0 compilerLowerisbetter
  • 34. Early Results from Astra 7/30/19 Unclassified System has been online for around two weeks , incredible team working round the clock, already running full application ports and many of our key frameworks Baseline: Trinity ASC Platform (Current Production), dual-socket Haswell CFD Models Hydrodynamics Molecular DynamicsMonte Carlo 1.60X 1.45X 1.30X 1.42X Linear Solvers 1.87X
  • 35. Porting to ARM 7/30/19 Unclassified
  • 36. Sanity Checks • See if your software has already been ported to aarch64: • www.gitlab.com/arm-hpc/packages/wikis • See if its available via Spack https://github.com/spack/spack • Don’t use old compilers: • GCC 8.2 or newer, 9.1 better • Allinea armflang/armclang 19.0 or newer • If you’re package relies on some system packages in performance critical areas, may want to build your own versions. Libraries that come with base release are not optimized for Thunderx2 • If your application has lots of dependencies, this may be a good time to learn how to use Spack • Checkout training material at https://gitlab.com/arm-hpc/training 7/30/19 Unclassified
  • 37. 7/30/19 Unclassified Porting Cheat Sheet Ensure all dependencies have been ported. •Arm HPC Packages Wiki: https://gitlab.com/arm-hpc/packages/wikis/categories/allPackages Update or patch autotools and libtool as needed •wget 'http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD' -O config.guess •wget 'http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub;hb=HEAD' -O config.sub •sed -i -e 's#wl=""#wl="-Wl,"#g' libtool •sed -i -e 's#pic_flag=""#pic_flag=" -fPIC -DPIC"#g' libtool Update build system to use the right compiler and architecture •Check #ifdef in Makefiles. Use other architectures as a template. Use the right compiler flags •Start with -mcpu=native -Ofast Avoid non-standard compiler extensions and language features •Arm compiler team is actively adding new “unique” features, but it’s best to stick to the standard. Update hard-wired intrinsics for other architectures •https://developer.arm.com/technologies/neon/intrinsics •Worst case: default to a slow code. Update, and possibly fix, your test suite •Regression tests are a porter’s best friend. •Beware of tests that expect exactly the same answer on all architectures! Know architectural features and what they mean for your code •Arm’s weak memory model. •Division by zero is silently zero on Arm.