In this deck from the GoingARM workshop at SC17, Filippo Mantovani describes the contributions of the Barcelona Supercomputing Center to the European Mont-Blanc project.
"Since 2011, Mont-Blanc has pushed the adoption of Arm technology in High Performance Computing, deploying Arm-based prototypes, enhancing system software ecosystem and projecting performance of current systems for developing new, more powerful and less power hungry HPC computing platforms based on Arm SoC. In this talk, Filippo introduces the last Mont-Blanc system, called Dibona, designed and integrated by the coordinator and industrial partner of the project, Bull/ATOS. He also talks about tests performed at BSC of the Arm software tools (HPC compiler and mathematical libraries) as well as the Dynamic Load Balancing (DLB) technique and the Multiscale Simulator Architecture (MUSA)."
Watch the video: https://wp.me/p3RLHQ-i6o
Learn more: http://www.goingarm.com/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Handwritten Text Recognition for manuscripts and early printed texts
Update on the Mont-Blanc Project for ARM-based HPC
1. montblanc-project.eu | @MontBlanc_EU
This project has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement n° 671697
The Mont-Blanc project
Updates from the Barcelona Supercomputing Center
Filippo Mantovani
2. Mont-Blanc
The “legacy” Mont-Blanc vision
Denver, Nov 13th 2017Arm HPC User Group2
Vision: to leverage the fast growing market of mobile technology for
scientific computation, HPC and data centers.
2012 2013 2014 20162015 2017 2018
Mont-Blanc 2
Mont-Blanc 3
3. Mont-Blanc
The “legacy” Mont-Blanc vision
Phases share a common structure
Experiment with real hardware
Android dev-kits, mini-clusters, prototypes, production ready systems
Push software development
System software, HPC benchmarks/mini-apps/production codes
Study next generation architectures
Learn from hardware deployment and evaluation for planning new systems
Denver, Nov 13th 2017Arm HPC User Group3
Vision: to leverage the fast growing market of mobile technology for
scientific computation, HPC and data centers.
2012 2013 2014 20162015 2017 2018
Mont-Blanc 2
Mont-Blanc 3
4. We started here We ended up here
Hardware platforms
Denver, Nov 13th 2017Arm HPC User Group4
N. Rajovic et al., “The Mont-Blanc Prototype: An Alternative Approach
for HPC Systems,” in Proceedings of SC’16, p. 38:1–38:12.
5. We started here We ended up here
Different OS flavors
Arm HPC Compiler
Arm Performance Libraries
Allinea tools
…
All well packed and distributed
through OpenHPC
Several complex HPC production
codes have run on Mont-Blanc
Alya
AVL codes
WRF
FEniCS
System Software and Use Cases
Denver, Nov 13th 2017Arm HPC User Group5
Source files (C, C++, FORTRAN, Python, …)
GNU Arm HPC Mercurium
Compilers
Network driverOpenCL driver
Linux OS / Ubuntu
LAPACK Boost PETSc Arm PL
FFTW HDF5ATLAS clBLAS
Scientific libraries
ScalascaPerfExtrae Allinea
Developer tools
SLURMGanglia NTP
OpenLDAPNagios Puppet
Cluster management
Nanos++ OpenCL CUDA MPI
Runtime libraries
Power
monitor
Power
monitor LustreNFSDVFS
Hardware support / Storage
CPU
GPUCPU
CPU
Network
6. We started here We ended up here
A Multi-level Simulation
Approach (MUSA) allows us:
To gather performance traces on
any current HPC architecture
To replay them using almost any
architecture configuration
To study scalability and
performance figures at scale,
changing the number of MPI
processes simulated
Study of Next-Generation Architectures
Denver, Nov 13th 2017Arm HPC User Group6
Credits: N. Rajovic
Credits: MUSA team @ BSC
7. Where BSC is contributing today?
Evaluation of solutions
Hardware solutions
• Mini-clusters deployed liaising with SoC providers and system integrators
Software solutions
• Arm Performance Libraries, Arm HPC Compiler
Use cases
Alya: finite element code where we experiment atomics-avoiding techniques
• GOAL: test new runtime features to be pushed into OpenMP
HPCG: benchmark where we started looking at vectorization
• GOAL: explore techniques for exploitation of the Arm Scalable Vector Extension
Simulation of next generation large clusters
MUSA: Combining detailed trace driven simulation with sampling strategies for
exploring how architectural parameters affects the performance at scale.
Denver, Nov 13th 2017Arm HPC User Group7
T. Grass et al., “MUSA: A Multi-level Simulation Approach for
Next-Generation HPC Machines,” in SC16 proceedings, pp. 526–537.
F. Banchelli et al., “Is Arm software ecosystem
ready for HPC?”, poster at SC17.
8. Evaluation of Arm Performance Libraries
Goal
Test an HPC code making use of arithmetic and FFT libraries
Method
Quantum Espresso pwscf input
Compiled with GCC 7.1.0
Platform configuration #1 (poster SC17)
AMD Seattle
Arm PL 2.2
ATLAS 3.11.39
OpenBLAS 0.2.20
FFTW 3.3.6
Platform configuration #2
Cavium ThunderX2
Arm PL v18.0
OpenBLAS 0.2.20
FFTW 3.3.7
Denver, Nov 13th 2017Arm HPC User Group8
9. Evaluation of the Arm HPC Compiler
Goal
Evaluate the Arm HPC Compilers v18.0 vs v1.4
Method
Run Polybench benchmark suite
Including 30 benchmarks by Ohio State University
Run on Cavium ThunderX2
Denver, Nov 13th 2017Arm HPC User Group9
Execution time increment v18.0 vs v1.4
SIMD instructions v18.0 vs v1.4
10. High Performance Conjugate Gradient
Problem
Scalability of HPCG is very limited
OpenMP parallelization of the reference HPCG version is poor
Goals
1. Improve OpenMP parallelization of HPCG
2. Study current auto-vectorization for leveraging SVE
3. Analyze other performance limitations (e.g. cache effects)
Denver, Nov 13th 2017Arm HPC User Group10
0,00
2,00
4,00
6,00
8,00
10,00
12,00
1 2 4 8 16 28
SpeedUp
OpenMP Threads
Arm HPC Compiler 1.4 GCC 7.1.0
0,00
2,00
4,00
6,00
8,00
10,00
12,00
1 2 4 8 16 28
SpeedUp
OpenMP Threads
Arm HPC Compiler 1.4 GCC 7.1.0
On Cavium ThunderX2
11. High Performance Conjugate Gradient
Problem
Scalability of HPCG is very limited
OpenMP parallelization of the reference HPCG version is poor
Goals
1. Improve OpenMP parallelization of HPCG
2. Study current auto-vectorization for leveraging SVE
3. Analyze other performance limitations (e.g. cache effects)
Denver, Nov 13th 2017Arm HPC User Group11
On Cavium ThunderX2
12. HPCG - SIMD parallelization
First approach
Check auto-vectorization in current platforms
Method
Count SIMD instructions in the “ComputeSYMGS” region
On Cavium ThunderX2 using Arm HPC Compiler v18.0
On Intel Xeon Platinum 8160 (Skylake) using ICC supporting AVX512
Denver, Nov 13th 2017Arm HPC User Group12
x106
13. HPCG - SVE emulation
First approach
Check auto-vectorization when SVE is enabled
Method
Evaluate auto-vectorization in a whole execution of HPCG (one iteration)
Generate binary using Arm HPC Compiler v1.4 enabling SVE
Emulate SVE instruction using Arm Instruction Emulator in Cavium ThunderX2
Denver, Nov 13th 2017Arm HPC User Group13
0
5
10
15
20
25
30
35
SVE 128b SVE 256b SVE 512b SVE 1024b SVE 2048b
IncrementinSIMDinstructionsagainstNEON
14. HPGC - Memory access evaluation
Cache hit ratio degraded when using multi-coloring approaches
Data related to ComputeSYMGS
Gathered on Cavium ThunderX2
Compiled with GCC
Next steps
Optimize data access patterns in memory
Simulate “SVE gather load” instructions in order to quantify the benefits
Denver, Nov 13th 2017Arm HPC User Group14
~13% L1D miss ratio ~35% L2D miss ratio
0% 100% 0% 100%
15. Alya: BSC code for multi-physics problems
Analysis with Paraver:
Reductions with indirect accesses on large arrays using
No coloring
Use of atomics operations harms performance
Coloring
Use of coloring harms locality
Commutative Multidependences
• (OmpSs feature to be hopefully
included in OpenMP)
Denver, Nov 13th 2017Arm HPC User Group15
Parallelization of finite elements code
Credits: M. Garcia, J. Labarta
16. Alya: taskification and dynamic load balancing
Goal
Quantify the effect of commutative dependences and DLB on an HPC code
Method
Run the “Assembly phase” of Alya (containing atomics)
On MareNostrum 3, 2x Intel Xeon SandyBridge-EP E5-2670
On Cavium ThunderX, 2x CN8890
Denver, Nov 13th 2017Arm HPC User Group16
16 nodes x P processes/node x T threads/process
Assembly phase
Credits: M. Josep, M. Garcia, J. Labarta
17. Multi-Level Simulation Approach
Level 1: Trace generation
Denver, Nov 13th 2017Arm HPC User Group17
HPC application execution
OpenMP Runtime
System Plugin
MPI Call
Instrumenatation
Pintool /
DynamoRIO
Task / chunk
creation events,
dependencies
MPI calls
Dynamic
instructions
Trace
Credits: T. Grass, C. Gomez, M. Casas, M. Moreto
18. Level 2: Network simulation (Dimemas)
Level 3: Multi-core simulation (TaskSim + Ramulator + McPAT)
Multi-Level Simulation Approach
Time
Rank 1
Rank 2
……
Network simulator
Multi-core simulator
Thread 1
Thread 2
……
Time
Denver, Nov 13th 2017Arm HPC User Group18
Trace
Credits: T. Grass, C. Gomez, M. Casas, M. Moreto
19. Multi-Level Parameters
Architectural
CPU architecture
Number of cores
Core frequency
Threads per core
Reorder buffer size
SIMD width
Micro-architectural
L1/2/3 Cache size/latency
Main memory
Memory technology
Capacity
Bandwidth
Latency
Problem:
Simulation time diverges
Solution:
We supported different modes
(Burst, Detailed, Sampling)
trading accuracy for speed
Denver, Nov 13th 2017Arm HPC User Group19
Credits: T. Grass, C. Gomez, M. Casas, M. Moreto
20. MUSA: status
SC’16 paper
Validation of the methodology
with 5 applications
• BT-MZ, SP-MZ, LU-MZ, HYDRO, SPECFEM3D
Proven performance figures
at scale up to 16 kMPI ranks
Status update
Added parameter sets
for state-of-the art architectures
Support for power consumption modeling
• Including CPU, NoC and memory hierarchy
Incremented set of applications
Expanded trace database
• Including traces gathered on
MareNostrum4 (Intel Skylake + OmniPath)
Included support for DynamoRIO
Denver, Nov 13th 2017Arm HPC User Group20
Credits: T. Grass, C. Gomez, M. Casas, M. Moreto
21. Student Cluster Competition
Rules
12 teams of 6 undergraduate students
1 cluster operating within 3 kW power budget
3 HPC applications + 2 benchmarks
One team from University
Politècnica de Catalunya (UPC-Spain)
Participating with
Mont-Blanc technology
3 awards to win
Best HPL
1st, 2nd, 3rd overall places
Fan favorite
We are looking for
an Arm-based
cluster for 2018!!!
Denver, Nov 13th 2017Arm HPC User Group21
22. Interested in any of the topics presented?
Follow us!
montblanc-project.eu @MontBlanc_EU filippo.mantovani@bsc.es
Visit our booths @ SC17!
booth #1694
booth #1925
booth #1975
Denver, Nov 13th 2017Arm HPC User Group22