Sierra Supercomputer Powers Scientific Discovery

LLNL-PRES-746388
This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore
National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
The Sierra Supercomputer:
Science and Technology on a
Mission
Adam Bertsch
Sierra Integration Project Lead
Lawrence Livermore National Laboratory
February 20th, 2018

LLNL-PRES-746388
2
§ The Sierra Supercomputer
§ Collaboration
— CORAL and OpenPower
§ Science and Technology on a Mission
Overview

LLNL-PRES-746388
3
§ On a spring bicycle ride in Livermore five years ago.
Where did Sierra start for me?

LLNL-PRES-746388
4
AdvancedTechnology
Systems(ATS)
Fiscal Year
‘13 ‘14 ‘15 ‘16 ‘17 ‘18
Use
Retire
‘20‘19 ‘21
Commodity
Technology
Systems(CTS)
Procure&
Deploy
Sequoia (LLNL)
ATS 1 – Trinity (LANL/SNL)
Tri-lab Linux Capacity Cluster II (TLCC II)
CTS 1
CTS 2
‘22
System
Delivery
ATS 3 – Crossroads (LANL/SNL)
ATS 4 – El Capitan (LLNL)
‘23
ATS 5 – (LANL/SNL)
ATS 2 – Sierra (LLNL)
Sequoia and Sierra are the current and next-generation Advanced Technology Systems at LLNL
Sierra will be the next ASC ATS platform

LLNL-PRES-746388
5
The Sierra system that will replace Sequoia
features a GPU-accelerated architecture
Mellanox Interconnect
Single Plane EDR InfiniBand
2 to 1 Tapered Fat Tree
IBM POWER9
• Gen2 NVLink
NVIDIA Volta
• 7 TFlop/s
• HBM2
• Gen2 NVLink
Components
Compute Node
2 IBM POWER9 CPUs
4 NVIDIA Volta GPUs
NVMe-compatible PCIe 1.6 TB SSD
256 GiB DDR4
16 GiB Globally addressable HBM2
associated with each GPU
Coherent Shared Memory
Compute Rack
Standard 19”
Warm water cooling
Compute System
4320 nodes
1.29 PB Memory
240 Compute Racks
125 PFLOPS
~12 MW
Spectrum Scale
File System
154 PB usable storage
1.54 TB/s R/W bandwidth

LLNL-PRES-746388
6
Sierra Represents a Major Shift For Our Mission-
Critical Applications
2004 -
present
• IBM BlueGene, along with copious amounts of
linux-based cluster technology dominate the HPC
landscape at LLNL.
• Apps scale to unprecedented 1.6M MPI ranks, but
are still largely MPI-everywhere on ”lean” cores
2014
BG/L
BG/P
BG/Q
• Heterogeneous GPGPU-based computing begins
to look attractive with NVLINK and Unified
Memory (P9/Volta)
• Apps must undertake a 180 turn from BlueGene
toward heterogeneity and large-memory nodes
2015 –
20??
• LLNL believes heterogeneous computing is here
to stay, and Sierra gives us a solid leg-up in
preparing for exascale and beyond

LLNL-PRES-746388
7
Cost-neutral architectural tradeoffs
This decision, counter to prevailing wisdom for system design, benefits Sierra’s
planned UQ workload: 5% more UQ simulations at performance loss of < 1%
§ Full bandwidth from dual ported Mellanox
EDR HCAs to TOR switches
§ Half bandwidth from TOR switches to
director switches
§ An economic trade-off that provides
approximately 5% more nodes
§ DRAM cost double expectation
§ Evaluated two options to close
the cost gap
§ Buy half as much memory
§ Taper the network
§ Selected both options!

LLNL-PRES-746388
8
Sierra system architecture details
Sierra uSierra rzSierra
Nodes 4,320 684 54
POWER9 processors per node 2 2 2
GV100 (Volta) GPUs per node 4 4 4
Node Peak (TFLOP/s) 29.1 29.1 29.1
System Peak (PFLOP/s) 125 19.9 1.57
Node Memory (GiB) 320(256+64) 320(256+64) 320(256+64)
System Memory (PiB) 1.29 0.209 0.017
Interconnect 2x IB EDR 2x IB EDR 2x IB EDR
Off-Node Aggregate b/w (GB/s) 45.5 45.5 45.5
Compute racks 240 38 3
Network and Infrastructure racks 13 4 0.5
Storage Racks 24 4 0.5
Total racks 277 46 4
Peak Power (MW) ~12 ~1.8 ~0.14

LLNL-PRES-746388
9
CORAL Targets compared to Sierra
§ Target speedup over current systems of 4x on Scalable benchmarks …
§ … and 6x on Throughput benchmarks
§ Aggregate memory of 4 PB and ≥ 1 GB per MPI task (2 GB preferred)
§ Max power of system and peripherals ≤ 20MW
§ Mean Time Between Application Failure that requires human
intervention ≥ 6 days
§ Architectural Diversity
§ Delivery in 2017 with acceptance in 2018
§ Peak Performance ≥ 100 PF

LLNL-PRES-746388
10
Collaboration

LLNL-PRES-746388
11
Sierra is part of CORAL, the Collaboration
of Oak Ridge, Argonne and Livermore
CORAL is the next major phase in the U.S. Department of Energy’s
scientific computing roadmap and path to exascale computing
Modeled on successful LLNL/ANL/IBM
Blue Gene partnership (Sequoia/Mira)
Long-term contractual partnership
with 2 vendors
2 awardees for 3 platform
acquisition contracts
2 nonrecurring eng. contracts
NRE contract
ANL Aurora contract
NRE contract
ORNL Summit contract
LLNL Sierra contract
RFP
BG/Q Sequoia
BG/L BG/P Dawn
LLNL’s IBM Blue Gene Systems

LLNL-PRES-746388
12
OpenPOWER collaboration enabled technology
behind Sierra

LLNL-PRES-XXXXXX
13
LLNL-PRES-683853
21&
!  Motherboard'design'
!  Water'cooled'compute'nodes''
!  GPU'resilience''
!  Switch'based'collectives,'
FlashDirect'
!  Open'source'compiler'
infrastructure'
CORAL&NRE&will&provide&significant&benefit&
to&the&final&systems&
!  System'diagnostics'
!  System'scheduling''
!  Burst'buffer''
!  GPFS'performance'and'
scalability'
!  Cluster'management'
!  Open'source'tools'
!  Center'of'Excellence'
Eight''joint'lab/contractor'working'groups'are'actively'
working'on'all'the'above'and'more'''

LLNL-PRES-746388
14
CORAL NRE drove Spectrum Scale advances
that will vastly improve Sierra file system
Original Plan of Record
Modified (Cost Neutral)
§ 120PB delivered capacity
§ Conservative drive capacity estimates
§ 1TB/s write, 1.2TB/s read
§ Concern about metadata targets
NRE + Hardware
Advancements
Close partnerships lead to ecosystem improvements
§ 154PB delivered capacity
§ Substantially increased drive capacities
§ 1.54 TB/s write/read
§ Enhanced Spectrum Scale metadata perf.
(many optimizations already released)

LLNL-PRES-746388
15
Outstanding benchmark analysis by IBM and
NVIDIA demonstrates the system’s usability
Projections included code changes that showed tractable
annotation-based approach (i.e., OpenMP) will be competitive
Figure 5: CORAL benchmark projections show GPU-accelerated system is expected to deliver substan
performance at the system level compared to CPU-only configuration.
The demonstration of compelling, scalable performance at the system level across a
13
CORAL APPLICATION PERFORMANCE PROJECTIONS
0x
2x
4x
6x
8x
10x
12x
14x
QBOX LSMS HACC NEKbone
RelativePerformance
Scalable Science Benchmarks
CPU-only CPU + GPU
0x
2x
4x
6x
8x
10x
12x
14x
CAM-SE UMT2013 AMG2013 MCB
RelativePerformance
Throughput Benchmarks
CPU-only CPU + GPU

LLNL-PRES-746388
16
3-4 Years to Prepare Applications for Sierra
2014 – CORAL contracts at LLNL and
ORNL go to IBM/NVIDIA. 3-4 years of
“runway”
Today – System Delivery Underway.
Initial open science use about to begin.
Transition to mission use in ~10 months.
Goal: Take off flying
when machine “hits
the floor”

LLNL-PRES-746388
17
Application enablement through the Center of
Excellence – the original (and enduring) concept
Vendor
Application /
Platform
Expertise
Staff (code
developers)
Applications
and Key
Libraries
Hands-on collaborations
On-site vendor expertise
Customized training
Early delivery platform access
Application bottleneck analysis
Both portable and vendor-specific solutions
Technical areas: algorithms, compilers,
programming models, resilience, tools, …
Codes tuned for
next-gen
platforms
Rapid feedback
to vendors on
early issues
Publications,
early science
Steering
committee
The LLNL/Sierra
CoE is jointly
funded by the ASC
Program and the
LLNL Institution
Improved
Software Tools
and
Environment
Performance
Portability

LLNL-PRES-746388
18
§ Decouple loop traversal and iterate (body)
§ An iterate is a “task” (aggregate, (re)order, …)
§ IndexSet and execution policy abstractions
simplify exploration of implementation/tuning
options without disrupting source code
RAJA is our performance-portability solution that uses standard C++
idioms to target multiple back-end programming models
double* x ; double* y ;
double a, tsum = 0, tmin = MYMAX;
for ( int i = begin; i < end; ++i ) {
y[i] += a * x[i] ;
tsum += y[i] ;
if ( y[i] < tmin ) tmin = y[i];
}
C-style for-loop
Execution Policy
(how loop runs: PM backend, etc.)
Index
(index sets, segments to
partition, order, …. iterations)
Pattern
(forall, reduction, scan, etc.)
RAJA Concepts
RAJA-style loop
double a ;
RAJA::SumReduction<reduce_policy, double> tsum(0);
RAJA::MinReduction<reduce_policy, double> tmin(MYMAX);
RAJA::forall< exec_policy > ( IndexSet , [=] (int i) {
y[i] += a * x[i] ;
tsum += y[i];
tmin.min( y[i] );
} );
double a ;
RAJA::SumReduction<reduce_policy, double> tsum(0);
RAJA::MinReduction<reduce_policy, double> tmin(MYMAX);
RAJA::forall< exec_policy > ( index_set , [=] (int i) {
y[i] += a * x[i] ;
tsum += y[i];
tmin.min( y[i] );
} );
RAJA Transformation
RAJA allows us to write-once, target multiple back-ends

LLNL-PRES-746388
19
Early Access to CORAL Software Technology
Based on IBM’s current technical programming offerings with
enhancements for new function and scalability. Includes:
• Initial messaging stack implementation
• Initial Compute node kernel
• Compilers:
o Open Source LLVM
o PGI compiler
o XL compiler
Early Access systems provide critical
pre-Sierra generation resources
Early Access Compute System
• 18 Compute Nodes
• EDR IB switch fabric
• Ethernet mgmt
• Air-cooled
Early Access Compute Node
• “Minsky” HPC Server
• 2xPOWER8+ processors
• 4xNVIDIA GP100; NVLINK
• 256GB SMP (32x8GB DDR4)
Initial Early Access GPFS
• GPFS Management servers
• 2 POWER8 Storage controllers
• 1 GL-2; 2x58 6TB SAS drives or
• 1 GL-4; 4x58 6TB SAS drives
• EDR InfiniBand switch fabric
• Enhanced to GL-6, 6x58
6TB SAS drives

LLNL-PRES-746388
20
Three LLNL Early Access systems support
ASC and institutional preparations for Sierra
Shark
597.6TF
Compute Compute
Storage
Infrastructure
ESS
Ray:
896.4 TF
Compute Compute StorageCompute Infrastructure
Compute nodes on
Ray have 1.6TB
NVMe drives
Compute Compute
Storage
Infrastructure
ESS
Manta
597.6 TF

LLNL-PRES-746388
21
Sierra COE focus over time and how it necessarily has
shifted as technology and code readiness advances
Time (~ 5 years)
Relativelevelsofeffort
•
Training
•
Topical “deep dives”
•
Develop of portability
abstractions (RAJA)
•
Early explorations on
key kernels
•
Exploration with proxy apps
•
Identify algorithmic and data structures changes
•
Hack-a-thons
•
Use of available non-IBM GPU resources
DeliveryofSierra
•
Transition to
Sierra
•
Performance
optimization
•
Deliver “early
science”
results
•
Transition to early
delivery hardware
•
Focus on production
apps
•
Performance prediction
and modeling
•
Test portability

LLNL-PRES-746388
22
The COE has broadened to encompass most Sierra
preparation activities
§ The Sierra COE began with vendor proposals as part of the Sierra procurement
for proposed NRE (non-recurring engineering) activities
— Inspired by ORNL Titan activities
— IBM Research and NVIDIA Dev Tech’s are staffed with a cadre of top-notch application experts
§ Has become an organizing principle beyond vendor contributions
— Ties to Livermore Computing (LC) CORAL working groups (compilers and tools)
— ASQ division staff continue to provide key CS support to ASC code teams
— CASC leading several projects (RAJA, ICOE Tools and Libraries)
§ The LLNL Advance Architecture and Portability Specialists (AAPS) team has
become an integral and enduring part of our COE strategy
— Providing targeted expertise to both ASC and ICOE teams
Sierra
COE
Sierra NRE
subcontract
w/ IBM
Institutional
COE
LLNL Staff
(Code Teams
+ AAPS)

LLNL-PRES-746388
23
Preliminary ASC application results on EA
system (single node results)
Application/
package/lib
Status Notes Observed GPU
Speedups*
Hydro Most of code ported to RAJA 7-10x (package dependent)
Equation of
State
Ported to RAJA. Still porting setup/initialization and some less-
used interpolation routines
11x
CG solver Internal (e.g. non-hypre) solver 10x
Hypre Setup not ported to GPU, still a potential bottleneck 9x
Unstructured
Deterministic
Transport
Basing work on UMT benchmark results. Performance limited
to NVLINK bandwidth (streaming algorithm)
With UMT:
2-2.5x streaming
4-8x in memory
Structured
Deterministic
Transport
Excellent results with Kripke mini-app via RAJA. Porting main
application and expect similar results
Kripke (mini-app):
14x
Monte Carlo
Transport
Using “big kernel” approach, and exploring via Quicksilver
mini-app.
~1-2x
* 4 GPUs vs 2 P8s
Little or no tuning has been done for the Power CPU yet, and compilers are improving.
Most codes currently see a 1.1 – 1.4x speedup over comparable Xeon CPUs (e.g. Haswell)
Optimistic Optimistic, with
caveats
Unknown, but
promising
May not be
GPU-friendly
GPUs bad

LLNL-PRES-746388
24
Science and Technology on a Mission
Intelligence Biosecurity
Nonproliferation Defense
Science Energy Weapons
Counterterrorism

LLNL-PRES-746388
25
§ Annual certification of the U.S. nuclear stockpile without nuclear
testing
Stockpile Stewardship and the ASC Program

LLNL-PRES-XXXXXX
26
LLNL-PRES-683853
5&
!  ASC&delivers&the&leading&edge,&highGend&simula>on&capabili>es&
needed&to&meet&U.S.&nuclear&weapons&assessment&and&cer>fica>on&
requirements,&including:&weapon&codes,&weapon&science,&compu>ng&
plaOorms,&and&suppor>ng&infrastructure.&
!  ASC&modeling&and&simula>on&provides:&
—  a&computa>onal&surrogate&for&nuclear&tes>ng,&which&is&central&to&U.S.&
na>onal&security.&
—  the&ability&to&model&the&extraordinary&complexity&of&nuclear&weapons&
systems,&which&is&essen>al&to&establish&confidence&in&the&performance&of&
our&aging&stockpile.&&
—  the&necessary&tools&for&nuclear&weapons&scien>sts&and&engineers&to&gain&
a&comprehensive&understanding&of&the&en>re&weapons&lifecycle&from&
design&to&safe&processes&for&dismantlement.&&
!  Through&close&coordina>on&with&other&government&agencies,&ASC&
tools&also&play&an&important&role&in&suppor>ng&nonprolifera>on&
efforts,&emergency&response,&and&nuclear&forensics.&
Advanced'Simula:on'and'Compu:ng'(ASC)'
Program'is'LLNL’s'main'HPC'funding'source'

LLNL-PRES-746388
27
LLNL-PRES-683853
26&
Extreme'scale'compu:ng'is'needed'across'the'spectrum'
of'NNSA’s'na:onal'security'missions!
Modelcomplexity Transition
Metals
Noble
Metals
Actinide
Metals
(10µ)3µ3(100nm)3
ab initio
Systems
2D
Weapons’ Science
3D
Simulation scale
Nuclear Stockpile
•  Safety
•  Surety
•  Reliability
•  Robustness
Non-
Proliferation
and Nuclear
Counter
Terrorism

LLNL-PRES-746388
28
Advanced technology insertion will establish new
directions for high-performance computing
CORAL technology
Exascale technology
Memory-
intensive
systems
Machine learning and
neuromorphic technologies
Quantum technologies
Data-analytics
systems e.g.
Catalyst
Cognitive simulation systems
38 Photonics Spectra June 2014 www.photonics.com
Quantum Computers
NV−
center platforms have already proved
effective in many core parts of a quan-
tum computer, from single-qubit gates to
entanglement of two remote NV−
centers.
The system can operate at room tem-
perature, which is an advantage over ion
traps and superconductor approaches. The
NV−
center offers an electron spin and
a nuclear spin, the latter of which works
as a memory to store quantum informa-
tion. The electron spin can be coupled to
a photon to connect one NV−
module to
another, potentially enabling large-scale
quantum computers.
optic interconnects, which can simultane-
ously be used as off-chip memory/storage
buffers. The role of the buffer is to delay
single photons for short periods to facili-
tate circuit synchronization. The NTT
group has also proposed (doi: 10.1038/
ncomms3725) a photonic quantum pro-
cessor based on an on-chip single-photon
buffer using a coupled resonator optical
waveguide (CROW).
-
fer consists of 400 high-Q (quantum
nanocavities. The model showed buffer-
ing of a pulsed single photon for up to 150
ps with 50-ps tenability while preserving
entanglement. At the circuit level, the
group found maximum tolerable error
per elementary quantum gate to be ap-
proximately 0.73 percent. At the physical
architecture level, they estimate that a
large-scale computer with a similar error
threshold is achievable in the near future,
-
puting. Even better, the team says that its
electron-spin entanglement distribution
scheme can also be applied to quantum-
dot-based systems.
Superconductor platforms
Why are qubits so faulty? The quantum
state of a qubit is a very fragile thing,
easily disturbed by loss due to materials-
related defects, noise from nearby control
lines, or other qubits.
“Intuitively, one can imagine the
quantum state as a spinning top: If you
don’t touch it, it’ll keep spinning. If you
tap on it, the top starts to precess; and if
you perturb it too much, it tumbles,” said
Rami Barends, postdoctoral fellow of
physics at the University of California,
Santa Barbara.
Fault-tolerant large-scale quantum
computers require a platform that can
scale up, a compatible architecture that
corrects errors to below 1 percent, high
=
the threshold, said Barends, a member of
professor John Martinis’ group at UCSB.
He and doctoral candidate Julian Kelly
Nature paper de-
scribing a superconductor-based quantum
architecture that paves the path toward
those goals. Superconductor platforms
enable construction of large quantum
circuits compatible with inexpensive
microfabrication.
The approach used by Martinis’ group
uses a Josephson junction design, which
uses aluminum as the superconductor
and 2-nm thin barriers of aluminum
oxide to store quantum bits on a sapphire
substrate. The key to making the surface
code work is a 2-D array of Xmon qubits
in which quantum logic and measure-
ment correct the errors (doi:10.1038/
nature13171). The processor chip has
nearest-neighbor coupling. The gates
modify the state of one qubit conditional
on the state of the other, thus entangling
them. The photons that are sent to the qu-
bits have a 6-GHz microwave frequency,
which can be generated electrically.
To eliminate thermal noise at room
temperature, superconductor-based
EricLucero/UCSB
614QuantumComputing.indd 38 5/27/14 12:45 PM
Workflow steering
Active learning – design
optimization and UQ
Hybrid accelerated
simulation
2015
2021
2018
2025
Analytics and simulation
convergence

LLNL-PRES-746388
29
Integrated machine learning and simulation could
enable dynamic validation of complex models
High-fidelity
simulation
Ensembles
of
simulation
sets
CORAL computing
architectures power the
dynamic validation loop
High dimensional
model parameters
Hypothesis generation
– use the ML model to
predict parameters for
experimental data
Requests
for new
experiments
and data
Distribution of
predicted
properties

LLNL-PRES-746388
30
§ The advantages that led us to select Sierra generalize
—Power efficiency
—Network advantages of “fat nodes”
—Clear path for performance portability
—Balancing capabilities/costs implies complex memory hierarchies
§ Broad collaboration is essential to successful deployment
of increasingly complex HPC systems
—Computing centers collaborate on requirements
—Vendors collaborate on solutions
Sierra and its EA systems are beginning an
accelerator-based computing era at LLNL
Heterogeneous computing on Sierra represents a viable path to
exascale

Sierra Supercomputer Powers Scientific Discovery

Sierra Supercomputer Powers Scientific Discovery

Recomendados

Recomendados

Más contenido relacionado

Similar a Sierra Supercomputer Powers Scientific Discovery

Similar a Sierra Supercomputer Powers Scientific Discovery (20)

Más de inside-BigData.com

Más de inside-BigData.com (20)

Último

Último (20)

Sierra Supercomputer Powers Scientific Discovery