Mateo valero p1

“Future Exascale Supercomputers”

Mexico DF, November, 2011 Prof. Mateo Valero
Director

Top10
Rank Site Computer Procs Rmax Rpeak
RIKEN Advanced Institute
Fujitsu, K computer, SPARC64
1 for Computational Science 705024 10510000 11280384
VIIIfx 2.0GHz, Tofu interconnect
(AICS)
186368
2 Tianjin, China XeonX5670+NVIDIA 2566000 4701000
100352
3 Oak Ridge Nat. Lab. Crat XT5,6 cores 224162 1759000 2331000
4 Shenzhen, China XeonX5670+NVIDIA 120640 1271000 2984300
73278
5 GSIC Center, Tokyo XeonX5670+NVIDIA 1192000 2287630
56994
6 DOE/NNSA/LANL/SNL Cray XE6 8-core 2.4 GHz 142272 1110000 1365811
SGI Altix ICE 8200EX/8400EX,
NASA/Ames Research
7 Xeon HT QC 3.0/Xeon 111104 1088000 1315328
Center/NAS
5570/5670 2.93 Ghz, Infiniband
8 DOE/SC/LBNL/NERSC Cray XE6 12 cores 153408 1054000 1288627
Commissariat a l'Energie Bull bullx super-node
9 138368 1050000 1254550
Atomique (CEA) S6010/S6030
QS22/LS21 Cluster, PowerXCell
10 DOE/NNSA/LANL 122400 1042000 1375776
8i / Opteron Infiniband

Mexico DF, November, 2011 2

1

Parallel Systems

Interconnect (Myrinet, IB, Ge, 3D torus, tree, …)

Node Node* Node**
Node
Node* * Node** **
Node Node

Node
Node
Node SMP
Memory homogeneous multicore (BlueGene-Q chip)
IN heterogenous multicore
multicore general-purpose accelerator (e.g. Cell)
multicore
multicore GPU
multicore FPGA
ASIC (e.g. Anton for MD)
Network-on-chip (bus, ring, direct, …)


Riken’s Fujitsu K with SPARC64 VIIIfx

● Homogeneous architecture:
● Compute node:
● One SPARC64 VIIIfx processor
2 GHz, 8 cores per chip
128 Gigaflops per chip

● 16 GB memory per node

● Number of nodes and cores:
● 864 cabinets * 102 compute nodes/cabinet * (1 socket * 8 CPU cores) = 705024
cores …. 50 by 60 meters

● Peak performance (DP):
p ( )
● 705024 cores * 16 GFLOPS per core = 11280384 PFLOPS

● Linpack: 10510 PF 93% efficiency. Matrix: more than 13725120 rows !!!
29 hours and 28 minutes

● Power consumption 12.6 MWatt, 0.8 Gigaflops/W


2

Looking at the Gordon Bell Prize

● 1 GFlop/s; 1988; Cray Y-MP; 8 Processors
● Static finite element analysis
● 1 TFlop/s; 1998; Cray T3E; 1024 Processors
● Modeling of metallic magnet atoms, using a
variation of the locally self-consistent multiple
scattering method.
● 1 PFlop/s; 2008; Cray XT5; 1.5x105 Processors
● Superconductive materials

● 1 EFlop/s; ~2018; ?; 1x108 Processors?? (109 threads)

Jack Dongarra


3

Nvidia GPU instruction execution

MP1 MP2 MP3 MP4

instruction1

instruction2

Long latency3

Instruction4

SBAC-PAD, Vitoria October
Mexico DF, November, 201128th, 2011 7

Potential System Architecture
for Exascale Supercomputers

System 2010 “2015” “2018” Difference
attributes 2010-18
System peak 2 Pflop/s 200 Pflop/s 1 Eflop/sec O(1000)
Power 6 MW 15 MW ~20 MW
20
System memory 0.3 PB 5 PB 32-64 PB O(100)
Node 125 GF 0.5 TF 7 TF 1 TF 10 TF O(10) –
performance O(100)
Node memory 25 GB/s 0.1 1 TB/sec 0.4 TB/sec 4 TB/sec O(100)
BW TB/sec
Node 12 O(100) O(1,000) O(1,000) O(10,000) O(100) –
concurrency O(1000)
Total 225,000 O(108) O(109) O(10,000)
Concurrency
Total Node 1.5 GB/s 20 GB/sec 200 GB/sec O(100)
Interconnect
BW
MTTI days O(1day) O(1 day) - O(10)
EESI Final Conference
10-11 Oct. 2011, Barcelona

4

2. To faster air plane design

Boeing: Number of wing prototypes prepared for wind-tunnel testing

Date 1980 1995 2005

Airplane B757/B767 B777 B787

# wing prototypes 77 11 11 5

Plateau due to RANS limitations.
Further decrease expected from LES with ExaFlop


Diseño del Airbus 380


5

2. To faster air plane design

Airbus: "More simulation, less tests
More tests“
From A380 to A350

- 40% less wind-tunnel days
- 25% saving in aerodynamics development time
- 20% saving on wind-tunnel tests cost
th k t HPC
thanks to HPC-enabled CFD runs, especially i hi h
bl d i ll in high-speed regime, providing
d i idi
even better representation of aerodynamics phenomenon turned into better
design choices.

Acknowledgements: E. CHAPUT (AIRBUS)


2. Oil industry


6

Diseño del ITER

TOKAMAK (JET)


Fundamental Sciences

Mexico DF, November, 2011 10-11 Oct. 2011, Barcelona 14

7

Materials: a new path to competitiveness

On-demand materials for effective commercial use
Conductivity: energy loss reduction
Lifetime: corrosion protection, e.g. chrome
Fissures: saftety insurance from molecular design
Optimisation of materials / lubricants
less friction, longer lifetime, less energy-losses
Industrial need to speed up simulation from months to days

All atom Multi-scale
Exascale enables simulation of larger
and realistic systems and devices
EESI Final Conference, 10-11
Oct. 2011, Barcelona

Life Sciences and Health

Population
Organ

Tissue

Cell

Macromolecule

Small Molecule
Atom
EESI Final Conference, 10-11
Oct. 2011, Barcelona

8

Supercomputación, teoría y experimentación

Cortesia de IBM

Supercomputing, theory and experimentation

Cortesia de IBM

9

Holistic approach …

Towards exaflop

Comput. Complexity
Applications
Async. Algs.
Moldability
Job Scheduling Resource awareness
Load Balancin

User satisfaction
Programming Model Address space Dependencies
Work generation
ng
ng

Run time Locality optimization
Concurrency extraction
Topology and routing
Interconnection
External contention

Processor/node NIC design Run time support Hw counters
architecture
Memory subsystem Core Structure


10+ Pflop/s systems planned

● Fujitsu Kei
● 80,000 8-core Sparc64 VIIIfx processors 2 GHz,
(16 Gflops/core, 58 watts 3.2 Gflops/watt),
16 GB/node 1 PB memory, 6D mesh-torus,
GB/node, memory mesh torus
10 Pflops

● Cray's Titan at DOE, Oak Ridge National Laboratory
● Hybrid system with Nvidia GPUs, 1 Pflop/s in 2011,
20 Pflop/s in 2012, late 2011 prototype
● $100 million


10

10+ Pflop/s systems planned

● IBM Blue Waters at Illinois
● 40,000 8-core Power7, 1 PB memory,
18 PB disk, 500 PB archival storage,
10 Pflop/s 2012 $200 million
Pflop/s, 2012,

● IBM Blue Gene/Q systems:
● Mira to DOE, Argonne National Lab with 49,000 nodes,
16-core Power A2 processor (1.6-3 GHz),
750 K cores, 750 TB memory, 70 PB disk,
5D torus 10 Pflop/s
torus,
● Sequoia to Lawrence Livermore National Lab with
98304 nodes (96 racks), 16-core A2 processor,
1.6 M cores (1 GB/core), 1.6 Petabytes memory, 6 Mwatt,
3 Gflops/watt, 20 Pflop/s, 2012


Japan Plan for Exascale

Heterogeneous, Distributed Memory
GigaHz KiloCore MegaNode system

2012 2015 2018-2020

K Machine 10K Machine 100K Machine
10 PF 100 PF ExaFlops

Feasibility Study (2012-2013) Exascale Project (2014-2020)
Post-Petascale Projects


11

Mexico DF, November, 2011 Thanks to S. Borkar, Intel 23

Mexico DF, November, 2011 Thanks to S. Borkar, Intel 24

12

Nvidia: Chip for the Exaflop
Computer

Mexico DF, November, 2011 Thanks Bill Dally 25

Nvidia: Node for the Exaflop
Computer

Thanks Bill Dally

13

Exascale Supercomputer

Mexico DF, November, 2011 Thanks Bill Dally 27

BSC-CNS: International Initiatives (IESP)

Improve the world’s simulation and modeling
capability by improving the coordination and
development of the HPC software environment

B ild an i
Build international plan f d l i
i l l for developing
the next generation open source software
for scientific high-performance computing


14

Back to Babel?

Book of Genesis The computer age

“Now the whole earth had
Fortran & MPI
one language and the
same words” …

…”Come, let us make
bricks, and burn them
thoroughly. ”…

…"Come, let us build
ourselves a city, and a tower
with its top in the heavens,
++
and let us make a name for
ourselves”…

And the LORD said, "Look, they are one Cilk++
people, and they have all one language; and Fortress X10 CUDA
this is only the beginning of what they will do; Sisal HPF
StarSs RapidMind
nothing that they propose to do will now be Sequoia
impossible for them. Come, let us go down, and CAF ALF OpenMP
UPC SDK
confuse their language there, so that they will
not understand one another's speech." Chapel MPI

Mexico DF, November, 2011 Thanks to Jesus Labarta 29

You will see…. in 400 years from now people
will get crazy

New generation of programmers

Parallel
Multicore/manycore Programming
Architectures

New Usage
g
models

Source: Picasso -- Don Quixote

Dr. Avi Mendelson (Microsoft). Keynote at ISC-2007


15

Different models of computation …….

● The dream for automatic parallelizing compilers not true …
● … so programmer needs to express opportunities for parallel execution
in the application

SPMD OpenMP 2.5 Nested fork-join OpenMP 3.0 DAG – data flow

Huge Lookahead &Reuse….
Latency/EBW/Scheduling

● And … asynchrony (MPI and OpenMP too synchronous):
● Collectives/barriers multiply effects of microscopic load
imbalance, OS noise,…

StarSs: … generates task graph at run time …
#pragma css task input(A, B) output(C)
void vadd3 (float A[BS], float B[BS],
float C[BS]);
#pragma css task input(sum, A) output(B)
void scale_add (float sum, float A[BS],
float B[BS]);
Task Graph Generation
#pragma css task input(A) inout(sum)
void accum (float A[BS], float *sum);

for (i=0; i<N; i+=BS) // C=A+B 1 2 3 4
vadd3 ( &A[i], &B[i], &C[i]);
...
for (i=0; i<N; i+=BS) // sum(C[i])
5 6 7 8
accum (&C[i], &sum);
...
for (i=0; i<N; i+=BS) // B=sum*E
scale_add (sum, &E[i], &B[i]); 9 10 11 12
...
for (i=0; i<N; i+=BS) // A=C+D
vadd3 (&C[i], &D[i], &A[i]); 13 14 15 16
...
for (i=0; i<N; i+=BS) // E=C+F
vadd3 (&C[i], &F[i], &E[i]);
17 18 19 20


16

StarSs: … and executes as efficient as possible …
#pragma css task input(A, B) output(C)
void vadd3 (float A[BS], float B[BS],
float C[BS]);
#pragma css task input(sum, A) output(B)
void scale_add (float sum, float A[BS],
float B[BS]);
Task Graph Execution
#pragma css task input(A) inout(sum)
void accum (float A[BS], float *sum);

for (i=0; i<N; i+=BS) // C=A+B 1 1 1 2
vadd3 ( &A[i], &B[i], &C[i]);
...
for (i=0; i<N; i+=BS) // sum(C[i])
2 3 4 5
accum (&C[i], &sum);
...
for (i=0; i<N; i+=BS) // B=sum*E
scale_add (sum, &E[i], &B[i]); 6 6 6 7
...
for (i=0; i<N; i+=BS) // A=C+D
vadd3 (&C[i], &D[i], &A[i]); 2 2 2 3
...
for (i=0; i<N; i+=BS) // E=C+F
vadd3 (&C[i], &F[i], &E[i]);
7 8 7 8


StarSs: … benefiting from data access information

● Flat global address space seen
by programmer
● Flexibility to dynamically traverse
dataflow graph “optimizing”
● Concurrency. Critical path
● Memory access

● Opportunities for
● Prefetch
● Reuse
● Eli i t antidependences
Eliminate tid d
(rename)
● Replication management


17

StarSs: Enabler for exascale
Can exploit very unstructured Support for heterogeneity
parallelism Any # and combination of CPUs,
Not just loop/data parallelism GPUs
Easy to change structure Including autotuning
Supports large amounts of lookahead
S t l t fl k h d Malleability: Decouple program f
M ll bilit D l from
Not stalling for dependence resources
satisfaction Allowing dynamic resource
Allow for locality optimizations to allocation and load balance
tolerate latency Tolerate noise
Overlap data transfers, prefetch
Reuse
Nicely hybridizes into MPI/StarSs Data-flow; Asynchrony
Data flow;
Propagates to large scale the node
level dataflow characteristics Potential is there;
Overlap communication and Can blame runtime
computation
A chance against Amdahl’s law Compatible with proprietary
low level technologies
35


StarSs: history/strategy/versions

Basic SMPSs
must provide directionality ∀argument
Contiguous, non partially overlapped
Renaming
Several schedulers ( i it l
S l h d l (priority, locality,…)
lit )
No nesting
C/Fortran
MPI/SMPSs optims.
SMPSs regions
C, No Fortran
must provide directionality ∀argument
ovelaping &strided
OMPSs
Reshaping strided accesses
Priority and locality aware scheduling C/C++, Fortran under development
O
OpenMP compatibility ( )
MP tibilit (~)
Dependences based only on args. with directionality
Contiguous args. (address used as centinels)
Separate dependences/transfers
Inlined/outlined pragmas
Nesting
SMP/GPU/Cluster
No renaming,
Several schedulers: “Simple” locality aware sched,…


18

Multidisciplinary top-down approach

Application Programming
and algorithms models

Investigate
Performance
solutions
to these
Load
analysis and Power
balancing
Computer Center Power Projections

prediction 90
tools
and other 80

70
Cooling
Computers $31M

problems
Processor
60 $23M

Power (MW)
Interconnect 50
$17M
and node
40

30 $9M
20
$3M
10

0
2005 2006 2007 2008 2009 2010 2011
Year



19

Green/Top 500 November 2011
Green500 Top500
_Rank _Rank Mflops/Watt Power Site Computer
1 64 2026,48 85,12 IBM - Rochester BlueGene/Q, Power BQC 16C 1.60 GHz, Custom
2 65 2026,48 85,12 IBM Thomas J. Watson Research Center BlueGene/Q, Power BQC 16C 1.60 GHz, Custom
3 29 1996,09 170,25 IBM - Rochester BlueGene/Q, Power BQC 16C 1.60 GHz, Custom
4 17 1988,56 340,5 DOE/NNSA/LLNL BlueGene/Q, Power BQC 16C 1.60 GHz, Custom
5 284 1689,86 38,67 IBM Thomas J. Watson Research Center NNSA/SC Blue Gene/Q Prototype 1
6 328 1378,32 47,05 Nagasaki University
g y DEGIMA Cluster, Intel i5, ATI Radeon GPU, Infiniband QDR
Bullx B505, Xeon E5649 6C 2.53GHz, Infiniband QDR, NVIDIA
7 114 1266,26 81,5 Barcelona Supercomputing Center 2090
Curie Hybrid Nodes - Bullx B505, Xeon E5640 2.67 GHz, Infiniband
8 102 1010,11 108,8 TGCC / GENCI QDR
Institute of Process Engineering, Chinese Mole-8.5 Cluster, Xeon X5520 4C 2.27 GHz, Infiniband QDR,
9 21 963,7 515,2 Academy of Sciences NVIDIA 2050
GSIC Center, Tokyo Institute of HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia GPU,
10 5 958,35 1243,8 Technology Linux/Windows
SuperServer 2026GT-TRF, Xeon E5645 6C 2.40GHz, Infiniband
11 96 928,96 126,27 Virginia Tech QDR, NVIDIA 2050
HP ProLiant SL390s G7 Xeon 6C X5660 2.8Ghz, nVidia Fermi,
12 111 901,54 117,91 Georgia Institute of Technology Infiniband QDR
CINECA / SCS - SuperComputing iDataPlex DX360M3, Xeon E5645 6C 2.40 GHz, Infiniband QDR,
13 82 891 88
891,88 160 S l ti
Solution NVIDIA 2070
iDataPlex DX360M3, Xeon X5650 6C 2.66 GHz, Infiniband QDR,
14 256 891,87 76,25 Forschungszentrum Juelich (FZJ) NVIDIA 2070
Xtreme-X GreenBlade GB512X, Xeon E5 (Sandy Bridge - EP) 8C
15 61 889,19 198,72 Sandia National Laboratories 2.60GHz, Infiniband QDR
RIKEN Advanced Institute for
32 1 830,18 12659,89 Computational Science (AICS) K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect
47 2 635,15 4040 National Supercomputing Center in Tianjin NUDT YH MPP, Xeon X5670 6C 2.93 GHz, NVIDIA 2050
149 3 253,09 6950 DOE/SC/Oak Ridge National Laboratory Cray XT5-HE Opteron 6-core 2.6 GHz
National Supercomputing Centre in Dawning TC3600 Blade System, Xeon X5650 6C 2.66GHz,
56 4 492,64 2580 Shenzhen (NSCS) Infiniband QDR, NVIDIA 2050

Mexico DF, November, 2011
SBAC-PAD, Vitoria October 28th, 2011 39

Green/Top 500 November 2011
Top500 rank

BSC, Xeon 6C, NVIDIA 2090 GPU

Nagasaki U., Intel i5, ATI Radeon GPU

IBM and NNSA, Blue Gene/Q

Mflops/watt Mwatts/Exaflop

2026,48 493
1689,86 592
Mflops/watt
1378,32 726 500-1000 100-500
1266,26 726 >1 GF/watt MF/watt MF/watt


20

Mateo valero p1

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (8)

Similar to Mateo valero p1

Similar to Mateo valero p1 (20)

Recently uploaded

Recently uploaded (20)

Mateo valero p1