5. 2. To faster air plane design
Boeing: Number of wing prototypes prepared for wind-tunnel testing
Date 1980 1995 2005
Airplane B757/B767 B777 B787
# wing prototypes 77 11 11 5
Plateau due to RANS limitations.
Further decrease expected from LES with ExaFlop
EESI Final Conference
10-11 Oct. 2011, Barcelona
Mexico DF, November, 2011 9
Diseño del Airbus 380
Mexico DF, November, 2011 10
5
6. 2. To faster air plane design
Airbus: "More simulation, less tests
More tests“
From A380 to A350
- 40% less wind-tunnel days
- 25% saving in aerodynamics development time
- 20% saving on wind-tunnel tests cost
th k t HPC
thanks to HPC-enabled CFD runs, especially i hi h
bl d i ll in high-speed regime, providing
d i idi
even better representation of aerodynamics phenomenon turned into better
design choices.
Acknowledgements: E. CHAPUT (AIRBUS)
EESI Final Conference
10-11 Oct. 2011, Barcelona
Mexico DF, November, 2011 11
2. Oil industry
EESI Final Conference
10-11 Oct. 2011, Barcelona
Mexico DF, November, 2011 12
6
7. Diseño del ITER
TOKAMAK (JET)
Mexico DF, November, 2011 13
Fundamental Sciences
EESI Final Conference
Mexico DF, November, 2011 10-11 Oct. 2011, Barcelona 14
7
8. Materials: a new path to competitiveness
On-demand materials for effective commercial use
Conductivity: energy loss reduction
Lifetime: corrosion protection, e.g. chrome
Fissures: saftety insurance from molecular design
Optimisation of materials / lubricants
less friction, longer lifetime, less energy-losses
Industrial need to speed up simulation from months to days
All atom Multi-scale
Exascale enables simulation of larger
and realistic systems and devices
EESI Final Conference, 10-11
Oct. 2011, Barcelona
Mexico DF, November, 2011 15
Life Sciences and Health
Population
Organ
Tissue
Cell
Macromolecule
Small Molecule
Atom
EESI Final Conference, 10-11
Oct. 2011, Barcelona
Mexico DF, November, 2011 16
8
9. Supercomputación, teoría y experimentación
Mexico DF, November, 2011 17
Cortesia de IBM
Supercomputing, theory and experimentation
Mexico DF, November, 2011 18
Cortesia de IBM
9
10. Holistic approach …
Towards exaflop
Comput. Complexity
Applications
Async. Algs.
Moldability
Job Scheduling Resource awareness
Load Balancin
User satisfaction
Programming Model Address space Dependencies
Work generation
ng
ng
Run time Locality optimization
Concurrency extraction
Topology and routing
Interconnection
External contention
Processor/node NIC design Run time support Hw counters
architecture
Memory subsystem Core Structure
Mexico DF, November, 2011 19
10+ Pflop/s systems planned
● Fujitsu Kei
● 80,000 8-core Sparc64 VIIIfx processors 2 GHz,
(16 Gflops/core, 58 watts 3.2 Gflops/watt),
16 GB/node 1 PB memory, 6D mesh-torus,
GB/node, memory mesh torus
10 Pflops
● Cray's Titan at DOE, Oak Ridge National Laboratory
● Hybrid system with Nvidia GPUs, 1 Pflop/s in 2011,
20 Pflop/s in 2012, late 2011 prototype
● $100 million
Mexico DF, November, 2011 20
10
11. 10+ Pflop/s systems planned
● IBM Blue Waters at Illinois
● 40,000 8-core Power7, 1 PB memory,
18 PB disk, 500 PB archival storage,
10 Pflop/s 2012 $200 million
Pflop/s, 2012,
● IBM Blue Gene/Q systems:
● Mira to DOE, Argonne National Lab with 49,000 nodes,
16-core Power A2 processor (1.6-3 GHz),
750 K cores, 750 TB memory, 70 PB disk,
5D torus 10 Pflop/s
torus,
● Sequoia to Lawrence Livermore National Lab with
98304 nodes (96 racks), 16-core A2 processor,
1.6 M cores (1 GB/core), 1.6 Petabytes memory, 6 Mwatt,
3 Gflops/watt, 20 Pflop/s, 2012
Mexico DF, November, 2011 21
Japan Plan for Exascale
Heterogeneous, Distributed Memory
GigaHz KiloCore MegaNode system
2012 2015 2018-2020
K Machine 10K Machine 100K Machine
10 PF 100 PF ExaFlops
Feasibility Study (2012-2013) Exascale Project (2014-2020)
Post-Petascale Projects
Mexico DF, November, 2011 22
11
12. Mexico DF, November, 2011 Thanks to S. Borkar, Intel 23
Mexico DF, November, 2011 Thanks to S. Borkar, Intel 24
12
13. Nvidia: Chip for the Exaflop
Computer
Mexico DF, November, 2011 Thanks Bill Dally 25
Nvidia: Node for the Exaflop
Computer
Thanks Bill Dally
Mexico DF, November, 2011 26
13
14. Exascale Supercomputer
Mexico DF, November, 2011 Thanks Bill Dally 27
BSC-CNS: International Initiatives (IESP)
Improve the world’s simulation and modeling
capability by improving the coordination and
development of the HPC software environment
B ild an i
Build international plan f d l i
i l l for developing
the next generation open source software
for scientific high-performance computing
Mexico DF, November, 2011 28
14
15. Back to Babel?
Book of Genesis The computer age
“Now the whole earth had
Fortran & MPI
one language and the
same words” …
…”Come, let us make
bricks, and burn them
thoroughly. ”…
…"Come, let us build
ourselves a city, and a tower
with its top in the heavens,
++
and let us make a name for
ourselves”…
And the LORD said, "Look, they are one Cilk++
people, and they have all one language; and Fortress X10 CUDA
this is only the beginning of what they will do; Sisal HPF
StarSs RapidMind
nothing that they propose to do will now be Sequoia
impossible for them. Come, let us go down, and CAF ALF OpenMP
UPC SDK
confuse their language there, so that they will
not understand one another's speech." Chapel MPI
Mexico DF, November, 2011 Thanks to Jesus Labarta 29
You will see…. in 400 years from now people
will get crazy
New generation of programmers
Parallel
Multicore/manycore Programming
Architectures
New Usage
g
models
Source: Picasso -- Don Quixote
Dr. Avi Mendelson (Microsoft). Keynote at ISC-2007
Mexico DF, November, 2011 30
15
16. Different models of computation …….
● The dream for automatic parallelizing compilers not true …
● … so programmer needs to express opportunities for parallel execution
in the application
SPMD OpenMP 2.5 Nested fork-join OpenMP 3.0 DAG – data flow
Huge Lookahead &Reuse….
Latency/EBW/Scheduling
● And … asynchrony (MPI and OpenMP too synchronous):
● Collectives/barriers multiply effects of microscopic load
imbalance, OS noise,…
Mexico DF, November, 2011 31
StarSs: … generates task graph at run time …
#pragma css task input(A, B) output(C)
void vadd3 (float A[BS], float B[BS],
float C[BS]);
#pragma css task input(sum, A) output(B)
void scale_add (float sum, float A[BS],
float B[BS]);
Task Graph Generation
#pragma css task input(A) inout(sum)
void accum (float A[BS], float *sum);
for (i=0; i<N; i+=BS) // C=A+B 1 2 3 4
vadd3 ( &A[i], &B[i], &C[i]);
...
for (i=0; i<N; i+=BS) // sum(C[i])
5 6 7 8
accum (&C[i], &sum);
...
for (i=0; i<N; i+=BS) // B=sum*E
scale_add (sum, &E[i], &B[i]); 9 10 11 12
...
for (i=0; i<N; i+=BS) // A=C+D
vadd3 (&C[i], &D[i], &A[i]); 13 14 15 16
...
for (i=0; i<N; i+=BS) // E=C+F
vadd3 (&C[i], &F[i], &E[i]);
17 18 19 20
Mexico DF, November, 2011 32
16
17. StarSs: … and executes as efficient as possible …
#pragma css task input(A, B) output(C)
void vadd3 (float A[BS], float B[BS],
float C[BS]);
#pragma css task input(sum, A) output(B)
void scale_add (float sum, float A[BS],
float B[BS]);
Task Graph Execution
#pragma css task input(A) inout(sum)
void accum (float A[BS], float *sum);
for (i=0; i<N; i+=BS) // C=A+B 1 1 1 2
vadd3 ( &A[i], &B[i], &C[i]);
...
for (i=0; i<N; i+=BS) // sum(C[i])
2 3 4 5
accum (&C[i], &sum);
...
for (i=0; i<N; i+=BS) // B=sum*E
scale_add (sum, &E[i], &B[i]); 6 6 6 7
...
for (i=0; i<N; i+=BS) // A=C+D
vadd3 (&C[i], &D[i], &A[i]); 2 2 2 3
...
for (i=0; i<N; i+=BS) // E=C+F
vadd3 (&C[i], &F[i], &E[i]);
7 8 7 8
Mexico DF, November, 2011 33
StarSs: … benefiting from data access information
● Flat global address space seen
by programmer
● Flexibility to dynamically traverse
dataflow graph “optimizing”
● Concurrency. Critical path
● Memory access
● Opportunities for
● Prefetch
● Reuse
● Eli i t antidependences
Eliminate tid d
(rename)
● Replication management
Mexico DF, November, 2011 34
17
18. StarSs: Enabler for exascale
Can exploit very unstructured Support for heterogeneity
parallelism Any # and combination of CPUs,
Not just loop/data parallelism GPUs
Easy to change structure Including autotuning
Supports large amounts of lookahead
S t l t fl k h d Malleability: Decouple program f
M ll bilit D l from
Not stalling for dependence resources
satisfaction Allowing dynamic resource
Allow for locality optimizations to allocation and load balance
tolerate latency Tolerate noise
Overlap data transfers, prefetch
Reuse
Nicely hybridizes into MPI/StarSs Data-flow; Asynchrony
Data flow;
Propagates to large scale the node
level dataflow characteristics Potential is there;
Overlap communication and Can blame runtime
computation
A chance against Amdahl’s law Compatible with proprietary
low level technologies
35
Mexico DF, November, 2011 35
StarSs: history/strategy/versions
Basic SMPSs
must provide directionality ∀argument
Contiguous, non partially overlapped
Renaming
Several schedulers ( i it l
S l h d l (priority, locality,…)
lit )
No nesting
C/Fortran
MPI/SMPSs optims.
SMPSs regions
C, No Fortran
must provide directionality ∀argument
ovelaping &strided
OMPSs
Reshaping strided accesses
Priority and locality aware scheduling C/C++, Fortran under development
O
OpenMP compatibility ( )
MP tibilit (~)
Dependences based only on args. with directionality
Contiguous args. (address used as centinels)
Separate dependences/transfers
Inlined/outlined pragmas
Nesting
SMP/GPU/Cluster
No renaming,
Several schedulers: “Simple” locality aware sched,…
Mexico DF, November, 2011 36
18
19. Multidisciplinary top-down approach
Application Programming
and algorithms models
Investigate
Performance
solutions
to these
Load
analysis and Power
balancing
Computer Center Power Projections
prediction 90
tools
and other 80
70
Cooling
Computers $31M
problems
Processor
60 $23M
Power (MW)
Interconnect 50
$17M
and node
40
30 $9M
20
$3M
10
0
2005 2006 2007 2008 2009 2010 2011
Year
Mexico DF, November, 2011 37
Mexico DF, November, 2011 38
19
20. Green/Top 500 November 2011
Green500 Top500
_Rank _Rank Mflops/Watt Power Site Computer
1 64 2026,48 85,12 IBM - Rochester BlueGene/Q, Power BQC 16C 1.60 GHz, Custom
2 65 2026,48 85,12 IBM Thomas J. Watson Research Center BlueGene/Q, Power BQC 16C 1.60 GHz, Custom
3 29 1996,09 170,25 IBM - Rochester BlueGene/Q, Power BQC 16C 1.60 GHz, Custom
4 17 1988,56 340,5 DOE/NNSA/LLNL BlueGene/Q, Power BQC 16C 1.60 GHz, Custom
5 284 1689,86 38,67 IBM Thomas J. Watson Research Center NNSA/SC Blue Gene/Q Prototype 1
6 328 1378,32 47,05 Nagasaki University
g y DEGIMA Cluster, Intel i5, ATI Radeon GPU, Infiniband QDR
Bullx B505, Xeon E5649 6C 2.53GHz, Infiniband QDR, NVIDIA
7 114 1266,26 81,5 Barcelona Supercomputing Center 2090
Curie Hybrid Nodes - Bullx B505, Xeon E5640 2.67 GHz, Infiniband
8 102 1010,11 108,8 TGCC / GENCI QDR
Institute of Process Engineering, Chinese Mole-8.5 Cluster, Xeon X5520 4C 2.27 GHz, Infiniband QDR,
9 21 963,7 515,2 Academy of Sciences NVIDIA 2050
GSIC Center, Tokyo Institute of HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia GPU,
10 5 958,35 1243,8 Technology Linux/Windows
SuperServer 2026GT-TRF, Xeon E5645 6C 2.40GHz, Infiniband
11 96 928,96 126,27 Virginia Tech QDR, NVIDIA 2050
HP ProLiant SL390s G7 Xeon 6C X5660 2.8Ghz, nVidia Fermi,
12 111 901,54 117,91 Georgia Institute of Technology Infiniband QDR
CINECA / SCS - SuperComputing iDataPlex DX360M3, Xeon E5645 6C 2.40 GHz, Infiniband QDR,
13 82 891 88
891,88 160 S l ti
Solution NVIDIA 2070
iDataPlex DX360M3, Xeon X5650 6C 2.66 GHz, Infiniband QDR,
14 256 891,87 76,25 Forschungszentrum Juelich (FZJ) NVIDIA 2070
Xtreme-X GreenBlade GB512X, Xeon E5 (Sandy Bridge - EP) 8C
15 61 889,19 198,72 Sandia National Laboratories 2.60GHz, Infiniband QDR
RIKEN Advanced Institute for
32 1 830,18 12659,89 Computational Science (AICS) K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect
47 2 635,15 4040 National Supercomputing Center in Tianjin NUDT YH MPP, Xeon X5670 6C 2.93 GHz, NVIDIA 2050
149 3 253,09 6950 DOE/SC/Oak Ridge National Laboratory Cray XT5-HE Opteron 6-core 2.6 GHz
National Supercomputing Centre in Dawning TC3600 Blade System, Xeon X5650 6C 2.66GHz,
56 4 492,64 2580 Shenzhen (NSCS) Infiniband QDR, NVIDIA 2050
Mexico DF, November, 2011
SBAC-PAD, Vitoria October 28th, 2011 39
Green/Top 500 November 2011
Top500 rank
BSC, Xeon 6C, NVIDIA 2090 GPU
Nagasaki U., Intel i5, ATI Radeon GPU
IBM and NNSA, Blue Gene/Q
Mflops/watt Mwatts/Exaflop
2026,48 493
1689,86 592
Mflops/watt
1378,32 726 500-1000 100-500
1266,26 726 >1 GF/watt MF/watt MF/watt
Mexico DF, November, 2011 40
20