SlideShare a Scribd company logo
1 of 20
Download to read offline
“Future Exascale Supercomputers”




        Mexico DF, November, 2011                                                      Prof. Mateo Valero
                                                                                       Director




Top10
Rank                   Site                     Computer                 Procs    Rmax         Rpeak
          RIKEN Advanced Institute
                                      Fujitsu, K computer, SPARC64
 1        for Computational Science                                      705024   10510000    11280384
                                      VIIIfx 2.0GHz, Tofu interconnect
          (AICS)
                                                                         186368
 2        Tianjin, China              XeonX5670+NVIDIA                            2566000      4701000
                                                                         100352
 3        Oak Ridge Nat. Lab.         Crat XT5,6 cores                   224162    1759000     2331000
 4        Shenzhen, China             XeonX5670+NVIDIA                   120640   1271000      2984300
                                                                          73278
 5        GSIC Center, Tokyo          XeonX5670+NVIDIA                            1192000      2287630
                                                                          56994
 6        DOE/NNSA/LANL/SNL           Cray XE6 8-core 2.4 GHz            142272   1110000      1365811
                                      SGI Altix ICE 8200EX/8400EX,
          NASA/Ames Research
 7                                    Xeon HT QC 3.0/Xeon                111104   1088000      1315328
          Center/NAS
                                      5570/5670 2.93 Ghz, Infiniband
 8        DOE/SC/LBNL/NERSC           Cray XE6 12 cores                  153408   1054000      1288627
          Commissariat a l'Energie    Bull bullx super-node
 9                                                                       138368    1050000     1254550
          Atomique (CEA)              S6010/S6030
                                      QS22/LS21 Cluster, PowerXCell
 10       DOE/NNSA/LANL                                                  122400    1042000     1375776
                                      8i / Opteron Infiniband




Mexico DF, November, 2011                                                          2




                                                                                                            1
Parallel Systems


                      Interconnect (Myrinet, IB, Ge, 3D torus, tree, …)


     Node                                       Node*                   Node**
      Node
                                                    Node* *                    Node** **
                                                     Node                       Node

                        Node
                         Node
                           Node          SMP
                               Memory                   homogeneous multicore (BlueGene-Q chip)
                                          IN            heterogenous multicore
                             multicore                     general-purpose accelerator (e.g. Cell)
                              multicore
                                 multicore                 GPU
                                   multicore               FPGA
                                                           ASIC (e.g. Anton for MD)
                                                        Network-on-chip (bus, ring, direct, …)



 Mexico DF, November, 2011                                                        3




Riken’s Fujitsu K with SPARC64 VIIIfx

 ●    Homogeneous architecture:
       ● Compute node:
               ● One SPARC64 VIIIfx processor
                 2 GHz, 8 cores per chip
                 128 Gigaflops per chip

               ● 16 GB memory per node


 ●    Number of nodes and cores:
       ● 864 cabinets * 102 compute nodes/cabinet * (1 socket * 8 CPU cores) = 705024
         cores …. 50 by 60 meters

 ●    Peak performance (DP):
           p           ( )
       ● 705024 cores * 16 GFLOPS per core = 11280384 PFLOPS

 ●    Linpack: 10510 PF   93% efficiency. Matrix: more than 13725120 rows !!!
              29 hours and 28 minutes

 ●    Power consumption 12.6 MWatt, 0.8 Gigaflops/W

 Mexico DF, November, 2011                                                        4




                                                                                                     2
Looking at the Gordon Bell Prize


  ● 1 GFlop/s; 1988; Cray Y-MP; 8 Processors
        ● Static finite element analysis
  ● 1 TFlop/s; 1998; Cray T3E; 1024 Processors
        ● Modeling of metallic magnet atoms, using a
          variation of the locally self-consistent multiple
          scattering method.
  ● 1 PFlop/s; 2008; Cray XT5; 1.5x105 Processors
        ● Superconductive materials


  ● 1 EFlop/s; ~2018; ?; 1x108 Processors?? (109 threads)


              Jack Dongarra
 Mexico DF, November, 2011                                    5




 Mexico DF, November, 2011                                    6




                                                                  3
Nvidia GPU instruction execution




                                  MP1                    MP2               MP3                MP4


instruction1

instruction2

Long latency3

Instruction4



 SBAC-PAD, Vitoria October
 Mexico DF, November, 201128th, 2011                                                    7




Potential System Architecture
for Exascale Supercomputers

        System                 2010           “2015”                       “2018”            Difference
       attributes                                                                             2010-18
   System peak               2 Pflop/s      200 Pflop/s                 1 Eflop/sec           O(1000)
   Power                       6 MW           15 MW                        ~20 MW
                                                                            20
   System memory               0.3 PB             5 PB                     32-64 PB           O(100)
   Node                       125 GF     0.5 TF            7 TF     1 TF             10 TF    O(10) –
   performance                                                                                O(100)
   Node memory                25 GB/s     0.1         1 TB/sec    0.4 TB/sec     4 TB/sec     O(100)
   BW                                    TB/sec
   Node                          12      O(100)       O(1,000)    O(1,000)       O(10,000)   O(100) –
   concurrency                                                                               O(1000)
   Total                      225,000             O(108)                    O(109)           O(10,000)
   Concurrency
   Total Node                1.5 GB/s        20 GB/sec                  200 GB/sec            O(100)
   Interconnect
   BW
   MTTI                         days          O(1day)                      O(1 day)           - O(10)
     EESI Final Conference
     10-11 Oct. 2011, Barcelona
 Mexico DF, November, 2011                                                              8




                                                                                                          4
2. To faster air plane design

   Boeing: Number of wing prototypes prepared for wind-tunnel testing


              Date                  1980       1995               2005

            Airplane              B757/B767    B777               B787

     # wing prototypes               77         11                 11    5




     Plateau due to RANS limitations.
     Further decrease expected from LES with ExaFlop

     EESI Final Conference
     10-11 Oct. 2011, Barcelona
 Mexico DF, November, 2011                                         9




Diseño del Airbus 380




 Mexico DF, November, 2011                                         10




                                                                             5
2. To faster air plane design


    Airbus: "More simulation, less tests
             More                  tests“
    From                          A380       to    A350




    - 40% less wind-tunnel days
    - 25% saving in aerodynamics development time
    - 20% saving on wind-tunnel tests cost
    th k t HPC
    thanks to HPC-enabled CFD runs, especially i hi h
                       bl d                i ll in high-speed regime, providing
                                                            d    i        idi
    even better representation of aerodynamics phenomenon turned into better
    design choices.

      Acknowledgements: E. CHAPUT (AIRBUS)



     EESI Final Conference
     10-11 Oct. 2011, Barcelona
 Mexico DF, November, 2011                                         11




     2. Oil industry




     EESI Final Conference
     10-11 Oct. 2011, Barcelona
 Mexico DF, November, 2011                                         12




                                                                                  6
Diseño del ITER




                                                 TOKAMAK (JET)

Mexico DF, November, 2011                                        13




    Fundamental Sciences




                            EESI Final Conference
Mexico DF, November, 2011   10-11 Oct. 2011, Barcelona           14




                                                                      7
Materials: a new path to competitiveness

     On-demand materials for effective commercial use
     Conductivity: energy loss reduction
     Lifetime: corrosion protection, e.g. chrome
     Fissures: saftety insurance from molecular design
     Optimisation of materials / lubricants
     less friction, longer lifetime, less energy-losses
 Industrial need to speed up simulation from months to days




                                                                    All atom        Multi-scale
                                                   Exascale enables simulation of larger
                                                    and realistic systems and devices
    EESI Final Conference, 10-11
    Oct. 2011, Barcelona
Mexico DF, November, 2011                                                      15




    Life Sciences and Health




                                   Population
                                                                                    Organ

                                                                       Tissue

                                                             Cell

                                                    Macromolecule


                                          Small Molecule
                                   Atom
    EESI Final Conference, 10-11
    Oct. 2011, Barcelona
Mexico DF, November, 2011                                                      16




                                                                                                  8
Supercomputación, teoría y experimentación




 Mexico DF, November, 2011               17
                                              Cortesia de IBM




Supercomputing, theory and experimentation




 Mexico DF, November, 2011               18
                                              Cortesia de IBM




                                                                9
Holistic approach …



                                                                                                   Towards exaflop



                                                    Comput. Complexity
     Applications
                                                                            Async. Algs.
                                               Moldability
     Job Scheduling                                 Resource awareness
                             Load Balancin




                                                  User satisfaction
     Programming Model                        Address space                  Dependencies
                                                Work generation
                                         ng
                                         ng




     Run time                                                 Locality optimization
                                                                          Concurrency extraction
                                                Topology and routing
     Interconnection
                                               External contention

     Processor/node                           NIC design              Run time support                   Hw counters
     architecture
                                                                       Memory subsystem            Core Structure


 Mexico DF, November, 2011                                                                          19




10+ Pflop/s systems planned

  ● Fujitsu Kei
         ● 80,000 8-core Sparc64 VIIIfx processors 2 GHz,
           (16 Gflops/core, 58 watts   3.2 Gflops/watt),
           16 GB/node 1 PB memory, 6D mesh-torus,
              GB/node,        memory      mesh torus
           10 Pflops




  ● Cray's Titan at DOE, Oak Ridge National Laboratory
         ● Hybrid system with Nvidia GPUs, 1 Pflop/s in 2011,
           20 Pflop/s in 2012, late 2011 prototype
         ● $100 million

 Mexico DF, November, 2011                                                                          20




                                                                                                                       10
10+ Pflop/s systems planned

  ● IBM Blue Waters at Illinois
         ● 40,000 8-core Power7, 1 PB memory,
           18 PB disk, 500 PB archival storage,
           10 Pflop/s 2012 $200 million
              Pflop/s, 2012,


  ● IBM Blue Gene/Q systems:
         ● Mira to DOE, Argonne National Lab with 49,000 nodes,
           16-core Power A2 processor (1.6-3 GHz),
           750 K cores, 750 TB memory, 70 PB disk,
           5D torus 10 Pflop/s
               torus,
         ● Sequoia to Lawrence Livermore National Lab with
           98304 nodes (96 racks), 16-core A2 processor,
           1.6 M cores (1 GB/core), 1.6 Petabytes memory, 6 Mwatt,
           3 Gflops/watt, 20 Pflop/s, 2012

 Mexico DF, November, 2011                                 21




Japan Plan for Exascale

                       Heterogeneous, Distributed Memory
                       GigaHz KiloCore MegaNode system

        2012                       2015                2018-2020




      K Machine                   10K Machine         100K Machine
       10 PF                        100 PF               ExaFlops

   Feasibility Study (2012-2013)          Exascale Project (2014-2020)
   Post-Petascale Projects

 Mexico DF, November, 2011                                 22




                                                                         11
Mexico DF, November, 2011   Thanks to S. Borkar, Intel   23




Mexico DF, November, 2011   Thanks to S. Borkar, Intel   24




                                                              12
Nvidia: Chip for the Exaflop
Computer




 Mexico DF, November, 2011   Thanks Bill Dally   25




Nvidia: Node for the Exaflop
Computer




                             Thanks Bill Dally
 Mexico DF, November, 2011                       26




                                                      13
Exascale Supercomputer




 Mexico DF, November, 2011   Thanks Bill Dally   27




BSC-CNS: International Initiatives (IESP)




          Improve the world’s simulation and modeling
          capability by improving the coordination and
          development of the HPC software environment

            B ild an i
            Build    international plan f d l i
                              i    l l for developing
            the next generation open source software
            for scientific high-performance computing

 Mexico DF, November, 2011                       28




                                                         14
Back to Babel?

  Book of Genesis                                                       The computer age

 “Now the whole earth had
                                                                                  Fortran & MPI
 one language and the
 same words” …

 …”Come, let us make
 bricks, and burn them
 thoroughly. ”…

 …"Come, let us build
 ourselves a city, and a tower
 with its top in the heavens,
                                                                                                  ++
 and let us make a name for
 ourselves”…

 And the LORD said, "Look, they are one                                                Cilk++
 people, and they have all one language; and                        Fortress     X10          CUDA
 this is only the beginning of what they will do;                      Sisal           HPF
                                                               StarSs        RapidMind
 nothing that they propose to do will now be                                                Sequoia
 impossible for them. Come, let us go down, and                        CAF ALF OpenMP
                                                               UPC                          SDK
 confuse their language there, so that they will
 not understand one another's speech."                                 Chapel      MPI


 Mexico DF, November, 2011              Thanks to Jesus Labarta                          29




You will see…. in 400 years from now people
will get crazy

                                 New generation of programmers


                                                                         Parallel
             Multicore/manycore                                          Programming
             Architectures




                                                                                       New Usage
                                                                                              g
                                                                                       models


                                    Source: Picasso -- Don Quixote


                             Dr. Avi Mendelson (Microsoft). Keynote at ISC-2007


 Mexico DF, November, 2011                                                               30




                                                                                                       15
Different models of computation …….

  ● The dream for automatic parallelizing compilers not true …
  ● … so programmer needs to express opportunities for parallel execution
    in the application


           SPMD              OpenMP 2.5   Nested fork-join   OpenMP 3.0                       DAG – data flow




                                                                                          Huge Lookahead &Reuse….
                                                                                           Latency/EBW/Scheduling

  ● And … asynchrony (MPI and OpenMP too synchronous):
        ● Collectives/barriers multiply effects of microscopic load
          imbalance, OS noise,…
 Mexico DF, November, 2011                                                                  31




StarSs: … generates task graph at run time …
#pragma css task input(A, B) output(C)
void vadd3 (float A[BS], float B[BS],
            float C[BS]);
#pragma css task input(sum, A) output(B)
void scale_add (float sum, float A[BS],
               float B[BS]);
                                                              Task Graph Generation
#pragma css task input(A) inout(sum)
void accum (float A[BS], float *sum);


for (i=0; i<N; i+=BS)               // C=A+B                    1            2               3            4
    vadd3 ( &A[i], &B[i], &C[i]);
...
for (i=0; i<N; i+=BS)             // sum(C[i])
                                                                         5            6               7            8
    accum (&C[i], &sum);
...
for (i=0; i<N; i+=BS)             // B=sum*E
    scale_add (sum, &E[i], &B[i]);                                  9            10              11           12
...
for (i=0; i<N; i+=BS)              // A=C+D
    vadd3 (&C[i], &D[i], &A[i]);                                    13           14              15           16
...
for (i=0; i<N; i+=BS)              // E=C+F
    vadd3 (&C[i], &F[i], &E[i]);
                                                                17           18              19           20




 Mexico DF, November, 2011                                                                  32




                                                                                                                       16
StarSs: … and executes as efficient as possible …
#pragma css task input(A, B) output(C)
void vadd3 (float A[BS], float B[BS],
            float C[BS]);
#pragma css task input(sum, A) output(B)
void scale_add (float sum, float A[BS],
               float B[BS]);
                                                 Task Graph Execution
#pragma css task input(A) inout(sum)
void accum (float A[BS], float *sum);


for (i=0; i<N; i+=BS)               // C=A+B     1            1            1           2
    vadd3 ( &A[i], &B[i], &C[i]);
...
for (i=0; i<N; i+=BS)             // sum(C[i])
                                                          2           3            4           5
    accum (&C[i], &sum);
...
for (i=0; i<N; i+=BS)             // B=sum*E
    scale_add (sum, &E[i], &B[i]);                    6           6            6           7
...
for (i=0; i<N; i+=BS)              // A=C+D
    vadd3 (&C[i], &D[i], &A[i]);                     2            2            2           3
...
for (i=0; i<N; i+=BS)              // E=C+F
    vadd3 (&C[i], &F[i], &E[i]);
                                                  7           8            7           8




 Mexico DF, November, 2011                                                33




StarSs: … benefiting from data access information

  ● Flat global address space seen
    by programmer
  ● Flexibility to dynamically traverse
    dataflow graph “optimizing”
     ● Concurrency. Critical path
     ● Memory access

  ● Opportunities for
     ● Prefetch
     ● Reuse
     ● Eli i t antidependences
       Eliminate tid     d
       (rename)
     ● Replication management




 Mexico DF, November, 2011                                                34




                                                                                                   17
StarSs: Enabler for exascale
         Can exploit very unstructured                     Support for heterogeneity
         parallelism                                           Any # and combination of CPUs,
              Not just loop/data parallelism                   GPUs
              Easy to change structure                         Including autotuning
         Supports large amounts of lookahead
         S     t l           t fl k h d                    Malleability: Decouple program f
                                                           M ll bilit D        l          from
              Not stalling for dependence                  resources
              satisfaction                                     Allowing dynamic resource
         Allow for locality optimizations to                   allocation and load balance
         tolerate latency                                      Tolerate noise
              Overlap data transfers, prefetch
              Reuse
         Nicely hybridizes into MPI/StarSs                            Data-flow; Asynchrony
                                                                      Data flow;
               Propagates to large scale the node
               level dataflow characteristics                             Potential is there;
               Overlap communication and                                 Can blame runtime
               computation
               A chance against Amdahl’s law                   Compatible with proprietary
                                                                 low level technologies
                                                                                                       35

 Mexico DF, November, 2011                                                       35




StarSs: history/strategy/versions

Basic SMPSs
     must provide directionality ∀argument
     Contiguous, non partially overlapped
     Renaming
     Several schedulers ( i it l
     S         l h d l   (priority, locality,…)
                                        lit   )
     No nesting
     C/Fortran
     MPI/SMPSs optims.
 SMPSs regions
    C, No Fortran
    must provide directionality ∀argument
    ovelaping &strided
                                                 OMPSs
    Reshaping strided accesses
    Priority and locality aware scheduling       C/C++, Fortran under development
                                                 O
                                                 OpenMP compatibility ( )
                                                       MP        tibilit (~)
                                                 Dependences based only on args. with directionality
                                                 Contiguous args. (address used as centinels)
                                                 Separate dependences/transfers
                                                 Inlined/outlined pragmas
                                                 Nesting
                                                 SMP/GPU/Cluster
                                                 No renaming,
                                                 Several schedulers: “Simple” locality aware sched,…

 Mexico DF, November, 2011                                                       36




                                                                                                            18
Multidisciplinary top-down approach



                  Application           Programming
                 and algorithms            models

                                    Investigate
        Performance
                                     solutions
                                      to these
                                                   Load
        analysis and               Power
                                                 balancing
                                                                                             Computer Center Power Projections

         prediction                                                       90
            tools
                                     and other                            80

                                                                          70
                                                                                              Cooling
                                                                                              Computers                  $31M


                                     problems
                                         Processor
                                                                          60                                     $23M

                                                             Power (MW)
                    Interconnect                                          50
                                                                                                          $17M
                                          and node
                                                                          40

                                                                          30                     $9M
                                                                          20
                                                                                      $3M
                                                                          10

                                                                          0
                                                                               2005   2006       2007     2008   2009    2010    2011
                                                                                                          Year

 Mexico DF, November, 2011                                                                                  37




 Mexico DF, November, 2011                                                                                  38




                                                                                                                                        19
Green/Top 500 November 2011
Green500 Top500
_Rank    _Rank Mflops/Watt Power    Site                                          Computer
       1      64   2026,48    85,12 IBM - Rochester                               BlueGene/Q, Power BQC 16C 1.60 GHz, Custom
       2      65   2026,48    85,12 IBM Thomas J. Watson Research Center          BlueGene/Q, Power BQC 16C 1.60 GHz, Custom
       3      29   1996,09 170,25 IBM - Rochester                                 BlueGene/Q, Power BQC 16C 1.60 GHz, Custom
       4      17   1988,56    340,5 DOE/NNSA/LLNL                                 BlueGene/Q, Power BQC 16C 1.60 GHz, Custom
       5     284   1689,86    38,67 IBM Thomas J. Watson Research Center          NNSA/SC Blue Gene/Q Prototype 1
       6     328   1378,32    47,05 Nagasaki University
                                       g              y                           DEGIMA Cluster, Intel i5, ATI Radeon GPU, Infiniband QDR
                                                                                  Bullx B505, Xeon E5649 6C 2.53GHz, Infiniband QDR, NVIDIA
        7     114    1266,26       81,5 Barcelona Supercomputing Center           2090
                                                                                  Curie Hybrid Nodes - Bullx B505, Xeon E5640 2.67 GHz, Infiniband
        8     102    1010,11      108,8 TGCC / GENCI                              QDR
                                        Institute of Process Engineering, Chinese Mole-8.5 Cluster, Xeon X5520 4C 2.27 GHz, Infiniband QDR,
        9      21      963,7      515,2 Academy of Sciences                       NVIDIA 2050
                                        GSIC Center, Tokyo Institute of           HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia GPU,
       10       5     958,35     1243,8 Technology                                Linux/Windows
                                                                                  SuperServer 2026GT-TRF, Xeon E5645 6C 2.40GHz, Infiniband
       11      96     928,96     126,27 Virginia Tech                             QDR, NVIDIA 2050
                                                                                  HP ProLiant SL390s G7 Xeon 6C X5660 2.8Ghz, nVidia Fermi,
       12     111     901,54     117,91 Georgia Institute of Technology           Infiniband QDR
                                        CINECA / SCS - SuperComputing             iDataPlex DX360M3, Xeon E5645 6C 2.40 GHz, Infiniband QDR,
       13      82     891 88
                      891,88        160 S l ti
                                        Solution                                  NVIDIA 2070
                                                                                  iDataPlex DX360M3, Xeon X5650 6C 2.66 GHz, Infiniband QDR,
       14     256     891,87      76,25 Forschungszentrum Juelich (FZJ)           NVIDIA 2070
                                                                                  Xtreme-X GreenBlade GB512X, Xeon E5 (Sandy Bridge - EP) 8C
       15      61     889,19     198,72 Sandia National Laboratories              2.60GHz, Infiniband QDR
                                        RIKEN Advanced Institute for
       32       1     830,18   12659,89 Computational Science (AICS)              K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect
       47       2     635,15       4040 National Supercomputing Center in Tianjin NUDT YH MPP, Xeon X5670 6C 2.93 GHz, NVIDIA 2050
      149       3     253,09       6950 DOE/SC/Oak Ridge National Laboratory Cray XT5-HE Opteron 6-core 2.6 GHz
                                        National Supercomputing Centre in         Dawning TC3600 Blade System, Xeon X5650 6C 2.66GHz,
       56       4     492,64       2580 Shenzhen (NSCS)                           Infiniband QDR, NVIDIA 2050


  Mexico DF, November, 2011
                   SBAC-PAD, Vitoria October 28th, 2011                                                               39




Green/Top 500 November 2011
                                                                                                                                      Top500 rank




BSC, Xeon 6C, NVIDIA 2090 GPU




Nagasaki U., Intel i5, ATI Radeon GPU




IBM and NNSA, Blue Gene/Q




   Mflops/watt          Mwatts/Exaflop

     2026,48                    493
     1689,86                    592
                                                    Mflops/watt
     1378,32                    726                                                           500-1000           100-500
     1266,26                    726                             >1 GF/watt                     MF/watt           MF/watt

  Mexico DF, November, 2011                                                                                           40




                                                                                                                                                     20

More Related Content

What's hot

LAMMPS Molecular Dynamics on GPU
LAMMPS Molecular Dynamics on GPULAMMPS Molecular Dynamics on GPU
LAMMPS Molecular Dynamics on GPUDevang Sachdev
 
20121205 open stack_accelerating_science_v3
20121205 open stack_accelerating_science_v320121205 open stack_accelerating_science_v3
20121205 open stack_accelerating_science_v3Tim Bell
 
Cybertron pc slayer ii gaming pc (blue)
Cybertron pc slayer ii gaming pc (blue)Cybertron pc slayer ii gaming pc (blue)
Cybertron pc slayer ii gaming pc (blue)LilianaSuri
 
MSI N480GTX Lightning Infokit
MSI N480GTX Lightning InfokitMSI N480GTX Lightning Infokit
MSI N480GTX Lightning InfokitMSI
 
NAMD Molecular Dynamics on GPU
NAMD Molecular Dynamics on GPUNAMD Molecular Dynamics on GPU
NAMD Molecular Dynamics on GPUDevang Sachdev
 
Accelerating science with Puppet
Accelerating science with PuppetAccelerating science with Puppet
Accelerating science with PuppetTim Bell
 
Accelerating Science with OpenStack.pptx
Accelerating Science with OpenStack.pptxAccelerating Science with OpenStack.pptx
Accelerating Science with OpenStack.pptxOpenStack Foundation
 
20121017 OpenStack CERN Accelerating Science
20121017 OpenStack CERN Accelerating Science20121017 OpenStack CERN Accelerating Science
20121017 OpenStack CERN Accelerating ScienceTim Bell
 
Placas base evolucion
Placas base evolucionPlacas base evolucion
Placas base evoluciongatarufo
 
PowerColor PCS+ Vortex II sales kit
PowerColor PCS+ Vortex II sales kitPowerColor PCS+ Vortex II sales kit
PowerColor PCS+ Vortex II sales kitPowerColor
 
Real-time Systems Design (part I)
Real-time Systems Design (part I)Real-time Systems Design (part I)
Real-time Systems Design (part I)Rob Williams
 

What's hot (18)

LAMMPS Molecular Dynamics on GPU
LAMMPS Molecular Dynamics on GPULAMMPS Molecular Dynamics on GPU
LAMMPS Molecular Dynamics on GPU
 
20121205 open stack_accelerating_science_v3
20121205 open stack_accelerating_science_v320121205 open stack_accelerating_science_v3
20121205 open stack_accelerating_science_v3
 
Sponge v2
Sponge v2Sponge v2
Sponge v2
 
Cybertron pc slayer ii gaming pc (blue)
Cybertron pc slayer ii gaming pc (blue)Cybertron pc slayer ii gaming pc (blue)
Cybertron pc slayer ii gaming pc (blue)
 
MSI N480GTX Lightning Infokit
MSI N480GTX Lightning InfokitMSI N480GTX Lightning Infokit
MSI N480GTX Lightning Infokit
 
NAMD Molecular Dynamics on GPU
NAMD Molecular Dynamics on GPUNAMD Molecular Dynamics on GPU
NAMD Molecular Dynamics on GPU
 
Accelerating science with Puppet
Accelerating science with PuppetAccelerating science with Puppet
Accelerating science with Puppet
 
Accelerating Science with OpenStack.pptx
Accelerating Science with OpenStack.pptxAccelerating Science with OpenStack.pptx
Accelerating Science with OpenStack.pptx
 
20121017 OpenStack CERN Accelerating Science
20121017 OpenStack CERN Accelerating Science20121017 OpenStack CERN Accelerating Science
20121017 OpenStack CERN Accelerating Science
 
Vigor Ex
Vigor ExVigor Ex
Vigor Ex
 
Brochure NAS LG
Brochure NAS LGBrochure NAS LG
Brochure NAS LG
 
Placas base evolucion
Placas base evolucionPlacas base evolucion
Placas base evolucion
 
PowerColor PCS+ Vortex II sales kit
PowerColor PCS+ Vortex II sales kitPowerColor PCS+ Vortex II sales kit
PowerColor PCS+ Vortex II sales kit
 
Sahara Net Slate
Sahara Net SlateSahara Net Slate
Sahara Net Slate
 
Real-time Systems Design (part I)
Real-time Systems Design (part I)Real-time Systems Design (part I)
Real-time Systems Design (part I)
 
PostgreSQL 8.3 Update
PostgreSQL 8.3 UpdatePostgreSQL 8.3 Update
PostgreSQL 8.3 Update
 
Tao zhang
Tao zhangTao zhang
Tao zhang
 
ppt
pptppt
ppt
 

Viewers also liked (8)

Index-Thumb
Index-ThumbIndex-Thumb
Index-Thumb
 
Gerardo zavala guzman
Gerardo zavala guzmanGerardo zavala guzman
Gerardo zavala guzman
 
Policy Integration
Policy IntegrationPolicy Integration
Policy Integration
 
Celso garrido
Celso garridoCelso garrido
Celso garrido
 
Ron perrot
Ron perrotRon perrot
Ron perrot
 
Ele detef
Ele detefEle detef
Ele detef
 
Leonid sheremetov
Leonid sheremetovLeonid sheremetov
Leonid sheremetov
 
Hector duran limon
Hector duran limonHector duran limon
Hector duran limon
 

Similar to Mateo valero p1

[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...Rakuten Group, Inc.
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010TELECOM I+D
 
In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)Naoto MATSUMOTO
 
Report to the NAC
Report to the NACReport to the NAC
Report to the NACLarry Smarr
 
Intel Theater Presentation - SC11
Intel Theater Presentation - SC11Intel Theater Presentation - SC11
Intel Theater Presentation - SC11Deepak Singh
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicEric Verhulst
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 HardwareJacob Wu
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveHigh performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveJason Shih
 
Presentation of the 40th TOP500 List
Presentation of the 40th TOP500 ListPresentation of the 40th TOP500 List
Presentation of the 40th TOP500 Listtop500
 
Sites Making the List the First Time
Sites Making the List the First TimeSites Making the List the First Time
Sites Making the List the First Timetop500
 
Top500 11/2011 BOF Slides
Top500 11/2011 BOF SlidesTop500 11/2011 BOF Slides
Top500 11/2011 BOF Slidestop500
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univainside-BigData.com
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrKohei KaiGai
 
20121017 OpenStack Accelerating Science
20121017 OpenStack Accelerating Science20121017 OpenStack Accelerating Science
20121017 OpenStack Accelerating ScienceTim Bell
 
Top500 november 2017
Top500 november 2017Top500 november 2017
Top500 november 2017top500
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Jeff Larkin
 
한컴MDS_NVIDIA Jetson Platform
한컴MDS_NVIDIA Jetson Platform한컴MDS_NVIDIA Jetson Platform
한컴MDS_NVIDIA Jetson PlatformHANCOM MDS
 

Similar to Mateo valero p1 (20)

[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010
 
In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)
 
Report to the NAC
Report to the NACReport to the NAC
Report to the NAC
 
Intel Theater Presentation - SC11
Intel Theater Presentation - SC11Intel Theater Presentation - SC11
Intel Theater Presentation - SC11
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 Altreonic
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 Hardware
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveHigh performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspective
 
Presentation of the 40th TOP500 List
Presentation of the 40th TOP500 ListPresentation of the 40th TOP500 List
Presentation of the 40th TOP500 List
 
Sites Making the List the First Time
Sites Making the List the First TimeSites Making the List the First Time
Sites Making the List the First Time
 
SGI HPC DAY 2011 Kiev
SGI HPC DAY 2011 KievSGI HPC DAY 2011 Kiev
SGI HPC DAY 2011 Kiev
 
Top500 11/2011 BOF Slides
Top500 11/2011 BOF SlidesTop500 11/2011 BOF Slides
Top500 11/2011 BOF Slides
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
 
Latest HPC News from NVIDIA
Latest HPC News from NVIDIALatest HPC News from NVIDIA
Latest HPC News from NVIDIA
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
20121017 OpenStack Accelerating Science
20121017 OpenStack Accelerating Science20121017 OpenStack Accelerating Science
20121017 OpenStack Accelerating Science
 
LUG 2014
LUG 2014LUG 2014
LUG 2014
 
Top500 november 2017
Top500 november 2017Top500 november 2017
Top500 november 2017
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
한컴MDS_NVIDIA Jetson Platform
한컴MDS_NVIDIA Jetson Platform한컴MDS_NVIDIA Jetson Platform
한컴MDS_NVIDIA Jetson Platform
 

Recently uploaded

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

Mateo valero p1

  • 1. “Future Exascale Supercomputers” Mexico DF, November, 2011 Prof. Mateo Valero Director Top10 Rank Site Computer Procs Rmax Rpeak RIKEN Advanced Institute Fujitsu, K computer, SPARC64 1 for Computational Science 705024 10510000 11280384 VIIIfx 2.0GHz, Tofu interconnect (AICS) 186368 2 Tianjin, China XeonX5670+NVIDIA 2566000 4701000 100352 3 Oak Ridge Nat. Lab. Crat XT5,6 cores 224162 1759000 2331000 4 Shenzhen, China XeonX5670+NVIDIA 120640 1271000 2984300 73278 5 GSIC Center, Tokyo XeonX5670+NVIDIA 1192000 2287630 56994 6 DOE/NNSA/LANL/SNL Cray XE6 8-core 2.4 GHz 142272 1110000 1365811 SGI Altix ICE 8200EX/8400EX, NASA/Ames Research 7 Xeon HT QC 3.0/Xeon 111104 1088000 1315328 Center/NAS 5570/5670 2.93 Ghz, Infiniband 8 DOE/SC/LBNL/NERSC Cray XE6 12 cores 153408 1054000 1288627 Commissariat a l'Energie Bull bullx super-node 9 138368 1050000 1254550 Atomique (CEA) S6010/S6030 QS22/LS21 Cluster, PowerXCell 10 DOE/NNSA/LANL 122400 1042000 1375776 8i / Opteron Infiniband Mexico DF, November, 2011 2 1
  • 2. Parallel Systems Interconnect (Myrinet, IB, Ge, 3D torus, tree, …) Node Node* Node** Node Node* * Node** ** Node Node Node Node Node SMP Memory homogeneous multicore (BlueGene-Q chip) IN heterogenous multicore multicore general-purpose accelerator (e.g. Cell) multicore multicore GPU multicore FPGA ASIC (e.g. Anton for MD) Network-on-chip (bus, ring, direct, …) Mexico DF, November, 2011 3 Riken’s Fujitsu K with SPARC64 VIIIfx ● Homogeneous architecture: ● Compute node: ● One SPARC64 VIIIfx processor 2 GHz, 8 cores per chip 128 Gigaflops per chip ● 16 GB memory per node ● Number of nodes and cores: ● 864 cabinets * 102 compute nodes/cabinet * (1 socket * 8 CPU cores) = 705024 cores …. 50 by 60 meters ● Peak performance (DP): p ( ) ● 705024 cores * 16 GFLOPS per core = 11280384 PFLOPS ● Linpack: 10510 PF 93% efficiency. Matrix: more than 13725120 rows !!! 29 hours and 28 minutes ● Power consumption 12.6 MWatt, 0.8 Gigaflops/W Mexico DF, November, 2011 4 2
  • 3. Looking at the Gordon Bell Prize ● 1 GFlop/s; 1988; Cray Y-MP; 8 Processors ● Static finite element analysis ● 1 TFlop/s; 1998; Cray T3E; 1024 Processors ● Modeling of metallic magnet atoms, using a variation of the locally self-consistent multiple scattering method. ● 1 PFlop/s; 2008; Cray XT5; 1.5x105 Processors ● Superconductive materials ● 1 EFlop/s; ~2018; ?; 1x108 Processors?? (109 threads) Jack Dongarra Mexico DF, November, 2011 5 Mexico DF, November, 2011 6 3
  • 4. Nvidia GPU instruction execution MP1 MP2 MP3 MP4 instruction1 instruction2 Long latency3 Instruction4 SBAC-PAD, Vitoria October Mexico DF, November, 201128th, 2011 7 Potential System Architecture for Exascale Supercomputers System 2010 “2015” “2018” Difference attributes 2010-18 System peak 2 Pflop/s 200 Pflop/s 1 Eflop/sec O(1000) Power 6 MW 15 MW ~20 MW 20 System memory 0.3 PB 5 PB 32-64 PB O(100) Node 125 GF 0.5 TF 7 TF 1 TF 10 TF O(10) – performance O(100) Node memory 25 GB/s 0.1 1 TB/sec 0.4 TB/sec 4 TB/sec O(100) BW TB/sec Node 12 O(100) O(1,000) O(1,000) O(10,000) O(100) – concurrency O(1000) Total 225,000 O(108) O(109) O(10,000) Concurrency Total Node 1.5 GB/s 20 GB/sec 200 GB/sec O(100) Interconnect BW MTTI days O(1day) O(1 day) - O(10) EESI Final Conference 10-11 Oct. 2011, Barcelona Mexico DF, November, 2011 8 4
  • 5. 2. To faster air plane design Boeing: Number of wing prototypes prepared for wind-tunnel testing Date 1980 1995 2005 Airplane B757/B767 B777 B787 # wing prototypes 77 11 11 5 Plateau due to RANS limitations. Further decrease expected from LES with ExaFlop EESI Final Conference 10-11 Oct. 2011, Barcelona Mexico DF, November, 2011 9 Diseño del Airbus 380 Mexico DF, November, 2011 10 5
  • 6. 2. To faster air plane design Airbus: "More simulation, less tests More tests“ From A380 to A350 - 40% less wind-tunnel days - 25% saving in aerodynamics development time - 20% saving on wind-tunnel tests cost th k t HPC thanks to HPC-enabled CFD runs, especially i hi h bl d i ll in high-speed regime, providing d i idi even better representation of aerodynamics phenomenon turned into better design choices. Acknowledgements: E. CHAPUT (AIRBUS) EESI Final Conference 10-11 Oct. 2011, Barcelona Mexico DF, November, 2011 11 2. Oil industry EESI Final Conference 10-11 Oct. 2011, Barcelona Mexico DF, November, 2011 12 6
  • 7. Diseño del ITER TOKAMAK (JET) Mexico DF, November, 2011 13 Fundamental Sciences EESI Final Conference Mexico DF, November, 2011 10-11 Oct. 2011, Barcelona 14 7
  • 8. Materials: a new path to competitiveness On-demand materials for effective commercial use Conductivity: energy loss reduction Lifetime: corrosion protection, e.g. chrome Fissures: saftety insurance from molecular design Optimisation of materials / lubricants less friction, longer lifetime, less energy-losses Industrial need to speed up simulation from months to days All atom Multi-scale Exascale enables simulation of larger and realistic systems and devices EESI Final Conference, 10-11 Oct. 2011, Barcelona Mexico DF, November, 2011 15 Life Sciences and Health Population Organ Tissue Cell Macromolecule Small Molecule Atom EESI Final Conference, 10-11 Oct. 2011, Barcelona Mexico DF, November, 2011 16 8
  • 9. Supercomputación, teoría y experimentación Mexico DF, November, 2011 17 Cortesia de IBM Supercomputing, theory and experimentation Mexico DF, November, 2011 18 Cortesia de IBM 9
  • 10. Holistic approach … Towards exaflop Comput. Complexity Applications Async. Algs. Moldability Job Scheduling Resource awareness Load Balancin User satisfaction Programming Model Address space Dependencies Work generation ng ng Run time Locality optimization Concurrency extraction Topology and routing Interconnection External contention Processor/node NIC design Run time support Hw counters architecture Memory subsystem Core Structure Mexico DF, November, 2011 19 10+ Pflop/s systems planned ● Fujitsu Kei ● 80,000 8-core Sparc64 VIIIfx processors 2 GHz, (16 Gflops/core, 58 watts 3.2 Gflops/watt), 16 GB/node 1 PB memory, 6D mesh-torus, GB/node, memory mesh torus 10 Pflops ● Cray's Titan at DOE, Oak Ridge National Laboratory ● Hybrid system with Nvidia GPUs, 1 Pflop/s in 2011, 20 Pflop/s in 2012, late 2011 prototype ● $100 million Mexico DF, November, 2011 20 10
  • 11. 10+ Pflop/s systems planned ● IBM Blue Waters at Illinois ● 40,000 8-core Power7, 1 PB memory, 18 PB disk, 500 PB archival storage, 10 Pflop/s 2012 $200 million Pflop/s, 2012, ● IBM Blue Gene/Q systems: ● Mira to DOE, Argonne National Lab with 49,000 nodes, 16-core Power A2 processor (1.6-3 GHz), 750 K cores, 750 TB memory, 70 PB disk, 5D torus 10 Pflop/s torus, ● Sequoia to Lawrence Livermore National Lab with 98304 nodes (96 racks), 16-core A2 processor, 1.6 M cores (1 GB/core), 1.6 Petabytes memory, 6 Mwatt, 3 Gflops/watt, 20 Pflop/s, 2012 Mexico DF, November, 2011 21 Japan Plan for Exascale Heterogeneous, Distributed Memory GigaHz KiloCore MegaNode system 2012 2015 2018-2020 K Machine 10K Machine 100K Machine 10 PF 100 PF ExaFlops Feasibility Study (2012-2013) Exascale Project (2014-2020) Post-Petascale Projects Mexico DF, November, 2011 22 11
  • 12. Mexico DF, November, 2011 Thanks to S. Borkar, Intel 23 Mexico DF, November, 2011 Thanks to S. Borkar, Intel 24 12
  • 13. Nvidia: Chip for the Exaflop Computer Mexico DF, November, 2011 Thanks Bill Dally 25 Nvidia: Node for the Exaflop Computer Thanks Bill Dally Mexico DF, November, 2011 26 13
  • 14. Exascale Supercomputer Mexico DF, November, 2011 Thanks Bill Dally 27 BSC-CNS: International Initiatives (IESP) Improve the world’s simulation and modeling capability by improving the coordination and development of the HPC software environment B ild an i Build international plan f d l i i l l for developing the next generation open source software for scientific high-performance computing Mexico DF, November, 2011 28 14
  • 15. Back to Babel? Book of Genesis The computer age “Now the whole earth had Fortran & MPI one language and the same words” … …”Come, let us make bricks, and burn them thoroughly. ”… …"Come, let us build ourselves a city, and a tower with its top in the heavens, ++ and let us make a name for ourselves”… And the LORD said, "Look, they are one Cilk++ people, and they have all one language; and Fortress X10 CUDA this is only the beginning of what they will do; Sisal HPF StarSs RapidMind nothing that they propose to do will now be Sequoia impossible for them. Come, let us go down, and CAF ALF OpenMP UPC SDK confuse their language there, so that they will not understand one another's speech." Chapel MPI Mexico DF, November, 2011 Thanks to Jesus Labarta 29 You will see…. in 400 years from now people will get crazy New generation of programmers Parallel Multicore/manycore Programming Architectures New Usage g models Source: Picasso -- Don Quixote Dr. Avi Mendelson (Microsoft). Keynote at ISC-2007 Mexico DF, November, 2011 30 15
  • 16. Different models of computation ……. ● The dream for automatic parallelizing compilers not true … ● … so programmer needs to express opportunities for parallel execution in the application SPMD OpenMP 2.5 Nested fork-join OpenMP 3.0 DAG – data flow Huge Lookahead &Reuse…. Latency/EBW/Scheduling ● And … asynchrony (MPI and OpenMP too synchronous): ● Collectives/barriers multiply effects of microscopic load imbalance, OS noise,… Mexico DF, November, 2011 31 StarSs: … generates task graph at run time … #pragma css task input(A, B) output(C) void vadd3 (float A[BS], float B[BS], float C[BS]); #pragma css task input(sum, A) output(B) void scale_add (float sum, float A[BS], float B[BS]); Task Graph Generation #pragma css task input(A) inout(sum) void accum (float A[BS], float *sum); for (i=0; i<N; i+=BS) // C=A+B 1 2 3 4 vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) // sum(C[i]) 5 6 7 8 accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*E scale_add (sum, &E[i], &B[i]); 9 10 11 12 ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); 13 14 15 16 ... for (i=0; i<N; i+=BS) // E=C+F vadd3 (&C[i], &F[i], &E[i]); 17 18 19 20 Mexico DF, November, 2011 32 16
  • 17. StarSs: … and executes as efficient as possible … #pragma css task input(A, B) output(C) void vadd3 (float A[BS], float B[BS], float C[BS]); #pragma css task input(sum, A) output(B) void scale_add (float sum, float A[BS], float B[BS]); Task Graph Execution #pragma css task input(A) inout(sum) void accum (float A[BS], float *sum); for (i=0; i<N; i+=BS) // C=A+B 1 1 1 2 vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) // sum(C[i]) 2 3 4 5 accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*E scale_add (sum, &E[i], &B[i]); 6 6 6 7 ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); 2 2 2 3 ... for (i=0; i<N; i+=BS) // E=C+F vadd3 (&C[i], &F[i], &E[i]); 7 8 7 8 Mexico DF, November, 2011 33 StarSs: … benefiting from data access information ● Flat global address space seen by programmer ● Flexibility to dynamically traverse dataflow graph “optimizing” ● Concurrency. Critical path ● Memory access ● Opportunities for ● Prefetch ● Reuse ● Eli i t antidependences Eliminate tid d (rename) ● Replication management Mexico DF, November, 2011 34 17
  • 18. StarSs: Enabler for exascale Can exploit very unstructured Support for heterogeneity parallelism Any # and combination of CPUs, Not just loop/data parallelism GPUs Easy to change structure Including autotuning Supports large amounts of lookahead S t l t fl k h d Malleability: Decouple program f M ll bilit D l from Not stalling for dependence resources satisfaction Allowing dynamic resource Allow for locality optimizations to allocation and load balance tolerate latency Tolerate noise Overlap data transfers, prefetch Reuse Nicely hybridizes into MPI/StarSs Data-flow; Asynchrony Data flow; Propagates to large scale the node level dataflow characteristics Potential is there; Overlap communication and Can blame runtime computation A chance against Amdahl’s law Compatible with proprietary low level technologies 35 Mexico DF, November, 2011 35 StarSs: history/strategy/versions Basic SMPSs must provide directionality ∀argument Contiguous, non partially overlapped Renaming Several schedulers ( i it l S l h d l (priority, locality,…) lit ) No nesting C/Fortran MPI/SMPSs optims. SMPSs regions C, No Fortran must provide directionality ∀argument ovelaping &strided OMPSs Reshaping strided accesses Priority and locality aware scheduling C/C++, Fortran under development O OpenMP compatibility ( ) MP tibilit (~) Dependences based only on args. with directionality Contiguous args. (address used as centinels) Separate dependences/transfers Inlined/outlined pragmas Nesting SMP/GPU/Cluster No renaming, Several schedulers: “Simple” locality aware sched,… Mexico DF, November, 2011 36 18
  • 19. Multidisciplinary top-down approach Application Programming and algorithms models Investigate Performance solutions to these Load analysis and Power balancing Computer Center Power Projections prediction 90 tools and other 80 70 Cooling Computers $31M problems Processor 60 $23M Power (MW) Interconnect 50 $17M and node 40 30 $9M 20 $3M 10 0 2005 2006 2007 2008 2009 2010 2011 Year Mexico DF, November, 2011 37 Mexico DF, November, 2011 38 19
  • 20. Green/Top 500 November 2011 Green500 Top500 _Rank _Rank Mflops/Watt Power Site Computer 1 64 2026,48 85,12 IBM - Rochester BlueGene/Q, Power BQC 16C 1.60 GHz, Custom 2 65 2026,48 85,12 IBM Thomas J. Watson Research Center BlueGene/Q, Power BQC 16C 1.60 GHz, Custom 3 29 1996,09 170,25 IBM - Rochester BlueGene/Q, Power BQC 16C 1.60 GHz, Custom 4 17 1988,56 340,5 DOE/NNSA/LLNL BlueGene/Q, Power BQC 16C 1.60 GHz, Custom 5 284 1689,86 38,67 IBM Thomas J. Watson Research Center NNSA/SC Blue Gene/Q Prototype 1 6 328 1378,32 47,05 Nagasaki University g y DEGIMA Cluster, Intel i5, ATI Radeon GPU, Infiniband QDR Bullx B505, Xeon E5649 6C 2.53GHz, Infiniband QDR, NVIDIA 7 114 1266,26 81,5 Barcelona Supercomputing Center 2090 Curie Hybrid Nodes - Bullx B505, Xeon E5640 2.67 GHz, Infiniband 8 102 1010,11 108,8 TGCC / GENCI QDR Institute of Process Engineering, Chinese Mole-8.5 Cluster, Xeon X5520 4C 2.27 GHz, Infiniband QDR, 9 21 963,7 515,2 Academy of Sciences NVIDIA 2050 GSIC Center, Tokyo Institute of HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia GPU, 10 5 958,35 1243,8 Technology Linux/Windows SuperServer 2026GT-TRF, Xeon E5645 6C 2.40GHz, Infiniband 11 96 928,96 126,27 Virginia Tech QDR, NVIDIA 2050 HP ProLiant SL390s G7 Xeon 6C X5660 2.8Ghz, nVidia Fermi, 12 111 901,54 117,91 Georgia Institute of Technology Infiniband QDR CINECA / SCS - SuperComputing iDataPlex DX360M3, Xeon E5645 6C 2.40 GHz, Infiniband QDR, 13 82 891 88 891,88 160 S l ti Solution NVIDIA 2070 iDataPlex DX360M3, Xeon X5650 6C 2.66 GHz, Infiniband QDR, 14 256 891,87 76,25 Forschungszentrum Juelich (FZJ) NVIDIA 2070 Xtreme-X GreenBlade GB512X, Xeon E5 (Sandy Bridge - EP) 8C 15 61 889,19 198,72 Sandia National Laboratories 2.60GHz, Infiniband QDR RIKEN Advanced Institute for 32 1 830,18 12659,89 Computational Science (AICS) K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect 47 2 635,15 4040 National Supercomputing Center in Tianjin NUDT YH MPP, Xeon X5670 6C 2.93 GHz, NVIDIA 2050 149 3 253,09 6950 DOE/SC/Oak Ridge National Laboratory Cray XT5-HE Opteron 6-core 2.6 GHz National Supercomputing Centre in Dawning TC3600 Blade System, Xeon X5650 6C 2.66GHz, 56 4 492,64 2580 Shenzhen (NSCS) Infiniband QDR, NVIDIA 2050 Mexico DF, November, 2011 SBAC-PAD, Vitoria October 28th, 2011 39 Green/Top 500 November 2011 Top500 rank BSC, Xeon 6C, NVIDIA 2090 GPU Nagasaki U., Intel i5, ATI Radeon GPU IBM and NNSA, Blue Gene/Q Mflops/watt Mwatts/Exaflop 2026,48 493 1689,86 592 Mflops/watt 1378,32 726 500-1000 100-500 1266,26 726 >1 GF/watt MF/watt MF/watt Mexico DF, November, 2011 40 20