SlideShare una empresa de Scribd logo
1 de 16
Critical Issues at Exascale
 for Algorithm and Software Design
SC12, Salt Lake City, Utah, Nov 2012
    Jack Dongarra, University of Tennessee, Tennessee, USA
Performance Development in Top500
    1E+11
    1E+10
   1 Eflop/s
   1E+09
10000000
 100 Pflop/s
  10 Pflop/s
10000000
 1000000
   1 Pflop/s                 N=1
  100000
 100 Tflop/s

   10000
  10 Tflop/s

   1 1000
     Tflop/s                 N=500
      100
 100 Gflop/s
        10
  10 Gflop/s
         1
   1 Gflop/s
               1996




                      2002




                                     2008




                                            2014




                                                   2020
       0.1
Potential System Architecture

Systems                      2012                 2022            Difference
                          Titan Computer                         Today & 2022
System peak                27 Pflop/s           1 Eflop/s            O(100)

Power                       8.3 MW              ~20 MW
                           (2 Gflops/W)        (50 Gflops/W)

System memory                710 TB            32 - 64 PB             O(10)
                            (38*18688)

Node performance            1,452 GF/s        1.2 or 15TF/s       O(10) – O(100)
                              (1311+141)


Node memory BW              232 GB/s            2 - 4TB/s            O(1000)
                             (52+180)

Node concurrency          16 cores CPU        O(1k) or 10k       O(100) – O(1000)
                           2688 CUDA
                              cores
Total Node Interconnect      8 GB/s           200-400GB/s             O(10)
BW
System size (nodes)          18,688        O(100,000) or O(1M)   O(100) – O(1000)

Total concurrency             50 M              O(billion)          O(1,000)

MTTI                       ?? unknown           O(<1 day)            - O(10)
Potential System Architecture
    with a cap of $200M and 20MW
Systems                      2012                 2022            Difference
                          Titan Computer                         Today & 2022
System peak                27 Pflop/s           1 Eflop/s            O(100)

Power                       8.3 MW              ~20 MW
                           (2 Gflops/W)        (50 Gflops/W)

System memory                710 TB            32 - 64 PB             O(10)
                            (38*18688)

Node performance            1,452 GF/s        1.2 or 15TF/s       O(10) – O(100)
                              (1311+141)


Node memory BW              232 GB/s            2 - 4TB/s            O(1000)
                             (52+180)

Node concurrency          16 cores CPU        O(1k) or 10k       O(100) – O(1000)
                           2688 CUDA
                              cores
Total Node Interconnect      8 GB/s           200-400GB/s             O(10)
BW
System size (nodes)          18,688        O(100,000) or O(1M)   O(100) – O(1000)

Total concurrency             50 M              O(billion)          O(1,000)

MTTI                       ?? unknown           O(<1 day)            - O(10)
Critical Issues at Peta & Exascale for
Algorithm and Software Design
 Synchronization-reducing algorithms
   Break Fork-Join model
 Communication-reducing algorithms
   Use methods which have lower bound on communication
 Mixed precision methods
   2x speed of ops and 2x speed for data movement
 Autotuning
   Today’s machines are too complicated, build “smarts” into
    software to adapt to the hardware
 Fault resilient algorithms
   Implement algorithms that can recover from failures/bit
    flips
 Reproducibility of results
   Today we can’t guarantee this. We understand the issues,
                            5
    but some of our “colleagues” have a hard time with this.
Major Changes to Algorithms/Software
• Must rethink the design of our
 algorithms and software
  Manycore and Hybrid architectures are
   disruptive technology
    Similar to what happened with cluster
     computing and message passing


  Rethink and rewrite the applications,
   algorithms, and software

  Data movement is expensive
  Flops are cheap 6
Fork-Join Parallelization of LU and QR.
Parallelize the update:                             dgemm
    • Easy and done in any reasonable software.
    • This is the 2/3n3 term in the FLOPs count.               -
    • Can be done efficiently with LAPACK+multithreaded BLAS




                       Cores
Synchronization (in LAPACK LU)



    Step 1             Step 2              Step 3             Step 4   ...




                     synchronous processing
    • Fork-join, bulk fork join                                         27

                     bulk synchronous processing


                                     8
Allowing for delayed update, out of order, asynchronous, dataflow execution
PLASMA/MAGMA: Parallel Linear Algebra
 s/w for Multicore/Hybrid Architectures
Objectives
  High utilization of each core
  Scaling to large number of cores
  Synchronization reducing algorithms
Methodology
    Dynamic DAG scheduling (QUARK)
    Explicit parallelism
    Implicit communication
    Fine granularity / block data layout
Arbitrary DAG with dynamic scheduling
                                               Fork-join
                                               parallelism

                               DAG scheduled
                               parallelism
                                                             9
Communication Avoiding QR
     Example
                                                                 R0                  R0   R0R R
                            D0            Domain_Tile_QR
                                                                                            D0

                                                                 R1
                            D1             Domain_Tile_QR                                   D1


                                                                 R2                  R2
                            D2            Domain_Tile_QR                                    D2


                                                                 R3
                            D3             Domain_Tile_QR                                       D3




A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.


      11/20/2012                                                                           10
Communication Avoiding QR
     Example
                                                                 R0                  R0   R0R R
                            D0            Domain_Tile_QR
                                                                                            D0

                                                                 R1
                            D1             Domain_Tile_QR                                   D1


                                                                 R2                  R2
                            D2            Domain_Tile_QR                                    D2


                                                                 R3
                            D3             Domain_Tile_QR                                       D3




A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.


      11/20/2012                                                                           11
Communication Avoiding QR
     Example
                                                                 R0                  R0   R0R R
                            D0            Domain_Tile_QR
                                                                                            D0

                                                                 R1
                            D1             Domain_Tile_QR                                   D1


                                                                 R2                  R2
                            D2            Domain_Tile_QR                                    D2


                                                                 R3
                            D3             Domain_Tile_QR                                       D3




A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.


      11/20/2012                                                                           12
Communication Avoiding QR
     Example
                                                                 R0                  R0   R0R R
                            D0            Domain_Tile_QR
                                                                                            D0

                                                                 R1
                            D1             Domain_Tile_QR                                   D1


                                                                 R2                  R2
                            D2            Domain_Tile_QR                                    D2


                                                                 R3
                            D3             Domain_Tile_QR                                       D3




A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.


      11/20/2012                                                                           13
Communication Avoiding QR
     Example
                                                                 R0                  R0   R0R R
                            D0            Domain_Tile_QR
                                                                                            D0

                                                                 R1
                            D1             Domain_Tile_QR                                   D1


                                                                 R2                  R2
                            D2            Domain_Tile_QR                                    D2


                                                                 R3
                            D3             Domain_Tile_QR                                       D3




A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.


      11/20/2012                                                                           14
PowerPack 2.0




The PowerPack platform consists of software and hardware instrumentation.
                                                                       15
Kirk Cameron, Virginia Tech; http://scape.cs.vt.edu/software/powerpack-2-0/
Power for QR Factorization
                                LAPACK’s QR Factorization
                                Fork-join based


                                MKL’s QR Factorization
                                Fork-join based

                                PLASMA’s Conventional
                                QR Factorization
                                DAG based
                                PLASMA’s Communication
                                Reducing QR Factorization
                                DAG based
dual-socket quad-core Intel Xeon E5462 (Harpertown) processor
@ 2.80GHz (8 cores total) w / MLK BLAS                          16
matrix size is very tall and skinny (mxn is 1,152,000 by 288)

Más contenido relacionado

La actualidad más candente

Report to the NAC
Report to the NACReport to the NAC
Report to the NACLarry Smarr
 
Top500 november 2017
Top500 november 2017Top500 november 2017
Top500 november 2017top500
 
Snapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 ArchitectureSnapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 ArchitectureSantosh Verma
 
Top500 SC17 invited talk
Top500 SC17 invited talkTop500 SC17 invited talk
Top500 SC17 invited talktop500
 
33C3: Code BROWN in the Air
33C3: Code BROWN in the Air33C3: Code BROWN in the Air
33C3: Code BROWN in the AirPhilippe Lin
 
Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...
Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...
Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...Nansen Chen
 
Universal Reconfigurable Processing Platform For Space Rev Voice4
Universal Reconfigurable Processing Platform For Space Rev Voice4Universal Reconfigurable Processing Platform For Space Rev Voice4
Universal Reconfigurable Processing Platform For Space Rev Voice4dseagrave
 
Top500 11/2011 BOF Slides
Top500 11/2011 BOF SlidesTop500 11/2011 BOF Slides
Top500 11/2011 BOF Slidestop500
 
CELC - Архитектура коммутаторов Catalyst 4500
CELC - Архитектура коммутаторов Catalyst 4500CELC - Архитектура коммутаторов Catalyst 4500
CELC - Архитектура коммутаторов Catalyst 4500Cisco Russia
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesDustin Franklin
 
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoWebinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoEmbarcados
 
VR-Zone Tech News for the Geeks Oct 2011 Issue 2
VR-Zone Tech News for the Geeks Oct 2011 Issue 2VR-Zone Tech News for the Geeks Oct 2011 Issue 2
VR-Zone Tech News for the Geeks Oct 2011 Issue 2VR-Zone .com
 
Amat p5000 etcher semi star
Amat p5000 etcher   semi starAmat p5000 etcher   semi star
Amat p5000 etcher semi starEmily Tan
 
Do SI Simulation, Also for Carbon Reduction
Do SI Simulation, Also for Carbon ReductionDo SI Simulation, Also for Carbon Reduction
Do SI Simulation, Also for Carbon ReductionNansen Chen
 
Acer aspire 4252_4552_4552g_quanta_zqa_rev_1a_sch
Acer aspire 4252_4552_4552g_quanta_zqa_rev_1a_schAcer aspire 4252_4552_4552g_quanta_zqa_rev_1a_sch
Acer aspire 4252_4552_4552g_quanta_zqa_rev_1a_schjuan carlos pintos
 
Вопросы балансировки трафика
Вопросы балансировки трафикаВопросы балансировки трафика
Вопросы балансировки трафикаSkillFactory
 

La actualidad más candente (20)

No[1][1]
No[1][1]No[1][1]
No[1][1]
 
Supercomputers and Cloud Games
Supercomputers and Cloud GamesSupercomputers and Cloud Games
Supercomputers and Cloud Games
 
Sponge v2
Sponge v2Sponge v2
Sponge v2
 
Report to the NAC
Report to the NACReport to the NAC
Report to the NAC
 
Top500 november 2017
Top500 november 2017Top500 november 2017
Top500 november 2017
 
Snapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 ArchitectureSnapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 Architecture
 
Top500 SC17 invited talk
Top500 SC17 invited talkTop500 SC17 invited talk
Top500 SC17 invited talk
 
33C3: Code BROWN in the Air
33C3: Code BROWN in the Air33C3: Code BROWN in the Air
33C3: Code BROWN in the Air
 
Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...
Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...
Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...
 
Universal Reconfigurable Processing Platform For Space Rev Voice4
Universal Reconfigurable Processing Platform For Space Rev Voice4Universal Reconfigurable Processing Platform For Space Rev Voice4
Universal Reconfigurable Processing Platform For Space Rev Voice4
 
Top500 11/2011 BOF Slides
Top500 11/2011 BOF SlidesTop500 11/2011 BOF Slides
Top500 11/2011 BOF Slides
 
CELC - Архитектура коммутаторов Catalyst 4500
CELC - Архитектура коммутаторов Catalyst 4500CELC - Архитектура коммутаторов Catalyst 4500
CELC - Архитектура коммутаторов Catalyst 4500
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous Machines
 
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoWebinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
 
Zynq ultrascale
Zynq ultrascaleZynq ultrascale
Zynq ultrascale
 
VR-Zone Tech News for the Geeks Oct 2011 Issue 2
VR-Zone Tech News for the Geeks Oct 2011 Issue 2VR-Zone Tech News for the Geeks Oct 2011 Issue 2
VR-Zone Tech News for the Geeks Oct 2011 Issue 2
 
Amat p5000 etcher semi star
Amat p5000 etcher   semi starAmat p5000 etcher   semi star
Amat p5000 etcher semi star
 
Do SI Simulation, Also for Carbon Reduction
Do SI Simulation, Also for Carbon ReductionDo SI Simulation, Also for Carbon Reduction
Do SI Simulation, Also for Carbon Reduction
 
Acer aspire 4252_4552_4552g_quanta_zqa_rev_1a_sch
Acer aspire 4252_4552_4552g_quanta_zqa_rev_1a_schAcer aspire 4252_4552_4552g_quanta_zqa_rev_1a_sch
Acer aspire 4252_4552_4552g_quanta_zqa_rev_1a_sch
 
Вопросы балансировки трафика
Вопросы балансировки трафикаВопросы балансировки трафика
Вопросы балансировки трафика
 

Destacado

L presentation
L presentationL presentation
L presentationj2chris
 
Presentation1
Presentation1Presentation1
Presentation1kerangps
 
Thunderbird Online GMAT & TOEFL Preparation
Thunderbird Online GMAT & TOEFL PreparationThunderbird Online GMAT & TOEFL Preparation
Thunderbird Online GMAT & TOEFL PreparationElyse Meyer
 
Magazine World Games 2013 en Cali Colombia
Magazine World Games 2013 en Cali ColombiaMagazine World Games 2013 en Cali Colombia
Magazine World Games 2013 en Cali ColombiaClara Luz Roldán
 
10 bagay na dapat malaman tungkol sa run
10 bagay na dapat malaman tungkol sa run10 bagay na dapat malaman tungkol sa run
10 bagay na dapat malaman tungkol sa runUlikblogRunner
 
Scheduling Updates nov 2013
Scheduling Updates nov 2013Scheduling Updates nov 2013
Scheduling Updates nov 2013ClubReady Inc
 
Didakticheskie materialy
Didakticheskie materialyDidakticheskie materialy
Didakticheskie materialyAlisha_Rum
 
áLbum nº 1 xadrez
áLbum nº 1   xadrezáLbum nº 1   xadrez
áLbum nº 1 xadrezcepmaio
 
Turma m8 noturno 2011.1
Turma m8    noturno 2011.1Turma m8    noturno 2011.1
Turma m8 noturno 2011.1cepmaio
 
Anziani chi-li-assiste-attach s299103
Anziani chi-li-assiste-attach s299103Anziani chi-li-assiste-attach s299103
Anziani chi-li-assiste-attach s299103Cagliostro Puntodue
 
øStgaard bokgren on_stress
øStgaard bokgren on_stressøStgaard bokgren on_stress
øStgaard bokgren on_stressJens östgaard
 
Supporting business decisions in the technological enterprise
Supporting business decisions in the technological enterpriseSupporting business decisions in the technological enterprise
Supporting business decisions in the technological enterpriseWilliam Hall
 
Dos and don'ts of job interview
Dos and don'ts of job interviewDos and don'ts of job interview
Dos and don'ts of job interviewPedro Daniel Cosme
 
1993 de f a z ok
1993 de f a z ok1993 de f a z ok
1993 de f a z okcepmaio
 

Destacado (20)

L presentation
L presentationL presentation
L presentation
 
Presentation1
Presentation1Presentation1
Presentation1
 
Thunderbird Online GMAT & TOEFL Preparation
Thunderbird Online GMAT & TOEFL PreparationThunderbird Online GMAT & TOEFL Preparation
Thunderbird Online GMAT & TOEFL Preparation
 
Magazine World Games 2013 en Cali Colombia
Magazine World Games 2013 en Cali ColombiaMagazine World Games 2013 en Cali Colombia
Magazine World Games 2013 en Cali Colombia
 
10 bagay na dapat malaman tungkol sa run
10 bagay na dapat malaman tungkol sa run10 bagay na dapat malaman tungkol sa run
10 bagay na dapat malaman tungkol sa run
 
Scheduling Updates nov 2013
Scheduling Updates nov 2013Scheduling Updates nov 2013
Scheduling Updates nov 2013
 
Transferencia
TransferenciaTransferencia
Transferencia
 
Didakticheskie materialy
Didakticheskie materialyDidakticheskie materialy
Didakticheskie materialy
 
김민경
김민경김민경
김민경
 
Class 1. ss
Class 1. ssClass 1. ss
Class 1. ss
 
áLbum nº 1 xadrez
áLbum nº 1   xadrezáLbum nº 1   xadrez
áLbum nº 1 xadrez
 
Turma m8 noturno 2011.1
Turma m8    noturno 2011.1Turma m8    noturno 2011.1
Turma m8 noturno 2011.1
 
Automatic enrolment
Automatic enrolmentAutomatic enrolment
Automatic enrolment
 
Anziani chi-li-assiste-attach s299103
Anziani chi-li-assiste-attach s299103Anziani chi-li-assiste-attach s299103
Anziani chi-li-assiste-attach s299103
 
øStgaard bokgren on_stress
øStgaard bokgren on_stressøStgaard bokgren on_stress
øStgaard bokgren on_stress
 
Pilttekstid
PilttekstidPilttekstid
Pilttekstid
 
Supporting business decisions in the technological enterprise
Supporting business decisions in the technological enterpriseSupporting business decisions in the technological enterprise
Supporting business decisions in the technological enterprise
 
Canjs
CanjsCanjs
Canjs
 
Dos and don'ts of job interview
Dos and don'ts of job interviewDos and don'ts of job interview
Dos and don'ts of job interview
 
1993 de f a z ok
1993 de f a z ok1993 de f a z ok
1993 de f a z ok
 

Similar a Critical Issues at Exascale for Algorithm and Software Design

POLYTEDA PowerDRC/LVS overview
POLYTEDA PowerDRC/LVS overviewPOLYTEDA PowerDRC/LVS overview
POLYTEDA PowerDRC/LVS overviewAlexander Grudanov
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGATO project
 
New idc architecture
New idc architectureNew idc architecture
New idc architectureMason Mei
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicEric Verhulst
 
Numascale Product IBM
Numascale Product IBMNumascale Product IBM
Numascale Product IBMIBM Danmark
 
Optical Switching in the Datacenter
Optical Switching in the DatacenterOptical Switching in the Datacenter
Optical Switching in the DatacenterKostas Katrinis
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchJim St. Leger
 
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
DPDK summit 2015: It's kind of fun  to do the impossible with DPDKDPDK summit 2015: It's kind of fun  to do the impossible with DPDK
DPDK summit 2015: It's kind of fun to do the impossible with DPDKLagopus SDN/OpenFlow switch
 
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro NakajimaDPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro NakajimaJim St. Leger
 
Data flow super computing valentina balas
Data flow super computing   valentina balasData flow super computing   valentina balas
Data flow super computing valentina balasValentina Emilia Balas
 
Final presentation [dissertation project], 20192 esv0002
Final presentation [dissertation project], 20192 esv0002Final presentation [dissertation project], 20192 esv0002
Final presentation [dissertation project], 20192 esv0002MOHAMMED FURQHAN
 
underground cable fault location using aruino,gsm&gps
underground cable fault location using aruino,gsm&gps underground cable fault location using aruino,gsm&gps
underground cable fault location using aruino,gsm&gps Mohd Sohail
 
SDNs: hot topics, evolution & research opportunities
SDNs: hot topics, evolution & research opportunitiesSDNs: hot topics, evolution & research opportunities
SDNs: hot topics, evolution & research opportunitiesDiego Kreutz
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best PracticesJeff Larkin
 
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics WorkshopLagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics WorkshopLagopus SDN/OpenFlow switch
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
 
Hummingbird - Open Source for Small Satellites - GSAW 2012
Hummingbird - Open Source for Small Satellites - GSAW 2012Hummingbird - Open Source for Small Satellites - GSAW 2012
Hummingbird - Open Source for Small Satellites - GSAW 2012Logica_hummingbird
 
Study and Emulation of 10G-EPON with Triple Play
Study and Emulation of 10G-EPON with Triple PlayStudy and Emulation of 10G-EPON with Triple Play
Study and Emulation of 10G-EPON with Triple PlaySatya Prakash Rout
 

Similar a Critical Issues at Exascale for Algorithm and Software Design (20)

POLYTEDA PowerDRC/LVS overview
POLYTEDA PowerDRC/LVS overviewPOLYTEDA PowerDRC/LVS overview
POLYTEDA PowerDRC/LVS overview
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
New idc architecture
New idc architectureNew idc architecture
New idc architecture
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 Altreonic
 
Numascale Product IBM
Numascale Product IBMNumascale Product IBM
Numascale Product IBM
 
Optical Switching in the Datacenter
Optical Switching in the DatacenterOptical Switching in the Datacenter
Optical Switching in the Datacenter
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
 
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
DPDK summit 2015: It's kind of fun  to do the impossible with DPDKDPDK summit 2015: It's kind of fun  to do the impossible with DPDK
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
 
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro NakajimaDPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
 
Data flow super computing valentina balas
Data flow super computing   valentina balasData flow super computing   valentina balas
Data flow super computing valentina balas
 
Final presentation [dissertation project], 20192 esv0002
Final presentation [dissertation project], 20192 esv0002Final presentation [dissertation project], 20192 esv0002
Final presentation [dissertation project], 20192 esv0002
 
Kubernetes-DX-5G-session
Kubernetes-DX-5G-sessionKubernetes-DX-5G-session
Kubernetes-DX-5G-session
 
underground cable fault location using aruino,gsm&gps
underground cable fault location using aruino,gsm&gps underground cable fault location using aruino,gsm&gps
underground cable fault location using aruino,gsm&gps
 
SDNs: hot topics, evolution & research opportunities
SDNs: hot topics, evolution & research opportunitiesSDNs: hot topics, evolution & research opportunities
SDNs: hot topics, evolution & research opportunities
 
0507036
05070360507036
0507036
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best Practices
 
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics WorkshopLagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
Hummingbird - Open Source for Small Satellites - GSAW 2012
Hummingbird - Open Source for Small Satellites - GSAW 2012Hummingbird - Open Source for Small Satellites - GSAW 2012
Hummingbird - Open Source for Small Satellites - GSAW 2012
 
Study and Emulation of 10G-EPON with Triple Play
Study and Emulation of 10G-EPON with Triple PlayStudy and Emulation of 10G-EPON with Triple Play
Study and Emulation of 10G-EPON with Triple Play
 

Último

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Último (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

Critical Issues at Exascale for Algorithm and Software Design

  • 1. Critical Issues at Exascale for Algorithm and Software Design SC12, Salt Lake City, Utah, Nov 2012 Jack Dongarra, University of Tennessee, Tennessee, USA
  • 2. Performance Development in Top500 1E+11 1E+10 1 Eflop/s 1E+09 10000000 100 Pflop/s 10 Pflop/s 10000000 1000000 1 Pflop/s N=1 100000 100 Tflop/s 10000 10 Tflop/s 1 1000 Tflop/s N=500 100 100 Gflop/s 10 10 Gflop/s 1 1 Gflop/s 1996 2002 2008 2014 2020 0.1
  • 3. Potential System Architecture Systems 2012 2022 Difference Titan Computer Today & 2022 System peak 27 Pflop/s 1 Eflop/s O(100) Power 8.3 MW ~20 MW (2 Gflops/W) (50 Gflops/W) System memory 710 TB 32 - 64 PB O(10) (38*18688) Node performance 1,452 GF/s 1.2 or 15TF/s O(10) – O(100) (1311+141) Node memory BW 232 GB/s 2 - 4TB/s O(1000) (52+180) Node concurrency 16 cores CPU O(1k) or 10k O(100) – O(1000) 2688 CUDA cores Total Node Interconnect 8 GB/s 200-400GB/s O(10) BW System size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000) Total concurrency 50 M O(billion) O(1,000) MTTI ?? unknown O(<1 day) - O(10)
  • 4. Potential System Architecture with a cap of $200M and 20MW Systems 2012 2022 Difference Titan Computer Today & 2022 System peak 27 Pflop/s 1 Eflop/s O(100) Power 8.3 MW ~20 MW (2 Gflops/W) (50 Gflops/W) System memory 710 TB 32 - 64 PB O(10) (38*18688) Node performance 1,452 GF/s 1.2 or 15TF/s O(10) – O(100) (1311+141) Node memory BW 232 GB/s 2 - 4TB/s O(1000) (52+180) Node concurrency 16 cores CPU O(1k) or 10k O(100) – O(1000) 2688 CUDA cores Total Node Interconnect 8 GB/s 200-400GB/s O(10) BW System size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000) Total concurrency 50 M O(billion) O(1,000) MTTI ?? unknown O(<1 day) - O(10)
  • 5. Critical Issues at Peta & Exascale for Algorithm and Software Design Synchronization-reducing algorithms  Break Fork-Join model Communication-reducing algorithms  Use methods which have lower bound on communication Mixed precision methods  2x speed of ops and 2x speed for data movement Autotuning  Today’s machines are too complicated, build “smarts” into software to adapt to the hardware Fault resilient algorithms  Implement algorithms that can recover from failures/bit flips Reproducibility of results  Today we can’t guarantee this. We understand the issues, 5 but some of our “colleagues” have a hard time with this.
  • 6. Major Changes to Algorithms/Software • Must rethink the design of our algorithms and software Manycore and Hybrid architectures are disruptive technology Similar to what happened with cluster computing and message passing Rethink and rewrite the applications, algorithms, and software Data movement is expensive Flops are cheap 6
  • 7. Fork-Join Parallelization of LU and QR. Parallelize the update: dgemm • Easy and done in any reasonable software. • This is the 2/3n3 term in the FLOPs count. - • Can be done efficiently with LAPACK+multithreaded BLAS Cores
  • 8. Synchronization (in LAPACK LU) Step 1 Step 2 Step 3 Step 4 ...  synchronous processing • Fork-join, bulk fork join 27  bulk synchronous processing 8 Allowing for delayed update, out of order, asynchronous, dataflow execution
  • 9. PLASMA/MAGMA: Parallel Linear Algebra s/w for Multicore/Hybrid Architectures Objectives  High utilization of each core  Scaling to large number of cores  Synchronization reducing algorithms Methodology  Dynamic DAG scheduling (QUARK)  Explicit parallelism  Implicit communication  Fine granularity / block data layout Arbitrary DAG with dynamic scheduling Fork-join parallelism DAG scheduled parallelism 9
  • 10. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3 A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 10
  • 11. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3 A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 11
  • 12. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3 A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 12
  • 13. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3 A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 13
  • 14. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3 A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 14
  • 15. PowerPack 2.0 The PowerPack platform consists of software and hardware instrumentation. 15 Kirk Cameron, Virginia Tech; http://scape.cs.vt.edu/software/powerpack-2-0/
  • 16. Power for QR Factorization LAPACK’s QR Factorization Fork-join based MKL’s QR Factorization Fork-join based PLASMA’s Conventional QR Factorization DAG based PLASMA’s Communication Reducing QR Factorization DAG based dual-socket quad-core Intel Xeon E5462 (Harpertown) processor @ 2.80GHz (8 cores total) w / MLK BLAS 16 matrix size is very tall and skinny (mxn is 1,152,000 by 288)

Notas del editor

  1. Bi-section bandwidth is calculated by knowing the partition sizes and the link bandwidth. For Sequoia, 96 racks will be arranged to16x12x16x16x2.The minimal surface is 12x16x16x2.The bandwidth between the two minimal surfaces is (12x16x16x2) x 2 x 2 x 2  = 49.152 TB/sHere, 2x2 reflects bi-directional 2 GB/s links, another 2 comes from the torus connections that brings a pair of links in each dimension.* Node memory bw for Q is 42.6 gb/s advertised and 28.6 GB/s on STREAM triad.  That&apos;s **NODE**, not per core...For bg/q bisection bw?Missing LatenciesMessage injection ratesFlops/watt cost money and bytes/flop that costs moneyJoule/op in 2009, 2015 and 2018: 2015: 100 pj/opCapacity doesn’t cost as much power as bandwidth how many joules to move a bit 2 picojoule/bit 75pj/bit for accessing DRAM32 Petabytes: with system memory at same fraction of system Need $ numberBest machine for 20MW and best machine for $200MMemory op is 64 bit word of memory 75 picojoule bit for (multiply by 64) (DDR 3 spec) 50 pj/ for an entire 64 bit opMemory technology in 5pj/bit by 2015 if we invest soonAnything more aggressive than 4pj/bit is close to the limit (will not sign up for 2pj/bit)2015 10 pj/flop 5pj/flop in 2018So we are talking 30:1 ratio of memory reference per flop 10pj/operation to bring a byte in8 terabits * 1pj -&gt; 8 wattsJEDEC is fundamentally broken (DDR4 is the end) Low swing differential insertion of known technology20GB/s per component to 1 order of magnitude more 10-12 Gigabits/second per wire16-64 using courant limited scaling of hydro codesCost per DRAM in that timeframe and how much to spend# outstanding memory references per cycle- bandwidth * latency above based on memory reference size memory concurrency 200 cycles from DRAM (2GHz) is 100ns (40ns for memory alone). With queues will be 100ns O(1000) references per node to memory O(10k) for 64 byte cache lines?Need to add system bisection: 2015: whatever local node bandwidth: factor of 4-8 or 2-4 against per-node interconnect bandwidth 2018:Occupancy vs latency: zero occupancy (1 slot for message launch) 5ns per 2-4 in 20152-4 in 201810^4vs 10^9th
  2. Bi-section bandwidth is calculated by knowing the partition sizes and the link bandwidth. For Sequoia, 96 racks will be arranged to16x12x16x16x2.The minimal surface is 12x16x16x2.The bandwidth between the two minimal surfaces is (12x16x16x2) x 2 x 2 x 2  = 49.152 TB/sHere, 2x2 reflects bi-directional 2 GB/s links, another 2 comes from the torus connections that brings a pair of links in each dimension.* Node memory bw for Q is 42.6 gb/s advertised and 28.6 GB/s on STREAM triad.  That&apos;s **NODE**, not per core...For bg/q bisection bw?Missing LatenciesMessage injection ratesFlops/watt cost money and bytes/flop that costs moneyJoule/op in 2009, 2015 and 2018: 2015: 100 pj/opCapacity doesn’t cost as much power as bandwidth how many joules to move a bit 2 picojoule/bit 75pj/bit for accessing DRAM32 Petabytes: with system memory at same fraction of system Need $ numberBest machine for 20MW and best machine for $200MMemory op is 64 bit word of memory 75 picojoule bit for (multiply by 64) (DDR 3 spec) 50 pj/ for an entire 64 bit opMemory technology in 5pj/bit by 2015 if we invest soonAnything more aggressive than 4pj/bit is close to the limit (will not sign up for 2pj/bit)2015 10 pj/flop 5pj/flop in 2018So we are talking 30:1 ratio of memory reference per flop 10pj/operation to bring a byte in8 terabits * 1pj -&gt; 8 wattsJEDEC is fundamentally broken (DDR4 is the end) Low swing differential insertion of known technology20GB/s per component to 1 order of magnitude more 10-12 Gigabits/second per wire16-64 using courant limited scaling of hydro codesCost per DRAM in that timeframe and how much to spend# outstanding memory references per cycle- bandwidth * latency above based on memory reference size memory concurrency 200 cycles from DRAM (2GHz) is 100ns (40ns for memory alone). With queues will be 100ns O(1000) references per node to memory O(10k) for 64 byte cache lines?Need to add system bisection: 2015: whatever local node bandwidth: factor of 4-8 or 2-4 against per-node interconnect bandwidth 2018:Occupancy vs latency: zero occupancy (1 slot for message launch) 5ns per 2-4 in 20152-4 in 201810^4vs 10^9th
  3. The PowerPack platform consists of software and hardware instrumentation. Hardware tools include a WattsUp Pro meter and a NI meter, which connect to the monitored system to measure system-level power and component-level power streams.
  4. The system power rate is measured prior to the AC/DC converter, which has an efficiency of 80-85%So during this process, you are loosing power... Which makes the numbers not adding upIntel Xeon E5462 (Harpertown) processor with 8 DIMMs or 16 GB memory (each DIMM is 2 GB DDR2).