SlideShare una empresa de Scribd logo
1 de 40
Exascale Computer in 2018?
     A Hardware View
       jkwu@cs.hku.hk
• Hardware Evolution

• Exascale Challenges

• My View

• Industrial & international movements
Hardware Evolution
• Processor/Node Architecture: Multi-core ->
  Many Core
• Acceleration/Coprocessors
   - SIMD Units (GP GPUs)
   - FPGAs (field-programmable gate array)
• Memory/ I/O Considerations
• Interconnection
Processor/Node Architecture
• Intel Xeon E5-2600 processor: Sandy Bridge microarchitecture
• Released: March 2012                  Up to 8 cores (16 threads)
                                           up to 3.8 GHz (turbo- boost)
                                           DDR3 1600 Memory at 51 GB/s
                                           64 KB L1 (3 cycles)
                                           256 KB L2 (8 cycles)
                                           20 MB L3
                                           Core-Memory: ring-topology
                                           interconnect
                                           CPU-CPU: QPI interconnect
Processor/Node Architecture
• Intel Knights Corner: Many Integrated Cores (MIC)
• Early Version: Nov, 2011



                                   • Over 50 cores
                                   • each core operating at 1.2GHz, 512 -bit
                                   vector processing units, 8MB of cache,
                                   and 4 threads per core.
                                   • 1 TFLOPS
                                   • It can be coupled with up to 2GB of
                                   GDDR5 memory. The chip uses the
                                   Sandy Bridge architecture, and
                                   manufactured using a 22nm process with
                                   3D tri-gate transistors.
Processor/Node Architecture
• AMD Opteron 6200 processor: Bulldozer core
• Released: Nov, 2011


                                               4 cores
                                               up to 3.3 GHz
                                               128KB L1 (4)
                                               1024 KB L2 (4)
                                               16 MB L3
                                               115W
Processor/Node Architecture
• AMD Llano APU A8-3870K: Fusion
• Released: Dec, 2011



                                   4 x86 Cores (Stars architecture),
                                   1MB L2 on each core, GPU on chip
                                   with 480 stream processors.
Processor/Node Architecture
• IBM Power 7: Power Architecture, multi-core
• Released: Feb, 2010

                                        8 cores
                                        up to 4.25 GHz, 32 threads,
                                        32 KB L1 (2 cycles)
                                        256 KB L2 (8 cycles)
                                        32 MB of L3 (embedded DRAM)
                                        100 GB/s of memory bandwidth
Coprocessor/GPU Architecture
• NVIDIA Fermi (GeForce 590)/Kepler/Maxwell
• Released: March 2011
                                   16 streaming multiprocessors
                                   (SMs), each with 32 stream processors
                                   (512 CUDA cores)
                                    48 KB/SM memory (True cache
                                   hierarchy + on-chip shared RAM)
                                   768KB L2
                                   772 MHz core
                                    3GB GDDR5 at 288G/s
                                   1.6 TFLOP peak
Coprocessor/FPGA Architecture
• Xilinx/Altera/Lattice   Semiconductor FPGAs typically interface to
                          PCI/PCIe buses and can significantly
                          accelerate compute - intensive applications
                          by orders of magnitude.
Petascale Parallel Architectures: Blue Waters
Petascale Parallel Architectures: XT6
Current Petascale Parallel Platforms
Heterogeneous Platforms: Tianhe-1A
                   • 14,336 Intel XeonX5670 processors
                   and 7,168 Nvidia Tesla M2050 general
                   purpose GPUs

                   • Theoretical peak performance of
                   4.701 PFLOPS

                   • 2PB Disk and 262 TB RAM

                   • Arch interconnect links the server
                   nodes together using optical-
                   electric cables in a hybrid fat tree
                   configuration
Heterogeneous Platforms: RoadRunner
From 10 to 1000 PFLOPS
Several critical issues must be addressed:
• Power (GFLOPS/w)
• Fault Tolerance (MTBF and high component count)
• Node Performance (esp. in view of limited
  memory)
• I/O (esp. in view of limited I/O bandwidth)
• Heterogeneity (regarding application composition)
• (and many incoming ones)
Exascale Hardware Challenges
•   Power consumption
•   Concurrency
•   Scalability
•   Resiliency
Architectures Considered
• Evolutionary Strawmen
 – “Heavyweight” Strawman based on commodity-derived microprocessors
 – “Lightweight” Strawman based on custom microprocessors


• Aggressive Strawmen
 – “Clean Sheet of Paper” CMOS Silicon
Evolutionary Scaling Assumptions
• Applications will demand same DRAM/Flops ratio as today
• Ignore any changes needed in disk capacity
• Processor die size will remain constant
• Continued reduction in device area => multi-core chips
•Vdd, max power dissipation will flatten as forecast
  – Thus clock rates limited as before
• On a per core basis, micro-architecture will improve from 2 flops/cycle to 4
    in 2008, and 8 in 2015
• Max # of sockets per board will double roughly every 5 years
• Max # of boards per rank will increase once by 33%
• Max power per rack will double every 3 years
• Allow growth in system configuration by 50 racks each year
The Power Models
• Simplistic: A highly optimistic model
  – Max power per die grows as per ITRS
  – Power for memory grows only linearly with # of chips
    • Power per memory chip remains constant
  – Power for routers and common logic remains constant
    • Regardless of obvious need to increase bandwidth
  – True if energy for bit moved/accessed decreases as fast as “flops per
   second” increase

• Fully Scaled: A pessimistic model
  – Same as Simplistic, except memory & router power grow
with peak flops per chip
  – True if energy for bit moved/accessed remains constant
The Prediction: Heavyweight
The Prediction: Lightweight
Architectures Considered
• Evolutionary Strawmen: NOT FEASIBLE
 – “Heavyweight” Strawman based on commodity-derived microprocessors
 – “Lightweight” Strawman based on custom microprocessors


• Aggressive Strawmen
 – “Clean Sheet of Paper” CMOS Silicon
The Prediction: Aggressive
A Whole Picture
Why?
Supply voltages are unlikely to reduce significantly.




Processor clocks are unlikely to increase significantly.
Die power consumption flattens.




Clock rate decreased with power constrained.
Power consumption/Flop flattens.
Fault Tolerance
My View (based on DARPA report)
• Power is a major consideration
• Faults and fault tolerance are major issues
• Constraints on power density constrain processor speed –
   thus emphasizing concurrency
• Levels of concurrency needed to reach exascale are projected
   to be over 109 cores
• For these reasons, evolutionary path to exaflop is unlikely
   to succeed in/before 2018, to its best in 2020ish
Intel
NVIDIA Echelon Project: Extreme-scale Computer
    Hierarchies with Efficient Locality-Optimized Nodes
                                               • 64 NoC (Network on Chip), each with
                                                 4 SMs, each SM with 8 SM Lanes
                                               • 8 LOC (latency optimized core)
                                               • 2.5GHz
                                               • 10nm



                                                         Chip Floorplan
               Node and System
Objectives: 16 TFLOP (double precision ) per
chip in 2018 at best
• 100X better application energy efficiency
over today’s CPU systems.
• Improved programmer productivity
• Strong scaling for many applications
• High AMTT
• Machines resilient to attack
DOE’s View
DOE’s points on Exascale System
• Voltage scaling to reduce power and energy
   - Explodes parallelism
   - Cost of communication vs. computation—critical balance
• Its not about the FLOPS. Its about data movement.
   - Algorithms should be designed to perform more work per
  unit data movement.
  - Programming systems should further optimize this data
  movement.
  - Architectures should facilitate this by providing an exposed
  hierarchy and efficient communication.
• System software to orchestrate all of the above
   Self aware operating system
DOE’s Timeline
European: Dynamical Exascale Entry
         Platform (DEEP)

                            Start: 1st Dec 2011
                            Duration: 3 years
                            Budget: 18.5 M€
DEEP System: A fusion of general purpose
and high scalability supercomputers
China Exascale Plans
• 12th 5-year Plan (2011-2015)
  - Seven petascale HPCs
  - At least one 50-100 PFLOPS
  - Budget: CNY 4 Billions

• 13th 5-year Plan (2016-2020)
  - 1~10 ExaFLOPS HPC
Thank you!

Más contenido relacionado

La actualidad más candente

Power 7 Overview
Power 7 OverviewPower 7 Overview
Power 7 Overviewlambertt
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU deviceKohei KaiGai
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmicguest40fc7cd
 
Open stack in action cern _openstack_accelerating_science
Open stack in action  cern _openstack_accelerating_scienceOpen stack in action  cern _openstack_accelerating_science
Open stack in action cern _openstack_accelerating_scienceeNovance
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architecturesnextlib
 
Multiprocessor architecture and programming
Multiprocessor architecture and programmingMultiprocessor architecture and programming
Multiprocessor architecture and programmingRaul Goycoolea Seoane
 
Multicore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash PrajapatiMulticore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash PrajapatiAnkit Raj
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architecturesinside-BigData.com
 
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...DESMOND YUEN
 
Evolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorEvolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorLarry Lang
 
Cascade lake-advanced-performance-press-deck
Cascade lake-advanced-performance-press-deckCascade lake-advanced-performance-press-deck
Cascade lake-advanced-performance-press-deckDESMOND YUEN
 

La actualidad más candente (19)

Multicore computers
Multicore computersMulticore computers
Multicore computers
 
Power 7 Overview
Power 7 OverviewPower 7 Overview
Power 7 Overview
 
Multicore Processors
Multicore ProcessorsMulticore Processors
Multicore Processors
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU device
 
Cpu Caches
Cpu CachesCpu Caches
Cpu Caches
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
 
How swift is your Swift - SD.pptx
How swift is your Swift - SD.pptxHow swift is your Swift - SD.pptx
How swift is your Swift - SD.pptx
 
Open stack in action cern _openstack_accelerating_science
Open stack in action  cern _openstack_accelerating_scienceOpen stack in action  cern _openstack_accelerating_science
Open stack in action cern _openstack_accelerating_science
 
uCluster
uClusteruCluster
uCluster
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
 
Multiprocessor architecture and programming
Multiprocessor architecture and programmingMultiprocessor architecture and programming
Multiprocessor architecture and programming
 
Multicore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash PrajapatiMulticore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash Prajapati
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architectures
 
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
 
Evolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorEvolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO Visor
 
POWER9 for AI & HPC
POWER9 for AI & HPCPOWER9 for AI & HPC
POWER9 for AI & HPC
 
Cascade lake-advanced-performance-press-deck
Cascade lake-advanced-performance-press-deckCascade lake-advanced-performance-press-deck
Cascade lake-advanced-performance-press-deck
 

Similar a Exaflop In 2018 Hardware

Trend - HPC-29mai2012
Trend - HPC-29mai2012Trend - HPC-29mai2012
Trend - HPC-29mai2012Agora Group
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programminginside-BigData.com
 
Intel new processors
Intel new processorsIntel new processors
Intel new processorszaid_b
 
Argonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer ArchitectureArgonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer Architectureinside-BigData.com
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Fisnik Kraja
 
trends of microprocessor field
trends of microprocessor fieldtrends of microprocessor field
trends of microprocessor fieldRamya SK
 
Intel core i3, i5, i7 , core2 duo and atom processors
Intel core i3, i5, i7 , core2 duo and atom processorsIntel core i3, i5, i7 , core2 duo and atom processors
Intel core i3, i5, i7 , core2 duo and atom processorsFadyMorris
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisMike Pittaro
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis PyData
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersCastLabKAIST
 
Sparc t3 2 technical presentation
Sparc t3 2 technical presentationSparc t3 2 technical presentation
Sparc t3 2 technical presentationsolarisyougood
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsAnand Haridass
 
Core 2 processors
Core 2 processorsCore 2 processors
Core 2 processorsArun Kumar
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisMike Pittaro
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysisodsc
 
Sparc t4 2 system technical overview
Sparc t4 2 system technical overviewSparc t4 2 system technical overview
Sparc t4 2 system technical overviewsolarisyougood
 

Similar a Exaflop In 2018 Hardware (20)

Trend - HPC-29mai2012
Trend - HPC-29mai2012Trend - HPC-29mai2012
Trend - HPC-29mai2012
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programming
 
Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012 Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012
 
Intel new processors
Intel new processorsIntel new processors
Intel new processors
 
Argonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer ArchitectureArgonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer Architecture
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
 
trends of microprocessor field
trends of microprocessor fieldtrends of microprocessor field
trends of microprocessor field
 
Intel core i3, i5, i7 , core2 duo and atom processors
Intel core i3, i5, i7 , core2 duo and atom processorsIntel core i3, i5, i7 , core2 duo and atom processors
Intel core i3, i5, i7 , core2 duo and atom processors
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
UNIT 2 P1
UNIT 2 P1UNIT 2 P1
UNIT 2 P1
 
Sparc t3 2 technical presentation
Sparc t3 2 technical presentationSparc t3 2 technical presentation
Sparc t3 2 technical presentation
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
 
Core 2 processors
Core 2 processorsCore 2 processors
Core 2 processors
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
Sparc t4 2 system technical overview
Sparc t4 2 system technical overviewSparc t4 2 system technical overview
Sparc t4 2 system technical overview
 

Exaflop In 2018 Hardware

  • 1. Exascale Computer in 2018? A Hardware View jkwu@cs.hku.hk
  • 2. • Hardware Evolution • Exascale Challenges • My View • Industrial & international movements
  • 3. Hardware Evolution • Processor/Node Architecture: Multi-core -> Many Core • Acceleration/Coprocessors - SIMD Units (GP GPUs) - FPGAs (field-programmable gate array) • Memory/ I/O Considerations • Interconnection
  • 4. Processor/Node Architecture • Intel Xeon E5-2600 processor: Sandy Bridge microarchitecture • Released: March 2012 Up to 8 cores (16 threads) up to 3.8 GHz (turbo- boost) DDR3 1600 Memory at 51 GB/s 64 KB L1 (3 cycles) 256 KB L2 (8 cycles) 20 MB L3 Core-Memory: ring-topology interconnect CPU-CPU: QPI interconnect
  • 5. Processor/Node Architecture • Intel Knights Corner: Many Integrated Cores (MIC) • Early Version: Nov, 2011 • Over 50 cores • each core operating at 1.2GHz, 512 -bit vector processing units, 8MB of cache, and 4 threads per core. • 1 TFLOPS • It can be coupled with up to 2GB of GDDR5 memory. The chip uses the Sandy Bridge architecture, and manufactured using a 22nm process with 3D tri-gate transistors.
  • 6. Processor/Node Architecture • AMD Opteron 6200 processor: Bulldozer core • Released: Nov, 2011 4 cores up to 3.3 GHz 128KB L1 (4) 1024 KB L2 (4) 16 MB L3 115W
  • 7. Processor/Node Architecture • AMD Llano APU A8-3870K: Fusion • Released: Dec, 2011 4 x86 Cores (Stars architecture), 1MB L2 on each core, GPU on chip with 480 stream processors.
  • 8. Processor/Node Architecture • IBM Power 7: Power Architecture, multi-core • Released: Feb, 2010 8 cores up to 4.25 GHz, 32 threads, 32 KB L1 (2 cycles) 256 KB L2 (8 cycles) 32 MB of L3 (embedded DRAM) 100 GB/s of memory bandwidth
  • 9. Coprocessor/GPU Architecture • NVIDIA Fermi (GeForce 590)/Kepler/Maxwell • Released: March 2011 16 streaming multiprocessors (SMs), each with 32 stream processors (512 CUDA cores) 48 KB/SM memory (True cache hierarchy + on-chip shared RAM) 768KB L2 772 MHz core 3GB GDDR5 at 288G/s 1.6 TFLOP peak
  • 10. Coprocessor/FPGA Architecture • Xilinx/Altera/Lattice Semiconductor FPGAs typically interface to PCI/PCIe buses and can significantly accelerate compute - intensive applications by orders of magnitude.
  • 14. Heterogeneous Platforms: Tianhe-1A • 14,336 Intel XeonX5670 processors and 7,168 Nvidia Tesla M2050 general purpose GPUs • Theoretical peak performance of 4.701 PFLOPS • 2PB Disk and 262 TB RAM • Arch interconnect links the server nodes together using optical- electric cables in a hybrid fat tree configuration
  • 16. From 10 to 1000 PFLOPS Several critical issues must be addressed: • Power (GFLOPS/w) • Fault Tolerance (MTBF and high component count) • Node Performance (esp. in view of limited memory) • I/O (esp. in view of limited I/O bandwidth) • Heterogeneity (regarding application composition) • (and many incoming ones)
  • 17. Exascale Hardware Challenges • Power consumption • Concurrency • Scalability • Resiliency
  • 18. Architectures Considered • Evolutionary Strawmen – “Heavyweight” Strawman based on commodity-derived microprocessors – “Lightweight” Strawman based on custom microprocessors • Aggressive Strawmen – “Clean Sheet of Paper” CMOS Silicon
  • 19. Evolutionary Scaling Assumptions • Applications will demand same DRAM/Flops ratio as today • Ignore any changes needed in disk capacity • Processor die size will remain constant • Continued reduction in device area => multi-core chips •Vdd, max power dissipation will flatten as forecast – Thus clock rates limited as before • On a per core basis, micro-architecture will improve from 2 flops/cycle to 4 in 2008, and 8 in 2015 • Max # of sockets per board will double roughly every 5 years • Max # of boards per rank will increase once by 33% • Max power per rack will double every 3 years • Allow growth in system configuration by 50 racks each year
  • 20. The Power Models • Simplistic: A highly optimistic model – Max power per die grows as per ITRS – Power for memory grows only linearly with # of chips • Power per memory chip remains constant – Power for routers and common logic remains constant • Regardless of obvious need to increase bandwidth – True if energy for bit moved/accessed decreases as fast as “flops per second” increase • Fully Scaled: A pessimistic model – Same as Simplistic, except memory & router power grow with peak flops per chip – True if energy for bit moved/accessed remains constant
  • 23. Architectures Considered • Evolutionary Strawmen: NOT FEASIBLE – “Heavyweight” Strawman based on commodity-derived microprocessors – “Lightweight” Strawman based on custom microprocessors • Aggressive Strawmen – “Clean Sheet of Paper” CMOS Silicon
  • 26. Why?
  • 27. Supply voltages are unlikely to reduce significantly. Processor clocks are unlikely to increase significantly.
  • 28. Die power consumption flattens. Clock rate decreased with power constrained.
  • 31. My View (based on DARPA report) • Power is a major consideration • Faults and fault tolerance are major issues • Constraints on power density constrain processor speed – thus emphasizing concurrency • Levels of concurrency needed to reach exascale are projected to be over 109 cores • For these reasons, evolutionary path to exaflop is unlikely to succeed in/before 2018, to its best in 2020ish
  • 32. Intel
  • 33. NVIDIA Echelon Project: Extreme-scale Computer Hierarchies with Efficient Locality-Optimized Nodes • 64 NoC (Network on Chip), each with 4 SMs, each SM with 8 SM Lanes • 8 LOC (latency optimized core) • 2.5GHz • 10nm Chip Floorplan Node and System Objectives: 16 TFLOP (double precision ) per chip in 2018 at best • 100X better application energy efficiency over today’s CPU systems. • Improved programmer productivity • Strong scaling for many applications • High AMTT • Machines resilient to attack
  • 35. DOE’s points on Exascale System • Voltage scaling to reduce power and energy - Explodes parallelism - Cost of communication vs. computation—critical balance • Its not about the FLOPS. Its about data movement. - Algorithms should be designed to perform more work per unit data movement. - Programming systems should further optimize this data movement. - Architectures should facilitate this by providing an exposed hierarchy and efficient communication. • System software to orchestrate all of the above Self aware operating system
  • 37. European: Dynamical Exascale Entry Platform (DEEP) Start: 1st Dec 2011 Duration: 3 years Budget: 18.5 M€
  • 38. DEEP System: A fusion of general purpose and high scalability supercomputers
  • 39. China Exascale Plans • 12th 5-year Plan (2011-2015) - Seven petascale HPCs - At least one 50-100 PFLOPS - Budget: CNY 4 Billions • 13th 5-year Plan (2016-2020) - 1~10 ExaFLOPS HPC