Exaflop In 2018 Hardware

Exascale Computer in 2018?
A Hardware View
jkwu@cs.hku.hk

• Hardware Evolution

• Exascale Challenges

• My View

• Industrial & international movements

Hardware Evolution
• Processor/Node Architecture: Multi-core ->
Many Core
• Acceleration/Coprocessors
- SIMD Units (GP GPUs)
- FPGAs (field-programmable gate array)
• Memory/ I/O Considerations
• Interconnection

Processor/Node Architecture
• Intel Xeon E5-2600 processor: Sandy Bridge microarchitecture
• Released: March 2012 Up to 8 cores (16 threads)
up to 3.8 GHz (turbo- boost)
DDR3 1600 Memory at 51 GB/s
64 KB L1 (3 cycles)
256 KB L2 (8 cycles)
20 MB L3
Core-Memory: ring-topology
interconnect
CPU-CPU: QPI interconnect

• Intel Knights Corner: Many Integrated Cores (MIC)
• Early Version: Nov, 2011

• Over 50 cores
• each core operating at 1.2GHz, 512 -bit
vector processing units, 8MB of cache,
and 4 threads per core.
• 1 TFLOPS
• It can be coupled with up to 2GB of
GDDR5 memory. The chip uses the
Sandy Bridge architecture, and
manufactured using a 22nm process with
3D tri-gate transistors.

• AMD Opteron 6200 processor: Bulldozer core
• Released: Nov, 2011

4 cores
up to 3.3 GHz
128KB L1 (4)
1024 KB L2 (4)
16 MB L3
115W

• AMD Llano APU A8-3870K: Fusion
• Released: Dec, 2011

4 x86 Cores (Stars architecture),
1MB L2 on each core, GPU on chip
with 480 stream processors.

• IBM Power 7: Power Architecture, multi-core
• Released: Feb, 2010

8 cores
up to 4.25 GHz, 32 threads,
32 KB L1 (2 cycles)
256 KB L2 (8 cycles)
32 MB of L3 (embedded DRAM)
100 GB/s of memory bandwidth

Coprocessor/GPU Architecture
• NVIDIA Fermi (GeForce 590)/Kepler/Maxwell
• Released: March 2011
16 streaming multiprocessors
(SMs), each with 32 stream processors
(512 CUDA cores)
48 KB/SM memory (True cache
hierarchy + on-chip shared RAM)
768KB L2
772 MHz core
3GB GDDR5 at 288G/s
1.6 TFLOP peak

Coprocessor/FPGA Architecture
• Xilinx/Altera/Lattice Semiconductor FPGAs typically interface to
PCI/PCIe buses and can significantly
accelerate compute - intensive applications
by orders of magnitude.

Petascale Parallel Architectures: Blue Waters

Petascale Parallel Architectures: XT6

Current Petascale Parallel Platforms

Heterogeneous Platforms: Tianhe-1A
• 14,336 Intel XeonX5670 processors
and 7,168 Nvidia Tesla M2050 general
purpose GPUs

• Theoretical peak performance of
4.701 PFLOPS

• 2PB Disk and 262 TB RAM

• Arch interconnect links the server
nodes together using optical-
electric cables in a hybrid fat tree
configuration

Heterogeneous Platforms: RoadRunner

From 10 to 1000 PFLOPS
Several critical issues must be addressed:
• Power (GFLOPS/w)
• Fault Tolerance (MTBF and high component count)
• Node Performance (esp. in view of limited
memory)
• I/O (esp. in view of limited I/O bandwidth)
• Heterogeneity (regarding application composition)
• (and many incoming ones)

Exascale Hardware Challenges
• Power consumption
• Concurrency
• Scalability
• Resiliency

Architectures Considered
• Evolutionary Strawmen
– “Heavyweight” Strawman based on commodity-derived microprocessors
– “Lightweight” Strawman based on custom microprocessors

• Aggressive Strawmen
– “Clean Sheet of Paper” CMOS Silicon

Evolutionary Scaling Assumptions
• Applications will demand same DRAM/Flops ratio as today
• Ignore any changes needed in disk capacity
• Processor die size will remain constant
• Continued reduction in device area => multi-core chips
•Vdd, max power dissipation will flatten as forecast
– Thus clock rates limited as before
• On a per core basis, micro-architecture will improve from 2 flops/cycle to 4
in 2008, and 8 in 2015
• Max # of sockets per board will double roughly every 5 years
• Max # of boards per rank will increase once by 33%
• Max power per rack will double every 3 years
• Allow growth in system configuration by 50 racks each year

The Power Models
• Simplistic: A highly optimistic model
– Max power per die grows as per ITRS
– Power for memory grows only linearly with # of chips
• Power per memory chip remains constant
– Power for routers and common logic remains constant
• Regardless of obvious need to increase bandwidth
– True if energy for bit moved/accessed decreases as fast as “flops per
second” increase

• Fully Scaled: A pessimistic model
– Same as Simplistic, except memory & router power grow
with peak flops per chip
– True if energy for bit moved/accessed remains constant

Architectures Considered
• Evolutionary Strawmen: NOT FEASIBLE
– “Heavyweight” Strawman based on commodity-derived microprocessors
– “Lightweight” Strawman based on custom microprocessors

• Aggressive Strawmen
– “Clean Sheet of Paper” CMOS Silicon

Supply voltages are unlikely to reduce significantly.

Processor clocks are unlikely to increase significantly.

Die power consumption flattens.

Clock rate decreased with power constrained.

Power consumption/Flop flattens.

My View (based on DARPA report)
• Power is a major consideration
• Faults and fault tolerance are major issues
• Constraints on power density constrain processor speed –
thus emphasizing concurrency
• Levels of concurrency needed to reach exascale are projected
to be over 109 cores
• For these reasons, evolutionary path to exaflop is unlikely
to succeed in/before 2018, to its best in 2020ish

NVIDIA Echelon Project: Extreme-scale Computer
Hierarchies with Efficient Locality-Optimized Nodes
• 64 NoC (Network on Chip), each with
4 SMs, each SM with 8 SM Lanes
• 8 LOC (latency optimized core)
• 2.5GHz
• 10nm

Chip Floorplan
Node and System
Objectives: 16 TFLOP (double precision ) per
chip in 2018 at best
• 100X better application energy efficiency
over today’s CPU systems.
• Improved programmer productivity
• Strong scaling for many applications
• High AMTT
• Machines resilient to attack

DOE’s points on Exascale System
• Voltage scaling to reduce power and energy
- Explodes parallelism
- Cost of communication vs. computation—critical balance
• Its not about the FLOPS. Its about data movement.
- Algorithms should be designed to perform more work per
unit data movement.
- Programming systems should further optimize this data
movement.
- Architectures should facilitate this by providing an exposed
hierarchy and efficient communication.
• System software to orchestrate all of the above
Self aware operating system

European: Dynamical Exascale Entry
Platform (DEEP)

Start: 1st Dec 2011
Duration: 3 years
Budget: 18.5 M€

DEEP System: A fusion of general purpose
and high scalability supercomputers

China Exascale Plans
• 12th 5-year Plan (2011-2015)
- Seven petascale HPCs
- At least one 50-100 PFLOPS
- Budget: CNY 4 Billions

• 13th 5-year Plan (2016-2020)
- 1~10 ExaFLOPS HPC

Exaflop In 2018 Hardware

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Similar a Exaflop In 2018 Hardware

Similar a Exaflop In 2018 Hardware (20)

Exaflop In 2018 Hardware