2. Clusters vs. Shared Memory Architecture
Small Node x86 Clusters SGI® Altix™ 4000 Family, UV
Commodity Interconnect SGI® NUMAflex™ Interconnect
mem mem mem mem mem Global shared memory
system
+
system
+
system
+
system
+ ... system
+ system system system ... system
OS OS OS OS OS
OS
• Each system has own memory and OS • All nodes operate on one large shared
• Batch, not interactive user interface memory space
• Coding required for parallel code execution • Cache Coherency
• Great for capacity workflows • Eliminates data passing between nodes
• SGI® Altix XE x86-64 clusters, Rackable BTO • Big data sets fit entirely in memory
• Less memory per node required
• Simpler to program
• High Performance, Low Cost, Easy to
Deploy
Company Confidential 2
3. Infiniband vs. Numalink™ Interconnect
Interconnect Type Bandwidth (each direction)
Infiniband 4xDDR 2.0 GBytes/s
Infiniband 4xQDR 4.0 GBytes/s
Numalink4 (Altix 4700/450) 3.2 GBytes/s
Numalink5 (UV) 7.5 GBytes/s
Company Confidential 3
6. SGI Scalable ccNUMA Architecture
Basic Node Structure and Interconnect
C C C C
A A A A
C
H
CPU CPU C
H
C
H
CPU CPU C
H
E E E E
NUMAlink Interconnect
Interface Interface
Chip Chip
Physical Memory Physical Memory
Shared Memory
Company Confidential 6
7. SGI Scalable ccNUMA Architecture
Scaling to Large Node Counts
C
A
C
H
E
CPU CPU
C
A
C
H
E
….. C
A
C
H
E
CPU CPU
C
A
C
H
E
C
A
C
H
E
CPU CPU
C
A
C
H
E
… C
A
C
H
E
CPU CPU
C
A
C
H
E
….
Interface Interface Interface Interface
Chip Chip Chip Chip
(Local) Physical Memory (Local) Physical Memory (Local) Physical Memory (Local) Physical Memory
Shared Memory (Within an SSI: OpenMP) Shared Memory Shared Memory
Globally Addressable Memory (GAM) Within a NUMAlinked System: MPI
NUMAlink
and Routers
Company Confidential 7
8. Company Confidential
100%
10%
20%
30%
40%
50%
60%
70%
80%
90%
0%
Communication vs. Computation
Applications on Altix 3000
Nastran/4
CSM
Pam-
Crash/32
Ls-Dyna/48p
Computation
Radioss/96
Fluent/64
StarHPC/32
CFD
Fire/32
Communication
Gamess/32
CCM
Amber/8
8
CASTEP/128
ADF/32
HOMME/1944
BIO
MM5/96
HIRLAM/128
CWO
CCM3/64
IFS /120
SPI
GeoDepth
Eclipse/52
RE
VIP/32
S
9. Why MOE (MPI Offload Engine) ?
Company Confidential 9
10. ®
SGI Project Ultraviolet – Overview
Extraordinary Capability in an x86 Architecture
• Performance and Productivity for Demanding Workloads
• Highly Data-Efficient – up to Many Terabytes of Data in Memory
• Scales to 2048 Core and 16TB in Single x86 System
• Scales IO to >1TB/s
• Advanced Reliability
• Hardware-enabled Fault Detection, Prevention, Containment
• Enhanced Monitoring and Serviceability
• Low TCO
• X86-64 and Linux Economics
• Industry Leading Rack-level Energy Efficiency
• Easiest System to Administer and Productively Use
Slide 10
Company Confidential 10
11. UV Architectural Scalability
16,384 Nodes (scaling supported by NUMAlink5 node ID)
– 16,384 UV_HUBs
– 32,768K Sockets / 262,144 Cores (with 8-cores per socket)
– >2pflop
Coherent shared memory
– Xeon: 16TB (44 bits socket PA)
8PB coherent get/put memory (53 bits PA w/GRU)
16 DIMMs per node (2DIMMs per Channel)
Intel coherence scheme within node
SGI coherence scheme between nodes
Company Confidential 11
12. UV Accelerated Performance
For Distributed or Shared Memory Programming
MPI Offload Engine (MOE) frees cpu from MPI activity
- MPI Reductions 2-3X faster than competitive clusters/MPPs
- barriers up to 80X+ faster
NUMAlink Advances – industry’s most efficient
interconnect
Massively Memory-mapped I/O
- Big speedup for I/O bound apps
Hold massive datasets in memory
- to 16TB per OS system image, to petascale across systems
Company Confidential 12
13. UV Accelerated Performance
For Distributed or Shared Memory Programming
MPI Offload Engine (MOE) frees cpu from MPI activity
- MPI Reductions 2-3X faster than competitive clusters/MPPs
- barriers up to 80X+ faster 6
Altix 4700
Longest Path MPI Latency
5 Altix ICE
NUMAlink Advances 4 UV
- 2-3X MPI latency improvement 3
2
Massively Memory-mapped I/O 1
- Big speedup for I/O bound apps 0
0 1000 2000
Destination CPU
Hold massive datasets in memory
- to 16TB per OS image, to petascale across systems
- Up to 10X+ speedup for data-intensive applications
Company Confidential 13
14. UV Low TCO
Economical to own and operate
Excellent Price/performance
– x86 economics plus UV performance advantages
UV
– 3-5X compared to today’s Altix 80%
Delivered Rack-Level
78%
– Can take the place of multiple systems 75%
Power Efficiency
75%
70%
70%
Leading Rack-level Power Efficiency 65%
65%
– UV stretch goal = 80% 60%
60%
55%
Origin 2000 Origin 3000 Altix 3000 Altix 4000 Ultraviolet
Carlsbad
Most Economical System
– to administer and use
Company Confidential 14
15. Project Ultraviolet Product Design
•Bladed Node Package
•Memory or compute-dense blades
•Variety of IO expansion options
•Mix/match resources
•Expand or reconfigure when needed
•Industry-leading Scalability
•Run standard Linux Distros
•RedHat, SLES
Slide 15
Company Confidential 15
16. IRU (Chassis) Packaging and Topology
N+1 PS 16 blade IRU for 24” rack
2 blade IRU for 19” rack
Compute node with IO
expansion capability 3U
18U
24” IRU Topology
1+1 PS
For (8) NUMAlink 5 Ports
Blowers per Router Cabled
to Network
(8) NUMAlink 5
Fan-In Ports per
Router
Paired Nodes
(Dual NUMAlink 5 Cross-
24”EIA Linked)
Company Confidential 16
17. Ultraviolet Rack
• (64) Intel® Xeon® Sockets • Blade-based packaging
• (512) Intel Xeon Cores • Air-Cooled electronics
• (512) DDR3 RDIMMs
• 128GB / node (w/ 8GB DIMMs) • N+1 12VDC Power Supplies
• 4TB / rack (w/ 8GB DIMMs) • N+1 Axial Fans
• Integrated BaseIO & Boot HDDs • (2) 60A 200VAC-240VAC
• Integrated or External IO Expansion 3-Φ IEC 60309 plugs provide
17.3 kVA each
• SGI® NUMAlink™ 5 network • Rack Nameplate 34.5 kVA max
• (1) System Management Node per
up to 4-racks • Optional water-cooling
• Leverages SGI® Altix® ICE 8200
• IO Expansion for higher power or
larger form factor cards
Company Confidential 17
18. UV System Packaging Options
High Performance Price-performance Midrange Capability
Quad Router 19” rack
Short Rack
Admin Node Admin
Node
Storage
Storage
IO
16 blade Expansion 2 blade
chassis 24/32
core
chassis
42U, 24 inch rack 20U, 19 inch rack
42U, 24 inch rack, routerless 40U, 19 inch rack
64 skts, 512c per rack 24 skts, 192c
64 skts, 512c, Up to 50 skts, 400c,
4TB memory (8GB DIMM) 3TB memory (8GB DIMM)
4TB memory (8GB DIMM) 3TB memory (8GB DIMM)
Up to 4.65 tflop Up to 1.8 tflop per short rack
Up to 4.65 tflop Up to 3.5 tflop per rack
Fat Tree, 7.5GB/s/skt bisection 2D Torus, 1.25GB/s/skt bisection
NL Scalable to 16K sockets Can be clustered with IB, Gig-e
Up to 2048core SSI supported
Company Confidential 18
19. Capability Comparisons
UV-Midrange Offers More Headroom
UV-Midrange 96 SSI, S
System Scale, Sockets 6 Max Memory,TB
Max Memory, TB 64+ Max IO, Slots
Max IO (PCIe slots/system)
Scalable x86 (IBM. Bull, Unisys)
System Scale, Sockets
Max Memory, TB
Max IO (PCIe slots/system)
8S Glueless
System Scale, Sockets
Max Memory, TB
Max IO (PCIe slots/system)
IBM P6 570,575, HP Integrity
System Scale, Sockets
Max Memory, TB
Max IO (PCIe slots/system)
Company Confidential 19
20. UV Nehalem-EX Node Board - Compute Blade
Optional
I/O Riser
Boxboro
IOH
QPI QPI
Nehalem- Nehalem-
EX QPI EX
Each Blade:
8-16 Xeon cores QPI QPI
Up to 145gflop (8) DDR3 RDIMMs
& (4) Millbrook Memory
Up to 128GB Buffers per socket UV
RLDRAM
HUB (Snoop Acceleration)
(2) Directory
FB-DIMMs
•SGI® NUMAlink™ 5 = 15.0 GB/s aggregate
(4) NUMAlink 5
•Intel® Quick Path Interconnect (QPI) = 25.6 GB/s aggregate (6.4GT/s)
•Directory FBD1 = 6.4GB/s Read + 3.2GB/s Write (800MHz DIMMs)
Single-Socket
•Millbrook Memory Buffers = 8.53GB/s (1067MHz DDR3 DIMMs)
x 4 channels = 34.1 GB/s Read / Socket Memory/IO Expansion
•Intel® Scalable Memory Interconnect (SMI) = 30 GB/s/socket Blade also
Available
Company Confidential 20
24. UV IO Expansion Chassis in Development
For Full-height and High-Power Card Support
1U
One x16 PCIe G2.0
input connector
Each unit
supports up
to 4 slots,
either PCIx
or PCIe
Company Confidential 24
29. SGI Altix ICE 8200 Water-Cooled Coils
(4) Individual Coils
Condensate Drain Pan
Branch Feed to
Target Heat Rejection Individual Coil
95% water / 05% air
Chilled-Water Supply
45° to 60° (7.2° to 15.6°
F F C C)
14.4 gpm (3.3 m3/hr) Max.
3/4” (1.91 cm) Coupling
Swivel Coupling to
Supply Hose
Company Confidential 29
30. UV Rack w/ Top-Feed Water-Cooled Coil
Target Heat Rejection
95% water / 05% air
UV Enhancements:
- Reduce water-side
Chilled-Water Supply pressure drop
45° to 65° (7.2° to 18.3°
F F C C) - Increase allowable
16.0 gpm (3.6 m3/hr) Max.
water supply temp
to 65° (18.3°
F C)
-Enable top-feed water
1” (2.54 cm) Coupling
Company Confidential 30
31. 80 Plus® Organization
Ultraviolet Power Supplies Planned to be Gold Certified
Mission
– Unique forum that is uniting electric utilities, the computer
industry and consumers in a groundbreaking effort to bring
energy efficient power supplies to desktop computers and
servers
N+0 desktop power supply certification available today
– SGI worked with 80 Plus to draft N+1 server power supply
specification
80 Plus Bronze Silver Gold
Year 1 Year 2 Year 3
CSCI July-07 July-08 July-09
http://www.80plus.org/ 20% PSU Load 81% 85% 88%
50% PSU Load 85% 89% 92%
100% PSU Load 81% 85% 88%
Company Confidential 31
32. Energy Efficiency : Rack Level
stretch goal
Rack 80%
Net (all-in) Rack Energy Efficiency Roadmap
(N.B. even higher efficiency if no water-coil)
78%
75%
75%
70%
70%
65%
65%
60%
60%
55%
Origin 2000 Origin 3000 Altix 3000 Altix 4000 Ultraviolet
Carlsbad
Company Confidential 32
33. UV Rack Power
34.5kVA Rack Nameplate
– Used for facilities wire-sizing
33.3kW Power Model Roll-Up
– 130W TDP sockets, full memory, fans at altitude with water-
coil impedance
30.0kW Estimate Running Linpack
– 90% of Power Model
– “Maximum Measured”
22.5kW Estimate Running Applications
– ~75% of Linpack Power
– Used for energy consumption planning (kWh)
Company Confidential 33
34. Projected UV Performance Advances
6
IB
P a ny
Altix 4700
L n e t P thM I L te c
Profile for Large Jobs
UV
Excellent BW/latency
MPI 5
Bandwidth Altix ICE
vs 4 UV
Message NL4
Size Typical Cluster Systems 3
ogs a
2
UV-NL 5
1
0
Bytes Destination
0 1000 2000 CPU
Destination CPU
IB HPCC Benchmarks
MPI and HPPC, Barriers
Single element MPI_reduce
UV
25
Speedups with GRU
MPI_Reduce Ramdom UV with GRU
20
Access UV no GRU
Time for MPI_Reduce (us)
15
3X
10
FFTE
5
0
ptrans
2
4
84
68
36
6
2
4
8
6
2
07
14
02
04
09
19
25
51
,3
,7
,5
1,
2,
1,
2,
4,
8,
16
32
65
0
13
26
number of threads
Barrier Latency <1usec (4096 thread)
Source: Qlogic, Inc.
Company Confidential 34
36. UV_HUB / Node Controller Technologies
Processor Interface Active Memory Unit
• Snoop Acceleration • Rich set of Atomic Operations
• Large Number of In-Flight References • AMO cache at memory home
• Multicast
• Message Queues in Coherent Memory
Globally Addressable Memory
• Page Initialization
• Large Shared Address Space
• Extremely Large Coherent Get/Put Space GRU Global Reference Unit
• AMOs in Coherent Memory • High-BW, Low-Latency Socket
• Coherence Directory Communication
• Update Cache for many AMOs
RAS • Scatter/Gather Operations
• BCOPY Operations
• Redundant Real-Time Clock
• External TLB with Large Page Support
• Built-In Debug and Performance
Monitors
• Internal/External Datapath Protection
• Alpha-immune Flip-Flops
Company Confidential 36
39. SGI Flagship Platform Evolution
SGI’s Flagship Product Line has 4 Characteristics:
1. GAM
2. SSI
3. x/core, where x={I/O, Memory}
4. SWAP (and cooling)
UV - 3 things to know:
1. Xeons into the Flagship Product Line WITHOUT COMPROMISE
2. MOE (MPI Offload Engine)
3. Topology Options:
- Selectable Fat-tree sizes
- Vertices within a Torus
- Paired Node Routerless or Routed
- Constellations
Company Confidential 39
40. UV HUB/Node Controller Features
Extended Capability
•Enabling Enterprise-class scalability and reliability on x86-64
•Cache-coherence across nodes
•Fault resiliency – mirror thru block devices in memory – survive OS crash
•Extensive fault isolation, datapath protection, monitoring/debug functions
•Accelerating Large-scale workloads
•Fast Message-Passing (without cpu cache-line delays)
•Extends cpu capability for load requests
•System scale to 256+ sockets, 2048+ cores on standard Linux
•Accelerating Data-intensive applications
•Extended physical memory address to peta-scale (8PB)
•Extended “Super” TLB page size (1TB, map up to 4PB)
•avoid TLB misses for large, random data references
•Very fast locking mechanism for highly contended data (no cache-line delay)
•Off-load add, compare, swap instructions
•HUB/Node controller directly exposed to user for easy utilization
•No system calls
Company Confidential 40
41. System Management
UV maintains the hierarchical system management
approach.
– Origin/Altix: L1/L2/L3
– ICE/UV: BMC, CMC, Leader Node/SMN
– Command line interface at L2 & CMC very similar
Unified approach to system management wrapped
into SGI Cluster Manager
SNMP used extensively across product lines including
UV
– Hardware inventories & sensor values stored in MIB
format
– SNMP data coalesced at SMN, available via SGI provided
RAS software or through SNMP queries by 3rd party or
customer developed apps
Company Confidential 41