This document provides an overview of high performance computing (HPC). It defines HPC as using supercomputers and computer clusters to solve advanced computation problems quickly and efficiently through parallel processing. The document discusses the building blocks of HPC systems including CPUs, memory, power consumption, and number of cores. It also outlines some common applications of HPC in fields like physics, engineering, and life sciences. Finally, it traces the evolution of HPC technologies over decades from early mainframes and supercomputers to today's clusters and parallel systems.
2. What is HPC?
HPC Definition:
14.9K hits from Google
Uses supercomputers and computer clusters to solve advanced
computation problems. Today, computer..... (Wikipedia)
Use of parallel processing for running advanced application
programs efficiently, reliably and quickly. The term applies
especially to systems that function above a teraflop or 1012
floating-point operations per second. The term HPC is
occasionally used as a synonym for supercomputing, although.
(Techtarget)
A branch of computer science that concentrates on developing
supercomputers and software to run on supercomputers. A main
area of this discipline is developing parallel processing algorithms
and software (Webopedia)
And another “14.9k - 3” definitions……
2
3. So What is HPC Really?
My understanding
No clear definition!!
At least O(2) time as powerful as PC
Solving advanced computation problem? Online game!?
HPC ~ Supercomputer? & Supercomputer ~ Σ Cluster(s)
Possible Components:
CPU1, CPU2, CPU3….. CPU”N”
~ O(1) tons of memory dimm….
~ O(2) kW power consumption
O(1) - ~O(3) K-Cores
~ 1 system admin
Rmember:
“640K ought to be enoughfor anybody. ” Bill Gates, 1981 3
4. Why HPC?
Possible scenario:
header: budget_Unit = MUSD
if{budget[Gov.] >= O(2) && budget[Com.] >= O(1)}
else {Show_Off == “true”}
else if{Possible_Run_on_PC == “false”}
{Exec “HPC Implementation”}}
Got to Wait another 6M ~ 1Yr…….
ruth is:
T
Time consuming operations
Huge memory demanded tasks
Mission critical e.g. limit time duration
Large quantities of run (cores/CPUs etc.)
non-optimized programs….
4
5. Why HPC? Story Cont’
rand Challenge Application Requirements
G
Capable with 97’ MaxTop500 & 05’ MinTop500 break TFlops
10 PB
LHC Est.
CPU/DISK ~
143MSI2k/56.3PB
~100K-Cores
PFlops
5
6. How Need HPC? HPC Domain Applications
Fluid dynamics & Heat Transfer
Physics & Astrophysics
Nanoscience
Chemistry & Biochemistry
Biophysics & Bioinformatics
Geophysics & Earth Imaging
Medical Physics & Drop Discovery
Databases & Data Mining Financial
Modeling Signal & Image Processing
And more ....
6
7. HPC – Speed vs. Size
> Can’t fit on a PC –
usually because they > Take a very very long time to run
need more than a few on a PC: months or even years. But
GB of RAM, or more Size a problem that would take a month
than a few 100 GB of on a PC might take only a few hours
disk. on a supercomputer.
Speed
7
8. HPC ~ Supercomputer ~ Σ Cluster(s)
What is cluster?
Again, 1.4k hits from google…..
A computer cluster is a group of linked computers, working
together closely thus in many respects forming a single
computer... (Wikipedia)
Single logical unit consisting of multiple computers that are
linked through a LAN. The networked computers essentially
act as a single, much more powerful machine.. (Techopedia)
And……
But the cluster is:
CPU1, CPU2, CPU3….. CPU”N”
~ O() Kg of memory dimm….
< O(1) kW power consumption
~ O(1) K-Cores
Still ~ 1 system admin 8
9. HPC – Trend in Growth Potential & Easy of Use
10
1.0
GFlops per Processors
- 1995
- 2000
0.1 - 2005
- 2010
0.01
0.001
10 100 1000 10K 100K
Number of Processors (P)
9
18. HPC – Cost of Computing (1960s ~ 2011)
About “17 million IBM 1620 units”
costing $64,000 each.
The 1620's multiplication
operation takes 17.7 ms
Cost (USD)
Two 16-processor Beowulf clusters
Cray X-MP with Pentium Pro microprocessors
First computing technology
which scaled to large
First sub-US$1/MFLOPS computing applications while staying
technology. It won the Gordon Bell Prize under US$1/MFLOPS
in 2000 Bunyip Beowulf cluster KLAT2
First sub-US$100/GFLOPS computing technologyKASY0
As of August 2007, this 26.25 GFLOPS "personal" Microwulf
Beowulf cluster can be built for $1256
Year HPU4Science
$30,000 cluster was built using only
commercially available "gamer" grade
hardware
18
Ref: http://en.wikipedia.org/wiki/FLOPS
19. HPC - Interconnect, Proc Type, Speed & Threads
19
Ref: “Introduction to the HPC Challenge Benchmark Suite” by Piotr et. al.
22. HPC – Computing System Evolution (I)
ENIAC
940s (Beginning)
1
ENIAC (Eckart & Mauchly, U.Penn)
Von Neumann Machine
Sperry Rand Corp
IBM Corp.
Vacuum tube
Thousands instruction per second (0.002 MIPS)
1950s (Early Days)
IBM 704, 709x
CDC 1604
Transistor (Bell Lab, 1948)
Memory: Drum/Magnetic Core (32K words)
Performance: 1 MIPS
Separate I/O processor
IBM 704 22
23. HPC – Computing System Evolution (II)
960s (System Concept)
1 IBM S/360 Model 85
IBM Stretch Machine (1st Pipeline machine)
IBM System 360 (Model 64, Model 91)
CDC 6600
GE, UNIVAC, RCA, Honeywell & Burrough etc
Integrated Circuit/Mult-layer (Printed Circuit Board)
Memory: Semiconductor (3MB)
Cache (IBM 360 Model 85)
Performance: 10 MIPS (~1 MFLOPS)
1970s (Vector, Mini-Computer)
IBM System 370/M195, 308x
CDC 7600, Cyber Systems
DEC Minicomputer
FPS (Floating Point System)
Cray 1, XMP
Large Scale Integrated Circuit
Performance: 100 MIPS (~10 MFLOPS)
Multiprogrmming, Time Sharing IBM S/370 Model 168
Vector: Pipeline Data Stream 23
24. HPC – Computing System Evolution (III)
980s (RISC, Micro-Processor)
1
CDC Cyber 205
Cray 2, YMP
IBM 3090 VF
Japan Inc. (Fujitsu’s VP, NEC’s SX)
Thinking Machine: CM2 (1st Large Scale Parallel)
RISC system (Appolo, Sun, SGI, etc)
CONVAX Vector Machine (mini Cray)
Microprocessor: PC (Apple, IBM)
Memory: 100MB Connect Machine: CM-2
RISC system:
Pipeline Instruction Stream
Multiple execution units in core
Vector: Multiple vector pipelines
Thinking Machine: kernel level parallelism
Performance: 100 Mflops
IBM 3090 Processor Complex 24
25. HPC – Computing System Evolution (IV)
990s (Cluster, Parallel Computing)
1
IBM Power Series (1,2,3)
SGI NUMA System
Cray CMP, T3E, Cray 3
CDC ETA
DEC’s Alpha IBM Power5 Family
SUN’s Internet Machine
Intel Parogon
Cluster of PC Power3 IBM Blue Gene
Memory: 512MB per processor
Performance: 1 Teraflops
SMP node in Cluster System
000s (Large Scale Parallel System)
2
IBM Power Series (4,5), Blue Gene
HP’s Superdome
Cray SV system
Intel’s Itanium, Xeon, Woodcrest, Westsmear processor
emory: 1-8 GB per processor
M
erformance: Reach 10 Teraflops
P 25
26. HPC – Programming Language (I)
Microcode, Machine Language
Assembly Language (1950s)
Mnemonic, based on machine instruction set
Fortran (Formula Translation) (John Backus, 1956)
IBM Fortran Mark I – IV (1950s, 1960s)
IBM Fortran G, H, HX (1970), VS Fortran
CDC, DEC, Cray, etc..., Fortran
Industrial Standardized - Fortran 77 (1978)
Industrial Standardized - Fortran (88), 90, 95 (1991,1996)
HPF (High Performance Fortran) (late 1980)
Algol (Algorithm Language) (1958) (1960, Dijksta, et. al.)
Based on Backus-Naur Form method
Considered as 1st Block Structure Language
COBOL (Common Business Oriented Language) (1960s)
IBM PL/1, PL/2 (Programming Language) (mid 60-70s)
Combined Fortran, COBOL, & Algol
Pointer function
26
Exceptional handling
27. HPC – Programming Language (II)
Applicative Languages
IBM APL (A Programming Language) (1970s)
LISP (List Processing Language) (1960s, MIT)
BASIC (Beginner’s All-Purpose Symbolic Instruction Code) (mid 1960)
1st Interactive language via Interpreter
PASCAL (1975, Nicklass Wirth)
Derived from Wirth’s Algol-W
Well designed programming language
Call argument list by value
& C++ (mid 1970, Bell Lab)
C
Procedure language
ADA (late 1980, U.S. DOD)
Prolog (Programming Logic) (mid 1970)
27
28. HPC – Computing Environment
Batch Processing (before 1970)
Multi-programming, Time Sharing (1970)
Remote Job Entry (RJE) (mid 1970)
Network Computing
APARnet (mother of INTERNET)
IBM’s VNET (mid 1970)
Establishment Community Computing Center
st Center: NCAR (1967)
1
U.S. National Supercomputer Centers (1980)
Parallel Computing
Distribute Computing
Emergence of microprocessors
Grid Computing (2000s)
Volunteer Computing @Home Technology 28
31. HPC – Parallel Computing (II)
1st Attempt - ILLIAC, 64-way monster (mid 1970)
U.S. Navy’s parallel weather forecast program (1970s)
Early programming method - UNIX thread (late 1970)
1st Viable Parallel Processing - Cray’s Micro-Tasking (80s)
Many, Many proposed methods in 1980s: e.g. HPF
SGI’s NUMA System - A very successful one (1990s)
Oakridge NL’s PVM and Europe’s PARMAC (Early 90s) programming model
for Distributing Memory System
Adaption of MPI and OpenMP for parallel programming
MPI - A main stream of parallel computing (late 1990)
Well and Clear defined programming model
Successful of Cluster Computing System
Network/Switch hardware performance
Scalability
Data decomposition allows for running large program
Mixed MPI/OpenMP parallel programming model for SMP node cluster
system (2000)
31
37. Constrain on Computing Solution –
“Distributed Computing”
Opposing forces
Commodity
Budgets push toward lower
cost computing solutions
At the expense of operation
cost
Limitations for power & cooling
difficult to change on short time
scales
Challenges:
Data Distribution & Data
Management
Distributed Computing Model
Fault Tolerance, Scalability &
Availability
SMP
Centralized Distributed
37
42. Cluster – Commercial x86 Architecture
Intel Dunnington 7400-series
last CPU of the Penryn generation and Intel's first multi-
core die & features a single-die six- (or hexa-) core design
with three unified 3 MB L2 caches
42
55. Network Performance Throughput vs. Latency (III)
Interconnection:
GbE, 10G (FC), IB and IBoIP (DDR vs. QDR)
Max throughput not reach 80% of IB DDR (~46%)
Peak of DDR IPoIB ~76% of IB peak (9.1Gbps)
Over IP, QDR have only 54%
While max throughput reach 85% (34.8Gbps)
No significant performance gain for IPoIB using RDMA
(by preloading SDO)
Possible performance degradation
Existing activities over IB edge switch at the chassis
Midplane performance limitation
Reaching 85% on clean IB QDR interconnection:
Redo performance measurement on IBM QDR
56. Cluster – File Server Performance
Preload SDP provided by OFED
Sockets Direct Protocol (SDP)
Note: Network protocol which provides
an RDMA accelerated alternative to TCP
over InfiniBand
58. Cluster – File Server IO Performance (II)
Re-Read Performance
Read Performance
59. Cluster I/O – Cluster filesystem options? (I)
OCFS2 (Oracle Cluster File System)
Once proprietary, now GPL
Available in Linux vanilla kernel
not widely used outside the database world
PVFS (Parallel Virtual File System)
Open source & easy to install
Userspace-only server
kernel module required only on clients
Optimized for MPI-IO
POSIX compatibility layer performance is sub-optimal
pNFS (Parallel NFS)
Extension of NFSv4
Proprietary solutions available: “Panasas”
Put together benefits of parallel IO using standard solution (NFS)
59
60. Cluster I/O – Cluster filesystem options? (II)
GPFS (General Parallel File System)
Rock-solid w/ 10-years history
Available for AIX, Linux & Windows Server 2003
Proprietary license
Tightly integrated with IBM cluster management tools
Lustre
HA & LB implementation
highly scalable parallel filesystem: ~ 100K clients
Performance:
Client: ~1 GB/s & 1K Metadata Op/s
MDS: 3K ~ 15K Metadata Op/s
OSS: 500 ~ 2.5 GB/s
POSIX compatibility
Components:
single or dual Metadata Server (MDS) w/ attached Metadata Target
(MDT) (if consider scalability & load balance)
multiple “up to ~O(3)” Object Storage Server (OSS) w/ attached Object
60
Storage Targets (OST)
61. Cluster I/O – Lustre Cluster Breakdown
InfiniBand Interconnect
Lustre Cluster
OSS1 Compute Compute Compute
OSS2 Compute Compute Compute
… … …
OSS nodes
(Load Balanced) Compute Compute Compute
Compute Compute Compute
MDS nodes
(High Availability) Compute Compute Admin
MDS(M) Compute Compute Admin
Compute Compute Login
MDS(S)
Compute Compute Login
Lustre
Quad-CPU Compute Nodes
Connectivity to all Connectivity to all Connectivity to all Connectivity to all
nodes nodes nodes nodes
GigE ethernet for boot and system control traffic
Connectivity to all Connectivity to all Connectivity to all Connectivity to all
nodes nodes nodes nodes
10/100 Ethernet out-of-band management (power on/off, etc)
61
62. Cluster I/O – Parallel Filesystem using Lustre
ypical Setup
T
MDS: ~ O(1) servers with good CPU and RAM, high seek rate
OSS: ~ O(3) server req. good bus bandwidth, storage
62
68. Hardware & system software
features affecting scalability Reliability
- Hardware
of parallel systems
Scalable Tools - Software
Machine Size
- User/Developer
- Proc. Performance
- Manager
- Num. Processors
- Libraries
Input/Output Totally Scalable Memory Size
- Bandwidth Architecture - Virtual
- Capacity - Physical
Program Env. Interconnect Network
- Familiar Program Paradigm - Latency
- Familiar Interface Memory Type - Bandwidth
- Distributed
- Shared
68
69. HPC – Demanded Features from diff. Roles
Def. Roles: Users, Developers, System Administrators
Features Users Developers Managers
Familiar User Interface ✔ ✔ ✔
Familiar Programming Paradigm ✔ ✔
Commercially Supported Applications ✔ ✔
Standards ✔ ✔
Scalable Libraries ✔ ✔
Development Tools ✔
Management Tools ✔
Total System Costs ✔
69
70. HPC –
Application Development Steps
Prep
SA
SPEC
Run
SA
SPEC
Code
Code
Opt
Par
Mod
Opt
Prep
Run
Par
Mod 70
71. HPC – Service Scopes
System Architecture Design
Various of interconnection e.g. GbE, IB, FC etc.
Mission specific e.g. high performance or high throughput
Computational or data intensive
OMP vs. MPI
Parallel/Global filesystem
Cluster Implementation
Objectives:
High availability & Fault tolerance
Load Balancing Design & Validation
Distributed & Parallel Computing
Deployment, Configuration, Cluster Mgmt. & Monitoring
Service Automation and Event Mgmt.
KB & Helpdesk
Service Level:
Helpdesk & Onsite Inspection
System Reliability & Availability
1st / 2nd line Tech. Support
Automation & Alarm Handling
71
Architecture & Outreach?
72. High Performance Computing –
Performance Tuning & Optimization
Tuning Strategy Best Practices
Profile & Bottleneck drilldown
System Optimization
Filesystem Improvement & (re-)Design
72
79. High Performance Computing
- How We Get to Today?
Moore’s Law, Heat/Energy/Power Density
Hardware Evolution
Datacenter & Green
HPC History Reminder:
1980s - 1st Gflops in single vector processor
1994 - 1st TFlop via thousands of microprocessors
2009 - 1st Pflop via several hundred thousand cores 79
80. Moore’s Law & Power Density
Dynamic Pwr ∝ V2fC 2X Transistors/Chip every 1.5Yr
Cubic effect if inc frequency & supply Golden Moore (co-founder of
voltage Intel) predicted in 1965.
Eff ∝ capacitance ∝ cores (linear)
High performance serial processor 33K ~ 38K MIPs
waste power
7.5K ~ 11K MIPs
More transistors rather serial
Transistor Count
1971-2011
1 Billion Transistors
processors
25 MIPs
1.0 MIPs
0.1 MIPs
Date of Production
80
Ref: http://en.wikipedia.org/wiki/List_of_Intel_microprocessors
81. Moore’s Law – What we learn?
Transistor ∝ MIPs ∝ Watts ∝ BTUs
Rule of thumb: 1 watt of power consumed requires 3.413
BTU/hr of cooling to remove the associated heat
Inter-chip vs. Intra-chip parallelism
Challenges: millions of concurrent threads
HP: Data Center Power Density Went from 2.1 kW/Rack
in 1992 to 14 kw/Rack in 2006
IDC: 3 Year Costs of Power and Cooling, Roughly Equal
to Initial Capital Equipment Cost of Data Center
NETWORKWORLD: 63% of 369 IT professionals said
that running out of space or power in their data centers
had already occurred
81
82. HPC – Feature size, Clock & Die Shrink
Historical data
TRTS Max Clock Rate
Main ITRS node (nm)
Year Feature size (nm)
Feature Size (nm)
Year
83. Trend: Cores per Socket
Top500 Nov 2011:
45.8% & 32% running 6 & quad cores proc.
5.8% sys. >= 8 cores (2.4% with 16 cores)
1
more than 2 fold inc. vs. 2010 Nov (6.8%) Top500 2011 Nov
Trend: quad (73% in 10’) to 6 cores (46% in 11’)
83
84. HPC – Evolution of Processors
Transistors: Moore’s Law
Clock rate no longer as a
proxy for Moore’s Law & Cores
may double instead.
Power literately under control.
Transistors Physical Gate Length
Ref: “Scaling to Petascale and Beyond: Performance Analysis and Optimization of Applications” NERSC.
85. HPC – Comprehensive Approach
CPU Chips
Clock Frequency & Voltage Scaling
75% power savings at idle and 40-70% power savings for
utilization in the 20-80% range
Server
Chassis: 20-50% Pwr reduction.
Modular switches & routers
Server consolidation & virtualization
Storage Devices
Max. TB/Watt & Disk Capacity
Large Scale Tiered Storage
Max. Pwr Eff by Min. Storage over-provisioning
Cabling & Networking
Stackable & backplane capacity (inc. Pwr Eff)
Scaling & Density 85
86. HPC – Datacenter Power Projection
Case: ORNL/UTK inc.
DOE & NSF sys.
Deploy 2 large Petascale
systems in next 5 years
Current Power
Consumption 4 MW
Exp to 15MW before year
end (2011)
50MW by 2012.
Cost estimates based on
$0.07 per KwH
86
87. HPC – Data Center Best Practices
Traditional Approach
Hot/Cold Aisle
Min. Leakage
Eff. Improvement (Coolig & Power)
DC input (UPS opt.), Cabling & Container
Liquid Cooling
Free Cooling
Leveraging Hydroelectric Power
Ref: http://www.google.com/about/datacenters/
87
http://www.google.com/about/datacenters/inside/efficiency/power-usage.html
88. HPC – DataCenter Growing Power Density
Total system efficiency comprises three main elements- the Grid, the Data
Centre and the IT Components. Each element has its own efficiency factor-
multiplied together for 100 watts of power generated, the CPU receives only
12 watts
Heat Load Product Footprint (Watt/ft2)
Ref: Internet2 P&C Nov 2011, “Managing Data Center Power Power & Cooling & Cooling” by Force10 88
89. HPC - Performance Benchmarking
CPU Arch., Scalability, SMT & Perf/Watt
Case study: Intel vs. AMD
89
90. HPC – Performance Strategy: “The Amdahl’s Law”
Fixed-size Model : Speedup = 1 / (s + p/N)
Scaled-size Model: Speedup = 1 / ((1-P) + P/N) ~ 1/(1-P)
arallel & Vector scale w/ problem size
P
s: Σ (I/O + serial bottleneck + vector startup + program loading)
SpeedUP
90
Numer of Processors
91. Price-Performance for Transaction-Processing
OLTP – One of the largest server markets is online
transaction processing
TPC-C – std. industry benchmark for OLTP is
Queries and updates rely on database system
Significant factors of performance in TPC-C:
Reasonable approx. to a real OLTP app.
Predictive of real system performance:
total system performance, inc. the hardware, the operating
system, the I/O system, and the database system.
Complete instruction and timing info for benchmarking
TPM (measure transactions per minute) & price-
performance in dollars per TPM.
91
92. 20 SPEC benchmarks
1.9 GHZ IBM Power5 processor vs. 3.8 GHz Intel Pentium 4
10 Integer @LHS & 10 floating point @RHS
Fallacy:
Processors with lower CPIs will always be faster.
Processors with faster clock rates will always be faster.
92
94. Cost of purchase split between processor, memory,
storage, and software
94
95. Pentium 4 Microarchitecture &
Important characteristics of
the recent Pentium 4 640
implementation in 90 nm
technology (code named
Prescott)
95
96. HPC – Performance Measurement (I)
Objective:
Baseline Performance
Performance Optimization
Confident & Verifiable
Measurement:
Open Std.: math kernel & application
MIPS (million instruction per second) (MIPS Tech. Inc.)
MFLOPS (million floating point operation per second)
Characteristics:
Peak vs. Sustained
Speed-Up & Computing Efficiency (mainly for Parallel)
CPU Time vs. Elapsed Time
Program performance (HP) vs. System Throughput (HT)
Performance per Watt
Ref: http://www-03.ibm.com/systems/power/hardware/benchmarks/hpc.html
http://icl.cs.utk.edu/hpcc/ 96
97. HPC – Performance Measurement (II)
Public Benchmark Utilities:
LINPACK (Jack Dongara, Oak Ridge N.L.)
Single Precision/Double Precision
n=100 TPP, n=1000 (Paper& Pencil benchmark)
HPL, n=’undefined’ (mainly for paraell system)
Synthetic: Drystone, Whetstone, Khornstone
SPEC (Standard Performance Evaluation Corp.)
SPECint (CINT2006), SPECfp(CFP2006), SPEComp●Not allow
for source code modification
Livermore Loops (introduction of MFLOPS)
Los Alamos Suite (Vector Computing)
Stream (Memory Performance)
NPB (NASA Ames): NPB 1 and NPB 2 (A, B, C)
Application (Weather/Material/MD/Statistics etc.):
MM5, NAMD, ANSYS, WRF, VASP etc. 97
98. Target Processors (I) - AMD vs. Intel
AMD Magny-Cours Opteron (45nm, Rel. Mar. 2010)
Socket G34 multi-chip module
2 x 4-cores or 6-cores dices connecting with HT 3.1
6172 (12-cores), 2.1GHz
L2: 8 x 512K, L3: 2 x 6M
HT: 3.2 GHz
ACP/TDP: 80W/115W
Streaming SIMD Extension: SSE, SEE2, SEE3 and SSE4a
6128HE (8-cores), 2.0GHz
L2: 8 x 512K, L3: 2 x 6M
HT: 3.2 GHz
ACP/TDP: 80W/115W
Streaming SIMD Extension: SSE, SEE2, SEE3 and SSE4a
99. Target Processors (II)
- AMD vs. Intel
Intel Woodcrest, Harpertown and Westmear (Rel. Jun 2006)
Xeon 5150
2.66GHz, LGA-771
L2: 4M
TDP: 65W
Streaming SIMD Extension: SSE, SSE2, SSE3 and SSSE3
Harpertown, Quad-Cores, 45nm (Rel. Nov 2007)
E5430 2.66GHz
L2: 2 x 6M
TDP: 80W
Streaming SIMD Extension: SSE, SSE2, SSE3, SSSE3 and SSE4.1
Westmear EP, 6-cores, 32nm (Re. Mar 2010)
X5650 2.67GHz, LGA-1366
L2/L3: 6x256K/12MB
I/O Bus: 2 x 6.4GT/s QPI
Streaming SIMD Extension: SSE, SSE2, SSE3, SSSE3, SSE4.1 and SSE4.2
100. SPEC2006 Performance Comparison
- SMT Off Turbo-on
8 Cores Nehalem-EP vs. 12 Cores Westmere-EP
32% performance gain by increase 50% of CPU-Cores
Scalability 12% below Ideal Performance
SMT Advantage:
Nehalem-EP 8 Cores to 16 Cores: “24.4%”
Westmere-EP 12 Cores to 24 Cores: “23.7”
Ref: CERN Openlan Intel WEP Evaluation Report (2010)
101. Efficiency of Westmere-EP
- Performance per Watt
Extrapolated from 12G to 24G
2 Watt per additional GB of Memory
Dual PSU (Upper) vs. Single PSU
(Lower)
SMT offer 21% boost in turns of
efficiency
Approx. 3% consume by SMT
comparing with absolute
performance (23.7%)
Ref: CERN Openlan Intel WEP Evaluation Report (2010)
102. Efficiency of Nehalem-EP Microarchitecture
With SMT Off
Most efficiency Nehalem-EP L5520 vs. X5670
Westmere add 10%
With efficiency 9.75% using dual PSU
+23.4% using single PSU
Nehalem L5520 vs. Harpertown (E5410)
+35% performance boost
Ref: CERN Openlan Intel WEP Evaluation Report (2010)
108. HPC – Performance of Countries
Nov 2011 Top500
Performance of
Countries
108
109. Top500 Analysis – Power Consumption & Efficiency
Top 4 Power Eff.: GlueGene/Q (2011 Nov)
Rochester > Thomas J. Watson > DOE/NNSA/LLNL
Eff: 2026 GF/kW
BlueGene/Q, Power BQC
16C 1.60 GHz, Custom
11.87MW
RIKEN Advanced Institute
for Computational Science
(AICS) - SPARC64 VIIIfx
2.0GHz, Tofu interconnect
3.6MW, Tianhe-1A
National
Supercomputing
Center in Tianjin
2008 2009 2010 2011 109
110. Top500 Analysis - Performance & Efficiency
20% of Top-performed clusters
contribute 60% of Total Computing Power
(27.98PF)
5 Clusters Eff. < 30
111. Top500 Analysis - HPC Cluster Performance
272 (52%) of world fastest clusters have efficiency lower
than 80% (Rmax/Rpeak)
Only 115 (18%) could drive over 90% of theoretical peak
Sampling from Top500 HPC cluster
Trend of Cluster Efficiency 2005-2009
112. Top500 Analysis – HPC Cluster Interconnection
SDR, DDR and QDR in Top500
Promising efficiency >= 80%
Majority of IB ready cluster adopt DDR
(87%) (2009 Nov)
Contribute 44% of total computing
power
~28 Pflops
Avg efficiency ~78%
113. Impact Factor: Interconnectivity
- Capacity & Cluster Efficiency
Over 52% of Cluster base on GbE
With efficiency around 50% only
InfiniBand adopt by ~36% HPC Clusters
114. Common Semantics
Programmer productivity
Easy of deployment
HPC filesystem are more mature, wider feature set:
High concurrent read and write
In the comfort zone of programmers (vs cloudFS)
Wide support, adoption, acceptance possible
pNFS working to be equivalent
Reuse standard data management tools
Backup, disaster recovery and tiering
116. Observation & Perspectives (I)
Performance pursuing another 1000X would be tough
~20PF Titan and Jaguar deliver in 2012
ExaFlops project ~ 2016 (PF in 2008)
Stil! IB & GbE are the most used interconnect solutions
multi-cores continue Moore’s Law
high level parallelism & software readiness
reduce bus traffic & data locality
Storage is fastest-growing product sector
Storage consolidation intensifies competition
Lustre roadmap stabilized for HPC
Computing paradigm
Complicated system vs. supplicated computing tools
hybrid computing model
Major concern: power efficiency
energy in memory & interconnect inc. data search application
exploit memory power efficiency: large cache?
Scalability and Reliability
Performance key factor: data communication
consider: layout, management & reuse 116
117. Observation & Perspectives (II)
Vendor Support & User readiness
No Moore’s Law for software, algorithms &
Service Orientation applications?
Standardization & KB
Automation & Expert system
Emerging new possibility
Cloud Infrastructure & Platform
currently 3% of spending (mostly private cloud)
Technology push & market/demand pull
growing opportunity of “Big Data”
datacenter, SMB & HPC solution providers
Rapidly growth of accelerator
Test by ~67% of users (20% in 10’)
NVIDIA posses 90% of current usage (11’)
“I think there is a world market for maybe five computers”
Thomas Watson, chairman of IBM, 1943
“Computers in the future may weight no more than 1.5 tons. ”
Popular Mechanics, 1949 117
119. Reference - Mathematical & Numerical Lib. (I)
Open Source
Linpack - numerical linear algebra intend to use on supercomputers
LAPACK - the successor to LINPACK (Netlib)
PLAPACK - Parallel Linear Algebra Package
BLAS - basic linear algebra subprograms
gotoBlas - optimal performance of Blas with new algorithm & memory
techniques
Scalapack - high performance linear algebra routines or distributed
memory message passing MIMD computer
FFTW - Fast Fourier Transform in the West
HPC-Netlib - is the high performance branch of Netlib
PETSc - portable, extensible toolkit for scientific computation
Numerical Recipes
GNU Scientific Libraries
119
120. Reference - Mathematical & Numerical Lib. (II)
Commercial
ESSL & pESSL (IBM/AIX) - Engineering & Scientific Subroutine
Library
MASS (IBM/AIX) - Mathematical Acceleration Subsystem
Intel Math Kernel - vector, linear algebra, special tuned math kernels
NAG Numerical Libraries - Numerical Algorithms Group
IMSL - International Mathematical and Statistical Libraries
PV-WAVE - Workstation Analysis & Visualization Env.
JAMA - Java matrix package, developed by the MathWorks & NIST.
WSSMP - Watson Symmetric Sparse Matrix Package
120
121. Reference - Message Passing
PVM (Parallel Virtual Machine, ORNL/CSM)
OpenMPI
MVAPICH & MVAPICH2
MPICH & MPICH2
v1 channels:
ch_p4 - based on older p4 project (Portable Programs for Parallel
Processors), tcp/ip
ch_p4mpd - p4 with mpd daemons to starting and managing processes
ch_shmem - shared memory only channel
globus2 – Globus2
v2 channels:
Nemesis – Universal
inter-node modules:
elan, GM, IB (infiniband), MX (myrinet express), NewMadeleine, tcp
intra-node variants of shared memory for large messages (LMT interface).
ssm - Sockets and Shared Memory
shm - SHared memory
sock - tcp/ip sockets
sctp - experimental channel over SCTP sockets 121
124. Reference - Book
Computer Architecture: A Quantitative Approach
2nd Ed., by David A. Patterson, John L. Hennessy, David Goldberg
Parallel Computer Architecture: A Hardware/Software Approach
by David Culler and J.P. Singh with Anoop Gupta
High-performance Computer Architecture
3rd Ed., by Harold Stone
High Performance Compilers for Parallel Computing
by Michael Wolfe (Addison Wesley, 1996)
Advanced Computer Architectures: A Design Space Approach
by Terence Fountain, Peter Kacsuk, Dezso Sima
Introduction to Parallel Computing: Design and Analysis of Parallel
Algorithms
by Vipin Kumar, Ananth Grama, Anshul Gupta, George Karypis
Parallel Computing Works!
by Geoffrey C. Fox, Roy D. Williams, Paul C. Messina
The Interaction of Compilation Technology and Computer Architecture
by David J. Lilja, Peter L. Bird (Editor) 124
125. National Laboratory Computing Facilities (I)
ANL, Argonne National Laboratory
http://www.lcrc.anl.gov/
ASC, Alabama Supercomputer Center
http://www.asc.edu/supercomputing/
BNL, Brookhaven National Laboratory, Computational Science Center
http://www.bnl.gov/csc/
CACR, Center for Advanced Computing Researc
http://www.cacr.caltech.edu/main/
CAPP, Center for Applied Parallel Processing
http://www.ceng.metu.edu.tr/courses/ceng577/announces/
supercomputingfacilities.htm
CHPC, Center for High Performance Computing, University of Utah
http://www.chpc.utah.edu/
125
126. National Laboratory Computing Facilities (II)
CRPC, Center For Research on Parallel Computation
http://www.crpc.rice.edu/
LANL, Los Alamos National Lab
http://www.lanl.gov/roadrunner/
LBL, Lawrence Berkeley National Lab
http://crd.lbl.gov/
LLNL, Lawrence Livermore National Lab
https://computing.llnl.gov/
MHPCC, Maui High Performance Computing Center
http://www.mhpcc.edu/
NCAR, National Center for Atmospheric Research
http://ncar.ucar.edu/
NCCS, National Center for Computational Science
http://www.nccs.gov/computing-resources/systems-status
126
127. National Laboratory Computing Facilities (III)
NCSA, National Center for Supercomputing Application
http://www.ncsa.illinois.edu/
NERSC, National Energy Research Scientific Computing Center
http://www.nersc.gov/home-2/
NSCEE, National Supercomputing Center for Energy and the
Environment
http://www.nscee.edu/
NWSC, NCAR-Wyoming Supercomputing Center
http://nwsc.ucar.edu/
ORNL, Oak Ridge National Lab
http://www.ornl.gov/ornlhome/high_performance_computing.shtml
OSC, Ohio Supercomputer Center
http://www.osc.edu/
127
128. National Laboratory Computing Facilities (IV)
PSC, Pittsburgh Supercomputing Center
http://www.psc.edu/
SANDIA, Sandia National Laboratories
http://www.cs.sandia.gov/
SCRI, Supercomputer Computations Research Institute
http://www.sc.fsu.edu/
SDSC, San Diego Supercomputing Center
http://www.sdsc.edu/services/hpc.html
ARSC, Arctic Region Supercomputing Center
http://nwsc.ucar.edu/
NASA, National Aeronautics and Space Admin
http://www.nas.nasa.gov/
128