High performance computing - building blocks, production & perspective

High Performance Computing
- Building blocks, Production & Perspective
Jason Shih
Feb, 2012

What is HPC?
 HPC Definition:
 14.9K hits from Google 
 Uses supercomputers and computer clusters to solve advanced
computation problems. Today, computer..... (Wikipedia)
 Use of parallel processing for running advanced application
programs efficiently, reliably and quickly. The term applies
especially to systems that function above a teraflop or 1012
floating-point operations per second. The term HPC is
occasionally used as a synonym for supercomputing, although.
(Techtarget)
 A branch of computer science that concentrates on developing
supercomputers and software to run on supercomputers. A main
area of this discipline is developing parallel processing algorithms
and software (Webopedia)
 And another “14.9k - 3” definitions……

2

So What is HPC Really?
 My understanding
 No clear definition!!
 At least O(2) time as powerful as PC
 Solving advanced computation problem? Online game!?
 HPC ~ Supercomputer? & Supercomputer ~ Σ Cluster(s)
 Possible Components:
 CPU1, CPU2, CPU3….. CPU”N”
 ~ O(1) tons of memory dimm….
 ~ O(2) kW power consumption
 O(1) - ~O(3) K-Cores
 ~ 1 system admin 

 Rmember:
 “640K ought to be enoughfor anybody. ” Bill Gates, 1981 3

Why HPC?
 Possible scenario:
header: budget_Unit = MUSD
if{budget[Gov.] >= O(2) && budget[Com.] >= O(1)}
else {Show_Off == “true”}
else if{Possible_Run_on_PC == “false”}
{Exec “HPC Implementation”}}
Got to Wait another 6M ~ 1Yr…….
 ruth is:
T
 Time consuming operations
 Huge memory demanded tasks
 Mission critical e.g. limit time duration
 Large quantities of run (cores/CPUs etc.)
 non-optimized programs….
4

Why HPC? Story Cont’
 rand Challenge Application Requirements
G
 Capable with 97’ MaxTop500 & 05’ MinTop500 break TFlops
10 PB
LHC Est.
CPU/DISK ~
143MSI2k/56.3PB
~100K-Cores

PFlops
5

How Need HPC? HPC Domain Applications

 Fluid dynamics & Heat Transfer
 Physics & Astrophysics
 Nanoscience
 Chemistry & Biochemistry
 Biophysics & Bioinformatics
 Geophysics & Earth Imaging
 Medical Physics & Drop Discovery
 Databases & Data Mining Financial
 Modeling Signal & Image Processing
 And more ....

6

HPC – Speed vs. Size

> Can’t fit on a PC –
usually because they > Take a very very long time to run
need more than a few on a PC: months or even years. But
GB of RAM, or more Size a problem that would take a month
than a few 100 GB of on a PC might take only a few hours
disk. on a supercomputer.

Speed

7

HPC ~ Supercomputer ~ Σ Cluster(s)
 What is cluster?
 Again, 1.4k hits from google….. 
 A computer cluster is a group of linked computers, working
together closely thus in many respects forming a single
computer... (Wikipedia)
 Single logical unit consisting of multiple computers that are
linked through a LAN. The networked computers essentially
act as a single, much more powerful machine.. (Techopedia)
 And……
 But the cluster is:
 CPU1, CPU2, CPU3….. CPU”N”
 ~ O() Kg of memory dimm….
 < O(1) kW power consumption
 ~ O(1) K-Cores
 Still ~ 1 system admin  8

HPC – Trend in Growth Potential & Easy of Use

10

1.0
GFlops per Processors

- 1995
- 2000
0.1 - 2005
- 2010

0.01

0.001
10 100 1000 10K 100K
Number of Processors (P)
9

HPC
Energy Projection

Strawmen Project

10

HPC – Numbers Before 2000s



11

HPC – Annual Performance Distribution
 Top500 projection in 2012:
 Cray Titan: est. 20PF, transformed from Jaguar @ORNL
 1st phase: replace Cray XT5 w/ Cray XK6 & Operon CPUs & Tesla GPUs
 2nd phase: 18K additional Tesla GPUs
 IBM Sequoia: est. 20PF base on Glue Gene/Q @LLNL
 ExaFlops of “world computing power” in 2016?
10.5 Pflops, K computer,
SPARC64 VIIIfx 2.0GHz

1PFlops 35.8 TF
NEC, Earth-Simulator Japan
GFlops

< 6 Years
1TFlops
8 Years
PC: 109 GFlops
Intel Core i7 980 XE

1GFlops

HPC – Performance Trend “Microprocessor”

13

Trend – Transistors per Processor Chip

14

HPC History (I)
1PFlop/s IBM RoadRunner
Cray Jaguar

Parallel
1TFlop/s
2008 PFlops

Vector
1GFlop/s
SuperScalar

1987 GFlops
1MFlop/s
Scalar

1KFlops/
Bit Level Parallelism Instruction Level Thread Level

1950 1960 1970 1980 1990 2000 15
2010

HPC History (II)

CPU Year/Clock Rate
/Instruction per sec.

16

HPC History (III)
Four Decades of Computing – Time Sharing Era

17

HPC – Cost of Computing (1960s ~ 2011)

About “17 million IBM 1620 units”
costing $64,000 each.
The 1620's multiplication
operation takes 17.7 ms
Cost (USD)

Two 16-processor Beowulf clusters
Cray X-MP with Pentium Pro microprocessors

First computing technology
which scaled to large
First sub-US$1/MFLOPS computing applications while staying
technology. It won the Gordon Bell Prize under US$1/MFLOPS
in 2000 Bunyip Beowulf cluster KLAT2
First sub-US$100/GFLOPS computing technologyKASY0
As of August 2007, this 26.25 GFLOPS "personal" Microwulf
Beowulf cluster can be built for $1256

Year HPU4Science
$30,000 cluster was built using only
commercially available "gamer" grade
hardware
18
Ref: http://en.wikipedia.org/wiki/FLOPS

HPC - Interconnect, Proc Type, Speed & Threads

19
Ref: “Introduction to the HPC Challenge Benchmark Suite” by Piotr et. al.

Power Processor Roadmap

20

IBM mainframe

Intel mainframe

21

HPC – Computing System Evolution (I)
ENIAC
 940s (Beginning)
1
 ENIAC (Eckart & Mauchly, U.Penn)
 Von Neumann Machine
 Sperry Rand Corp
 IBM Corp.
 Vacuum tube
 Thousands instruction per second (0.002 MIPS)
 1950s (Early Days)
 IBM 704, 709x
 CDC 1604
 Transistor (Bell Lab, 1948)
 Memory: Drum/Magnetic Core (32K words)
 Performance: 1 MIPS
 Separate I/O processor

IBM 704 22

HPC – Computing System Evolution (II)
 960s (System Concept)
1 IBM S/360 Model 85
 IBM Stretch Machine (1st Pipeline machine)
 IBM System 360 (Model 64, Model 91)
 CDC 6600
 GE, UNIVAC, RCA, Honeywell & Burrough etc
 Integrated Circuit/Mult-layer (Printed Circuit Board)
 Memory: Semiconductor (3MB)
 Cache (IBM 360 Model 85)
 Performance: 10 MIPS (~1 MFLOPS)
 1970s (Vector, Mini-Computer)
 IBM System 370/M195, 308x
 CDC 7600, Cyber Systems
 DEC Minicomputer
 FPS (Floating Point System)
 Cray 1, XMP
 Large Scale Integrated Circuit
 Performance: 100 MIPS (~10 MFLOPS)
 Multiprogrmming, Time Sharing IBM S/370 Model 168
 Vector: Pipeline Data Stream 23

HPC – Computing System Evolution (III)
 980s (RISC, Micro-Processor)
1
 CDC Cyber 205
 Cray 2, YMP
 IBM 3090 VF
 Japan Inc. (Fujitsu’s VP, NEC’s SX)
 Thinking Machine: CM2 (1st Large Scale Parallel)
 RISC system (Appolo, Sun, SGI, etc)
 CONVAX Vector Machine (mini Cray)
 Microprocessor: PC (Apple, IBM)
 Memory: 100MB Connect Machine: CM-2
 RISC system:
 Pipeline Instruction Stream
 Multiple execution units in core
 Vector: Multiple vector pipelines
 Thinking Machine: kernel level parallelism
 Performance: 100 Mflops
IBM 3090 Processor Complex 24

HPC – Computing System Evolution (IV)
 990s (Cluster, Parallel Computing)
1
 IBM Power Series (1,2,3)
 SGI NUMA System
 Cray CMP, T3E, Cray 3
 CDC ETA
 DEC’s Alpha IBM Power5 Family
 SUN’s Internet Machine
 Intel Parogon
 Cluster of PC Power3 IBM Blue Gene
 Memory: 512MB per processor
 Performance: 1 Teraflops
 SMP node in Cluster System
 000s (Large Scale Parallel System)
2
 IBM Power Series (4,5), Blue Gene
 HP’s Superdome
 Cray SV system
 Intel’s Itanium, Xeon, Woodcrest, Westsmear processor
 emory: 1-8 GB per processor
M
 erformance: Reach 10 Teraflops
P 25

HPC – Programming Language (I)
 Microcode, Machine Language
 Assembly Language (1950s)
 Mnemonic, based on machine instruction set
 Fortran (Formula Translation) (John Backus, 1956)
 IBM Fortran Mark I – IV (1950s, 1960s)
 IBM Fortran G, H, HX (1970), VS Fortran
 CDC, DEC, Cray, etc..., Fortran
 Industrial Standardized - Fortran 77 (1978)
 Industrial Standardized - Fortran (88), 90, 95 (1991,1996)
 HPF (High Performance Fortran) (late 1980)
 Algol (Algorithm Language) (1958) (1960, Dijksta, et. al.)
 Based on Backus-Naur Form method
 Considered as 1st Block Structure Language
 COBOL (Common Business Oriented Language) (1960s)
 IBM PL/1, PL/2 (Programming Language) (mid 60-70s)
 Combined Fortran, COBOL, & Algol
 Pointer function
26
 Exceptional handling

HPC – Programming Language (II)
 Applicative Languages
 IBM APL (A Programming Language) (1970s)
 LISP (List Processing Language) (1960s, MIT)
 BASIC (Beginner’s All-Purpose Symbolic Instruction Code) (mid 1960)
 1st Interactive language via Interpreter
 PASCAL (1975, Nicklass Wirth)
 Derived from Wirth’s Algol-W
 Well designed programming language
 Call argument list by value
 & C++ (mid 1970, Bell Lab)
C
 Procedure language
 ADA (late 1980, U.S. DOD)
 Prolog (Programming Logic) (mid 1970)
27

HPC – Computing Environment
 Batch Processing (before 1970)
 Multi-programming, Time Sharing (1970)
 Remote Job Entry (RJE) (mid 1970)
 Network Computing
 APARnet (mother of INTERNET)
 IBM’s VNET (mid 1970)
 Establishment Community Computing Center
 st Center: NCAR (1967)
1
 U.S. National Supercomputer Centers (1980)
 Parallel Computing
 Distribute Computing
 Emergence of microprocessors
 Grid Computing (2000s)
 Volunteer Computing @Home Technology 28

HPC – Computational Platform Pro & Con

29

HPC – Parallel Computing (I)
 Characteristics:
 Asynchronous Operation (Hardware)
 Multiple Execution Units/Pipelines (Hardware)
 Instruction Level
 Data Parallel
 Kernel Level
 Loop Parallel
 Domain Decomposition
 Functional Decomposition

30

HPC – Parallel Computing (II)
 1st Attempt - ILLIAC, 64-way monster (mid 1970)
 U.S. Navy’s parallel weather forecast program (1970s)
 Early programming method - UNIX thread (late 1970)
 1st Viable Parallel Processing - Cray’s Micro-Tasking (80s)
 Many, Many proposed methods in 1980s: e.g. HPF
 SGI’s NUMA System - A very successful one (1990s)
 Oakridge NL’s PVM and Europe’s PARMAC (Early 90s) programming model
for Distributing Memory System
 Adaption of MPI and OpenMP for parallel programming
 MPI - A main stream of parallel computing (late 1990)
 Well and Clear defined programming model
 Successful of Cluster Computing System
 Network/Switch hardware performance
 Scalability
 Data decomposition allows for running large program
 Mixed MPI/OpenMP parallel programming model for SMP node cluster
system (2000)
31

Env. Sci./Disaster Mitigation

Defense
Engineering

HPC Application Area

Finance &
Business
Science Research

32

HPC – Applications (I)
 Early Days
 Ballistic Table
 Signal Processing
 Cryptography
 Von Neumann’s Weather Simulation
 1950s-60s
 Operational Weather Forecasting
 Computational Fluid Dynamics (CFD, 2D problems)
 Seismic Processing and Oil Reservoir Simulation
 Particle Tracing
 Molecular Dynamics Simulation
 CAD/CAE, Circuit Analysis
 1970s (emergency of “package” program)
 Structural Analysis (FEM application)
 Spectral Weather Forecasting Model
 ab initio Chemistry Computation (Material modeling in quantum level)
 3D Computational Fluid 33

HPC – Applications (II)

 980s (Wide Spread of Commercial/Industrial Usage)
1
 Petroleum Industry: Western Geo, etc
 Computational Chemistry: Charmm, Amber, Gaussian,Gammes, Mopad,
Crystal etc
 Computational Fluid Dynamics: Fluent, Arc3D etc
 Structural Analysis: NASTRAN, Ansys, Abacus, Dyna3D
 Physics: QCD
 Emergence of Multi-Discipline Application Program
 1990s & 2000s
 Grand Challenge Problems
 Life Science
 Large Scale Parallel Program
 Coupling of Computational Models
 Data Intensive Analysis/Computation

34

– Cluster Inside & Insight
Types of Cluster Architectures
Multicores & Heterogeneous Architecture
Cluster Overview & Bottleneck/Latency
Global & Parallel Filesystem
Application Development Step
35

Computer Architecture – Flynn’s Taxonomy
 SISD (single instruction & single data)

 SIMD (single instruction & multiple data)

 MISD (multiple instruction & single data)
 IMD
M
(multiple instruction & multiple data)
> Message Passing
> Share Memory: UMA/NUMA/COMA
36

Constrain on Computing Solution –
“Distributed Computing”

 Opposing forces

Commodity
 Budgets push toward lower
cost computing solutions
 At the expense of operation
cost
 Limitations for power & cooling
 difficult to change on short time
scales
 Challenges:
 Data Distribution & Data
Management
 Distributed Computing Model
 Fault Tolerance, Scalability &
Availability
SMP

Centralized Distributed

37

Shared Memory Architecture
Hybrid Architecture
HPC Cluster
Architecture

Vector Architecture

38
Distributed Memory Architecture

HPC – Multi-core Architectural Spectrum
Heterogeneous Multi-core Platform

NVIDIA Heterogeneous Architecture (GeForce)

Cluster – Commercial x86 Architecture
 Intel Core2 Quad, 2006

41

 Intel Dunnington 7400-series
 last CPU of the Penryn generation and Intel's first multi-
core die & features a single-die six- (or hexa-) core design
with three unified 3 MB L2 caches

42

 Intel Nehalem
 Core i7 2009 Q1 Quadcores

43


 Intel: ”Nehalem-Ex” (i7)

44


 AMD Shanghai, 2007

45

Cluster Overview (I)

 System
 Security & Account Policy
 System Performance Optimization Parallel Computer Arch.
 Mission: HT vs. HP Abstraction Layers
 Benchmarking: Serial vs. Parallel
 NPB, HPL, BioPerf, HPCC & SPEC (2000 & 2006) etc.
 Memory/Cache: Stream, cachebench & BYTEMark etc.
 Data: iozone, iometer, xdd, dd & bonie++ etc.
 Network: NetPIPE, Netperf, Nettest, Netspec & iperf etc.
 load generator: cpuburn, dbench, stress & contest etc.
 Resource Mgmt: Scheduling
 Account policy & Mgmt.
 Hardware
 Regular maintenance: spare parts replacement
 Facility Relocation & Cabling
46

Cluster Overview (II)
 Software
 Compiler: e.g. Intel, PGI, xl*(IBM)
 Compilation, Porting & Debug
 ddressing: 32 vs. 64bit.
A
 Various: Sys. Arch. (IA64, RISC, SPARC etc.)
 Scientific/Numerical Libraries
 NetCDF, PETSC, GSL, CERNLIB (ROOT/PAW), GEANT etc.
 Lapack, Blas, gotoBlas, Scalapack, FFTW, Linpack, HPC-Netlib etc.
 End User Applications:
 VASP, Guassian, Wien, Abinit, PWSCF, WRF, Comcot, Truchas,
VORPAL etc.
 Others
 Documentation
 Functions: UG & AG
 System Design Arch., Account Policy & Mgmt. etc.
 Training 47

Cluster I/O – Latency & Bottleneck
 Modern CPU achieve ~
5GFlops/core/sec.
 ~ 2 8-Bytes words per OP.
 CPU overhead: 80 GB/sec.
 ~O(1) GB/core/sec. B/W
 Case: IBM P7 (755) (Stream)
 Copy: 105418.0 MB
 Scale: 104865.0 MB
 Add: 121341.0 MB
 Triad: 121360.0
 Latency: ~52.8ns (@2.5GHz,
DDR3/1666)
 ven worse: init. fetching data
E
 Cf. Cache:
 L1@2.5GHz: 3 cycles
 L2@2.5GHz: 20 cycles 48

Memory Access vs. Clock Cycles
Data Rate Performance
Memory vs. CPU

49

Cluster – Message Time Breakdown

 Source Overhead
 Network Time
 Destination Overheard

51

Cluster – MPI & Resource Mgr.

MPI Processes Mgmt. w/o
Resource Mgr.

MPI Processes Mgmt. w/
Resource Mgr.

52
Ref: HPC BAS4 UF.

Network Performance Throughput vs. Latency (I)

 Peak 10G 9.1Gbps ~ 877 usec (Msg Size: 1MB)
 IB QDR reach 31.1Gbps with same msg size
 Only 29% of 10G Latency (~256 usec)
 Peak IB QDR 34.8Gbps ~ 57 usec (Msg Size: 262KB)

Network Performance Throughput vs. Latency (II)

 GbE, 10G (FC), IB and IBoIP (DDR vs. QDR)

Network Performance Throughput vs. Latency (III)
 Interconnection:
 GbE, 10G (FC), IB and IBoIP (DDR vs. QDR)
 Max throughput not reach 80% of IB DDR (~46%)
 Peak of DDR IPoIB ~76% of IB peak (9.1Gbps)
 Over IP, QDR have only 54%
 While max throughput reach 85% (34.8Gbps)
 No significant performance gain for IPoIB using RDMA
(by preloading SDO)
 Possible performance degradation
 Existing activities over IB edge switch at the chassis
 Midplane performance limitation
 Reaching 85% on clean IB QDR interconnection:
 Redo performance measurement on IBM QDR

Cluster – File Server Performance

 Preload SDP provided by OFED
 Sockets Direct Protocol (SDP)
 Note: Network protocol which provides
an RDMA accelerated alternative to TCP
over InfiniBand

Cluster – File Server IO Performance (I)

Re-Write Performance

Write Performance

Cluster – File Server IO Performance (II)

Re-Read Performance

Read Performance

Cluster I/O – Cluster filesystem options? (I)
 OCFS2 (Oracle Cluster File System)
 Once proprietary, now GPL
 Available in Linux vanilla kernel
 not widely used outside the database world
 PVFS (Parallel Virtual File System)
 Open source & easy to install
 Userspace-only server
 kernel module required only on clients
 Optimized for MPI-IO
 POSIX compatibility layer performance is sub-optimal
 pNFS (Parallel NFS)
 Extension of NFSv4
 Proprietary solutions available: “Panasas”
 Put together benefits of parallel IO using standard solution (NFS)
59

Cluster I/O – Cluster filesystem options? (II)
 GPFS (General Parallel File System)
 Rock-solid w/ 10-years history
 Available for AIX, Linux & Windows Server 2003
 Proprietary license
 Tightly integrated with IBM cluster management tools
 Lustre
 HA & LB implementation
 highly scalable parallel filesystem: ~ 100K clients
 Performance:
 Client: ~1 GB/s & 1K Metadata Op/s
 MDS: 3K ~ 15K Metadata Op/s
 OSS: 500 ~ 2.5 GB/s
 POSIX compatibility
 Components:
 single or dual Metadata Server (MDS) w/ attached Metadata Target
(MDT) (if consider scalability & load balance)
 multiple “up to ~O(3)” Object Storage Server (OSS) w/ attached Object
60
Storage Targets (OST)

Cluster I/O – Lustre Cluster Breakdown

InfiniBand Interconnect

Lustre Cluster

OSS1 Compute Compute Compute
OSS2 Compute Compute Compute
… … …
OSS nodes
(Load Balanced) Compute Compute Compute
Compute Compute Compute
MDS nodes
(High Availability) Compute Compute Admin
MDS(M) Compute Compute Admin
Compute Compute Login
MDS(S)
Compute Compute Login

Lustre
Quad-CPU Compute Nodes

Connectivity to all Connectivity to all Connectivity to all Connectivity to all
nodes nodes nodes nodes

GigE ethernet for boot and system control traffic
Connectivity to all Connectivity to all Connectivity to all Connectivity to all
nodes nodes nodes nodes

10/100 Ethernet out-of-band management (power on/off, etc)
61

Cluster I/O – Parallel Filesystem using Lustre
 ypical Setup
T
 MDS: ~ O(1) servers with good CPU and RAM, high seek rate
 OSS: ~ O(3) server req. good bus bandwidth, storage

62

Cluster I/O – Lustre Performance (I)
 Interconnection:
 PoB, IB & Quadrics
I

63

Cluster I/O – Lustre Performance (II)

 Scalability
 Throughput/Transactions vs. Num of OSS

64

Cluster I/O – Parallel Filesystem in HPC

65

Cluster – Consolidation & Pursuing High Density

66

Typical Blade System Connectivity Breakdown

Fibre Channel Expansion
Card (CFFv)

Optical Pass-Through
Module and MPO Cables

BNT 1/10 Gb Uplink
Ethernet Switch Module

BladeServer Chassis – BCE

 Hardware & system software
features affecting scalability Reliability
- Hardware
of parallel systems
Scalable Tools - Software
Machine Size
- User/Developer
- Proc. Performance
- Manager
- Num. Processors
- Libraries

Input/Output Totally Scalable Memory Size
- Bandwidth Architecture - Virtual
- Capacity - Physical

Program Env. Interconnect Network
- Familiar Program Paradigm - Latency
- Familiar Interface Memory Type - Bandwidth
- Distributed
- Shared

68

HPC – Demanded Features from diff. Roles

 Def. Roles: Users, Developers, System Administrators
Features Users Developers Managers

Familiar User Interface ✔ ✔ ✔

Familiar Programming Paradigm ✔ ✔

Commercially Supported Applications ✔ ✔

Standards ✔ ✔

Scalable Libraries ✔ ✔

Development Tools ✔

Management Tools ✔

Total System Costs ✔
69

HPC –
Application Development Steps
Prep

SA
SPEC
Run

SA

SPEC
Code

Code

Opt

Par
Mod

Opt

Prep

Run
Par

Mod 70

HPC – Service Scopes
 System Architecture Design
 Various of interconnection e.g. GbE, IB, FC etc.
 Mission specific e.g. high performance or high throughput
 Computational or data intensive
 OMP vs. MPI
 Parallel/Global filesystem
 Cluster Implementation
 Objectives:
 High availability & Fault tolerance
 Load Balancing Design & Validation
 Distributed & Parallel Computing
 Deployment, Configuration, Cluster Mgmt. & Monitoring
 Service Automation and Event Mgmt.
 KB & Helpdesk
 Service Level:
 Helpdesk & Onsite Inspection
 System Reliability & Availability
 1st / 2nd line Tech. Support
 Automation & Alarm Handling
71
 Architecture & Outreach?

High Performance Computing –
Performance Tuning & Optimization

Tuning Strategy Best Practices
Profile & Bottleneck drilldown
System Optimization
Filesystem Improvement & (re-)Design

72

Cluster Management & Administration Tools
 Categories:
 System: OS, network, backup, filesystem & virtualization.
 Clustering: deployment, monitoring, management alarm/logging,
dashboard & automation
 Administration: UID, security, scheduling, accounting
 Application: library, compiler, message-passing & domain-specific.
 Development: debug, profile, toolkits, VC & PM.
 Services: helpdesk, event, KB & FAQ.

73

Cluster Implementation (I)
 Operating system
 Candidates: CentOS, Scientific Linux, RedHat, Fedora etc.
 Cluster Management
 Tools: Oscar, Rocks, uBuntu, xCAT etc.
 Deployment & Configurations
 Tools: cobbler, kickstart, puppet, cfgng, quattor(CERN),
DRBL
 Alarm, Probes & Automation
 Tools: nagios, IPMI, lm_sensors
 System & Service monitoring
 Tools: ganglia, openQRM
 Network Monitoring
 Tools: MRTG, RRD, smokeping, awstats, weathermap
74

Cluster Implementation (II)
 Filesystem
 Candidates: NFS, Lustre, openAFS, pNFS, GPFS etc.
 Performance Analysis & Profile:
 Tools: gprof, pgroup(PGI), VTune(intel), tprof(IBM), TotalView
etc.
 Compilers
 Packages: Intel, PGI, MPI, Pathscale, Absoft, NAG, GNU, Cuda etc.
 Message Passing Libraries (parallel):
 Packages: Intel MPI, OpenMPI, MPICH, MVAPICH, PVM(old),
openMP(POSIX threads) etc.
 Memory Profile & Debug (Threads)
 Tool: Valgrind, IDB, GNU(gdb) etc.
 Distributed computing
 Toolkits: Condor, Globus, gLite(LCG) etc. 75

Cluster Implementation (III)
 Resource Mgmt. & Scheduling
 Tools: Torque, Maui, Moab, Condor, Slurm, SGE(SunGrid
Engine), NQS(old), loadleveler(IBM), LSF(Platform/IBM) etc.
 Dashboard
 Tools: openQRM, openNMS, Ahatsup, OpenView, BigBrother
etc.
 Helpdesk & Trouble Tracking
 Tools: phpFAQ, OTRS, Request Tracker, osTIcket,
simpleTicket, eTicket etc.
 Logging & Events
 Tools: elog, syslogNG etc.
 Knowledge Base
 Tools: vimwiki, Media Wiki, Twiki, phpFAQ, moinmoin etc.
76

Cluster Implementation (IV)
 Security
 Functionality: scanning, intrusion detection, & vulnerability
 Tools: honeypot, snort, saint, snmp, nessus, rootkithunter &
chkrootkit etc.
 Revision Services
 Tools: git, cvs, svn etc.
 Collaborative Project Mgmt.
 Tools: bugzilla, OTRS, projectHQ,
 Accounting:
 Tools: SACCT, PACCT etc.
 Visualization: RRD G/W, Google Chart Tool etc.

77

Cluster Implementation (IV)
 Backup Services
 Tools: Tivoli(IBM), Bacula, rsync, VERITAS, TSM, Netvault,
Amanda, etc.
 Remote Console
 Tools: openNX (no machine), rdp compatible, Hummingbird
(XDMCP), VNC, Xwin32, Cygwin, IPMI v2 etc.
 Cloud & Virtualization
 Packages: openstack, opennebula, eucalyptus, CERNVM,
Vmware, Xen, Citrix, VirtualBox etc.

78

- How We Get to Today?
Moore’s Law, Heat/Energy/Power Density
Hardware Evolution
Datacenter & Green

HPC History Reminder:
1980s - 1st Gflops in single vector processor
1994 - 1st TFlop via thousands of microprocessors
2009 - 1st Pflop via several hundred thousand cores 79

Moore’s Law & Power Density
 Dynamic Pwr ∝ V2fC  2X Transistors/Chip every 1.5Yr
 Cubic effect if inc frequency & supply  Golden Moore (co-founder of
voltage Intel) predicted in 1965.
 Eff ∝ capacitance ∝ cores (linear)
 High performance serial processor 33K ~ 38K MIPs
waste power
7.5K ~ 11K MIPs
 More transistors rather serial

Transistor Count
1971-2011

1 Billion Transistors
processors

25 MIPs

1.0 MIPs

0.1 MIPs

Date of Production
80
Ref: http://en.wikipedia.org/wiki/List_of_Intel_microprocessors

Moore’s Law – What we learn?
Transistor ∝ MIPs ∝ Watts ∝ BTUs

 Rule of thumb: 1 watt of power consumed requires 3.413
BTU/hr of cooling to remove the associated heat
 Inter-chip vs. Intra-chip parallelism
 Challenges: millions of concurrent threads
 HP: Data Center Power Density Went from 2.1 kW/Rack
in 1992 to 14 kw/Rack in 2006
 IDC: 3 Year Costs of Power and Cooling, Roughly Equal
to Initial Capital Equipment Cost of Data Center
 NETWORKWORLD: 63% of 369 IT professionals said
that running out of space or power in their data centers
had already occurred

81

HPC – Feature size, Clock & Die Shrink

Historical data
TRTS Max Clock Rate
Main ITRS node (nm)

Year Feature size (nm)

Feature Size (nm)
Year

Trend: Cores per Socket
 Top500 Nov 2011:
 45.8% & 32% running 6 & quad cores proc.
 5.8% sys. >= 8 cores (2.4% with 16 cores)
1
 more than 2 fold inc. vs. 2010 Nov (6.8%) Top500 2011 Nov
 Trend: quad (73% in 10’) to 6 cores (46% in 11’)

83

HPC – Evolution of Processors

 Transistors: Moore’s Law
 Clock rate no longer as a
proxy for Moore’s Law & Cores
may double instead.
 Power literately under control.

Transistors Physical Gate Length

Ref: “Scaling to Petascale and Beyond: Performance Analysis and Optimization of Applications” NERSC.

HPC – Comprehensive Approach
 CPU Chips
 Clock Frequency & Voltage Scaling
 75% power savings at idle and 40-70% power savings for
utilization in the 20-80% range
 Server
 Chassis: 20-50% Pwr reduction.
 Modular switches & routers
 Server consolidation & virtualization
 Storage Devices
 Max. TB/Watt & Disk Capacity
 Large Scale Tiered Storage
 Max. Pwr Eff by Min. Storage over-provisioning
 Cabling & Networking
 Stackable & backplane capacity (inc. Pwr Eff)
 Scaling & Density 85

HPC – Datacenter Power Projection

 Case: ORNL/UTK inc.
DOE & NSF sys.
 Deploy 2 large Petascale
systems in next 5 years
 Current Power
Consumption 4 MW
 Exp to 15MW before year
end (2011)
 50MW by 2012.
 Cost estimates based on
$0.07 per KwH

86

HPC – Data Center Best Practices
 Traditional Approach
 Hot/Cold Aisle
 Min. Leakage
 Eff. Improvement (Coolig & Power)
 DC input (UPS opt.), Cabling & Container
 Liquid Cooling
 Free Cooling
 Leveraging Hydroelectric Power

Ref: http://www.google.com/about/datacenters/
87
http://www.google.com/about/datacenters/inside/efficiency/power-usage.html

HPC – DataCenter Growing Power Density
Total system efficiency comprises three main elements- the Grid, the Data
Centre and the IT Components. Each element has its own efficiency factor-
multiplied together for 100 watts of power generated, the CPU receives only
12 watts

Heat Load Product Footprint (Watt/ft2)

Ref: Internet2 P&C Nov 2011, “Managing Data Center Power Power & Cooling & Cooling” by Force10 88

HPC - Performance Benchmarking
CPU Arch., Scalability, SMT & Perf/Watt
Case study: Intel vs. AMD

89

HPC – Performance Strategy: “The Amdahl’s Law”
 Fixed-size Model : Speedup = 1 / (s + p/N)
 Scaled-size Model: Speedup = 1 / ((1-P) + P/N) ~ 1/(1-P)
 arallel & Vector scale w/ problem size
P
 s: Σ (I/O + serial bottleneck + vector startup + program loading)
SpeedUP

90
Numer of Processors

Price-Performance for Transaction-Processing

 OLTP – One of the largest server markets is online
transaction processing
 TPC-C – std. industry benchmark for OLTP is
 Queries and updates rely on database system
 Significant factors of performance in TPC-C:
 Reasonable approx. to a real OLTP app.
 Predictive of real system performance:
 total system performance, inc. the hardware, the operating
system, the I/O system, and the database system.
 Complete instruction and timing info for benchmarking
 TPM (measure transactions per minute) & price-
performance in dollars per TPM.

91

 20 SPEC benchmarks
 1.9 GHZ IBM Power5 processor vs. 3.8 GHz Intel Pentium 4
 10 Integer @LHS & 10 floating point @RHS
 Fallacy:
 Processors with lower CPIs will always be faster.
 Processors with faster clock rates will always be faster.

92

 Characteristics of 10 OLTP systems & TPC-C as the
benchmark

93

 Cost of purchase split between processor, memory,
storage, and software

94

Pentium 4 Microarchitecture &
Important characteristics of
the recent Pentium 4 640
implementation in 90 nm
technology (code named
Prescott)

95

HPC – Performance Measurement (I)
 Objective:
 Baseline Performance
 Performance Optimization
 Confident & Verifiable
 Measurement:
 Open Std.: math kernel & application
 MIPS (million instruction per second) (MIPS Tech. Inc.)
 MFLOPS (million floating point operation per second)
 Characteristics:
 Peak vs. Sustained
 Speed-Up & Computing Efficiency (mainly for Parallel)
 CPU Time vs. Elapsed Time
 Program performance (HP) vs. System Throughput (HT)
 Performance per Watt

Ref: http://www-03.ibm.com/systems/power/hardware/benchmarks/hpc.html
http://icl.cs.utk.edu/hpcc/ 96

HPC – Performance Measurement (II)
 Public Benchmark Utilities:
 LINPACK (Jack Dongara, Oak Ridge N.L.)
 Single Precision/Double Precision
 n=100 TPP, n=1000 (Paper& Pencil benchmark)
 HPL, n=’undefined’ (mainly for paraell system)
 Synthetic: Drystone, Whetstone, Khornstone
 SPEC (Standard Performance Evaluation Corp.)
 SPECint (CINT2006), SPECfp(CFP2006), SPEComp●Not allow
for source code modification
 Livermore Loops (introduction of MFLOPS)
 Los Alamos Suite (Vector Computing)
 Stream (Memory Performance)
 NPB (NASA Ames): NPB 1 and NPB 2 (A, B, C)
 Application (Weather/Material/MD/Statistics etc.):
 MM5, NAMD, ANSYS, WRF, VASP etc. 97

Target Processors (I) - AMD vs. Intel

 AMD Magny-Cours Opteron (45nm, Rel. Mar. 2010)
 Socket G34 multi-chip module
 2 x 4-cores or 6-cores dices connecting with HT 3.1
 6172 (12-cores), 2.1GHz
 L2: 8 x 512K, L3: 2 x 6M
 HT: 3.2 GHz
 ACP/TDP: 80W/115W
 Streaming SIMD Extension: SSE, SEE2, SEE3 and SSE4a
 6128HE (8-cores), 2.0GHz
 L2: 8 x 512K, L3: 2 x 6M
 HT: 3.2 GHz
 ACP/TDP: 80W/115W
 Streaming SIMD Extension: SSE, SEE2, SEE3 and SSE4a

Target Processors (II)
- AMD vs. Intel

 Intel Woodcrest, Harpertown and Westmear (Rel. Jun 2006)
 Xeon 5150
 2.66GHz, LGA-771
 L2: 4M
 TDP: 65W
 Streaming SIMD Extension: SSE, SSE2, SSE3 and SSSE3
 Harpertown, Quad-Cores, 45nm (Rel. Nov 2007)
 E5430 2.66GHz
 L2: 2 x 6M
 TDP: 80W
 Streaming SIMD Extension: SSE, SSE2, SSE3, SSSE3 and SSE4.1
 Westmear EP, 6-cores, 32nm (Re. Mar 2010)
 X5650 2.67GHz, LGA-1366
 L2/L3: 6x256K/12MB
 I/O Bus: 2 x 6.4GT/s QPI
 Streaming SIMD Extension: SSE, SSE2, SSE3, SSSE3, SSE4.1 and SSE4.2

SPEC2006 Performance Comparison
- SMT Off Turbo-on

 8 Cores Nehalem-EP vs. 12 Cores Westmere-EP
 32% performance gain by increase 50% of CPU-Cores
 Scalability 12% below Ideal Performance
 SMT Advantage:
 Nehalem-EP 8 Cores to 16 Cores: “24.4%”
 Westmere-EP 12 Cores to 24 Cores: “23.7”

Ref: CERN Openlan Intel WEP Evaluation Report (2010)

Efficiency of Westmere-EP
- Performance per Watt

 Extrapolated from 12G to 24G
 2 Watt per additional GB of Memory
 Dual PSU (Upper) vs. Single PSU
(Lower)
 SMT offer 21% boost in turns of
efficiency
 Approx. 3% consume by SMT
comparing with absolute
performance (23.7%)


Efficiency of Nehalem-EP Microarchitecture
With SMT Off

 Most efficiency Nehalem-EP L5520 vs. X5670
 Westmere add 10%
 With efficiency 9.75% using dual PSU
 +23.4% using single PSU
 Nehalem L5520 vs. Harpertown (E5410)
 +35% performance boost


Multi-Cores Performance Scaling
- AMD Magny-Cours vs. Intel Westmere (I)

Multi-Cores Performance Scaling
- AMD Magny-Cours vs. Intel Westmere (II)

Single Server Linpack Performance
- Intel X5650, 2.67GHz 12G DDR3 (6 cores)
HPL Optimal Performance
~108.7 GFlops per Node

Lesson from Top500
Statistics, Analysis & Future Trend
Processor Tech. & Cores/socket
Cluster Interconnect
power consumption & Efficiency
Regional performance & Trend
106

Top 500 – 2011 Nov.
Rmax(GFlops)

Cores

HPC – Performance of Countries

Nov 2011 Top500
Performance of
Countries

108

Top500 Analysis – Power Consumption & Efficiency

 Top 4 Power Eff.: GlueGene/Q (2011 Nov)
 Rochester > Thomas J. Watson > DOE/NNSA/LLNL

Eff: 2026 GF/kW
BlueGene/Q, Power BQC
16C 1.60 GHz, Custom
11.87MW
RIKEN Advanced Institute
for Computational Science
(AICS) - SPARC64 VIIIfx
2.0GHz, Tofu interconnect

3.6MW, Tianhe-1A
National
Supercomputing
Center in Tianjin

2008 2009 2010 2011 109

Top500 Analysis - Performance & Efficiency
 20% of Top-performed clusters
contribute 60% of Total Computing Power
(27.98PF)
 5 Clusters Eff. < 30

Top500 Analysis - HPC Cluster Performance
 272 (52%) of world fastest clusters have efficiency lower
than 80% (Rmax/Rpeak)
 Only 115 (18%) could drive over 90% of theoretical peak

 Sampling from Top500 HPC cluster

Trend of Cluster Efficiency 2005-2009

Top500 Analysis – HPC Cluster Interconnection
 SDR, DDR and QDR in Top500
 Promising efficiency >= 80%
 Majority of IB ready cluster adopt DDR
(87%) (2009 Nov)
 Contribute 44% of total computing
power
 ~28 Pflops
 Avg efficiency ~78%

Impact Factor: Interconnectivity
- Capacity & Cluster Efficiency

 Over 52% of Cluster base on GbE
 With efficiency around 50% only
 InfiniBand adopt by ~36% HPC Clusters

Common Semantics

 Programmer productivity
 Easy of deployment
 HPC filesystem are more mature, wider feature set:
 High concurrent read and write
 In the comfort zone of programmers (vs cloudFS)
 Wide support, adoption, acceptance possible
 pNFS working to be equivalent
 Reuse standard data management tools
 Backup, disaster recovery and tiering

IB Roadmap
Trend in HPC

74.2PF
10.5PF

50.9TF

Observation & Perspectives (I)
 Performance pursuing another 1000X would be tough
 ~20PF Titan and Jaguar deliver in 2012
 ExaFlops project ~ 2016 (PF in 2008)
 Stil! IB & GbE are the most used interconnect solutions
 multi-cores continue Moore’s Law
 high level parallelism & software readiness
 reduce bus traffic & data locality
 Storage is fastest-growing product sector
 Storage consolidation intensifies competition
 Lustre roadmap stabilized for HPC
 Computing paradigm
 Complicated system vs. supplicated computing tools
 hybrid computing model
 Major concern: power efficiency
 energy in memory & interconnect inc. data search application
 exploit memory power efficiency: large cache?
 Scalability and Reliability
 Performance key factor: data communication
 consider: layout, management & reuse 116

Observation & Perspectives (II)
 Vendor Support & User readiness
No Moore’s Law for software, algorithms &
 Service Orientation applications?
 Standardization & KB
 Automation & Expert system
 Emerging new possibility
 Cloud Infrastructure & Platform
 currently 3% of spending (mostly private cloud)
 Technology push & market/demand pull
 growing opportunity of “Big Data”
 datacenter, SMB & HPC solution providers
 Rapidly growth of accelerator
 Test by ~67% of users (20% in 10’)
 NVIDIA posses 90% of current usage (11’)

“I think there is a world market for maybe five computers”
Thomas Watson, chairman of IBM, 1943
“Computers in the future may weight no more than 1.5 tons. ”
Popular Mechanics, 1949 117

References
 Top500: http://top500.org
 Green Top500: http://www.green500.org
 HPC Advisory Council
 http://www.hpcadvisorycouncil.com/subgroups.php
 HPC Inside
 http://insidehpc.com/
 HPC Wiki
 http://en.wikipedia.org/wiki/High-performance_computing
 Supercomputing Conferences Series
 http://www.supercomp.org/
 Beowulf Cluster
 http://www.beowulf.org/
 MPI Forum:
 http://www.mpi-forum.org/docs/docs.html

118

Reference - Mathematical & Numerical Lib. (I)
 Open Source
 Linpack - numerical linear algebra intend to use on supercomputers
 LAPACK - the successor to LINPACK (Netlib)
 PLAPACK - Parallel Linear Algebra Package
 BLAS - basic linear algebra subprograms
 gotoBlas - optimal performance of Blas with new algorithm & memory
techniques
 Scalapack - high performance linear algebra routines or distributed
memory message passing MIMD computer
 FFTW - Fast Fourier Transform in the West
 HPC-Netlib - is the high performance branch of Netlib
 PETSc - portable, extensible toolkit for scientific computation
 Numerical Recipes
 GNU Scientific Libraries
119

Reference - Mathematical & Numerical Lib. (II)
 Commercial
 ESSL & pESSL (IBM/AIX) - Engineering & Scientific Subroutine
Library
 MASS (IBM/AIX) - Mathematical Acceleration Subsystem
 Intel Math Kernel - vector, linear algebra, special tuned math kernels
 NAG Numerical Libraries - Numerical Algorithms Group
 IMSL - International Mathematical and Statistical Libraries
 PV-WAVE - Workstation Analysis & Visualization Env.
 JAMA - Java matrix package, developed by the MathWorks & NIST.
 WSSMP - Watson Symmetric Sparse Matrix Package

120

Reference - Message Passing
 PVM (Parallel Virtual Machine, ORNL/CSM)
 OpenMPI
 MVAPICH & MVAPICH2
 MPICH & MPICH2
 v1 channels:
 ch_p4 - based on older p4 project (Portable Programs for Parallel
Processors), tcp/ip
 ch_p4mpd - p4 with mpd daemons to starting and managing processes
 ch_shmem - shared memory only channel
 globus2 – Globus2
 v2 channels:
 Nemesis – Universal
 inter-node modules:
elan, GM, IB (infiniband), MX (myrinet express), NewMadeleine, tcp
intra-node variants of shared memory for large messages (LMT interface).
 ssm - Sockets and Shared Memory
 shm - SHared memory
 sock - tcp/ip sockets
 sctp - experimental channel over SCTP sockets 121

Reference - Performance, Benchmark & Tools

 High performance tools & technologies:
 https://computing.llnl.gov/tutorials/performance_tools/
HighPerformanceToolsTechnologiesLC.pdf
 Linux Benchmarking Suite:
 http://lbs.sourceforge.net
 Linux Test Tools Matrix:
 http://ltp.sourceforge.net/tooltable.php
 Network Performance
 http://compnetworking.about.com/od/networkperformance/
TCPIP_Network_Performance_Benchmarks_and_Tools.htm
 http://tldp.org/HOWTO/Benchmarking-HOWTO-3.html
 http://bulk.fefe.de/scalability/
 http://linuxperf.sourceforge.net
122

Reference - Network Security
 Network Security
 Tools: http://sectools.org/ , http://www.yolinux.com/TUTORIALS/
LinuxSecurityTools.html & http://www.lids.org/ etc.
 packet sniffer, wrapper, firewall, scanner, services (MTA/BIND) etc.
 Online Org.:
 CERT http://www.us-cert.gov
 SANS http://www.sans.org
 Linux Network Security
 basic config/utility/profile, encryption & routing.
 (obsolete: http://www.drolez.com/secu/)
 Network Security Toolkit
 Audit, Intrusion Detection & Prevention
 Event Types:
 DDoS, Scanning, Worms, Policy violation & unexpected app. services
 Honeypots, Tripwire, Snort, Tiger, Nessus, Ethereal, nmap, tcpdump,
portscan, portsentry, chkrootkit, rootkithunter, AIDE(HIDE), LIDS etc.
 Ref: NIST “Guide to Intrusion Detection and Prevention Systems” 123

Reference - Book
 Computer Architecture: A Quantitative Approach
 2nd Ed., by David A. Patterson, John L. Hennessy, David Goldberg
 Parallel Computer Architecture: A Hardware/Software Approach
 by David Culler and J.P. Singh with Anoop Gupta
 High-performance Computer Architecture
 3rd Ed., by Harold Stone
 High Performance Compilers for Parallel Computing
 by Michael Wolfe (Addison Wesley, 1996)
 Advanced Computer Architectures: A Design Space Approach
 by Terence Fountain, Peter Kacsuk, Dezso Sima
 Introduction to Parallel Computing: Design and Analysis of Parallel
Algorithms
 by Vipin Kumar, Ananth Grama, Anshul Gupta, George Karypis
 Parallel Computing Works!
 by Geoffrey C. Fox, Roy D. Williams, Paul C. Messina
 The Interaction of Compilation Technology and Computer Architecture
 by David J. Lilja, Peter L. Bird (Editor) 124

National Laboratory Computing Facilities (I)

 ANL, Argonne National Laboratory
 http://www.lcrc.anl.gov/
 ASC, Alabama Supercomputer Center
 http://www.asc.edu/supercomputing/
 BNL, Brookhaven National Laboratory, Computational Science Center
 http://www.bnl.gov/csc/
 CACR, Center for Advanced Computing Researc
 http://www.cacr.caltech.edu/main/
 CAPP, Center for Applied Parallel Processing
 http://www.ceng.metu.edu.tr/courses/ceng577/announces/
supercomputingfacilities.htm
 CHPC, Center for High Performance Computing, University of Utah
 http://www.chpc.utah.edu/

125

National Laboratory Computing Facilities (II)

 CRPC, Center For Research on Parallel Computation
 http://www.crpc.rice.edu/
 LANL, Los Alamos National Lab
 http://www.lanl.gov/roadrunner/
 LBL, Lawrence Berkeley National Lab
 http://crd.lbl.gov/
 LLNL, Lawrence Livermore National Lab
 https://computing.llnl.gov/
 MHPCC, Maui High Performance Computing Center
 http://www.mhpcc.edu/
 NCAR, National Center for Atmospheric Research
 http://ncar.ucar.edu/
 NCCS, National Center for Computational Science
 http://www.nccs.gov/computing-resources/systems-status

126

National Laboratory Computing Facilities (III)

 NCSA, National Center for Supercomputing Application
 http://www.ncsa.illinois.edu/
 NERSC, National Energy Research Scientific Computing Center
 http://www.nersc.gov/home-2/
 NSCEE, National Supercomputing Center for Energy and the
Environment
 http://www.nscee.edu/
 NWSC, NCAR-Wyoming Supercomputing Center
 http://nwsc.ucar.edu/
 ORNL, Oak Ridge National Lab
 http://www.ornl.gov/ornlhome/high_performance_computing.shtml
 OSC, Ohio Supercomputer Center
 http://www.osc.edu/

127

National Laboratory Computing Facilities (IV)

 PSC, Pittsburgh Supercomputing Center
 http://www.psc.edu/
 SANDIA, Sandia National Laboratories
 http://www.cs.sandia.gov/
 SCRI, Supercomputer Computations Research Institute
 http://www.sc.fsu.edu/
 SDSC, San Diego Supercomputing Center
 http://www.sdsc.edu/services/hpc.html
 ARSC, Arctic Region Supercomputing Center
 http://nwsc.ucar.edu/
 NASA, National Aeronautics and Space Admin
 http://www.nas.nasa.gov/

128

High performance computing - building blocks, production & perspective

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (16)

Similar a High performance computing - building blocks, production & perspective

Similar a High performance computing - building blocks, production & perspective (20)

Último

Último (20)

High performance computing - building blocks, production & perspective