SlideShare una empresa de Scribd logo
1 de 128
Descargar para leer sin conexión
High Performance Computing
    - Building blocks, Production & Perspective
              Jason Shih
               Feb, 2012
What is HPC?
 HPC Definition:
   14.9K hits from Google 
   Uses supercomputers and computer clusters to solve advanced
   computation problems. Today, computer..... (Wikipedia)
   Use of parallel processing for running advanced application
   programs efficiently, reliably and quickly. The term applies
   especially to systems that function above a teraflop or 1012
   floating-point operations per second. The term HPC is
   occasionally used as a synonym for supercomputing, although.
   (Techtarget)
   A branch of computer science that concentrates on developing
   supercomputers and software to run on supercomputers. A main
   area of this discipline is developing parallel processing algorithms
   and software (Webopedia)
   And another “14.9k - 3” definitions……

                                                                    2
So What is HPC Really?
 My understanding
   No clear definition!!
   At least O(2) time as powerful as PC
   Solving advanced computation problem? Online game!?
   HPC ~ Supercomputer? & Supercomputer ~ Σ Cluster(s)
   Possible Components:
     CPU1, CPU2, CPU3….. CPU”N”
     ~ O(1) tons of memory dimm….
     ~ O(2) kW power consumption
     O(1) - ~O(3) K-Cores
     ~ 1 system admin 


 Rmember:
   “640K ought to be enoughfor anybody. ” Bill Gates, 1981   3
Why HPC?
 Possible scenario:
header: budget_Unit = MUSD
if{budget[Gov.] >= O(2) && budget[Com.] >= O(1)}
   else {Show_Off == “true”}
   else if{Possible_Run_on_PC == “false”}
  {Exec “HPC Implementation”}}
  Got to Wait another 6M ~ 1Yr…….
  ruth is:
 T
   Time consuming operations
   Huge memory demanded tasks
   Mission critical e.g. limit time duration
   Large quantities of run (cores/CPUs etc.)
   non-optimized programs….
                                                   4
Why HPC? Story Cont’
  rand Challenge Application Requirements
 G
    Capable with 97’ MaxTop500 & 05’ MinTop500 break TFlops
10 PB
                                                  LHC Est.
                                                  CPU/DISK ~
                                                  143MSI2k/56.3PB
                                                  ~100K-Cores




                                                PFlops
                                                                    5
How Need HPC? HPC Domain Applications

 Fluid dynamics & Heat Transfer
 Physics & Astrophysics
 Nanoscience
 Chemistry & Biochemistry
 Biophysics & Bioinformatics
 Geophysics & Earth Imaging
 Medical Physics & Drop Discovery
 Databases & Data Mining Financial
 Modeling Signal & Image Processing
 And more ....


                                        6
HPC – Speed vs. Size


> Can’t fit on a PC –
usually because they           > Take a very very long time to run
need more than a few           on a PC: months or even years. But
GB of RAM, or more      Size   a problem that would take a month
than a few 100 GB of           on a PC might take only a few hours
disk.                          on a supercomputer.

                                      Speed




                                                               7
HPC ~ Supercomputer ~ Σ Cluster(s)
 What is cluster?
   Again, 1.4k hits from google….. 
   A computer cluster is a group of linked computers, working
   together closely thus in many respects forming a single
   computer... (Wikipedia)
   Single logical unit consisting of multiple computers that are
   linked through a LAN. The networked computers essentially
   act as a single, much more powerful machine.. (Techopedia)
   And……
 But the cluster is:
      CPU1, CPU2, CPU3….. CPU”N”
     ~ O() Kg of memory dimm….
     < O(1) kW power consumption
     ~ O(1) K-Cores
     Still ~ 1 system admin                                 8
HPC – Trend in Growth Potential & Easy of Use

                        10


                        1.0
GFlops per Processors




                                                                      - 1995
                                                                      - 2000
                        0.1                                           - 2005
                                                                      - 2010


                        0.01



                        0.001
                                10         100      1000        10K   100K
                                     Number of Processors (P)
                                                                               9
HPC
Energy Projection


         Strawmen Project




                            10
HPC – Numbers Before 2000s

 




                             11
HPC – Annual Performance Distribution
           Top500 projection in 2012:
               Cray Titan: est. 20PF, transformed from Jaguar @ORNL
                   1st phase: replace Cray XT5 w/ Cray XK6 & Operon CPUs & Tesla GPUs
                   2nd phase: 18K additional Tesla GPUs
               IBM Sequoia: est. 20PF base on Glue Gene/Q @LLNL
           ExaFlops of “world computing power” in 2016?
                                                                          10.5 Pflops, K computer,
                                                                          SPARC64 VIIIfx 2.0GHz

         1PFlops                                  35.8 TF
                                                  NEC, Earth-Simulator Japan
GFlops




                                      < 6 Years
         1TFlops
         8 Years
                                                                               PC: 109 GFlops
                                                                               Intel Core i7 980 XE


         1GFlops
HPC – Performance Trend “Microprocessor”




                                           13
Trend – Transistors per Processor Chip




                                         14
HPC History (I)
1PFlop/s                                                                                   IBM RoadRunner
                                                                                           Cray Jaguar




                                                                                Parallel
1TFlop/s
                                                                                             2008 PFlops




                                                       Vector
1GFlop/s
                             SuperScalar




                                                                       1987 GFlops
1MFlop/s
             Scalar




1KFlops/
                                           Bit Level Parallelism         Instruction Level Thread Level

           1950       1960                   1970               1980        1990            2000       15
                                                                                                            2010
HPC History (II)




CPU Year/Clock Rate
/Instruction per sec.




                        16
HPC History (III)
Four Decades of Computing – Time Sharing Era




                                               17
HPC – Cost of Computing (1960s ~ 2011)


                     About “17 million IBM 1620 units”
                     costing $64,000 each.
             The 1620's multiplication
             operation takes 17.7 ms
Cost (USD)




                                                                Two 16-processor Beowulf clusters
                                                Cray X-MP       with Pentium Pro microprocessors

                                                                                   First computing technology
                                                                                   which scaled to large
                            First sub-US$1/MFLOPS computing                        applications while staying
                            technology. It won the Gordon Bell Prize               under US$1/MFLOPS
                            in 2000             Bunyip Beowulf         cluster        KLAT2
                        First sub-US$100/GFLOPS computing technologyKASY0
                             As of August 2007, this 26.25 GFLOPS "personal" Microwulf
                             Beowulf cluster can be built for $1256

                                                      Year                            HPU4Science
                                                                    $30,000 cluster was built using only
                                                                  commercially available "gamer" grade
                                                                                               hardware
                                                                                                  18
Ref: http://en.wikipedia.org/wiki/FLOPS
HPC - Interconnect, Proc Type, Speed & Threads




                                                                            19
Ref: “Introduction to the HPC Challenge Benchmark Suite” by Piotr et. al.
Power Processor Roadmap




                          20
IBM mainframe




 Intel mainframe



                   21
HPC – Computing System Evolution (I)
                                               ENIAC
  940s (Beginning)
 1
   ENIAC (Eckart & Mauchly, U.Penn)
   Von Neumann Machine
   Sperry Rand Corp
   IBM Corp.
   Vacuum tube
   Thousands instruction per second (0.002 MIPS)
 1950s (Early Days)
   IBM 704, 709x
   CDC 1604
   Transistor (Bell Lab, 1948)
   Memory: Drum/Magnetic Core (32K words)
   Performance: 1 MIPS
   Separate I/O processor

                                    IBM 704            22
HPC – Computing System Evolution (II)
  960s (System Concept)
 1                                                               IBM S/360 Model 85
    IBM Stretch Machine (1st Pipeline machine)
    IBM System 360 (Model 64, Model 91)
    CDC 6600
    GE, UNIVAC, RCA, Honeywell & Burrough etc
    Integrated Circuit/Mult-layer (Printed Circuit Board)
    Memory: Semiconductor (3MB)
    Cache (IBM 360 Model 85)
    Performance: 10 MIPS (~1 MFLOPS)
 1970s (Vector, Mini-Computer)
    IBM System 370/M195, 308x
    CDC 7600, Cyber Systems
    DEC Minicomputer
    FPS (Floating Point System)
    Cray 1, XMP
    Large Scale Integrated Circuit
    Performance: 100 MIPS (~10 MFLOPS)
    Multiprogrmming, Time Sharing               IBM S/370 Model 168
    Vector: Pipeline Data Stream                                             23
HPC – Computing System Evolution (III)
  980s (RISC, Micro-Processor)
 1
   CDC Cyber 205
   Cray 2, YMP
   IBM 3090 VF
   Japan Inc. (Fujitsu’s VP, NEC’s SX)
   Thinking Machine: CM2 (1st Large Scale Parallel)
   RISC system (Appolo, Sun, SGI, etc)
   CONVAX Vector Machine (mini Cray)
   Microprocessor: PC (Apple, IBM)
   Memory: 100MB                                    Connect Machine: CM-2
   RISC system:
      Pipeline Instruction Stream
      Multiple execution units in core
   Vector: Multiple vector pipelines
   Thinking Machine: kernel level parallelism
   Performance: 100 Mflops
                                             IBM 3090 Processor Complex   24
HPC – Computing System Evolution (IV)
  990s (Cluster, Parallel Computing)
 1
     IBM Power Series (1,2,3)
     SGI NUMA System
     Cray CMP, T3E, Cray 3
     CDC ETA
     DEC’s Alpha                                             IBM Power5 Family
     SUN’s Internet Machine
     Intel Parogon
     Cluster of PC                                Power3       IBM Blue Gene
     Memory: 512MB per processor
     Performance: 1 Teraflops
     SMP node in Cluster System
  000s (Large Scale Parallel System)
 2
     IBM Power Series (4,5), Blue Gene
     HP’s Superdome
     Cray SV system
     Intel’s Itanium, Xeon, Woodcrest, Westsmear processor
     emory: 1-8 GB per processor
    M
     erformance: Reach 10 Teraflops
    P                                                                    25
HPC – Programming Language (I)
 Microcode, Machine Language
 Assembly Language (1950s)
    Mnemonic, based on machine instruction set
 Fortran (Formula Translation) (John Backus, 1956)
    IBM Fortran Mark I – IV (1950s, 1960s)
    IBM Fortran G, H, HX (1970), VS Fortran
    CDC, DEC, Cray, etc..., Fortran
    Industrial Standardized - Fortran 77 (1978)
    Industrial Standardized - Fortran (88), 90, 95 (1991,1996)
    HPF (High Performance Fortran) (late 1980)
 Algol (Algorithm Language) (1958) (1960, Dijksta, et. al.)
    Based on Backus-Naur Form method
    Considered as 1st Block Structure Language
    COBOL (Common Business Oriented Language) (1960s)
 IBM PL/1, PL/2 (Programming Language) (mid 60-70s)
    Combined Fortran, COBOL, & Algol
    Pointer function
                                                                  26
    Exceptional handling
HPC – Programming Language (II)
 Applicative Languages
   IBM APL (A Programming Language) (1970s)
   LISP (List Processing Language) (1960s, MIT)
 BASIC (Beginner’s All-Purpose Symbolic Instruction Code) (mid 1960)
   1st Interactive language via Interpreter
 PASCAL (1975, Nicklass Wirth)
   Derived from Wirth’s Algol-W
   Well designed programming language
   Call argument list by value
  & C++ (mid 1970, Bell Lab)
 C
   Procedure language
 ADA (late 1980, U.S. DOD)
 Prolog (Programming Logic) (mid 1970)
                                                                        27
HPC – Computing Environment
 Batch Processing (before 1970)
 Multi-programming, Time Sharing (1970)
 Remote Job Entry (RJE) (mid 1970)
 Network Computing
   APARnet (mother of INTERNET)
   IBM’s VNET (mid 1970)
 Establishment Community Computing Center
    st Center: NCAR (1967)
   1
   U.S. National Supercomputer Centers (1980)
 Parallel Computing
 Distribute Computing
   Emergence of microprocessors
   Grid Computing (2000s)
 Volunteer Computing @Home Technology           28
HPC – Computational Platform Pro & Con




                                         29
HPC – Parallel Computing (I)
 Characteristics:
   Asynchronous Operation (Hardware)
   Multiple Execution Units/Pipelines (Hardware)
   Instruction Level
   Data Parallel
   Kernel Level
   Loop Parallel
   Domain Decomposition
   Functional Decomposition




                                                    30
HPC – Parallel Computing (II)
 1st Attempt - ILLIAC, 64-way monster (mid 1970)
 U.S. Navy’s parallel weather forecast program (1970s)
 Early programming method - UNIX thread (late 1970)
 1st Viable Parallel Processing - Cray’s Micro-Tasking (80s)
 Many, Many proposed methods in 1980s: e.g. HPF
 SGI’s NUMA System - A very successful one (1990s)
 Oakridge NL’s PVM and Europe’s PARMAC (Early 90s) programming model
 for Distributing Memory System
 Adaption of MPI and OpenMP for parallel programming
 MPI - A main stream of parallel computing (late 1990)
    Well and Clear defined programming model
    Successful of Cluster Computing System
    Network/Switch hardware performance
    Scalability
    Data decomposition allows for running large program
 Mixed MPI/OpenMP parallel programming model for SMP node cluster
 system (2000)
                                                                     31
Env. Sci./Disaster Mitigation


Defense
                                            Engineering




                    HPC Application Area


      Finance &
      Business
                                Science Research



                                                          32
HPC – Applications (I)
 Early Days
   Ballistic Table
   Signal Processing
   Cryptography
   Von Neumann’s Weather Simulation
 1950s-60s
   Operational Weather Forecasting
   Computational Fluid Dynamics (CFD, 2D problems)
   Seismic Processing and Oil Reservoir Simulation
   Particle Tracing
   Molecular Dynamics Simulation
   CAD/CAE, Circuit Analysis
 1970s (emergency of “package” program)
   Structural Analysis (FEM application)
   Spectral Weather Forecasting Model
   ab initio Chemistry Computation (Material modeling in quantum level)
   3D Computational Fluid                                                 33
HPC – Applications (II)

  980s (Wide Spread of Commercial/Industrial Usage)
 1
   Petroleum Industry: Western Geo, etc
   Computational Chemistry: Charmm, Amber, Gaussian,Gammes, Mopad,
   Crystal etc
   Computational Fluid Dynamics: Fluent, Arc3D etc
   Structural Analysis: NASTRAN, Ansys, Abacus, Dyna3D
   Physics: QCD
   Emergence of Multi-Discipline Application Program
 1990s & 2000s
   Grand Challenge Problems
   Life Science
   Large Scale Parallel Program
   Coupling of Computational Models
   Data Intensive Analysis/Computation


                                                                 34
High Performance Computing
     – Cluster Inside & Insight
            Types of Cluster Architectures
            Multicores & Heterogeneous Architecture
            Cluster Overview & Bottleneck/Latency
            Global & Parallel Filesystem
            Application Development Step
                                                 35
Computer Architecture – Flynn’s Taxonomy
 SISD (single instruction & single data)




 SIMD (single instruction & multiple data)




 MISD (multiple instruction & single data)
                           IMD
                          M
                         (multiple instruction & multiple data)
                         > Message Passing
                         > Share Memory: UMA/NUMA/COMA
                                                     36
Constrain on Computing Solution –
 “Distributed Computing”

 Opposing forces




                                    Commodity
   Budgets push toward lower
   cost computing solutions
      At the expense of operation
     cost
   Limitations for power & cooling
      difficult to change on short time
     scales
      Challenges:
         Data Distribution & Data
        Management
         Distributed Computing Model
         Fault Tolerance, Scalability &
        Availability
                                    SMP




                                       Centralized   Distributed

                                                                   37
Shared Memory Architecture
                                              Hybrid Architecture
                        HPC Cluster
                        Architecture




Vector Architecture


                                                            38
                              Distributed Memory Architecture
HPC – Multi-core Architectural Spectrum
Heterogeneous Multi-core Platform
NVIDIA Heterogeneous Architecture (GeForce)
Cluster – Commercial x86 Architecture
 Intel Core2 Quad, 2006




                                        41
Cluster – Commercial x86 Architecture
 Intel Dunnington 7400-series
   last CPU of the Penryn generation and Intel's first multi-
   core die & features a single-die six- (or hexa-) core design
   with three unified 3 MB L2 caches




                                                                  42
Cluster – Commercial x86 Architecture
 Intel Nehalem
   Core i7 2009 Q1 Quadcores




                                        43
Cluster – Commercial x86 Architecture

 Intel: ”Nehalem-Ex” (i7)




                                        44
Cluster – Commercial x86 Architecture

 AMD Shanghai, 2007




                                        45
Cluster Overview (I)

 System
  Security & Account Policy
  System Performance Optimization                        Parallel Computer Arch.
    Mission: HT vs. HP                                   Abstraction Layers
    Benchmarking: Serial vs. Parallel
       NPB, HPL, BioPerf, HPCC & SPEC (2000 & 2006) etc.
       Memory/Cache: Stream, cachebench & BYTEMark etc.
       Data: iozone, iometer, xdd, dd & bonie++ etc.
       Network: NetPIPE, Netperf, Nettest, Netspec & iperf etc.
       load generator: cpuburn, dbench, stress & contest etc.
  Resource Mgmt: Scheduling
  Account policy & Mgmt.
 Hardware
  Regular maintenance: spare parts replacement
  Facility Relocation & Cabling
                                                                            46
Cluster Overview (II)
 Software
   Compiler: e.g. Intel, PGI, xl*(IBM)
   Compilation, Porting & Debug
       ddressing: 32 vs. 64bit.
      A
         Various: Sys. Arch. (IA64, RISC, SPARC etc.)
   Scientific/Numerical Libraries
      NetCDF, PETSC, GSL, CERNLIB (ROOT/PAW), GEANT etc.
      Lapack, Blas, gotoBlas, Scalapack, FFTW, Linpack, HPC-Netlib etc.
   End User Applications:
     VASP, Guassian, Wien, Abinit, PWSCF, WRF, Comcot, Truchas,
     VORPAL etc.
 Others
   Documentation
     Functions: UG & AG
     System Design Arch., Account Policy & Mgmt. etc.
   Training                                                               47
Cluster I/O – Latency & Bottleneck
 Modern CPU achieve ~
 5GFlops/core/sec.
  ~ 2 8-Bytes words per OP.
  CPU overhead: 80 GB/sec.
  ~O(1) GB/core/sec. B/W
  Case: IBM P7 (755) (Stream)
     Copy: 105418.0 MB
     Scale: 104865.0 MB
     Add: 121341.0 MB
     Triad: 121360.0
  Latency: ~52.8ns (@2.5GHz,
  DDR3/1666)
      ven worse: init. fetching data
     E
     Cf. Cache:
       L1@2.5GHz: 3 cycles
       L2@2.5GHz: 20 cycles            48
Memory Access vs. Clock Cycles
                Data Rate Performance
                Memory vs. CPU




                                        49
50
Cluster – Message Time Breakdown

 Source Overhead
 Network Time
 Destination Overheard




                                   51
Cluster – MPI & Resource Mgr.

                          MPI Processes Mgmt. w/o
                          Resource Mgr.




 MPI Processes Mgmt. w/
 Resource Mgr.



                                                    52
Ref: HPC BAS4 UF.
Network Performance Throughput vs. Latency (I)


 Peak 10G 9.1Gbps ~ 877 usec (Msg Size: 1MB)
 IB QDR reach 31.1Gbps with same msg size
   Only 29% of 10G Latency (~256 usec)
   Peak IB QDR 34.8Gbps ~ 57 usec (Msg Size: 262KB)
Network Performance Throughput vs. Latency (II)

 GbE, 10G (FC), IB and IBoIP (DDR vs. QDR)
Network Performance Throughput vs. Latency (III)
 Interconnection:
   GbE, 10G (FC), IB and IBoIP (DDR vs. QDR)
 Max throughput not reach 80% of IB DDR (~46%)
 Peak of DDR IPoIB ~76% of IB peak (9.1Gbps)
   Over IP, QDR have only 54%
   While max throughput reach 85% (34.8Gbps)
 No significant performance gain for IPoIB using RDMA
 (by preloading SDO)
 Possible performance degradation
   Existing activities over IB edge switch at the chassis
   Midplane performance limitation
 Reaching 85% on clean IB QDR interconnection:
   Redo performance measurement on IBM QDR
Cluster – File Server Performance

 Preload SDP provided by OFED
 Sockets Direct Protocol (SDP)
   Note: Network protocol which provides
   an RDMA accelerated alternative to TCP
   over InfiniBand
Cluster – File Server IO Performance (I)




                             Re-Write Performance



 Write Performance
Cluster – File Server IO Performance (II)




                                Re-Read Performance



 Read Performance
Cluster I/O – Cluster filesystem options? (I)
 OCFS2 (Oracle Cluster File System)
   Once proprietary, now GPL
   Available in Linux vanilla kernel
   not widely used outside the database world
 PVFS (Parallel Virtual File System)
   Open source & easy to install
   Userspace-only server
   kernel module required only on clients
   Optimized for MPI-IO
   POSIX compatibility layer performance is sub-optimal
 pNFS (Parallel NFS)
   Extension of NFSv4
   Proprietary solutions available: “Panasas”
   Put together benefits of parallel IO using standard solution (NFS)
                                                                    59
Cluster I/O – Cluster filesystem options? (II)
 GPFS (General Parallel File System)
   Rock-solid w/ 10-years history
   Available for AIX, Linux & Windows Server 2003
   Proprietary license
   Tightly integrated with IBM cluster management tools
 Lustre
   HA & LB implementation
   highly scalable parallel filesystem: ~ 100K clients
   Performance:
      Client: ~1 GB/s & 1K Metadata Op/s
      MDS: 3K ~ 15K Metadata Op/s
      OSS: 500 ~ 2.5 GB/s
   POSIX compatibility
   Components:
      single or dual Metadata Server (MDS) w/ attached Metadata Target
     (MDT) (if consider scalability & load balance)
      multiple “up to ~O(3)” Object Storage Server (OSS) w/ attached Object
                                                                         60
     Storage Targets (OST)
Cluster I/O – Lustre Cluster Breakdown

                                                                     InfiniBand Interconnect


  Lustre Cluster

         OSS1                           Compute                             Compute                      Compute
         OSS2                           Compute                             Compute                      Compute
                                           …                                  …                            …
     OSS nodes
   (Load Balanced)                      Compute                             Compute                      Compute
                                        Compute                             Compute                      Compute
    MDS nodes
   (High Availability)                  Compute                             Compute                       Admin
       MDS(M)                           Compute                             Compute                        Admin
                                        Compute                             Compute                        Login
       MDS(S)
                                        Compute                             Compute                        Login

                     Lustre
                                                  Quad-CPU Compute Nodes

     Connectivity to all                 Connectivity to all              Connectivity to all     Connectivity to all
     nodes                               nodes                            nodes                   nodes

                           GigE ethernet for boot and system control traffic
                Connectivity to all            Connectivity to all          Connectivity to all     Connectivity to all
                nodes                          nodes                        nodes                   nodes

                                      10/100 Ethernet out-of-band management (power on/off, etc)
                                                                                                                          61
Cluster I/O – Parallel Filesystem using Lustre
  ypical Setup
 T
   MDS: ~ O(1) servers with good CPU and RAM, high seek rate
   OSS: ~ O(3) server req. good bus bandwidth, storage




                                                                62
Cluster I/O – Lustre Performance (I)
 Interconnection:
   PoB, IB & Quadrics
   I




                                       63
Cluster I/O – Lustre Performance (II)

 Scalability
   Throughput/Transactions vs. Num of OSS




                                             64
Cluster I/O – Parallel Filesystem in HPC




                                           65
Cluster – Consolidation & Pursuing High Density




                                              66
Typical Blade System Connectivity Breakdown

 Fibre Channel Expansion
 Card (CFFv)



                                             Optical Pass-Through
                                             Module and MPO Cables




                                              BNT 1/10 Gb Uplink
                                              Ethernet Switch Module




                 BladeServer Chassis – BCE
 Hardware & system software
 features affecting scalability               Reliability
                                             - Hardware
 of parallel systems
                         Scalable Tools       - Software
                                                               Machine Size
                      - User/Developer
                                                             - Proc. Performance
                      - Manager
                                                              - Num. Processors
                      - Libraries

                   Input/Output           Totally Scalable          Memory Size
                 - Bandwidth               Architecture             - Virtual
                 - Capacity                                         - Physical


                      Program Env.                            Interconnect Network
             - Familiar Program Paradigm                            - Latency
             - Familiar Interface           Memory Type            - Bandwidth
                                             - Distributed
                                               - Shared

                                                                                     68
HPC – Demanded Features from diff. Roles

 Def. Roles: Users, Developers, System Administrators
                Features              Users   Developers   Managers

Familiar User Interface                ✔          ✔           ✔

Familiar Programming Paradigm          ✔          ✔

Commercially Supported Applications    ✔                      ✔

Standards                                         ✔           ✔

Scalable Libraries                     ✔          ✔

Development Tools                                 ✔

Management Tools                                              ✔

Total System Costs                                            ✔
                                                                  69
HPC –
 Application Development Steps
Prep




        SA
        SPEC
Run




               SA

               SPEC
        Code




               Code

               Opt

               Par
Mod




        Opt




               Prep

               Run
        Par




               Mod               70
HPC – Service Scopes
 System Architecture Design
     Various of interconnection e.g. GbE, IB, FC etc.
     Mission specific e.g. high performance or high throughput
     Computational or data intensive
     OMP vs. MPI
     Parallel/Global filesystem
 Cluster Implementation
     Objectives:
        High availability & Fault tolerance
        Load Balancing Design & Validation
        Distributed & Parallel Computing
     Deployment, Configuration, Cluster Mgmt. & Monitoring
     Service Automation and Event Mgmt.
     KB & Helpdesk
 Service Level:
     Helpdesk & Onsite Inspection
     System Reliability & Availability
     1st / 2nd line Tech. Support
    Automation & Alarm Handling
                                                                  71
    Architecture & Outreach?
High Performance Computing –
     Performance Tuning & Optimization


   Tuning Strategy Best Practices
   Profile & Bottleneck drilldown
   System Optimization
   Filesystem Improvement & (re-)Design


                                          72
High Performance Computing
     Cluster Management & Administration Tools
      Categories:
        System: OS, network, backup, filesystem & virtualization.
        Clustering: deployment, monitoring, management alarm/logging,
        dashboard & automation
        Administration: UID, security, scheduling, accounting
        Application: library, compiler, message-passing & domain-specific.
        Development: debug, profile, toolkits, VC & PM.
        Services: helpdesk, event, KB & FAQ.

                                                                     73
Cluster Implementation (I)
 Operating system
   Candidates: CentOS, Scientific Linux, RedHat, Fedora etc.
 Cluster Management
   Tools: Oscar, Rocks, uBuntu, xCAT etc.
 Deployment & Configurations
   Tools: cobbler, kickstart, puppet, cfgng, quattor(CERN),
   DRBL
 Alarm, Probes & Automation
   Tools: nagios, IPMI, lm_sensors
 System & Service monitoring
   Tools: ganglia, openQRM
 Network Monitoring
   Tools: MRTG, RRD, smokeping, awstats, weathermap
                                                               74
Cluster Implementation (II)
 Filesystem
   Candidates: NFS, Lustre, openAFS, pNFS, GPFS etc.
 Performance Analysis & Profile:
   Tools: gprof, pgroup(PGI), VTune(intel), tprof(IBM), TotalView
   etc.
 Compilers
   Packages: Intel, PGI, MPI, Pathscale, Absoft, NAG, GNU, Cuda etc.
 Message Passing Libraries (parallel):
   Packages: Intel MPI, OpenMPI, MPICH, MVAPICH, PVM(old),
   openMP(POSIX threads) etc.
 Memory Profile & Debug (Threads)
   Tool: Valgrind, IDB, GNU(gdb) etc.
 Distributed computing
   Toolkits: Condor, Globus, gLite(LCG) etc.                   75
Cluster Implementation (III)
 Resource Mgmt. & Scheduling
   Tools: Torque, Maui, Moab, Condor, Slurm, SGE(SunGrid
   Engine), NQS(old), loadleveler(IBM), LSF(Platform/IBM) etc.
 Dashboard
   Tools: openQRM, openNMS, Ahatsup, OpenView, BigBrother
   etc.
 Helpdesk & Trouble Tracking
   Tools: phpFAQ, OTRS, Request Tracker, osTIcket,
   simpleTicket, eTicket etc.
 Logging & Events
   Tools: elog, syslogNG etc.
 Knowledge Base
   Tools: vimwiki, Media Wiki, Twiki, phpFAQ, moinmoin etc.
                                                            76
Cluster Implementation (IV)
 Security
   Functionality: scanning, intrusion detection, & vulnerability
   Tools: honeypot, snort, saint, snmp, nessus, rootkithunter &
   chkrootkit etc.
 Revision Services
   Tools: git, cvs, svn etc.
 Collaborative Project Mgmt.
   Tools: bugzilla, OTRS, projectHQ,
 Accounting:
   Tools: SACCT, PACCT etc.
     Visualization: RRD G/W, Google Chart Tool etc.



                                                             77
Cluster Implementation (IV)
 Backup Services
   Tools: Tivoli(IBM), Bacula, rsync, VERITAS, TSM, Netvault,
   Amanda, etc.
 Remote Console
   Tools: openNX (no machine), rdp compatible, Hummingbird
   (XDMCP), VNC, Xwin32, Cygwin, IPMI v2 etc.
 Cloud & Virtualization
   Packages: openstack, opennebula, eucalyptus, CERNVM,
   Vmware, Xen, Citrix, VirtualBox etc.




                                                             78
High Performance Computing
          - How We Get to Today?
               Moore’s Law, Heat/Energy/Power Density
               Hardware Evolution
               Datacenter & Green

 HPC History Reminder:
 1980s - 1st Gflops in single vector processor
 1994 - 1st TFlop via thousands of microprocessors
 2009 - 1st Pflop via several hundred thousand cores    79
Moore’s Law & Power Density
 Dynamic Pwr ∝ V2fC                     2X Transistors/Chip every 1.5Yr
   Cubic effect if inc frequency & supply  Golden Moore (co-founder of
   voltage                                   Intel) predicted in 1965.
   Eff ∝ capacitance ∝ cores (linear)
   High performance serial processor         33K ~ 38K MIPs
   waste power
                                              7.5K ~ 11K MIPs
      More transistors rather serial




                                      Transistor Count
                                          1971-2011




                                                                                   1 Billion Transistors
        processors

                                                                     25 MIPs

                                                                     1.0 MIPs

                                                                     0.1 MIPs


                                                         Date of Production
                                                                                                           80
                             Ref: http://en.wikipedia.org/wiki/List_of_Intel_microprocessors
Moore’s Law – What we learn?
Transistor ∝ MIPs ∝ Watts ∝ BTUs

 Rule of thumb: 1 watt of power consumed requires 3.413
 BTU/hr of cooling to remove the associated heat
 Inter-chip vs. Intra-chip parallelism
 Challenges: millions of concurrent threads
 HP: Data Center Power Density Went from 2.1 kW/Rack
 in 1992 to 14 kw/Rack in 2006
 IDC: 3 Year Costs of Power and Cooling, Roughly Equal
 to Initial Capital Equipment Cost of Data Center
 NETWORKWORLD: 63% of 369 IT professionals said
 that running out of space or power in their data centers
 had already occurred


                                                      81
HPC – Feature size, Clock & Die Shrink


                                    Historical data
                                    TRTS Max Clock Rate
Main ITRS node (nm)




                                    Year     Feature size (nm)




                                                                 Feature Size (nm)
                             Year
Trend: Cores per Socket
 Top500 Nov 2011:
   45.8% & 32% running 6 & quad cores proc.
    5.8% sys. >= 8 cores (2.4% with 16 cores)
   1
     more than 2 fold inc. vs. 2010 Nov (6.8%)        Top500 2011 Nov
   Trend: quad (73% in 10’) to 6 cores (46% in 11’)




                                                                  83
HPC – Evolution of Processors

 Transistors: Moore’s Law
 Clock rate no longer as a
 proxy for Moore’s Law & Cores
 may double instead.
 Power literately under control.




                                                                    Transistors Physical Gate Length

Ref: “Scaling to Petascale and Beyond: Performance Analysis and Optimization of Applications” NERSC.
HPC – Comprehensive Approach
 CPU Chips
   Clock Frequency & Voltage Scaling
   75% power savings at idle and 40-70% power savings for
   utilization in the 20-80% range
 Server
   Chassis: 20-50% Pwr reduction.
   Modular switches & routers
   Server consolidation & virtualization
 Storage Devices
   Max. TB/Watt & Disk Capacity
   Large Scale Tiered Storage
     Max. Pwr Eff by Min. Storage over-provisioning
 Cabling & Networking
   Stackable & backplane capacity (inc. Pwr Eff)
     Scaling & Density                                      85
HPC – Datacenter Power Projection

 Case: ORNL/UTK inc.
 DOE & NSF sys.
   Deploy 2 large Petascale
   systems in next 5 years
   Current Power
   Consumption 4 MW
   Exp to 15MW before year
   end (2011)
   50MW by 2012.
   Cost estimates based on
   $0.07 per KwH



                                    86
HPC – Data Center Best Practices
 Traditional Approach
   Hot/Cold Aisle
   Min. Leakage
   Eff. Improvement (Coolig & Power)
      DC input (UPS opt.), Cabling & Container
      Liquid Cooling
      Free Cooling
      Leveraging Hydroelectric Power




 Ref: http://www.google.com/about/datacenters/
                                                                              87
 http://www.google.com/about/datacenters/inside/efficiency/power-usage.html
HPC – DataCenter Growing Power Density
 Total system efficiency comprises three main elements- the Grid, the Data
 Centre and the IT Components. Each element has its own efficiency factor-
 multiplied together for 100 watts of power generated, the CPU receives only
 12 watts




                                                  Heat Load Product Footprint (Watt/ft2)




Ref: Internet2 P&C Nov 2011, “Managing Data Center Power Power & Cooling & Cooling” by Force10   88
HPC - Performance Benchmarking
    CPU Arch., Scalability, SMT & Perf/Watt
          Case study: Intel vs. AMD




                                         89
HPC – Performance Strategy: “The Amdahl’s Law”
 Fixed-size Model : Speedup = 1 / (s + p/N)
 Scaled-size Model: Speedup = 1 / ((1-P) + P/N) ~ 1/(1-P)
    arallel & Vector scale w/ problem size
    P
   s: Σ (I/O + serial bottleneck + vector startup + program loading)
   SpeedUP




                                                                        90
             Numer of Processors
Price-Performance for Transaction-Processing

 OLTP – One of the largest server markets is online
 transaction processing
   TPC-C – std. industry benchmark for OLTP is
   Queries and updates rely on database system
 Significant factors of performance in TPC-C:
   Reasonable approx. to a real OLTP app.
   Predictive of real system performance:
     total system performance, inc. the hardware, the operating
    system, the I/O system, and the database system.
    Complete instruction and timing info for benchmarking
   TPM (measure transactions per minute) & price-
   performance in dollars per TPM.

                                                                   91
 20 SPEC benchmarks
  1.9 GHZ IBM Power5 processor vs. 3.8 GHz Intel Pentium 4
  10 Integer @LHS & 10 floating point @RHS
  Fallacy:
    Processors with lower CPIs will always be faster.
    Processors with faster clock rates will always be faster.




                                                                 92
 Characteristics of 10 OLTP systems & TPC-C as the
 benchmark




                                                      93
 Cost of purchase split between processor, memory,
 storage, and software




                                                      94
Pentium 4 Microarchitecture &
Important characteristics of
the recent Pentium 4 640
implementation in 90 nm
technology (code named
Prescott)




                                95
HPC – Performance Measurement (I)
 Objective:
   Baseline Performance
   Performance Optimization
   Confident & Verifiable
 Measurement:
   Open Std.: math kernel & application
   MIPS (million instruction per second) (MIPS Tech. Inc.)
   MFLOPS (million floating point operation per second)
 Characteristics:
   Peak vs. Sustained
   Speed-Up & Computing Efficiency (mainly for Parallel)
   CPU Time vs. Elapsed Time
   Program performance (HP) vs. System Throughput (HT)
   Performance per Watt


Ref: http://www-03.ibm.com/systems/power/hardware/benchmarks/hpc.html
http://icl.cs.utk.edu/hpcc/                                             96
HPC – Performance Measurement (II)
 Public Benchmark Utilities:
   LINPACK (Jack Dongara, Oak Ridge N.L.)
     Single Precision/Double Precision
     n=100 TPP, n=1000 (Paper& Pencil benchmark)
     HPL, n=’undefined’ (mainly for paraell system)
   Synthetic: Drystone, Whetstone, Khornstone
   SPEC (Standard Performance Evaluation Corp.)
     SPECint (CINT2006), SPECfp(CFP2006), SPEComp●Not allow
    for source code modification
   Livermore Loops (introduction of MFLOPS)
   Los Alamos Suite (Vector Computing)
   Stream (Memory Performance)
   NPB (NASA Ames): NPB 1 and NPB 2 (A, B, C)
 Application (Weather/Material/MD/Statistics etc.):
   MM5, NAMD, ANSYS, WRF, VASP etc.                       97
Target Processors (I) - AMD vs. Intel


 AMD Magny-Cours Opteron (45nm, Rel. Mar. 2010)
   Socket G34 multi-chip module
      2 x 4-cores or 6-cores dices connecting with HT 3.1
   6172 (12-cores), 2.1GHz
      L2: 8 x 512K, L3: 2 x 6M
      HT: 3.2 GHz
      ACP/TDP: 80W/115W
      Streaming SIMD Extension: SSE, SEE2, SEE3 and SSE4a
   6128HE (8-cores), 2.0GHz
      L2: 8 x 512K, L3: 2 x 6M
      HT: 3.2 GHz
      ACP/TDP: 80W/115W
      Streaming SIMD Extension: SSE, SEE2, SEE3 and SSE4a
Target Processors (II)
      - AMD vs. Intel

 Intel Woodcrest, Harpertown and Westmear (Rel. Jun 2006)
   Xeon 5150
      2.66GHz, LGA-771
      L2: 4M
      TDP: 65W
      Streaming SIMD Extension: SSE, SSE2, SSE3 and SSSE3
   Harpertown, Quad-Cores, 45nm (Rel. Nov 2007)
      E5430 2.66GHz
      L2: 2 x 6M
      TDP: 80W
      Streaming SIMD Extension: SSE, SSE2, SSE3, SSSE3 and SSE4.1
   Westmear EP, 6-cores, 32nm (Re. Mar 2010)
      X5650 2.67GHz, LGA-1366
      L2/L3: 6x256K/12MB
      I/O Bus: 2 x 6.4GT/s QPI
      Streaming SIMD Extension: SSE, SSE2, SSE3, SSSE3, SSE4.1 and SSE4.2
SPEC2006 Performance Comparison
    - SMT Off Turbo-on


 8 Cores Nehalem-EP vs. 12 Cores Westmere-EP
 32% performance gain by increase 50% of CPU-Cores
 Scalability 12% below Ideal Performance
 SMT Advantage:
   Nehalem-EP 8 Cores to 16 Cores: “24.4%”
   Westmere-EP 12 Cores to 24 Cores: “23.7”




Ref: CERN Openlan Intel WEP Evaluation Report (2010)
Efficiency of Westmere-EP
      - Performance per Watt


 Extrapolated from 12G to 24G
   2 Watt per additional GB of Memory
 Dual PSU (Upper) vs. Single PSU
 (Lower)
 SMT offer 21% boost in turns of
 efficiency
 Approx. 3% consume by SMT
 comparing with absolute
 performance (23.7%)




Ref: CERN Openlan Intel WEP Evaluation Report (2010)
Efficiency of Nehalem-EP Microarchitecture
      With SMT Off

 Most efficiency Nehalem-EP L5520 vs. X5670
   Westmere add 10%
     With efficiency 9.75% using dual PSU
   +23.4% using single PSU
 Nehalem L5520 vs. Harpertown (E5410)
   +35% performance boost




Ref: CERN Openlan Intel WEP Evaluation Report (2010)
Multi-Cores Performance Scaling
     - AMD Magny-Cours vs. Intel Westmere (I)
Multi-Cores Performance Scaling
     - AMD Magny-Cours vs. Intel Westmere (II)
Single Server Linpack Performance
     - Intel X5650, 2.67GHz 12G DDR3 (6 cores)
     HPL Optimal Performance
     ~108.7 GFlops per Node
Lesson from Top500
    Statistics, Analysis & Future Trend
        Processor Tech. & Cores/socket
        Cluster Interconnect
        power consumption & Efficiency
        Regional performance & Trend
                                          106
Top 500 – 2011 Nov.
Rmax(GFlops)




                      Cores
HPC – Performance of Countries

                      Nov 2011 Top500
                      Performance of
                      Countries




                                        108
Top500 Analysis – Power Consumption & Efficiency

    Top 4 Power Eff.: GlueGene/Q (2011 Nov)
      Rochester > Thomas J. Watson > DOE/NNSA/LLNL


                                      Eff: 2026 GF/kW
                                      BlueGene/Q, Power BQC
                                      16C 1.60 GHz, Custom
                                                             11.87MW
                                            RIKEN Advanced Institute
                                           for Computational Science
                                               (AICS) - SPARC64 VIIIfx
                                            2.0GHz, Tofu interconnect



                                                   3.6MW, Tianhe-1A
                                                            National
                                                    Supercomputing
                                                    Center in Tianjin


         2008         2009         2010                2011 109
Top500 Analysis - Performance & Efficiency
 20% of Top-performed clusters
 contribute 60% of Total Computing Power
 (27.98PF)
 5 Clusters Eff. < 30
Top500 Analysis - HPC Cluster Performance
 272 (52%) of world fastest clusters have efficiency lower
 than 80% (Rmax/Rpeak)
 Only 115 (18%) could drive over 90% of theoretical peak

                             Sampling from Top500 HPC cluster




                   Trend of Cluster Efficiency 2005-2009
Top500 Analysis – HPC Cluster Interconnection
 SDR, DDR and QDR in Top500
  Promising efficiency >= 80%
  Majority of IB ready cluster adopt DDR
  (87%) (2009 Nov)
  Contribute 44% of total computing
  power
     ~28 Pflops
  Avg efficiency ~78%
Impact Factor: Interconnectivity
    - Capacity & Cluster Efficiency

 Over 52% of Cluster base on GbE
   With efficiency around 50% only
 InfiniBand adopt by ~36% HPC Clusters
Common Semantics




 Programmer productivity
 Easy of deployment
 HPC filesystem are more mature, wider feature set:
   High concurrent read and write
   In the comfort zone of programmers (vs cloudFS)
 Wide support, adoption, acceptance possible
   pNFS working to be equivalent
   Reuse standard data management tools
     Backup, disaster recovery and tiering
IB Roadmap
Trend in HPC




                   74.2PF
                   10.5PF



                   50.9TF
Observation & Perspectives (I)
 Performance pursuing another 1000X would be tough
     ~20PF Titan and Jaguar deliver in 2012
     ExaFlops project ~ 2016 (PF in 2008)
     Stil! IB & GbE are the most used interconnect solutions
     multi-cores continue Moore’s Law
        high level parallelism & software readiness
        reduce bus traffic & data locality
 Storage is fastest-growing product sector
     Storage consolidation intensifies competition
     Lustre roadmap stabilized for HPC
 Computing paradigm
     Complicated system vs. supplicated computing tools
        hybrid computing model
     Major concern: power efficiency
        energy in memory & interconnect inc. data search application
        exploit memory power efficiency: large cache?
     Scalability and Reliability
     Performance key factor: data communication
        consider: layout, management & reuse                           116
Observation & Perspectives (II)
 Vendor Support & User readiness
                                                          No Moore’s Law for software, algorithms &
   Service Orientation                                   applications?
   Standardization & KB
   Automation & Expert system
 Emerging new possibility
    Cloud Infrastructure & Platform
       currently 3% of spending (mostly private cloud)
   Technology push & market/demand pull
   growing opportunity of “Big Data”
       datacenter, SMB & HPC solution providers
 Rapidly growth of accelerator
    Test by ~67% of users (20% in 10’)
    NVIDIA posses 90% of current usage (11’)


   “I think there is a world market for maybe five computers”
   Thomas Watson, chairman of IBM, 1943
           “Computers in the future may weight no more than 1.5 tons. ”
           Popular Mechanics, 1949                                                          117
References
 Top500: http://top500.org
 Green Top500: http://www.green500.org
 HPC Advisory Council
   http://www.hpcadvisorycouncil.com/subgroups.php
 HPC Inside
   http://insidehpc.com/
 HPC Wiki
   http://en.wikipedia.org/wiki/High-performance_computing
 Supercomputing Conferences Series
   http://www.supercomp.org/
 Beowulf Cluster
   http://www.beowulf.org/
 MPI Forum:
   http://www.mpi-forum.org/docs/docs.html

                                                              118
Reference - Mathematical & Numerical Lib. (I)
 Open Source
  Linpack - numerical linear algebra intend to use on supercomputers
  LAPACK - the successor to LINPACK (Netlib)
  PLAPACK - Parallel Linear Algebra Package
  BLAS - basic linear algebra subprograms
  gotoBlas - optimal performance of Blas with new algorithm & memory
  techniques
  Scalapack - high performance linear algebra routines or distributed
  memory message passing MIMD computer
  FFTW - Fast Fourier Transform in the West
  HPC-Netlib - is the high performance branch of Netlib
  PETSc - portable, extensible toolkit for scientific computation
  Numerical Recipes
  GNU Scientific Libraries
                                                                         119
Reference - Mathematical & Numerical Lib. (II)
 Commercial
  ESSL & pESSL (IBM/AIX) - Engineering & Scientific Subroutine
  Library
  MASS (IBM/AIX) - Mathematical Acceleration Subsystem
  Intel Math Kernel - vector, linear algebra, special tuned math kernels
  NAG Numerical Libraries - Numerical Algorithms Group
  IMSL - International Mathematical and Statistical Libraries
  PV-WAVE - Workstation Analysis & Visualization Env.
  JAMA - Java matrix package, developed by the MathWorks & NIST.
  WSSMP - Watson Symmetric Sparse Matrix Package




                                                                   120
Reference - Message Passing
 PVM (Parallel Virtual Machine, ORNL/CSM)
 OpenMPI
 MVAPICH & MVAPICH2
 MPICH & MPICH2
      v1 channels:
         ch_p4 - based on older p4 project (Portable Programs for Parallel
        Processors), tcp/ip
         ch_p4mpd - p4 with mpd daemons to starting and managing processes
         ch_shmem - shared memory only channel
         globus2 – Globus2
      v2 channels:
         Nemesis – Universal
         inter-node modules:
          elan, GM, IB (infiniband), MX (myrinet express), NewMadeleine, tcp
          intra-node variants of shared memory for large messages (LMT interface).
         ssm - Sockets and Shared Memory
         shm - SHared memory
         sock - tcp/ip sockets
         sctp - experimental channel over SCTP sockets                         121
Reference - Performance, Benchmark & Tools

 High performance tools & technologies:
   https://computing.llnl.gov/tutorials/performance_tools/
   HighPerformanceToolsTechnologiesLC.pdf
 Linux Benchmarking Suite:
   http://lbs.sourceforge.net
 Linux Test Tools Matrix:
   http://ltp.sourceforge.net/tooltable.php
 Network Performance
   http://compnetworking.about.com/od/networkperformance/
   TCPIP_Network_Performance_Benchmarks_and_Tools.htm
   http://tldp.org/HOWTO/Benchmarking-HOWTO-3.html
   http://bulk.fefe.de/scalability/
   http://linuxperf.sourceforge.net
                                                              122
Reference - Network Security
 Network Security
   Tools: http://sectools.org/ , http://www.yolinux.com/TUTORIALS/
   LinuxSecurityTools.html & http://www.lids.org/ etc.
      packet sniffer, wrapper, firewall, scanner, services (MTA/BIND) etc.
 Online Org.:
   CERT http://www.us-cert.gov
   SANS http://www.sans.org
 Linux Network Security
   basic config/utility/profile, encryption & routing.
   (obsolete: http://www.drolez.com/secu/)
 Network Security Toolkit
 Audit, Intrusion Detection & Prevention
   Event Types:
      DDoS, Scanning, Worms, Policy violation & unexpected app. services
   Honeypots, Tripwire, Snort, Tiger, Nessus, Ethereal, nmap, tcpdump,
   portscan, portsentry, chkrootkit, rootkithunter, AIDE(HIDE), LIDS etc.
   Ref: NIST “Guide to Intrusion Detection and Prevention Systems”       123
Reference - Book
 Computer Architecture: A Quantitative Approach
   2nd Ed., by David A. Patterson, John L. Hennessy, David Goldberg
 Parallel Computer Architecture: A Hardware/Software Approach
   by David Culler and J.P. Singh with Anoop Gupta
 High-performance Computer Architecture
   3rd Ed., by Harold Stone
 High Performance Compilers for Parallel Computing
   by Michael Wolfe (Addison Wesley, 1996)
 Advanced Computer Architectures: A Design Space Approach
   by Terence Fountain, Peter Kacsuk, Dezso Sima
 Introduction to Parallel Computing: Design and Analysis of Parallel
 Algorithms
     by Vipin Kumar, Ananth Grama, Anshul Gupta, George Karypis
 Parallel Computing Works!
     by Geoffrey C. Fox, Roy D. Williams, Paul C. Messina
 The Interaction of Compilation Technology and Computer Architecture
     by David J. Lilja, Peter L. Bird (Editor)                         124
National Laboratory Computing Facilities (I)

 ANL, Argonne National Laboratory
   http://www.lcrc.anl.gov/
 ASC, Alabama Supercomputer Center
   http://www.asc.edu/supercomputing/
 BNL, Brookhaven National Laboratory, Computational Science Center
   http://www.bnl.gov/csc/
 CACR, Center for Advanced Computing Researc
   http://www.cacr.caltech.edu/main/
 CAPP, Center for Applied Parallel Processing
   http://www.ceng.metu.edu.tr/courses/ceng577/announces/
   supercomputingfacilities.htm
 CHPC, Center for High Performance Computing, University of Utah
   http://www.chpc.utah.edu/

                                                               125
National Laboratory Computing Facilities (II)

 CRPC, Center For Research on Parallel Computation
   http://www.crpc.rice.edu/
 LANL, Los Alamos National Lab
   http://www.lanl.gov/roadrunner/
 LBL, Lawrence Berkeley National Lab
   http://crd.lbl.gov/
 LLNL, Lawrence Livermore National Lab
   https://computing.llnl.gov/
 MHPCC, Maui High Performance Computing Center
   http://www.mhpcc.edu/
 NCAR, National Center for Atmospheric Research
   http://ncar.ucar.edu/
 NCCS, National Center for Computational Science
   http://www.nccs.gov/computing-resources/systems-status

                                                             126
National Laboratory Computing Facilities (III)

 NCSA, National Center for Supercomputing Application
   http://www.ncsa.illinois.edu/
 NERSC, National Energy Research Scientific Computing Center
   http://www.nersc.gov/home-2/
 NSCEE, National Supercomputing Center for Energy and the
 Environment
   http://www.nscee.edu/
 NWSC, NCAR-Wyoming Supercomputing Center
   http://nwsc.ucar.edu/
 ORNL, Oak Ridge National Lab
   http://www.ornl.gov/ornlhome/high_performance_computing.shtml
 OSC, Ohio Supercomputer Center
   http://www.osc.edu/


                                                                    127
National Laboratory Computing Facilities (IV)

 PSC, Pittsburgh Supercomputing Center
   http://www.psc.edu/
 SANDIA, Sandia National Laboratories
   http://www.cs.sandia.gov/
 SCRI, Supercomputer Computations Research Institute
   http://www.sc.fsu.edu/
 SDSC, San Diego Supercomputing Center
   http://www.sdsc.edu/services/hpc.html
 ARSC, Arctic Region Supercomputing Center
   http://nwsc.ucar.edu/
 NASA, National Aeronautics and Space Admin
   http://www.nas.nasa.gov/




                                                        128

Más contenido relacionado

La actualidad más candente

Introduction to Parallel Distributed Computer Systems
Introduction to Parallel Distributed Computer SystemsIntroduction to Parallel Distributed Computer Systems
Introduction to Parallel Distributed Computer SystemsMrMaKKaWi
 
High Performance Computing Presentation
High Performance Computing PresentationHigh Performance Computing Presentation
High Performance Computing Presentationomar altayyan
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance ComputingDivyen Patel
 
Continguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux KernelContinguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux KernelKernel TLV
 
Non-Uniform Memory Access ( NUMA)
Non-Uniform Memory Access ( NUMA)Non-Uniform Memory Access ( NUMA)
Non-Uniform Memory Access ( NUMA)Nakul Manchanda
 
parallel computing.ppt
parallel computing.pptparallel computing.ppt
parallel computing.pptssuser413a98
 
High performance computing
High performance computingHigh performance computing
High performance computingGuy Tel-Zur
 
Ucx an open source framework for hpc network ap is and beyond
Ucx  an open source framework for hpc network ap is and beyondUcx  an open source framework for hpc network ap is and beyond
Ucx an open source framework for hpc network ap is and beyondinside-BigData.com
 
Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computingVajira Thambawita
 
Parallel computing
Parallel computingParallel computing
Parallel computingVinay Gupta
 
Emulation and simulation
Emulation and simulationEmulation and simulation
Emulation and simulationNebalAlJamal
 
Chapter 4: Parallel Programming Languages
Chapter 4: Parallel Programming LanguagesChapter 4: Parallel Programming Languages
Chapter 4: Parallel Programming LanguagesHeman Pathak
 

La actualidad más candente (20)

Introduction to Parallel Distributed Computer Systems
Introduction to Parallel Distributed Computer SystemsIntroduction to Parallel Distributed Computer Systems
Introduction to Parallel Distributed Computer Systems
 
High Performance Computing Presentation
High Performance Computing PresentationHigh Performance Computing Presentation
High Performance Computing Presentation
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance Computing
 
Continguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux KernelContinguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux Kernel
 
Non-Uniform Memory Access ( NUMA)
Non-Uniform Memory Access ( NUMA)Non-Uniform Memory Access ( NUMA)
Non-Uniform Memory Access ( NUMA)
 
Perspective on HPC-enabled AI
Perspective on HPC-enabled AIPerspective on HPC-enabled AI
Perspective on HPC-enabled AI
 
parallel computing.ppt
parallel computing.pptparallel computing.ppt
parallel computing.ppt
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Ucx an open source framework for hpc network ap is and beyond
Ucx  an open source framework for hpc network ap is and beyondUcx  an open source framework for hpc network ap is and beyond
Ucx an open source framework for hpc network ap is and beyond
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
Unit 5
Unit  5Unit  5
Unit 5
 
Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computing
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Cuda
CudaCuda
Cuda
 
Microkernel
MicrokernelMicrokernel
Microkernel
 
FreeRTOS
FreeRTOSFreeRTOS
FreeRTOS
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Emulation and simulation
Emulation and simulationEmulation and simulation
Emulation and simulation
 
Chapter 4: Parallel Programming Languages
Chapter 4: Parallel Programming LanguagesChapter 4: Parallel Programming Languages
Chapter 4: Parallel Programming Languages
 

Destacado

High-throughput computing in engineering
High-throughput computing in engineeringHigh-throughput computing in engineering
High-throughput computing in engineeringMatevz Dolenc
 
Bringing Private Cloud Computing to HPC and Science - Berkeley Lab - July 2014
Bringing Private Cloud Computing to HPC and Science  - Berkeley Lab - July 2014 Bringing Private Cloud Computing to HPC and Science  - Berkeley Lab - July 2014
Bringing Private Cloud Computing to HPC and Science - Berkeley Lab - July 2014 OpenNebula Project
 
High Performance Computing and the Opportunity with Cognitive Technology
 High Performance Computing and the Opportunity with Cognitive Technology High Performance Computing and the Opportunity with Cognitive Technology
High Performance Computing and the Opportunity with Cognitive TechnologyIBM Watson
 
Ibm pure data system for analytics n200x
Ibm pure data system for analytics n200xIbm pure data system for analytics n200x
Ibm pure data system for analytics n200xIBM Sverige
 
High Performance Computing: The Essential tool for a Knowledge Economy
High Performance Computing: The Essential tool for a Knowledge EconomyHigh Performance Computing: The Essential tool for a Knowledge Economy
High Performance Computing: The Essential tool for a Knowledge EconomyIntel IT Center
 
An Introduction to Netezza
An Introduction to NetezzaAn Introduction to Netezza
An Introduction to NetezzaVijaya Chandrika
 
High Performance Computing - The Future is Here
High Performance Computing - The Future is HereHigh Performance Computing - The Future is Here
High Performance Computing - The Future is HereMartin Hamilton
 
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...xKinAnx
 
GPFS - graphical intro
GPFS - graphical introGPFS - graphical intro
GPFS - graphical introAlex Balk
 
Architecting Next Generation Enterprise Network Storage
Architecting Next Generation Enterprise Network StorageArchitecting Next Generation Enterprise Network Storage
Architecting Next Generation Enterprise Network StorageIMEX Research
 
Security & Virtualization in the Data Center
Security & Virtualization in the Data CenterSecurity & Virtualization in the Data Center
Security & Virtualization in the Data CenterCisco Russia
 
Next-Gen Data Center Virtualization: Studies in Implementation
Next-Gen Data Center Virtualization: Studies in ImplementationNext-Gen Data Center Virtualization: Studies in Implementation
Next-Gen Data Center Virtualization: Studies in ImplementationIMEX Research
 
Next Gen Data Center Implementing Network Storage with Server Blades, Cluster...
Next Gen Data Center Implementing Network Storage with Server Blades, Cluster...Next Gen Data Center Implementing Network Storage with Server Blades, Cluster...
Next Gen Data Center Implementing Network Storage with Server Blades, Cluster...IMEX Research
 
Automating the Next Generation Datacenter
Automating the Next Generation DatacenterAutomating the Next Generation Datacenter
Automating the Next Generation DatacenterJosh Atwell
 
Unix _linux_fundamentals_for_hpc-_b
Unix  _linux_fundamentals_for_hpc-_bUnix  _linux_fundamentals_for_hpc-_b
Unix _linux_fundamentals_for_hpc-_bMohammad Reza Beygi
 

Destacado (16)

High-throughput computing in engineering
High-throughput computing in engineeringHigh-throughput computing in engineering
High-throughput computing in engineering
 
Bringing Private Cloud Computing to HPC and Science - Berkeley Lab - July 2014
Bringing Private Cloud Computing to HPC and Science  - Berkeley Lab - July 2014 Bringing Private Cloud Computing to HPC and Science  - Berkeley Lab - July 2014
Bringing Private Cloud Computing to HPC and Science - Berkeley Lab - July 2014
 
High Performance Computing and the Opportunity with Cognitive Technology
 High Performance Computing and the Opportunity with Cognitive Technology High Performance Computing and the Opportunity with Cognitive Technology
High Performance Computing and the Opportunity with Cognitive Technology
 
Ibm pure data system for analytics n200x
Ibm pure data system for analytics n200xIbm pure data system for analytics n200x
Ibm pure data system for analytics n200x
 
High Performance Computing: The Essential tool for a Knowledge Economy
High Performance Computing: The Essential tool for a Knowledge EconomyHigh Performance Computing: The Essential tool for a Knowledge Economy
High Performance Computing: The Essential tool for a Knowledge Economy
 
An Introduction to Netezza
An Introduction to NetezzaAn Introduction to Netezza
An Introduction to Netezza
 
Netezza pure data
Netezza pure dataNetezza pure data
Netezza pure data
 
High Performance Computing - The Future is Here
High Performance Computing - The Future is HereHigh Performance Computing - The Future is Here
High Performance Computing - The Future is Here
 
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
 
GPFS - graphical intro
GPFS - graphical introGPFS - graphical intro
GPFS - graphical intro
 
Architecting Next Generation Enterprise Network Storage
Architecting Next Generation Enterprise Network StorageArchitecting Next Generation Enterprise Network Storage
Architecting Next Generation Enterprise Network Storage
 
Security & Virtualization in the Data Center
Security & Virtualization in the Data CenterSecurity & Virtualization in the Data Center
Security & Virtualization in the Data Center
 
Next-Gen Data Center Virtualization: Studies in Implementation
Next-Gen Data Center Virtualization: Studies in ImplementationNext-Gen Data Center Virtualization: Studies in Implementation
Next-Gen Data Center Virtualization: Studies in Implementation
 
Next Gen Data Center Implementing Network Storage with Server Blades, Cluster...
Next Gen Data Center Implementing Network Storage with Server Blades, Cluster...Next Gen Data Center Implementing Network Storage with Server Blades, Cluster...
Next Gen Data Center Implementing Network Storage with Server Blades, Cluster...
 
Automating the Next Generation Datacenter
Automating the Next Generation DatacenterAutomating the Next Generation Datacenter
Automating the Next Generation Datacenter
 
Unix _linux_fundamentals_for_hpc-_b
Unix  _linux_fundamentals_for_hpc-_bUnix  _linux_fundamentals_for_hpc-_b
Unix _linux_fundamentals_for_hpc-_b
 

Similar a High performance computing - building blocks, production & perspective

Reservoir engineering in a HPC (zettaflops) world: a ‘disruptive’ presentation
Reservoir engineering in a HPC (zettaflops) world:  a ‘disruptive’ presentationReservoir engineering in a HPC (zettaflops) world:  a ‘disruptive’ presentation
Reservoir engineering in a HPC (zettaflops) world: a ‘disruptive’ presentationHans Haringa
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010TELECOM I+D
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedRCCSRENKEI
 
An Overview of Intel TFLOPS Super Computer
An Overview of Intel TFLOPS Super ComputerAn Overview of Intel TFLOPS Super Computer
An Overview of Intel TFLOPS Super ComputerSerwer Alam
 
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...inside-BigData.com
 
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Heiko Joerg Schick
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 HardwareJacob Wu
 
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)Heiko Joerg Schick
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor DesignSri Prasanna
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaFacultad de Informática UCM
 
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...Rakuten Group, Inc.
 
High Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale ComputingHigh Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale ComputingHeiko Joerg Schick
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicEric Verhulst
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Slide_N
 

Similar a High performance computing - building blocks, production & perspective (20)

Reservoir engineering in a HPC (zettaflops) world: a ‘disruptive’ presentation
Reservoir engineering in a HPC (zettaflops) world:  a ‘disruptive’ presentationReservoir engineering in a HPC (zettaflops) world:  a ‘disruptive’ presentation
Reservoir engineering in a HPC (zettaflops) world: a ‘disruptive’ presentation
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons Learned
 
An Overview of Intel TFLOPS Super Computer
An Overview of Intel TFLOPS Super ComputerAn Overview of Intel TFLOPS Super Computer
An Overview of Intel TFLOPS Super Computer
 
TensorFlow for HPC?
TensorFlow for HPC?TensorFlow for HPC?
TensorFlow for HPC?
 
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
 
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 Hardware
 
Sponge v2
Sponge v2Sponge v2
Sponge v2
 
Mateo valero p1
Mateo valero p1Mateo valero p1
Mateo valero p1
 
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
 
Pgopencl
PgopenclPgopencl
Pgopencl
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
LUG 2014
LUG 2014LUG 2014
LUG 2014
 
High Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale ComputingHigh Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale Computing
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 Altreonic
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
 

Último

Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 

Último (20)

Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 

High performance computing - building blocks, production & perspective

  • 1. High Performance Computing - Building blocks, Production & Perspective Jason Shih Feb, 2012
  • 2. What is HPC?  HPC Definition:  14.9K hits from Google   Uses supercomputers and computer clusters to solve advanced computation problems. Today, computer..... (Wikipedia)  Use of parallel processing for running advanced application programs efficiently, reliably and quickly. The term applies especially to systems that function above a teraflop or 1012 floating-point operations per second. The term HPC is occasionally used as a synonym for supercomputing, although. (Techtarget)  A branch of computer science that concentrates on developing supercomputers and software to run on supercomputers. A main area of this discipline is developing parallel processing algorithms and software (Webopedia)  And another “14.9k - 3” definitions…… 2
  • 3. So What is HPC Really?  My understanding  No clear definition!!  At least O(2) time as powerful as PC  Solving advanced computation problem? Online game!?  HPC ~ Supercomputer? & Supercomputer ~ Σ Cluster(s)  Possible Components:  CPU1, CPU2, CPU3….. CPU”N”  ~ O(1) tons of memory dimm….  ~ O(2) kW power consumption  O(1) - ~O(3) K-Cores  ~ 1 system admin   Rmember:  “640K ought to be enoughfor anybody. ” Bill Gates, 1981 3
  • 4. Why HPC?  Possible scenario: header: budget_Unit = MUSD if{budget[Gov.] >= O(2) && budget[Com.] >= O(1)} else {Show_Off == “true”} else if{Possible_Run_on_PC == “false”} {Exec “HPC Implementation”}} Got to Wait another 6M ~ 1Yr…….   ruth is: T  Time consuming operations  Huge memory demanded tasks  Mission critical e.g. limit time duration  Large quantities of run (cores/CPUs etc.)  non-optimized programs…. 4
  • 5. Why HPC? Story Cont’   rand Challenge Application Requirements G  Capable with 97’ MaxTop500 & 05’ MinTop500 break TFlops 10 PB LHC Est. CPU/DISK ~ 143MSI2k/56.3PB ~100K-Cores PFlops 5
  • 6. How Need HPC? HPC Domain Applications  Fluid dynamics & Heat Transfer  Physics & Astrophysics  Nanoscience  Chemistry & Biochemistry  Biophysics & Bioinformatics  Geophysics & Earth Imaging  Medical Physics & Drop Discovery  Databases & Data Mining Financial  Modeling Signal & Image Processing  And more .... 6
  • 7. HPC – Speed vs. Size > Can’t fit on a PC – usually because they > Take a very very long time to run need more than a few on a PC: months or even years. But GB of RAM, or more Size a problem that would take a month than a few 100 GB of on a PC might take only a few hours disk. on a supercomputer. Speed 7
  • 8. HPC ~ Supercomputer ~ Σ Cluster(s)  What is cluster?  Again, 1.4k hits from google…..   A computer cluster is a group of linked computers, working together closely thus in many respects forming a single computer... (Wikipedia)  Single logical unit consisting of multiple computers that are linked through a LAN. The networked computers essentially act as a single, much more powerful machine.. (Techopedia)  And……  But the cluster is:   CPU1, CPU2, CPU3….. CPU”N”  ~ O() Kg of memory dimm….  < O(1) kW power consumption  ~ O(1) K-Cores  Still ~ 1 system admin  8
  • 9. HPC – Trend in Growth Potential & Easy of Use 10 1.0 GFlops per Processors - 1995 - 2000 0.1 - 2005 - 2010 0.01 0.001 10 100 1000 10K 100K Number of Processors (P) 9
  • 10. HPC Energy Projection Strawmen Project 10
  • 11. HPC – Numbers Before 2000s   11
  • 12. HPC – Annual Performance Distribution   Top500 projection in 2012:   Cray Titan: est. 20PF, transformed from Jaguar @ORNL  1st phase: replace Cray XT5 w/ Cray XK6 & Operon CPUs & Tesla GPUs  2nd phase: 18K additional Tesla GPUs   IBM Sequoia: est. 20PF base on Glue Gene/Q @LLNL   ExaFlops of “world computing power” in 2016? 10.5 Pflops, K computer, SPARC64 VIIIfx 2.0GHz 1PFlops 35.8 TF NEC, Earth-Simulator Japan GFlops < 6 Years 1TFlops 8 Years PC: 109 GFlops Intel Core i7 980 XE 1GFlops
  • 13. HPC – Performance Trend “Microprocessor” 13
  • 14. Trend – Transistors per Processor Chip 14
  • 15. HPC History (I) 1PFlop/s IBM RoadRunner Cray Jaguar Parallel 1TFlop/s 2008 PFlops Vector 1GFlop/s SuperScalar 1987 GFlops 1MFlop/s Scalar 1KFlops/ Bit Level Parallelism Instruction Level Thread Level 1950 1960 1970 1980 1990 2000 15 2010
  • 16. HPC History (II) CPU Year/Clock Rate /Instruction per sec. 16
  • 17. HPC History (III) Four Decades of Computing – Time Sharing Era 17
  • 18. HPC – Cost of Computing (1960s ~ 2011) About “17 million IBM 1620 units” costing $64,000 each. The 1620's multiplication operation takes 17.7 ms Cost (USD) Two 16-processor Beowulf clusters Cray X-MP with Pentium Pro microprocessors First computing technology which scaled to large First sub-US$1/MFLOPS computing applications while staying technology. It won the Gordon Bell Prize under US$1/MFLOPS in 2000 Bunyip Beowulf cluster KLAT2 First sub-US$100/GFLOPS computing technologyKASY0 As of August 2007, this 26.25 GFLOPS "personal" Microwulf Beowulf cluster can be built for $1256 Year HPU4Science $30,000 cluster was built using only commercially available "gamer" grade hardware 18 Ref: http://en.wikipedia.org/wiki/FLOPS
  • 19. HPC - Interconnect, Proc Type, Speed & Threads 19 Ref: “Introduction to the HPC Challenge Benchmark Suite” by Piotr et. al.
  • 21. IBM mainframe Intel mainframe 21
  • 22. HPC – Computing System Evolution (I) ENIAC   940s (Beginning) 1  ENIAC (Eckart & Mauchly, U.Penn)  Von Neumann Machine  Sperry Rand Corp  IBM Corp.  Vacuum tube  Thousands instruction per second (0.002 MIPS)  1950s (Early Days)  IBM 704, 709x  CDC 1604  Transistor (Bell Lab, 1948)  Memory: Drum/Magnetic Core (32K words)  Performance: 1 MIPS  Separate I/O processor IBM 704 22
  • 23. HPC – Computing System Evolution (II)   960s (System Concept) 1 IBM S/360 Model 85   IBM Stretch Machine (1st Pipeline machine)   IBM System 360 (Model 64, Model 91)   CDC 6600   GE, UNIVAC, RCA, Honeywell & Burrough etc   Integrated Circuit/Mult-layer (Printed Circuit Board)   Memory: Semiconductor (3MB)   Cache (IBM 360 Model 85)   Performance: 10 MIPS (~1 MFLOPS)  1970s (Vector, Mini-Computer)   IBM System 370/M195, 308x   CDC 7600, Cyber Systems   DEC Minicomputer   FPS (Floating Point System)   Cray 1, XMP   Large Scale Integrated Circuit   Performance: 100 MIPS (~10 MFLOPS)   Multiprogrmming, Time Sharing IBM S/370 Model 168   Vector: Pipeline Data Stream 23
  • 24. HPC – Computing System Evolution (III)   980s (RISC, Micro-Processor) 1  CDC Cyber 205  Cray 2, YMP  IBM 3090 VF  Japan Inc. (Fujitsu’s VP, NEC’s SX)  Thinking Machine: CM2 (1st Large Scale Parallel)  RISC system (Appolo, Sun, SGI, etc)  CONVAX Vector Machine (mini Cray)  Microprocessor: PC (Apple, IBM)  Memory: 100MB Connect Machine: CM-2  RISC system:  Pipeline Instruction Stream  Multiple execution units in core  Vector: Multiple vector pipelines  Thinking Machine: kernel level parallelism  Performance: 100 Mflops IBM 3090 Processor Complex 24
  • 25. HPC – Computing System Evolution (IV)   990s (Cluster, Parallel Computing) 1   IBM Power Series (1,2,3)   SGI NUMA System   Cray CMP, T3E, Cray 3   CDC ETA   DEC’s Alpha IBM Power5 Family   SUN’s Internet Machine   Intel Parogon   Cluster of PC Power3 IBM Blue Gene   Memory: 512MB per processor   Performance: 1 Teraflops   SMP node in Cluster System   000s (Large Scale Parallel System) 2   IBM Power Series (4,5), Blue Gene   HP’s Superdome   Cray SV system   Intel’s Itanium, Xeon, Woodcrest, Westsmear processor   emory: 1-8 GB per processor M   erformance: Reach 10 Teraflops P 25
  • 26. HPC – Programming Language (I)  Microcode, Machine Language  Assembly Language (1950s)  Mnemonic, based on machine instruction set  Fortran (Formula Translation) (John Backus, 1956)  IBM Fortran Mark I – IV (1950s, 1960s)  IBM Fortran G, H, HX (1970), VS Fortran  CDC, DEC, Cray, etc..., Fortran  Industrial Standardized - Fortran 77 (1978)  Industrial Standardized - Fortran (88), 90, 95 (1991,1996)  HPF (High Performance Fortran) (late 1980)  Algol (Algorithm Language) (1958) (1960, Dijksta, et. al.)  Based on Backus-Naur Form method  Considered as 1st Block Structure Language  COBOL (Common Business Oriented Language) (1960s)  IBM PL/1, PL/2 (Programming Language) (mid 60-70s)  Combined Fortran, COBOL, & Algol  Pointer function 26  Exceptional handling
  • 27. HPC – Programming Language (II)  Applicative Languages  IBM APL (A Programming Language) (1970s)  LISP (List Processing Language) (1960s, MIT)  BASIC (Beginner’s All-Purpose Symbolic Instruction Code) (mid 1960)  1st Interactive language via Interpreter  PASCAL (1975, Nicklass Wirth)  Derived from Wirth’s Algol-W  Well designed programming language  Call argument list by value   & C++ (mid 1970, Bell Lab) C  Procedure language  ADA (late 1980, U.S. DOD)  Prolog (Programming Logic) (mid 1970) 27
  • 28. HPC – Computing Environment  Batch Processing (before 1970)  Multi-programming, Time Sharing (1970)  Remote Job Entry (RJE) (mid 1970)  Network Computing  APARnet (mother of INTERNET)  IBM’s VNET (mid 1970)  Establishment Community Computing Center   st Center: NCAR (1967) 1  U.S. National Supercomputer Centers (1980)  Parallel Computing  Distribute Computing  Emergence of microprocessors  Grid Computing (2000s)  Volunteer Computing @Home Technology 28
  • 29. HPC – Computational Platform Pro & Con 29
  • 30. HPC – Parallel Computing (I)  Characteristics:  Asynchronous Operation (Hardware)  Multiple Execution Units/Pipelines (Hardware)  Instruction Level  Data Parallel  Kernel Level  Loop Parallel  Domain Decomposition  Functional Decomposition 30
  • 31. HPC – Parallel Computing (II)  1st Attempt - ILLIAC, 64-way monster (mid 1970)  U.S. Navy’s parallel weather forecast program (1970s)  Early programming method - UNIX thread (late 1970)  1st Viable Parallel Processing - Cray’s Micro-Tasking (80s)  Many, Many proposed methods in 1980s: e.g. HPF  SGI’s NUMA System - A very successful one (1990s)  Oakridge NL’s PVM and Europe’s PARMAC (Early 90s) programming model for Distributing Memory System  Adaption of MPI and OpenMP for parallel programming  MPI - A main stream of parallel computing (late 1990)   Well and Clear defined programming model   Successful of Cluster Computing System   Network/Switch hardware performance   Scalability   Data decomposition allows for running large program  Mixed MPI/OpenMP parallel programming model for SMP node cluster system (2000) 31
  • 32. Env. Sci./Disaster Mitigation Defense Engineering HPC Application Area Finance & Business Science Research 32
  • 33. HPC – Applications (I)  Early Days  Ballistic Table  Signal Processing  Cryptography  Von Neumann’s Weather Simulation  1950s-60s  Operational Weather Forecasting  Computational Fluid Dynamics (CFD, 2D problems)  Seismic Processing and Oil Reservoir Simulation  Particle Tracing  Molecular Dynamics Simulation  CAD/CAE, Circuit Analysis  1970s (emergency of “package” program)  Structural Analysis (FEM application)  Spectral Weather Forecasting Model  ab initio Chemistry Computation (Material modeling in quantum level)  3D Computational Fluid 33
  • 34. HPC – Applications (II)   980s (Wide Spread of Commercial/Industrial Usage) 1  Petroleum Industry: Western Geo, etc  Computational Chemistry: Charmm, Amber, Gaussian,Gammes, Mopad, Crystal etc  Computational Fluid Dynamics: Fluent, Arc3D etc  Structural Analysis: NASTRAN, Ansys, Abacus, Dyna3D  Physics: QCD  Emergence of Multi-Discipline Application Program  1990s & 2000s  Grand Challenge Problems  Life Science  Large Scale Parallel Program  Coupling of Computational Models  Data Intensive Analysis/Computation 34
  • 35. High Performance Computing – Cluster Inside & Insight Types of Cluster Architectures Multicores & Heterogeneous Architecture Cluster Overview & Bottleneck/Latency Global & Parallel Filesystem Application Development Step 35
  • 36. Computer Architecture – Flynn’s Taxonomy  SISD (single instruction & single data)  SIMD (single instruction & multiple data)  MISD (multiple instruction & single data)   IMD M (multiple instruction & multiple data) > Message Passing > Share Memory: UMA/NUMA/COMA 36
  • 37. Constrain on Computing Solution – “Distributed Computing”  Opposing forces Commodity  Budgets push toward lower cost computing solutions  At the expense of operation cost  Limitations for power & cooling  difficult to change on short time scales  Challenges:  Data Distribution & Data Management  Distributed Computing Model  Fault Tolerance, Scalability & Availability SMP Centralized Distributed 37
  • 38. Shared Memory Architecture Hybrid Architecture HPC Cluster Architecture Vector Architecture 38 Distributed Memory Architecture
  • 39. HPC – Multi-core Architectural Spectrum Heterogeneous Multi-core Platform
  • 41. Cluster – Commercial x86 Architecture  Intel Core2 Quad, 2006 41
  • 42. Cluster – Commercial x86 Architecture  Intel Dunnington 7400-series  last CPU of the Penryn generation and Intel's first multi- core die & features a single-die six- (or hexa-) core design with three unified 3 MB L2 caches 42
  • 43. Cluster – Commercial x86 Architecture  Intel Nehalem  Core i7 2009 Q1 Quadcores 43
  • 44. Cluster – Commercial x86 Architecture  Intel: ”Nehalem-Ex” (i7) 44
  • 45. Cluster – Commercial x86 Architecture  AMD Shanghai, 2007 45
  • 46. Cluster Overview (I)  System  Security & Account Policy  System Performance Optimization Parallel Computer Arch.  Mission: HT vs. HP Abstraction Layers  Benchmarking: Serial vs. Parallel  NPB, HPL, BioPerf, HPCC & SPEC (2000 & 2006) etc.  Memory/Cache: Stream, cachebench & BYTEMark etc.  Data: iozone, iometer, xdd, dd & bonie++ etc.  Network: NetPIPE, Netperf, Nettest, Netspec & iperf etc.  load generator: cpuburn, dbench, stress & contest etc.  Resource Mgmt: Scheduling  Account policy & Mgmt.  Hardware  Regular maintenance: spare parts replacement  Facility Relocation & Cabling 46
  • 47. Cluster Overview (II)  Software  Compiler: e.g. Intel, PGI, xl*(IBM)  Compilation, Porting & Debug   ddressing: 32 vs. 64bit. A  Various: Sys. Arch. (IA64, RISC, SPARC etc.)  Scientific/Numerical Libraries  NetCDF, PETSC, GSL, CERNLIB (ROOT/PAW), GEANT etc.  Lapack, Blas, gotoBlas, Scalapack, FFTW, Linpack, HPC-Netlib etc.  End User Applications:  VASP, Guassian, Wien, Abinit, PWSCF, WRF, Comcot, Truchas, VORPAL etc.  Others  Documentation  Functions: UG & AG  System Design Arch., Account Policy & Mgmt. etc.  Training 47
  • 48. Cluster I/O – Latency & Bottleneck  Modern CPU achieve ~ 5GFlops/core/sec.  ~ 2 8-Bytes words per OP.  CPU overhead: 80 GB/sec.  ~O(1) GB/core/sec. B/W  Case: IBM P7 (755) (Stream)  Copy: 105418.0 MB  Scale: 104865.0 MB  Add: 121341.0 MB  Triad: 121360.0  Latency: ~52.8ns (@2.5GHz, DDR3/1666)   ven worse: init. fetching data E  Cf. Cache:  L1@2.5GHz: 3 cycles  L2@2.5GHz: 20 cycles 48
  • 49. Memory Access vs. Clock Cycles Data Rate Performance Memory vs. CPU 49
  • 50. 50
  • 51. Cluster – Message Time Breakdown  Source Overhead  Network Time  Destination Overheard 51
  • 52. Cluster – MPI & Resource Mgr. MPI Processes Mgmt. w/o Resource Mgr. MPI Processes Mgmt. w/ Resource Mgr. 52 Ref: HPC BAS4 UF.
  • 53. Network Performance Throughput vs. Latency (I)  Peak 10G 9.1Gbps ~ 877 usec (Msg Size: 1MB)  IB QDR reach 31.1Gbps with same msg size  Only 29% of 10G Latency (~256 usec)  Peak IB QDR 34.8Gbps ~ 57 usec (Msg Size: 262KB)
  • 54. Network Performance Throughput vs. Latency (II)  GbE, 10G (FC), IB and IBoIP (DDR vs. QDR)
  • 55. Network Performance Throughput vs. Latency (III)  Interconnection:  GbE, 10G (FC), IB and IBoIP (DDR vs. QDR)  Max throughput not reach 80% of IB DDR (~46%)  Peak of DDR IPoIB ~76% of IB peak (9.1Gbps)  Over IP, QDR have only 54%  While max throughput reach 85% (34.8Gbps)  No significant performance gain for IPoIB using RDMA (by preloading SDO)  Possible performance degradation  Existing activities over IB edge switch at the chassis  Midplane performance limitation  Reaching 85% on clean IB QDR interconnection:  Redo performance measurement on IBM QDR
  • 56. Cluster – File Server Performance  Preload SDP provided by OFED  Sockets Direct Protocol (SDP)  Note: Network protocol which provides an RDMA accelerated alternative to TCP over InfiniBand
  • 57. Cluster – File Server IO Performance (I) Re-Write Performance Write Performance
  • 58. Cluster – File Server IO Performance (II) Re-Read Performance Read Performance
  • 59. Cluster I/O – Cluster filesystem options? (I)  OCFS2 (Oracle Cluster File System)  Once proprietary, now GPL  Available in Linux vanilla kernel  not widely used outside the database world  PVFS (Parallel Virtual File System)  Open source & easy to install  Userspace-only server  kernel module required only on clients  Optimized for MPI-IO  POSIX compatibility layer performance is sub-optimal  pNFS (Parallel NFS)  Extension of NFSv4  Proprietary solutions available: “Panasas”  Put together benefits of parallel IO using standard solution (NFS) 59
  • 60. Cluster I/O – Cluster filesystem options? (II)  GPFS (General Parallel File System)  Rock-solid w/ 10-years history  Available for AIX, Linux & Windows Server 2003  Proprietary license  Tightly integrated with IBM cluster management tools  Lustre  HA & LB implementation  highly scalable parallel filesystem: ~ 100K clients  Performance:  Client: ~1 GB/s & 1K Metadata Op/s  MDS: 3K ~ 15K Metadata Op/s  OSS: 500 ~ 2.5 GB/s  POSIX compatibility  Components:  single or dual Metadata Server (MDS) w/ attached Metadata Target (MDT) (if consider scalability & load balance)  multiple “up to ~O(3)” Object Storage Server (OSS) w/ attached Object 60 Storage Targets (OST)
  • 61. Cluster I/O – Lustre Cluster Breakdown InfiniBand Interconnect Lustre Cluster OSS1 Compute Compute Compute OSS2 Compute Compute Compute … … … OSS nodes (Load Balanced) Compute Compute Compute Compute Compute Compute MDS nodes (High Availability) Compute Compute Admin MDS(M) Compute Compute Admin Compute Compute Login MDS(S) Compute Compute Login Lustre Quad-CPU Compute Nodes Connectivity to all Connectivity to all Connectivity to all Connectivity to all nodes nodes nodes nodes GigE ethernet for boot and system control traffic Connectivity to all Connectivity to all Connectivity to all Connectivity to all nodes nodes nodes nodes 10/100 Ethernet out-of-band management (power on/off, etc) 61
  • 62. Cluster I/O – Parallel Filesystem using Lustre   ypical Setup T  MDS: ~ O(1) servers with good CPU and RAM, high seek rate  OSS: ~ O(3) server req. good bus bandwidth, storage 62
  • 63. Cluster I/O – Lustre Performance (I)  Interconnection:  PoB, IB & Quadrics I 63
  • 64. Cluster I/O – Lustre Performance (II)  Scalability  Throughput/Transactions vs. Num of OSS 64
  • 65. Cluster I/O – Parallel Filesystem in HPC 65
  • 66. Cluster – Consolidation & Pursuing High Density 66
  • 67. Typical Blade System Connectivity Breakdown Fibre Channel Expansion Card (CFFv) Optical Pass-Through Module and MPO Cables BNT 1/10 Gb Uplink Ethernet Switch Module BladeServer Chassis – BCE
  • 68.  Hardware & system software features affecting scalability Reliability - Hardware of parallel systems Scalable Tools - Software Machine Size - User/Developer - Proc. Performance - Manager - Num. Processors - Libraries Input/Output Totally Scalable Memory Size - Bandwidth Architecture - Virtual - Capacity - Physical Program Env. Interconnect Network - Familiar Program Paradigm - Latency - Familiar Interface Memory Type - Bandwidth - Distributed - Shared 68
  • 69. HPC – Demanded Features from diff. Roles  Def. Roles: Users, Developers, System Administrators Features Users Developers Managers Familiar User Interface ✔ ✔ ✔ Familiar Programming Paradigm ✔ ✔ Commercially Supported Applications ✔ ✔ Standards ✔ ✔ Scalable Libraries ✔ ✔ Development Tools ✔ Management Tools ✔ Total System Costs ✔ 69
  • 70. HPC – Application Development Steps Prep SA SPEC Run SA SPEC Code Code Opt Par Mod Opt Prep Run Par Mod 70
  • 71. HPC – Service Scopes  System Architecture Design   Various of interconnection e.g. GbE, IB, FC etc.   Mission specific e.g. high performance or high throughput   Computational or data intensive   OMP vs. MPI   Parallel/Global filesystem  Cluster Implementation   Objectives:  High availability & Fault tolerance  Load Balancing Design & Validation  Distributed & Parallel Computing   Deployment, Configuration, Cluster Mgmt. & Monitoring   Service Automation and Event Mgmt.   KB & Helpdesk  Service Level:   Helpdesk & Onsite Inspection   System Reliability & Availability   1st / 2nd line Tech. Support  Automation & Alarm Handling 71  Architecture & Outreach?
  • 72. High Performance Computing – Performance Tuning & Optimization Tuning Strategy Best Practices Profile & Bottleneck drilldown System Optimization Filesystem Improvement & (re-)Design 72
  • 73. High Performance Computing Cluster Management & Administration Tools  Categories:  System: OS, network, backup, filesystem & virtualization.  Clustering: deployment, monitoring, management alarm/logging, dashboard & automation  Administration: UID, security, scheduling, accounting  Application: library, compiler, message-passing & domain-specific.  Development: debug, profile, toolkits, VC & PM.  Services: helpdesk, event, KB & FAQ. 73
  • 74. Cluster Implementation (I)  Operating system  Candidates: CentOS, Scientific Linux, RedHat, Fedora etc.  Cluster Management  Tools: Oscar, Rocks, uBuntu, xCAT etc.  Deployment & Configurations  Tools: cobbler, kickstart, puppet, cfgng, quattor(CERN), DRBL  Alarm, Probes & Automation  Tools: nagios, IPMI, lm_sensors  System & Service monitoring  Tools: ganglia, openQRM  Network Monitoring  Tools: MRTG, RRD, smokeping, awstats, weathermap 74
  • 75. Cluster Implementation (II)  Filesystem  Candidates: NFS, Lustre, openAFS, pNFS, GPFS etc.  Performance Analysis & Profile:  Tools: gprof, pgroup(PGI), VTune(intel), tprof(IBM), TotalView etc.  Compilers  Packages: Intel, PGI, MPI, Pathscale, Absoft, NAG, GNU, Cuda etc.  Message Passing Libraries (parallel):  Packages: Intel MPI, OpenMPI, MPICH, MVAPICH, PVM(old), openMP(POSIX threads) etc.  Memory Profile & Debug (Threads)  Tool: Valgrind, IDB, GNU(gdb) etc.  Distributed computing  Toolkits: Condor, Globus, gLite(LCG) etc. 75
  • 76. Cluster Implementation (III)  Resource Mgmt. & Scheduling  Tools: Torque, Maui, Moab, Condor, Slurm, SGE(SunGrid Engine), NQS(old), loadleveler(IBM), LSF(Platform/IBM) etc.  Dashboard  Tools: openQRM, openNMS, Ahatsup, OpenView, BigBrother etc.  Helpdesk & Trouble Tracking  Tools: phpFAQ, OTRS, Request Tracker, osTIcket, simpleTicket, eTicket etc.  Logging & Events  Tools: elog, syslogNG etc.  Knowledge Base  Tools: vimwiki, Media Wiki, Twiki, phpFAQ, moinmoin etc. 76
  • 77. Cluster Implementation (IV)  Security  Functionality: scanning, intrusion detection, & vulnerability  Tools: honeypot, snort, saint, snmp, nessus, rootkithunter & chkrootkit etc.  Revision Services  Tools: git, cvs, svn etc.  Collaborative Project Mgmt.  Tools: bugzilla, OTRS, projectHQ,  Accounting:  Tools: SACCT, PACCT etc.  Visualization: RRD G/W, Google Chart Tool etc. 77
  • 78. Cluster Implementation (IV)  Backup Services  Tools: Tivoli(IBM), Bacula, rsync, VERITAS, TSM, Netvault, Amanda, etc.  Remote Console  Tools: openNX (no machine), rdp compatible, Hummingbird (XDMCP), VNC, Xwin32, Cygwin, IPMI v2 etc.  Cloud & Virtualization  Packages: openstack, opennebula, eucalyptus, CERNVM, Vmware, Xen, Citrix, VirtualBox etc. 78
  • 79. High Performance Computing - How We Get to Today? Moore’s Law, Heat/Energy/Power Density Hardware Evolution Datacenter & Green HPC History Reminder: 1980s - 1st Gflops in single vector processor 1994 - 1st TFlop via thousands of microprocessors 2009 - 1st Pflop via several hundred thousand cores 79
  • 80. Moore’s Law & Power Density  Dynamic Pwr ∝ V2fC  2X Transistors/Chip every 1.5Yr  Cubic effect if inc frequency & supply  Golden Moore (co-founder of voltage Intel) predicted in 1965.  Eff ∝ capacitance ∝ cores (linear)  High performance serial processor 33K ~ 38K MIPs waste power 7.5K ~ 11K MIPs  More transistors rather serial Transistor Count 1971-2011 1 Billion Transistors processors 25 MIPs 1.0 MIPs 0.1 MIPs Date of Production 80 Ref: http://en.wikipedia.org/wiki/List_of_Intel_microprocessors
  • 81. Moore’s Law – What we learn? Transistor ∝ MIPs ∝ Watts ∝ BTUs  Rule of thumb: 1 watt of power consumed requires 3.413 BTU/hr of cooling to remove the associated heat  Inter-chip vs. Intra-chip parallelism  Challenges: millions of concurrent threads  HP: Data Center Power Density Went from 2.1 kW/Rack in 1992 to 14 kw/Rack in 2006  IDC: 3 Year Costs of Power and Cooling, Roughly Equal to Initial Capital Equipment Cost of Data Center  NETWORKWORLD: 63% of 369 IT professionals said that running out of space or power in their data centers had already occurred 81
  • 82. HPC – Feature size, Clock & Die Shrink Historical data TRTS Max Clock Rate Main ITRS node (nm) Year Feature size (nm) Feature Size (nm) Year
  • 83. Trend: Cores per Socket  Top500 Nov 2011:  45.8% & 32% running 6 & quad cores proc.   5.8% sys. >= 8 cores (2.4% with 16 cores) 1  more than 2 fold inc. vs. 2010 Nov (6.8%) Top500 2011 Nov  Trend: quad (73% in 10’) to 6 cores (46% in 11’) 83
  • 84. HPC – Evolution of Processors  Transistors: Moore’s Law  Clock rate no longer as a proxy for Moore’s Law & Cores may double instead.  Power literately under control. Transistors Physical Gate Length Ref: “Scaling to Petascale and Beyond: Performance Analysis and Optimization of Applications” NERSC.
  • 85. HPC – Comprehensive Approach  CPU Chips  Clock Frequency & Voltage Scaling  75% power savings at idle and 40-70% power savings for utilization in the 20-80% range  Server  Chassis: 20-50% Pwr reduction.  Modular switches & routers  Server consolidation & virtualization  Storage Devices  Max. TB/Watt & Disk Capacity  Large Scale Tiered Storage  Max. Pwr Eff by Min. Storage over-provisioning  Cabling & Networking  Stackable & backplane capacity (inc. Pwr Eff)  Scaling & Density 85
  • 86. HPC – Datacenter Power Projection  Case: ORNL/UTK inc. DOE & NSF sys.  Deploy 2 large Petascale systems in next 5 years  Current Power Consumption 4 MW  Exp to 15MW before year end (2011)  50MW by 2012.  Cost estimates based on $0.07 per KwH 86
  • 87. HPC – Data Center Best Practices  Traditional Approach  Hot/Cold Aisle  Min. Leakage  Eff. Improvement (Coolig & Power)  DC input (UPS opt.), Cabling & Container  Liquid Cooling  Free Cooling  Leveraging Hydroelectric Power Ref: http://www.google.com/about/datacenters/ 87 http://www.google.com/about/datacenters/inside/efficiency/power-usage.html
  • 88. HPC – DataCenter Growing Power Density Total system efficiency comprises three main elements- the Grid, the Data Centre and the IT Components. Each element has its own efficiency factor- multiplied together for 100 watts of power generated, the CPU receives only 12 watts Heat Load Product Footprint (Watt/ft2) Ref: Internet2 P&C Nov 2011, “Managing Data Center Power Power & Cooling & Cooling” by Force10 88
  • 89. HPC - Performance Benchmarking CPU Arch., Scalability, SMT & Perf/Watt Case study: Intel vs. AMD 89
  • 90. HPC – Performance Strategy: “The Amdahl’s Law”  Fixed-size Model : Speedup = 1 / (s + p/N)  Scaled-size Model: Speedup = 1 / ((1-P) + P/N) ~ 1/(1-P)   arallel & Vector scale w/ problem size P  s: Σ (I/O + serial bottleneck + vector startup + program loading) SpeedUP 90 Numer of Processors
  • 91. Price-Performance for Transaction-Processing  OLTP – One of the largest server markets is online transaction processing  TPC-C – std. industry benchmark for OLTP is  Queries and updates rely on database system  Significant factors of performance in TPC-C:  Reasonable approx. to a real OLTP app.  Predictive of real system performance:  total system performance, inc. the hardware, the operating system, the I/O system, and the database system.   Complete instruction and timing info for benchmarking  TPM (measure transactions per minute) & price- performance in dollars per TPM. 91
  • 92.  20 SPEC benchmarks  1.9 GHZ IBM Power5 processor vs. 3.8 GHz Intel Pentium 4  10 Integer @LHS & 10 floating point @RHS  Fallacy:  Processors with lower CPIs will always be faster.  Processors with faster clock rates will always be faster. 92
  • 93.  Characteristics of 10 OLTP systems & TPC-C as the benchmark 93
  • 94.  Cost of purchase split between processor, memory, storage, and software 94
  • 95. Pentium 4 Microarchitecture & Important characteristics of the recent Pentium 4 640 implementation in 90 nm technology (code named Prescott) 95
  • 96. HPC – Performance Measurement (I)  Objective:  Baseline Performance  Performance Optimization  Confident & Verifiable  Measurement:  Open Std.: math kernel & application  MIPS (million instruction per second) (MIPS Tech. Inc.)  MFLOPS (million floating point operation per second)  Characteristics:  Peak vs. Sustained  Speed-Up & Computing Efficiency (mainly for Parallel)  CPU Time vs. Elapsed Time  Program performance (HP) vs. System Throughput (HT)  Performance per Watt Ref: http://www-03.ibm.com/systems/power/hardware/benchmarks/hpc.html http://icl.cs.utk.edu/hpcc/ 96
  • 97. HPC – Performance Measurement (II)  Public Benchmark Utilities:  LINPACK (Jack Dongara, Oak Ridge N.L.)  Single Precision/Double Precision  n=100 TPP, n=1000 (Paper& Pencil benchmark)  HPL, n=’undefined’ (mainly for paraell system)  Synthetic: Drystone, Whetstone, Khornstone  SPEC (Standard Performance Evaluation Corp.)  SPECint (CINT2006), SPECfp(CFP2006), SPEComp●Not allow for source code modification  Livermore Loops (introduction of MFLOPS)  Los Alamos Suite (Vector Computing)  Stream (Memory Performance)  NPB (NASA Ames): NPB 1 and NPB 2 (A, B, C)  Application (Weather/Material/MD/Statistics etc.):  MM5, NAMD, ANSYS, WRF, VASP etc. 97
  • 98. Target Processors (I) - AMD vs. Intel  AMD Magny-Cours Opteron (45nm, Rel. Mar. 2010)  Socket G34 multi-chip module  2 x 4-cores or 6-cores dices connecting with HT 3.1  6172 (12-cores), 2.1GHz  L2: 8 x 512K, L3: 2 x 6M  HT: 3.2 GHz  ACP/TDP: 80W/115W  Streaming SIMD Extension: SSE, SEE2, SEE3 and SSE4a  6128HE (8-cores), 2.0GHz  L2: 8 x 512K, L3: 2 x 6M  HT: 3.2 GHz  ACP/TDP: 80W/115W  Streaming SIMD Extension: SSE, SEE2, SEE3 and SSE4a
  • 99. Target Processors (II) - AMD vs. Intel  Intel Woodcrest, Harpertown and Westmear (Rel. Jun 2006)  Xeon 5150  2.66GHz, LGA-771  L2: 4M  TDP: 65W  Streaming SIMD Extension: SSE, SSE2, SSE3 and SSSE3  Harpertown, Quad-Cores, 45nm (Rel. Nov 2007)  E5430 2.66GHz  L2: 2 x 6M  TDP: 80W  Streaming SIMD Extension: SSE, SSE2, SSE3, SSSE3 and SSE4.1  Westmear EP, 6-cores, 32nm (Re. Mar 2010)  X5650 2.67GHz, LGA-1366  L2/L3: 6x256K/12MB  I/O Bus: 2 x 6.4GT/s QPI  Streaming SIMD Extension: SSE, SSE2, SSE3, SSSE3, SSE4.1 and SSE4.2
  • 100. SPEC2006 Performance Comparison - SMT Off Turbo-on  8 Cores Nehalem-EP vs. 12 Cores Westmere-EP  32% performance gain by increase 50% of CPU-Cores  Scalability 12% below Ideal Performance  SMT Advantage:  Nehalem-EP 8 Cores to 16 Cores: “24.4%”  Westmere-EP 12 Cores to 24 Cores: “23.7” Ref: CERN Openlan Intel WEP Evaluation Report (2010)
  • 101. Efficiency of Westmere-EP - Performance per Watt  Extrapolated from 12G to 24G  2 Watt per additional GB of Memory  Dual PSU (Upper) vs. Single PSU (Lower)  SMT offer 21% boost in turns of efficiency  Approx. 3% consume by SMT comparing with absolute performance (23.7%) Ref: CERN Openlan Intel WEP Evaluation Report (2010)
  • 102. Efficiency of Nehalem-EP Microarchitecture With SMT Off  Most efficiency Nehalem-EP L5520 vs. X5670  Westmere add 10%  With efficiency 9.75% using dual PSU  +23.4% using single PSU  Nehalem L5520 vs. Harpertown (E5410)  +35% performance boost Ref: CERN Openlan Intel WEP Evaluation Report (2010)
  • 103. Multi-Cores Performance Scaling - AMD Magny-Cours vs. Intel Westmere (I)
  • 104. Multi-Cores Performance Scaling - AMD Magny-Cours vs. Intel Westmere (II)
  • 105. Single Server Linpack Performance - Intel X5650, 2.67GHz 12G DDR3 (6 cores) HPL Optimal Performance ~108.7 GFlops per Node
  • 106. Lesson from Top500 Statistics, Analysis & Future Trend Processor Tech. & Cores/socket Cluster Interconnect power consumption & Efficiency Regional performance & Trend 106
  • 107. Top 500 – 2011 Nov. Rmax(GFlops) Cores
  • 108. HPC – Performance of Countries Nov 2011 Top500 Performance of Countries 108
  • 109. Top500 Analysis – Power Consumption & Efficiency  Top 4 Power Eff.: GlueGene/Q (2011 Nov)  Rochester > Thomas J. Watson > DOE/NNSA/LLNL Eff: 2026 GF/kW BlueGene/Q, Power BQC 16C 1.60 GHz, Custom 11.87MW RIKEN Advanced Institute for Computational Science (AICS) - SPARC64 VIIIfx 2.0GHz, Tofu interconnect 3.6MW, Tianhe-1A National Supercomputing Center in Tianjin 2008 2009 2010 2011 109
  • 110. Top500 Analysis - Performance & Efficiency  20% of Top-performed clusters contribute 60% of Total Computing Power (27.98PF)  5 Clusters Eff. < 30
  • 111. Top500 Analysis - HPC Cluster Performance  272 (52%) of world fastest clusters have efficiency lower than 80% (Rmax/Rpeak)  Only 115 (18%) could drive over 90% of theoretical peak  Sampling from Top500 HPC cluster Trend of Cluster Efficiency 2005-2009
  • 112. Top500 Analysis – HPC Cluster Interconnection  SDR, DDR and QDR in Top500  Promising efficiency >= 80%  Majority of IB ready cluster adopt DDR (87%) (2009 Nov)  Contribute 44% of total computing power  ~28 Pflops  Avg efficiency ~78%
  • 113. Impact Factor: Interconnectivity - Capacity & Cluster Efficiency  Over 52% of Cluster base on GbE  With efficiency around 50% only  InfiniBand adopt by ~36% HPC Clusters
  • 114. Common Semantics  Programmer productivity  Easy of deployment  HPC filesystem are more mature, wider feature set:  High concurrent read and write  In the comfort zone of programmers (vs cloudFS)  Wide support, adoption, acceptance possible  pNFS working to be equivalent  Reuse standard data management tools  Backup, disaster recovery and tiering
  • 115. IB Roadmap Trend in HPC 74.2PF 10.5PF 50.9TF
  • 116. Observation & Perspectives (I)  Performance pursuing another 1000X would be tough   ~20PF Titan and Jaguar deliver in 2012   ExaFlops project ~ 2016 (PF in 2008)   Stil! IB & GbE are the most used interconnect solutions   multi-cores continue Moore’s Law  high level parallelism & software readiness  reduce bus traffic & data locality  Storage is fastest-growing product sector   Storage consolidation intensifies competition   Lustre roadmap stabilized for HPC  Computing paradigm   Complicated system vs. supplicated computing tools  hybrid computing model   Major concern: power efficiency  energy in memory & interconnect inc. data search application  exploit memory power efficiency: large cache?   Scalability and Reliability   Performance key factor: data communication  consider: layout, management & reuse 116
  • 117. Observation & Perspectives (II)  Vendor Support & User readiness No Moore’s Law for software, algorithms &  Service Orientation applications?  Standardization & KB  Automation & Expert system  Emerging new possibility   Cloud Infrastructure & Platform  currently 3% of spending (mostly private cloud)  Technology push & market/demand pull  growing opportunity of “Big Data”  datacenter, SMB & HPC solution providers  Rapidly growth of accelerator   Test by ~67% of users (20% in 10’)   NVIDIA posses 90% of current usage (11’) “I think there is a world market for maybe five computers” Thomas Watson, chairman of IBM, 1943 “Computers in the future may weight no more than 1.5 tons. ” Popular Mechanics, 1949 117
  • 118. References  Top500: http://top500.org  Green Top500: http://www.green500.org  HPC Advisory Council  http://www.hpcadvisorycouncil.com/subgroups.php  HPC Inside  http://insidehpc.com/  HPC Wiki  http://en.wikipedia.org/wiki/High-performance_computing  Supercomputing Conferences Series  http://www.supercomp.org/  Beowulf Cluster  http://www.beowulf.org/  MPI Forum:  http://www.mpi-forum.org/docs/docs.html 118
  • 119. Reference - Mathematical & Numerical Lib. (I)  Open Source  Linpack - numerical linear algebra intend to use on supercomputers  LAPACK - the successor to LINPACK (Netlib)  PLAPACK - Parallel Linear Algebra Package  BLAS - basic linear algebra subprograms  gotoBlas - optimal performance of Blas with new algorithm & memory techniques  Scalapack - high performance linear algebra routines or distributed memory message passing MIMD computer  FFTW - Fast Fourier Transform in the West  HPC-Netlib - is the high performance branch of Netlib  PETSc - portable, extensible toolkit for scientific computation  Numerical Recipes  GNU Scientific Libraries 119
  • 120. Reference - Mathematical & Numerical Lib. (II)  Commercial  ESSL & pESSL (IBM/AIX) - Engineering & Scientific Subroutine Library  MASS (IBM/AIX) - Mathematical Acceleration Subsystem  Intel Math Kernel - vector, linear algebra, special tuned math kernels  NAG Numerical Libraries - Numerical Algorithms Group  IMSL - International Mathematical and Statistical Libraries  PV-WAVE - Workstation Analysis & Visualization Env.  JAMA - Java matrix package, developed by the MathWorks & NIST.  WSSMP - Watson Symmetric Sparse Matrix Package 120
  • 121. Reference - Message Passing  PVM (Parallel Virtual Machine, ORNL/CSM)  OpenMPI  MVAPICH & MVAPICH2  MPICH & MPICH2  v1 channels:  ch_p4 - based on older p4 project (Portable Programs for Parallel Processors), tcp/ip  ch_p4mpd - p4 with mpd daemons to starting and managing processes  ch_shmem - shared memory only channel  globus2 – Globus2  v2 channels:  Nemesis – Universal  inter-node modules: elan, GM, IB (infiniband), MX (myrinet express), NewMadeleine, tcp intra-node variants of shared memory for large messages (LMT interface).  ssm - Sockets and Shared Memory  shm - SHared memory  sock - tcp/ip sockets  sctp - experimental channel over SCTP sockets 121
  • 122. Reference - Performance, Benchmark & Tools  High performance tools & technologies:  https://computing.llnl.gov/tutorials/performance_tools/ HighPerformanceToolsTechnologiesLC.pdf  Linux Benchmarking Suite:  http://lbs.sourceforge.net  Linux Test Tools Matrix:  http://ltp.sourceforge.net/tooltable.php  Network Performance  http://compnetworking.about.com/od/networkperformance/ TCPIP_Network_Performance_Benchmarks_and_Tools.htm  http://tldp.org/HOWTO/Benchmarking-HOWTO-3.html  http://bulk.fefe.de/scalability/  http://linuxperf.sourceforge.net 122
  • 123. Reference - Network Security  Network Security  Tools: http://sectools.org/ , http://www.yolinux.com/TUTORIALS/ LinuxSecurityTools.html & http://www.lids.org/ etc.  packet sniffer, wrapper, firewall, scanner, services (MTA/BIND) etc.  Online Org.:  CERT http://www.us-cert.gov  SANS http://www.sans.org  Linux Network Security  basic config/utility/profile, encryption & routing.  (obsolete: http://www.drolez.com/secu/)  Network Security Toolkit  Audit, Intrusion Detection & Prevention  Event Types:  DDoS, Scanning, Worms, Policy violation & unexpected app. services  Honeypots, Tripwire, Snort, Tiger, Nessus, Ethereal, nmap, tcpdump, portscan, portsentry, chkrootkit, rootkithunter, AIDE(HIDE), LIDS etc.  Ref: NIST “Guide to Intrusion Detection and Prevention Systems” 123
  • 124. Reference - Book  Computer Architecture: A Quantitative Approach  2nd Ed., by David A. Patterson, John L. Hennessy, David Goldberg  Parallel Computer Architecture: A Hardware/Software Approach  by David Culler and J.P. Singh with Anoop Gupta  High-performance Computer Architecture  3rd Ed., by Harold Stone  High Performance Compilers for Parallel Computing  by Michael Wolfe (Addison Wesley, 1996)  Advanced Computer Architectures: A Design Space Approach  by Terence Fountain, Peter Kacsuk, Dezso Sima  Introduction to Parallel Computing: Design and Analysis of Parallel Algorithms   by Vipin Kumar, Ananth Grama, Anshul Gupta, George Karypis  Parallel Computing Works!   by Geoffrey C. Fox, Roy D. Williams, Paul C. Messina  The Interaction of Compilation Technology and Computer Architecture   by David J. Lilja, Peter L. Bird (Editor) 124
  • 125. National Laboratory Computing Facilities (I)  ANL, Argonne National Laboratory  http://www.lcrc.anl.gov/  ASC, Alabama Supercomputer Center  http://www.asc.edu/supercomputing/  BNL, Brookhaven National Laboratory, Computational Science Center  http://www.bnl.gov/csc/  CACR, Center for Advanced Computing Researc  http://www.cacr.caltech.edu/main/  CAPP, Center for Applied Parallel Processing  http://www.ceng.metu.edu.tr/courses/ceng577/announces/ supercomputingfacilities.htm  CHPC, Center for High Performance Computing, University of Utah  http://www.chpc.utah.edu/ 125
  • 126. National Laboratory Computing Facilities (II)  CRPC, Center For Research on Parallel Computation  http://www.crpc.rice.edu/  LANL, Los Alamos National Lab  http://www.lanl.gov/roadrunner/  LBL, Lawrence Berkeley National Lab  http://crd.lbl.gov/  LLNL, Lawrence Livermore National Lab  https://computing.llnl.gov/  MHPCC, Maui High Performance Computing Center  http://www.mhpcc.edu/  NCAR, National Center for Atmospheric Research  http://ncar.ucar.edu/  NCCS, National Center for Computational Science  http://www.nccs.gov/computing-resources/systems-status 126
  • 127. National Laboratory Computing Facilities (III)  NCSA, National Center for Supercomputing Application  http://www.ncsa.illinois.edu/  NERSC, National Energy Research Scientific Computing Center  http://www.nersc.gov/home-2/  NSCEE, National Supercomputing Center for Energy and the Environment  http://www.nscee.edu/  NWSC, NCAR-Wyoming Supercomputing Center  http://nwsc.ucar.edu/  ORNL, Oak Ridge National Lab  http://www.ornl.gov/ornlhome/high_performance_computing.shtml  OSC, Ohio Supercomputer Center  http://www.osc.edu/ 127
  • 128. National Laboratory Computing Facilities (IV)  PSC, Pittsburgh Supercomputing Center  http://www.psc.edu/  SANDIA, Sandia National Laboratories  http://www.cs.sandia.gov/  SCRI, Supercomputer Computations Research Institute  http://www.sc.fsu.edu/  SDSC, San Diego Supercomputing Center  http://www.sdsc.edu/services/hpc.html  ARSC, Arctic Region Supercomputing Center  http://nwsc.ucar.edu/  NASA, National Aeronautics and Space Admin  http://www.nas.nasa.gov/ 128