High–Performance Computing

High–Performance Computing
(HPC)
Prepared By:
Abdussamad Muntahi
1
© Copyright: Abdussamad Muntahi & BUCC, 2013

Introduction
• High‐speed computing. Originally implemented only in
supercomputers for scientific research
• Tools and systems available to implement and create
high performance computing systems
• Used for scientific research and computational science
• Main area of discipline is developing parallel
processing algorithms and software so that programs
can be divided into small independent parts and can be
executed simultaneously by separate processors
• HPC systems have shifted from supercomputer to
computing clusters
2

What is Cluster?
• Cluster is a group of machines interconnected in a way that
they work together as a single system
3
• Terminology
o Node – individual machine in a cluster
o Head/Master node – connected to both the
private network of the cluster and a public
network and are used to access a given
cluster. Responsible for providing user an
environment to work and distributing task
among other nodes
o Compute nodes – connected to only the
private network of the cluster and are
generally used for running jobs assigned to
them by the head node(s)

What is Cluster?
• Types of Cluster
o Storage
Storage clusters provide a consistent file system image
Allowing simultaneous read and write to a single shared file system
o High‐availability (HA)
Provide continuous availability of services by eliminating single points of
failure
o Load‐balancing
Sends network service requests to multiple cluster nodes to balance the
requested load among the cluster nodes
o High‐performance
Use cluster nodes to perform concurrent calculations
Allows applications to work in parallel to enhance the performance of the
applications
Also referred to as computational clusters or grid computing
4

Benefits of Cluster
• Reduced Cost
o The price of off‐the‐shelf consumer desktops has plummeted in recent
years, and this drop in price has corresponded with a vast increase in
their processing power and performance. The average desktop PC
today is many times more powerful than the first mainframe
computers.
• Processing Power
o The parallel processing power of a high‐performance cluster can, in
many cases, prove more cost effective than a mainframe with similar
power. This reduced price‐per‐unit of power enables enterprises to get
a greater ROI (Return On Investment).
• Scalability
o Perhaps the greatest advantage of computer clusters is the scalability
they offer. While mainframe computers have a fixed processing
capacity, computer clusters can be easily expanded as requirements
change by adding additional nodes to the network.
5

Benefits of Cluster
• Improved Network Technology
o In clusters, computers are typically connected via a single virtual local
area network (VLAN), and the network treats each computer as a
separate node. Information can be passed throughout these networks
with very little lag, ensuring that data doesn’t bottleneck between
nodes.
6
• Availability
o When a mainframe computer fails, the
entire system fails. However, if a node
in a computer cluster fails, its
operations can be simply transferred
to another node within the cluster,
ensuring that there is no interruption
in service.

Invention of HPC
• Need for ever‐increasing performance
• And visionary concept of Parallel Computing
7

Why we need ever‐increasing
performance
• Computational power is increasing, but so are our
computation problems and needs.
• Some Examples:
– Case1: Complete a time‐consuming operation in less time
• I am an automotive engineer
• I need to design a new car that consumes less gasoline
• I’d rather have the design completed in 6 months than in 2 years
• I want to test my design using computer simulations rather than building
very expensive prototypes and crashing them
– Case 2: Complete an operation under a tight deadline
• I work for a weather prediction agency
• I am getting input from weather stations/sensors
• I’d like to predict tomorrow’s forecast today
8

performance
– Case 3: Perform a high number of operations per seconds
• I am an engineer at Amazon.com
• My Web server gets 1,000 hits per seconds
• I’d like my web server and databases to handle 1,000 transactions per
seconds so that customers do not experience bad delays
9

performance
10
Climate modeling Protein folding
Drug discovery Energy research
Data analysis

Where are we using HPC?
• Used to solve complex modeling problems in a spectrum of
disciplines
• Topics include:
• HPC is currently applied to business uses as well
o data warehouses
o transaction processing
11
o Artificial intelligence
o Climate modeling
o Automotive engineering
o Cryptographic analysis
o Geophysics
o Molecular biology
o Molecular dynamics
o Nuclear physics
o Physical oceanography
o Plasma physics
o Quantum physics
o Quantum chemistry
o Solid state physics
o Structural dynamics.

Top 10 Supercomputers for HPC
12
Copyright (c) 2000-2009 TOP500.Org | All trademarks and copyrights on this page are owned by their respective owners
June 2013

Fastest Supercomputer
Tianhe‐2 (MilkyWay‐2) @ China’s National University of Defense
Technology
13

Changing times
• From 1986 – 2002, microprocessors were
speeding like a rocket, increasing in
performance an average of 50% per year
• Since then, it’s dropped to about 20% increase
per year
14

The Problem
• Up to now, performance increases have been
attributed to increasing density of transistors
• But there are inherent problems
• A little Physics lesson –
– Smaller transistors = faster processors
– Faster processors = increased power consumption
– Increased power consumption = increased heat
– Increased heat = unreliable processors
15

An intelligent solution
• Move away from single‐core systems to multicore
processors
• “core” = processing unit
• Introduction of parallelism!!!
• But …
– Adding more processors doesn’t help much if
programmers aren’t aware of them…
– … or don’t know how to use them.
– Serial programs don’t benefit from this approach (in most
cases)
16

Parallel Computing
• Form of computation in which many calculations are
carried out simultaneously, operating on the principle
that large problems can often be divided into smaller
ones, which are then solved concurrently i.e. "in
parallel“
• So, we need to rewrite serial programs so that they’re
parallel.
• Write translation programs that automatically convert
serial programs into parallel programs.
– This is very difficult to do.
– Success has been limited.
17

Parallel Computing
• Example
– Compute n values and add them together.
– Serial solution:
18

Parallel Computing
• Example
– We have p cores, p much smaller than n.
– Each core performs a partial sum of approximately n/p
values.
19
Each core uses it’s own private variables
and executes this block of code
independently of the other cores.

Parallel Computing
• Example
– After each core completes execution of the code, is a
private variable my_sum contains the sum of the values
computed by its calls to Compute_next_value.
– Ex., n = 200, then
• Serial – will take 200 addition
• Parallel (for 8 cores)
– each core will perform n/p = 25 addition
– And master will perform 8 more addition + 8 receive operation
– Total 41 operation
20

Parallel Computing
• Some coding constructs can be recognized by an automatic
program generator, and converted to a parallel construct.
• However, it’s likely that the result will be a very inefficient
program.
• Sometimes the best parallel solution is to step back and
devise an entirely new algorithm.
• Parallel computer programs are more difficult to write than
sequential programs
• Potential problems
– Race condition (output depending on sequence or timing of other
events)
– Communication and synchronization between the different subtasks
21

Parallel Computing
• Parallel Computer classification
– Semiconductor industry has settled on two main
trajectories
• Multicore trajectory – CPU
– Coarse, heavyweight threads, better performance per thread
maximize the speed of sequential program
• Many‐core trajectory – GPU
– large number of much smaller cores to improve the execution
throughput of parallel applications
– Fine, lightweight threads
single‐thread performance is poor
22

CPU vs. GPU
• CPU
– Uses sophisticated control logic to allow single thread execution
– Uses large cache to reduce the latency of instruction and data access
– None of them contributed to the peak calculation speed
23
Cache
ALU
Control
ALU
ALU
ALU
DRAM

CPU vs. GPU
• GPU
– Need to conduct massive number of floating‐point calculation
– Optimize the execution throughput of massive numbers of threads
– Cache memories to help control the bandwidth requirements and
reduce DRAM access
24
DRAM

CPU vs. GPU
• Speed
– Calculation speed: 367 GFLOPS vs. 32 GFLOPS
– Ratio is about 10 to 1 for GPU vs. CPU
– But speed‐up depends on
• Problem set
• Level of parallelism
• Code optimization
• Memory management
25

CPU vs. GPU
26
Architecture: CPU took a right hand turn

CPU vs. GPU
27
Architecture: GPU still keeping up with Moore’s Law

CPU vs. GPU
28
• Architecture and Technology
– Control hardware dominates μprocessors
• Complex, difficult to build and verify
• Scales poorly
– Pay for max throughput, sustain average throughput
– Control hardware doesn’t do any math!
– Industry moving from “instructions per second” to
“instructions per watt”
• Traditional μprocessors are not power‐efficient
– We can continue to put more transistors on a chip
• … but we can’t scale their voltage like we used to …
• … and we can’t clock them as fast …

Why GPU?
29
• GPU is a massively parallel architecture
– Many problems map well to GPU‐style computing
– GPUs have large amount of arithmetic capability
– Increasing amount of programmability in the pipeline
– CPU has duel and quad core chips; but GPU currently has 240 cores
(GeForce GTX 280)
• Memory Bandwidth
– CPU – 3.2 GB/s; GPU – 141.7 GB/s
• Speed
– CPU – 20 GFLOPS (per core)
– GPU – 933 (single‐precision or int) / 78 (double‐precision) GFLOPS
• Direct access to compute units in new APIs

CPU + GPU
30
• CPU and GPU is a powerful combination
– CPUs consist of a few cores optimized for serial processing
– GPUs consist of thousands of smaller, more efficient cores
designed for parallel performance
– Serial portions of the code run on the CPU
– Parallel portions run on the GPU
– Performance significantly faster
• This idea ignited the movement of GPGPU (General‐
Purpose computation on GPU)

GPGPU
31
• Using GPU (graphics processing unit) together with a CPU to
accelerate general‐purpose scientific and engineering
applications
• GPGPU computing offers speed by
– Offloading compute‐intensive portions of the application to the GPU
– While the remainder of the code still runs on the CPU
• Data Parallel algorithms take advantage of GPU attributes
– Large data arrays, streaming throughput
– Fine‐grain SIMD (single‐instruction multiple‐data) parallelism
– Low‐latency floating point computation

Parallel Programming
• HPC Parallel Programming Models associated with
different computing technology
– Parallel programming in CPU Clusters
– General purpose GPU programming
32

Operational Model: CPU
• Originally was designed for distributed memory architectures
• Tasks are divided among p processes
• Data‐parallel, compute intensive functions should be selected
to be assigned to theses processes
• Functions that are executed many times, but independently
on different data, are prime candidates
– i.e. body of for‐loops
33

Operational Model: CPU
• Execution model allows each task to operate independently
• Memory model assumes that memory is private to each task
– Move data point‐to‐point between processes
• Perform some collective computations and at the end gather
results from different processes
– Needs synchronization after the end of tasks of each process
34

Programming Language: MPI
• Message Passing Interface
– An application programming interface (API) specification
that allows processes to communicate with one another by
sending and receiving messages
– Now a de facto standard for parallel programs running on
distributed memory systems in computer clusters and
supercomputers
– A massage passing API with language‐independent
protocol and semantic specifications
– Support both point‐to‐point and collective communication
– Communications are defined by the APIs
35

Programming Language: MPI
• Message Passing Interface
– Goals are standardization, high performance, scalability,
and portability
– Consists of a specific set of routines (i.e. APIs) directly
callable from C, C++, Fortran and any language able to
interface with such libraries
– Program consists of autonomous processes
• The processes may run either the same code (SPMD style) or
different codes (heterogeneous)
– Processes communicate with each other via calls to MPI
functions
36

Operation Model: GPU
37
• Both CPU and GPU operates with separate memory pool
• CPUs are masters and GPUs are workers
– CPUs launch computations onto the GPUs
– CPUs can be used for other computations as well
– GPUs will have limited communication back to CPUs
• CPU must initiate data transfer to the GPU memory
– Synchronous data transfer – CPU waits for transfer to complete
– Asynchronous data transfer – CPU continues with other work; checks if
transfer is complete

Operation Model : GPU
38
• GPU can not directly access main memory
• CPU can not directly access GPU memory
• Need to explicitly copy data

Operation Model : GPU
39
• GPU is viewed as a compute device operating as a
coprocessor to the main CPU (host)
– Data‐parallel, compute intensive functions should be off‐loaded to the
device
– Functions that are executed many times, but independently on
different data, are prime candidates
• i.e. body of for‐loops
– A function compiled for the device is called a kernel
– The kernel is executed on the device as many different threads
– Both host (CPU) and device (GPU) manage their own memory – host
memory and device memory

Programming Language: CUDA
40
• Compute Unified Device Architecture”
– Introduced by Nvidia in late 2006
– It is a compiler and toolkit for programming NVIDIA GPUs
– API extends the C programming language
– Adds library functions to access GPU
– Adds directives to translate C into instructions that run on
the host CPU or the GPU when needed
– Allows easy multi‐threading ‐ parallel execution on all
thread processors on the GPU
– Runs on thousands of threads
– It is a scalable model

Programming Language: CUDA
41
• Compute Unified Device Architecture”
– General purpose programming model
• User kicks off batches of threads on the GPU
• Specific language and tools
– Driver for loading computation programs into GPU
• Standalone Driver ‐ Optimized for computation
• Interface designed for compute – graphics‐free API
• Explicit GPU memory management
– Objectives
• Express parallelism
• Give a high level abstraction from hardware

High–Performance Computing

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a High–Performance Computing

Similar a High–Performance Computing (20)

Último

Último (20)

High–Performance Computing