WhenWe Need High Performance
To perform time-consuming operations in less time/ before a tighter deadline.
►I am a bioinformatic engineer.
►I need to run computationally complex programs.
►I’d rather have the result in 5 minutes than in 5 days.
To do a high number of operations per seconds
►I am an engineer ofAmazon.com
►My Web server gets 1,000 hits per seconds
►I’d like my web server and my databases to handle 1,000 transactions per seconds so that
customers do not experience bad delays
Amazon does “process” several GBytes of data per seconds
What Does High Performance
► It includes following subjects
- Computer Architecture
- Network Connections
- Programming paradigms
INTERNATIONAL COMPETITION FOR
Nations and global regions including China, the United States, Japan, and
Russia, are racing ahead and have created national programs that are
investing large sums of money to develop exascale supercomputers.
Supercomputer: A computing system exhibiting high-end performance
capabilities and resource capacities within practical constraints of technology, cost,
power, and reliability. Thomas Sterling, 2007
Supercomputer: a large very fast mainframe used especially for scientific
computations. Merriam-Webster Online
Supercomputer: any of a class of extremely powerful computers. The term is
commonly applied to the fastest high-performance systems available at any given time.
Such computers are used primarily for scientific and engineering work requiring
exceedingly high-speed computations. Encyclopedia Britannica Online
Research Applications of HPCs
• Sports and Entertainment
• Weather Forecasting
• Space Research
• Health-Care-Related Applications
• to unravel the morphology of cancer cells,
• to diagnose and treat cancers and improve the safety of cancer
• medical research,
• Personalized medicine
Introduction - today’s lecture
System Architectures (Single Instruction - Single Data, Single Instruction - Multiple
Data, Multiple Instruction - Multiple Data, Shared Memory, Distributed Memory,
Cluster, Multiple Instruction - Single Data)
Performance Analysis of parallel calculations (the speedup, efficiency, time execution
Parallel numerical methods (Principles of Parallel Algorithm Design, Analytical
Modeling of Parallel Programs, Matrices Operations, Matrix-Vector Operations, Graph
Software (Programming Using Compute Unified Device Architecture)
10. In the first section we will discuss the importance of parallel computing to high
performance computing. We will show the basic concepts of parallel computing.
The advantages and disadvantages of parallel computing will be discussed. We
will present an overview of current and future trends in HPC hardware.
The second section will provide an introduction to parallel GPU implementations
of numerical methods, such as Matrices Operations, Matrix-Vector Operations,
Graph Algorithms. Also second section will present the application of parallel
computing techniques using Graphic Processing Unit (GPU) in order to improve
the computational efficiency of numerical methods
The third session will briefly discuss such important HPC topics like computing
using graphic processing units (GPUs) with CUDA running relatively simple
examples on this hardware. As tradition dictates, we will show how to write
"Hello World" in CUDA. Some computational libraries available for HPC with GPU
will be highlighted.
What is traditional programming view?
Why Use Parallel Computing?
Motivation for parallelism (Moor’s law).
An Overview of Parallel Processing
Parallelism in Uniprocessor Systems
Organization of Multiprocessor
MIMD System Architectures
Concepts and terminology.
12. Parallelism is a method to improve computer
system performance by executing two or more
The goals of parallel processing.
One goal is to reduce the “wall-clock” time or the
amount of real time that you need to wait for a
problem to be solved.
Another goal is to solve bigger problems that
might not fit in the limited memory of a single
13. Consider your favorite computational application
• One processor can give me results in N hours
• Why not use N processors -- and get the
results in just one hour?
14. Parallel computing: the use
of multiple computers or
processors working together
on a common task.
Each processor works on
its section of the problem.
Processors are allowed to
with other processors.
Grid of a Problem to be
Flynn’s Classification (Taxonomy)
Was proposed by researcher Michael J. Flynn in
It is the most commonly accepted taxonomy of
In this classification, computers are classified by
whether it processes a single instruction at a time or
multiple instructions simultaneously, and whether it
operates on one or multiple data sets.
4 categories of Flynn’s classification of multiprocessor
systems by their instruction and data streams
Simple Diagrammatic Representation
SISD machines executes a single instruction
on individual data values using a single
Based on traditional Von Neumann
uniprocessor architecture, instructions are
executed sequentially or serially, one step
after the next.
Until most recently, most computers are of
An SIMD machine executes a single
instruction on multiple data values
simultaneously using many processors.
Since there is only one instruction, each
processor does not have to fetch and
decode each instruction. Instead, a single
control unit does the fetch and decoding for
SIMD architectures include array processors.
This category does not actually exist. This
category was included in the taxonomy for
the sake of completeness.
MIMD machines are usually referred to as
multiprocessors or multicomputers.
It may execute multiple instructions
simultaneously, contrary to SIMD machines.
Each processor must include its own control
unit that will assign to the processors parts
of a task or a separate task.
It has two subclasses: Shared memory and
Shared memory UMA (all processors have equal
access to memory. Can talk via memory.)
Processors only Have access to their
local memory “talk” to other processors
over a network
Shared memory nodes connected by a network
24. Hybrid Machines
•Add special purpose
processors to normal
•Not a new concept
but, regaining traction
Example: our Tesla
Shared memory and distributed memory
machines becomes mainstream.
Manycore architectures: GPUs used for
High performance computing almost equals
to parallel computing
Finally, the architecture of a MIMD system,
contrast to its topology, refers to its
connections to its system memory.
A systems may also be classified by their
architectures. Two of these are:
Uniform memory access (UMA)
Nonuniform memory access (NUMA)
The UMA is a type of symmetric
multiprocessor, or SMP, that has two or more
processors that perform symmetric functions.
UMA gives all CPUs equal (uniform) access to
all memory locations in shared memory.
They interact with shared memory by some
communications mechanism like a simple bus
or a complex multistage interconnection
NUMA architectures, unlike UMA architectures
do not allow uniform access to all shared
memory locations. This architecture still
allows all processors to access all shared
memory locations but in a nonuniform way,
each processor can access its local shared
memory more quickly than the other memory
modules not next to it.
An analogy of Flynn’s classification is the
check-in desk at an airport
SISD: a single desk
SIMD: many desks and a supervisor with a
megaphone giving instructions that every desk
MIMD: many desks working at their own pace,
synchronized through a central database
33. All parallel programs contain:
• Parallel sections
• Serial sections
• Serial sections are when work is being duplicated or no
useful work is being done, (waiting for others)
• Serial sections limit the parallel effectiveness
• If you have a lot of serial computation then you will not
get good speedup
• No serial work “allows” perfect speedup
• Amdahl’s Law states this formally
34. Amdahl’s Law places a strict limit on the speedup that can be
realized by using multiple processors.
Effect of multiple processors on run time
• Effect of multiple processors on speed up
• fS = serial fraction of code
• fp = parallel fraction of code
• N = number of processors
• Perfect speedup t=t1/n or
36. Amdahl’s Law provides a theoretical upper limit
on parallel speedup assuming that there are no
costs for communications.
In reality, communications will result in a
further degradation of performance
37. Writing effective parallel application is difficult
• Communication can limit parallel efficiency
• Serial time can dominate
• Load balance is important
Is it worth your time to rewrite your application
• Do the CPU requirements justify
• Will the code be used just once?
38. S(n) > n,
may be seen on occasion, but usually this is due to using a
suboptimal sequential algorithm or some unique feature of the
architecture that favors the parallel formation.
One common reason for superlinear speedup is the extra cache
in the multiprocessor system which can hold more of the
problem data at any instant, it leads to less, relatively slow
39. Efficiency = Execution time using one
processor over the
Execution time using a number of processors
Its just the speedup divided by the number of
40. Used to indicate a hardware design that allows the system to be
increased in size and in doing so to obtain increased
performance - could be described as architecture or hardware
Scalability is also used to indicate that a parallel algorithm can
accommodate increased data items with a low and bounded
increase in computational steps - could be described as
41. Problem size: the number of basic steps in the
best sequential algorithm for a given problem and
data set size
•Intuitively, we would think of the number of data elements
being processed in the algorithm as a measure of size.
•However, doubling the date set size would not necessarily
double the number of computational steps. It will depend upon
•For example, adding two matrices has this effect, but
multiplying matrices quadruples operations.
Note: bad sequential algorithms tend to scale well.
42. • Latency
• How long to get between nodes in the
• How much data can be moved per unit time.
• Bandwidth is limited by the number of wires
and the rate at which each wire can accept data and
43. For ultimate performance you may be
concerned how your nodes are connected.
Avoid communications between distant node.
For some machines it might be difficult to
control or know the placement of applications.
A system may also be classified by its topology.
A topology is the pattern of connections between
The cost-performance tradeoff determines which
topologies to use for a multiprocessor system.
A topology is characterized by its diameter,
total bandwidth, and bisection bandwidth
◦ Diameter – the maximum distance between two
processors in the computer system.
◦ Total bandwidth – the capacity of a
communications link multiplied by the number of
such links in the system.
◦ Bisection bandwidth – represents the maximum
data transfer that could occur at the bottleneck in
◦ Processors communicate
with each other via a single
bus that can only handle
one data transmissions at
◦ In most shared buses,
communicate with their
own local memory.
◦ Uses direct connections
instead of a shared bus.
◦ Allows communication
links to be active
simultaneously but data
may have to travel
processors to reach its
◦ Uses direct
◦ There is only one
unique path between
any pair of
◦ In the mesh topology,
connects to the
processors above and
below it, and to its
right and left.
◦ Is a multiple mesh
◦ Each processor
connects to all other
binary values differ
by one bit. For
0(0000) connects to
1(0001) or 2(0010).
Every processor has
n-1 connections, one to
each of the other
There is an increase in
complexity as the system
grows but this offers
Moore's Law describes a long-
term trend in the history of
computing hardware, in which
the number of transistors that
can be placed inexpensively on
an integrated circuit has doubled
approximately every two years.
54. Mechanical Computing
◦ Babbage, Hollerith, Aiken
Electronic Digital Calculating
◦ Atanasoff, Eckert, Mauchly
von Neumann Architecture
◦ Turing, von Neumann, Eckert, Mauchly, Foster, Wilkes
Birth of the Supercomputer
◦ Cray, Watanabe
The Golden Age
◦ Batcher, Dennis, S. Chen, Hillis, Dally, Blank, B. Smith
Common Era of Killer Micros
◦ Scott, Culler, Sterling/Becker, Goodhue, A. Chen, Tomkins
◦ Messina, Sterling, Stevens, P. Smith,
• Leibniz Stepped Reckoner
• Babbage Difference Engine
• Hollerith Tabulator
• Harvard Mark 1
• Un. of Pennsylvania Eniac
• Cambridge Edsac
• MIT Whirlwind
• Cray 1
• TMC CM-2
• Intel Touchstone Delta
• IBM Blue Gene/L
56. Eckert and Mauchly,
to problems in fields
such as atomic
energy and ballistic
57. Maurice Wilkes, 1949.
Mercury delay lines for
memory and vacuum
tubes for logic.
Used one of the first
Calculation of prime
numbers, solutions of
58. Jay Forrester, 1949.
First computer to use
Displayed real time
text and graphics on
a large oscilloscope
59. Cray Research,
Unique C-shape to
help increase the
signal speeds from
one end to the other.
61. INTEL, 1990.
LINPACK rating of
13.9 GFLOPS .
applications like real-
time processing of
satellite images and
molecular models for
62. Thomas Sterling and
Donald Becker, 1994.
Cluster formed of one
head node and
Nodes and network
dedicated to the
Compute nodes are
Use open source
63. Japan, 1997.
640 nodes with eight
and 16 gigabytes of
computer memory at
64. IBM, 2004.
ever to run over 100
on a real world
application, namely a
66. 1992 to present
Killer Micro and mass market
High density DRAM
High cost of fab lines
◦ Message passing
Economy of scale S-curve
◦ Gustafson et al
Beowulf, NOW Clusters
68. Hybrid cluster solutions & services that fully leverage the performance
Demand for computing power is growing steadily, as scientists &
engineers seek to tackle increasingly complex problems. The emergence
of multi-core CPUs has allowed to keep pace with their demands, but
energy consumption, space, & cooling have become major inhibitors to
computing systems expansion. Hence the success of acceleration
technologies such as GPGPUs (General-Purpose Graphics Processing
Units), which offer both breakthrough performance & outstanding space &
GPGPUs can accelerate processing by a factor of 1 to 100!
◦ Not enough work to do due to insufficient parallelism or
poor load balancing among distributed resources
◦ Waiting for access to memory or other parts of the system
◦ Extra work that has to be done to manage program
concurrency and parallel resources the real work you
want to perform
Waiting for Contention
◦ Delays due to fighting over what task gets to use a shared
resource next. Network bandwidth is a major constraint.
71. Simply saying, in architecture sense, CPU is composed of few huge
Arithmetic Logic Unit (ALU) cores for general purpose processing with lots
of cache memory and one huge control module that can handle a few
software threads at a time. CPU is optimized for serial operations since its
clock is very high. While GPU, on the other hand, has many small ALUs,
small control modules and small cache. GPU is optimized for parallel
72. A simple way to understand the difference between a GPU and a CPU is to compare how
they process tasks. A CPU consists of a few cores optimized for sequential serial processing
while a GPU has a massively parallel architecture consisting of thousands of smaller, more
efficient cores designed for handling multiple tasks simultaneously.
GPUs have thousands of cores to process parallel workloads efficiently
CUDA: More mature, bigger ‘ecosystem’, NVIDIA
OpenCL: Vendor-independent, open industry
Interfaces to C/C++, Fortran, Python, .NET, . . .
Important: Hardware abstraction and
‘expressiveness’ are identical
74. Fourier Transforms
CUFFT: NVIDIA, part of the CUDA toolkit
APPML (formerly ACML-GPU): AMD Accelerated
Parallel Processing Math Libraries
Dense linear algebra
CUBLAS: NVIDIA’s basic linear algebra subprograms
APPML (formerly ACML-GPU): AMD Accelerated
Parallel Processing Math Libraries
CULA: Third-party LAPACK, matrix decompositions
and eigenvalue problems
MAGMA and PLASMA: BLAS/LAPACK for multicore
and manycore (ICL, Tennessee)
75. Ten years ago, when GPUs were rst used to perform general-purpose computation, they were programmed using
low-level mechanism such as the interruption services of the BIOS, or by using graphic APIs such as OpenGL and
DirectX . Later, the programs for GPU were developed in assembly language for each card model, and they had
very limited portability. So, high-level languages were developed to fully exploit the capabilities of the GPUs. In
2007, NVIDIA introduced CUDA , a software architecture for managing the GPU as a parallel computing device
without requiring to map the data and the computation into a graphic API. CUDA is based in an extension of the C
language, and it is available for graphic cards GeForce 8 Series and superior, using the 32 and 64 bits versions of
the Linux and Windows (XP and successors) operating systems. Three software layers are used in CUDA to
communicate with the GPU (see Figure 1): a lowlevel hardware driver that performs the data communications
between the CPU and the GPU, a high-level API, and a set of libraries that includes CUBLAS for linear algebra
calculations and CUFFT for Fourier transforms calculation. For the CUDA programmer, the GPU is a computing
device which is able to execute a large number of threads in parallel. A specic procedure to be executed many times
over dierent data can be isolated in a GPU-function using many execution threads. The function is compiled using a
specic set of instructions and the resulting program named kernel is loaded in the GPU. The GPU has its own DRAM,
and the data are copied from the DRAM of the GPU to the RAM of the host (and viceversa) using optimized calls to
the CUDA API. The CUDA architecture is built around a scalable array of multiprocessors, each one of them having
eight scalar processors, one multithreading unit, and a shared memory chip. The multiprocessors are able to create,
manage, and execute parallel threads, with reduced overhead. The threads are grouped in blocks (with up to 512
threads), which are executed in a single multiprocessor of the graphic card, and the blocks are grouped in grids.
Each time that a CUDA program calls a grid to be executed in the GPU, each one of the blocks in the grid is
numbered and distributed to an available multiprocessor.
When a multiprocessor receives one (or more) blocks to execute, it splits the threads in warps a set of 32
consecutive threads. Each warp executes a single instruction at a time, so the best eciency is achieved when the 32
threads in the warp executes the same instruction. Otherwise, the warp serializes the threads. Each time that a
block nishes its execution, a new block is assigned to the available multiprocessor. The threads are able to access
the data using three memory spaces: the shared memory of the block, which can be used by the threads in the
block; the local memory of the thread; and the global memory of the GPU. Minimizing the access to the slower
memory spaces (the local memory of the thread and the global memory of the GPU) is a very important feature to
achieve eciency in GPU programming. On the other side, the shared memory is placed within the GPU chip, thus it
provides a faster way to store the data.