This Presentation was prepared by Abdussamad Muntahi for the Seminar on High Performance Computing on 11/7/13 (Thursday) Organized by BRAC University Computer Club (BUCC) in collaboration with BRAC University Electronics and Electrical Club (BUEEC).
2. Introduction
• High‐speed computing. Originally implemented only in
supercomputers for scientific research
• Tools and systems available to implement and create
high performance computing systems
• Used for scientific research and computational science
• Main area of discipline is developing parallel
processing algorithms and software so that programs
can be divided into small independent parts and can be
executed simultaneously by separate processors
• HPC systems have shifted from supercomputer to
computing clusters
2
3. What is Cluster?
• Cluster is a group of machines interconnected in a way that
they work together as a single system
3
• Terminology
o Node – individual machine in a cluster
o Head/Master node – connected to both the
private network of the cluster and a public
network and are used to access a given
cluster. Responsible for providing user an
environment to work and distributing task
among other nodes
o Compute nodes – connected to only the
private network of the cluster and are
generally used for running jobs assigned to
them by the head node(s)
4. What is Cluster?
• Types of Cluster
o Storage
Storage clusters provide a consistent file system image
Allowing simultaneous read and write to a single shared file system
o High‐availability (HA)
Provide continuous availability of services by eliminating single points of
failure
o Load‐balancing
Sends network service requests to multiple cluster nodes to balance the
requested load among the cluster nodes
o High‐performance
Use cluster nodes to perform concurrent calculations
Allows applications to work in parallel to enhance the performance of the
applications
Also referred to as computational clusters or grid computing
4
5. Benefits of Cluster
• Reduced Cost
o The price of off‐the‐shelf consumer desktops has plummeted in recent
years, and this drop in price has corresponded with a vast increase in
their processing power and performance. The average desktop PC
today is many times more powerful than the first mainframe
computers.
• Processing Power
o The parallel processing power of a high‐performance cluster can, in
many cases, prove more cost effective than a mainframe with similar
power. This reduced price‐per‐unit of power enables enterprises to get
a greater ROI (Return On Investment).
• Scalability
o Perhaps the greatest advantage of computer clusters is the scalability
they offer. While mainframe computers have a fixed processing
capacity, computer clusters can be easily expanded as requirements
change by adding additional nodes to the network.
5
6. Benefits of Cluster
• Improved Network Technology
o In clusters, computers are typically connected via a single virtual local
area network (VLAN), and the network treats each computer as a
separate node. Information can be passed throughout these networks
with very little lag, ensuring that data doesn’t bottleneck between
nodes.
6
• Availability
o When a mainframe computer fails, the
entire system fails. However, if a node
in a computer cluster fails, its
operations can be simply transferred
to another node within the cluster,
ensuring that there is no interruption
in service.
8. Why we need ever‐increasing
performance
• Computational power is increasing, but so are our
computation problems and needs.
• Some Examples:
– Case1: Complete a time‐consuming operation in less time
• I am an automotive engineer
• I need to design a new car that consumes less gasoline
• I’d rather have the design completed in 6 months than in 2 years
• I want to test my design using computer simulations rather than building
very expensive prototypes and crashing them
– Case 2: Complete an operation under a tight deadline
• I work for a weather prediction agency
• I am getting input from weather stations/sensors
• I’d like to predict tomorrow’s forecast today
8
11. Where are we using HPC?
• Used to solve complex modeling problems in a spectrum of
disciplines
• Topics include:
• HPC is currently applied to business uses as well
o data warehouses
o transaction processing
11
o Artificial intelligence
o Climate modeling
o Automotive engineering
o Cryptographic analysis
o Geophysics
o Molecular biology
o Molecular dynamics
o Nuclear physics
o Physical oceanography
o Plasma physics
o Quantum physics
o Quantum chemistry
o Solid state physics
o Structural dynamics.
17. Parallel Computing
• Form of computation in which many calculations are
carried out simultaneously, operating on the principle
that large problems can often be divided into smaller
ones, which are then solved concurrently i.e. "in
parallel“
• So, we need to rewrite serial programs so that they’re
parallel.
• Write translation programs that automatically convert
serial programs into parallel programs.
– This is very difficult to do.
– Success has been limited.
17
19. Parallel Computing
• Example
– We have p cores, p much smaller than n.
– Each core performs a partial sum of approximately n/p
values.
19
Each core uses it’s own private variables
and executes this block of code
independently of the other cores.
22. Parallel Computing
• Parallel Computer classification
– Semiconductor industry has settled on two main
trajectories
• Multicore trajectory – CPU
– Coarse, heavyweight threads, better performance per thread
maximize the speed of sequential program
• Many‐core trajectory – GPU
– large number of much smaller cores to improve the execution
throughput of parallel applications
– Fine, lightweight threads
single‐thread performance is poor
22
28. CPU vs. GPU
28
• Architecture and Technology
– Control hardware dominates μprocessors
• Complex, difficult to build and verify
• Scales poorly
– Pay for max throughput, sustain average throughput
– Control hardware doesn’t do any math!
– Industry moving from “instructions per second” to
“instructions per watt”
• Traditional μprocessors are not power‐efficient
– We can continue to put more transistors on a chip
• … but we can’t scale their voltage like we used to …
• … and we can’t clock them as fast …
29. Why GPU?
29
• GPU is a massively parallel architecture
– Many problems map well to GPU‐style computing
– GPUs have large amount of arithmetic capability
– Increasing amount of programmability in the pipeline
– CPU has duel and quad core chips; but GPU currently has 240 cores
(GeForce GTX 280)
• Memory Bandwidth
– CPU – 3.2 GB/s; GPU – 141.7 GB/s
• Speed
– CPU – 20 GFLOPS (per core)
– GPU – 933 (single‐precision or int) / 78 (double‐precision) GFLOPS
• Direct access to compute units in new APIs
32. Parallel Programming
• HPC Parallel Programming Models associated with
different computing technology
– Parallel programming in CPU Clusters
– General purpose GPU programming
32
33. Operational Model: CPU
• Originally was designed for distributed memory architectures
• Tasks are divided among p processes
• Data‐parallel, compute intensive functions should be selected
to be assigned to theses processes
• Functions that are executed many times, but independently
on different data, are prime candidates
– i.e. body of for‐loops
33
35. Programming Language: MPI
• Message Passing Interface
– An application programming interface (API) specification
that allows processes to communicate with one another by
sending and receiving messages
– Now a de facto standard for parallel programs running on
distributed memory systems in computer clusters and
supercomputers
– A massage passing API with language‐independent
protocol and semantic specifications
– Support both point‐to‐point and collective communication
– Communications are defined by the APIs
35
36. Programming Language: MPI
• Message Passing Interface
– Goals are standardization, high performance, scalability,
and portability
– Consists of a specific set of routines (i.e. APIs) directly
callable from C, C++, Fortran and any language able to
interface with such libraries
– Program consists of autonomous processes
• The processes may run either the same code (SPMD style) or
different codes (heterogeneous)
– Processes communicate with each other via calls to MPI
functions
36
37. Operation Model: GPU
37
• Both CPU and GPU operates with separate memory pool
• CPUs are masters and GPUs are workers
– CPUs launch computations onto the GPUs
– CPUs can be used for other computations as well
– GPUs will have limited communication back to CPUs
• CPU must initiate data transfer to the GPU memory
– Synchronous data transfer – CPU waits for transfer to complete
– Asynchronous data transfer – CPU continues with other work; checks if
transfer is complete
40. Programming Language: CUDA
40
• Compute Unified Device Architecture”
– Introduced by Nvidia in late 2006
– It is a compiler and toolkit for programming NVIDIA GPUs
– API extends the C programming language
– Adds library functions to access GPU
– Adds directives to translate C into instructions that run on
the host CPU or the GPU when needed
– Allows easy multi‐threading ‐ parallel execution on all
thread processors on the GPU
– Runs on thousands of threads
– It is a scalable model
41. Programming Language: CUDA
41
• Compute Unified Device Architecture”
– General purpose programming model
• User kicks off batches of threads on the GPU
• Specific language and tools
– Driver for loading computation programs into GPU
• Standalone Driver ‐ Optimized for computation
• Interface designed for compute – graphics‐free API
• Explicit GPU memory management
– Objectives
• Express parallelism
• Give a high level abstraction from hardware