SlideShare una empresa de Scribd logo
1 de 45
UBa/NAHPI-2020
DepartmentofComputer
Engineering
PARALLEL AND DISTRIBUTED
COMPUTING
By
Malobe LOTTIN Cyrille .M
Network and Telecoms Engineer
PhD Student- ICT–U USA/CAMEROON
Contact
Email:malobecyrille.marcel@ictuniversity.org
Phone:243004411/695654002
CHAPTER 2
Parallel and Distributed Computer
Architectures, Performance Metrics
And Parallel Programming Models
Previous … Chap 1: General Introduction (Parallel and Distributed Computing)
CONTENTS
• INTRODUCTION
• Why parallel Architecture ?
• Modern Classification of Parallel Computers
• Structural Classification of Parallel Computers
• Parallel Computers Memory Architectures
• Hardware Classification
• Performance of Parallel Computers architectures
- Peak and Sustained Performance
• Measuring Performance of Parallel Computers
• Other Common Benchmarks
• Parallel Programming Models
- Shared Memory Programming Model
- Thread Model
- Distributed Memory
- Data Parallel
- SPMD/MPMD
• Conclusion
Exercises ( Check your Progress, Further Reading and Evaluation)
Previously on Chap 1
 Part 1- Introducing Parallel and Distributed Computing
• Background Review of Parallel and Distributed Computing
• INTRODUCTION TO PARALLEL AND DISTRIBUTED COMPUTING
• Some keys terminologies
• Why parallel Computing?
• Parallel Computing: the Facts
• Basic Design Computer Architecture: the von Neumann Architecture
• Classification of Parallel Computers (SISD,SIMD,MISD,MIMD)
• Assignment 1a
 Part 2- Initiation to Parallel Programming Principles
• High Performance Computing (HPC)
• Speed: a need to solve Complexity
• Some Case Studies Showing the need of Parallel Computing
• Challenge of explicit Parallelism
• General Structure of Parallel Programs
• Introduction to the Amdahl's LAW
• The GUSTAFSON’s LAW
• SCALIBILITY
• Fixed Size Versus Scale Size
• Assignment 1b
• Conclusion
INTRODUCTION
• Parallel Computer Architecture is the method that consist of
Maximizing and organizing computer resources to achieve Maximum
performance.
- Performance at any instance of time, is achievable within the limit given
by the technology.
- The same system may be characterized both as "parallel" and
"distributed"; the processors in a typical distributed system run
concurrently in parallel.
• The use of more processors to compute tasks simultaneously
contribute in providing more features to computers systems.
• In the Parallel architecture, Processors during computation may have
access to a shared memory to exchange information between them.
•
imagesSource:Wikipedia,DistributingComputing,2020
• In a Distributed architecture, each processor during computation,
make use of its own private memory (distributed memory). In this
case, Information is exchanged by passing messages between the
processors.
• Significant characteristics of distributed systems are: concurrency of
components, lack of a global clock (Clock synchronization) , and
independent failure of components.
• The use of distributed systems to solve computational problems is
Called Distributed Computing (Divide problem into many tasks, each task is handle by one or
more computers, which communicate with each other via message passing).
• High-performance parallel computation operating shared-memory
multiprocessor uses parallel algorithms while the coordination of
a large-scale distributed system uses distributed algorithms.
INTRODUCTION
imagesSource:Wikipedia,DistributingComputing,2020
• Parallelism is nowadays in all levels of computer architectures.
• It is the Enhancements of Processors that justify the success in the
development of Parallelism.
• Today, they are superscalar (Execute several instructions in parallel each clock cycle).
- besides, The advancement of the underlying Very Large-Scale Integration (VLSI )technology,
which allows larger and larger numbers of components to fit on a chip and clock rates to increase.
• Three main elements define structure and performance of Multiprocessor:
- Processors
- Memory Hierarchies (registers, cache, main memory, magnetic discs, magnetic tapes)
- Interconnection Network
• But, the gap of performance between the processor and the memory is still
increasing ….
• Parallelism is used by computer architecture to translate the raw potential of
the technology into greater performance and expanded capability of the
computer system
• Diversity in parallel computer architecture makes the field challenging to learn
and challenging to present.
INTRODUCTION ( Cont…)
Remember that:
A parallel computer is a collection of processing elements that
cooperate and communicate to solve large problems fast.
• The attempt to solve this large problems raises some fundamental
questions which the answer can only by satisfy by understanding:
- Various components of Parallel and Distributed systems( Design
and operation),
- How much problems a given Parallel and Distributed system can
solve,
- How processors corporate, communicate / transmit data between
them,
- The primitive abstractions that the hardware and software provide
to the programmer for better control,
- And, How to ensure a proper translation to performance once these
elements are under control.
INTRODUCTION (Cont…)
Why Parallel Architecture ?
• No matter the performance of a single processor at a given time, we can
achieve in principle higher performance by utilizing many such processors
so far we are ready to pay the price (Cost).
Parallel Architecture is needed To:
 Respond to Applications Trends
• Advances in hardware capability enable new application functionality 
drives parallel architecture harder, since parallel architecture focuses on the
most demanding of these applications.
• At the Low end level, we have the largest volume of machines and greatest
number of users; at the High end, most demanding applications.
• Consequence: pressure for increased performance  most demanding
applications must be written as parallel programs to respond to this
demand generated from the High end
 Satisfy the need of High Computing in the field of computational science
and engineering
- A response to simulate physical phenomena impossible or very
costly to observe through empirical means (modeling global climate change
over long periods, the evolution of galaxies, the atomic structure of materials,
etc…)
 Respond to Technology Trends
• Can’t “wait for the single processor to get fast enough ”
Respond to Architectural Trends
• Advances in technology determine what is possible; architecture
translates the potential of the technology into performance and
capability .
• Four generation of Computer architectures (tubes, transistors,
integrated circuits, and VLSI ) where strong distinction is function of
the type of parallelism implemented ( Bit level parallelism  4-bits
to 64 bits, 128 bits is the future).
• There has been tremendous architectural advances over this period
: Bit level parallelism, Instruction level Parallelism, Thread Level
Parallelism
All these forces driving the development of parallel architectures are
resumed under one main quest: Achieve absolute maximum
performance ( Supercomputing)
Why Parallel Architecture ? (Cont …)
Modernclassification
Accordingto(Sima,Fountain,Kacsuk)
Before modern classification,
Recall Flynn’s taxonomy classification of Computers
- based on the number of instructions that can be executed and how they operate on data.
Four Main Type:
• SISD: traditional sequential architecture
• SIMD: processor arrays, vector processor
• Parallel computing on a budget – reduced control unit cost
• Many early supercomputers
• MIMD: most general purpose parallel computer today
• Clusters, MPP, data centers
• MISD: not a general purpose architecture
Note: Globally four type of parallelism are implemented:
- Bit Level Parallelism: performance of processors based on word size ( bits)
- Instruction Level Parallelism: give ability to processors to execute more than instruction
per clock cycle
- Task Parallelism: characterize Parallel programs
- Superword Level Parallelism: Based on vectorization Techniques
Computer Architectures
SISD SIMD MIMD MISD
• Classification here is based on how parallelism is achieved
• by operating on multiple data: Data parallelism
• by performing many functions in parallel: Task parallelism (function)
• Control parallelism, task parallelism depending on the level of the functional
parallelism.
ModernClassification
Accordingto(Sima,Fountain,Kacsuk)
Parallel architectures
Data-parallel
architectures
Function-parallel
architectures
- Different operations are
performed on the same or
different data
- Asynchronous computation
- Speedup is less as each
processor will execute a different
thread or process on the same or
different set of data.
- Amount of parallelization is
proportional to the number of
independent tasks to be
performed
- Load balancing depends on the
availability of the hardware and
scheduling algorithms like static
and dynamic scheduling.
- Applicability : pipelining
- Same operations are
performed on different
subsets of same data
- Synchronous computation
- Speedup is more as there is
only one execution thread
operating on all sets of data.
- Amount of parallelization is
proportional to the input data
size
- Designed for optimum load
balance on multi processor
system
Applicability: Arrays, Matrix
• Flynn’s classification Focus on the behavioral aspect of computers .
• Looking at the structure, Parallel computers can be classified based on a focus on
how processors communicate with the memory.
 When multiprocessors communicate through the global shared memory modules
then this organization is called Shared memory computer or Tightly
 when every processor in a multiprocessor system, has its own local memory and
the processors communicate via messages transmitted between their local memories,
then this organization is called Distributed memory computer or Loosely coupled system
StructuralClassificationof ParallelComputers
Parallel Computer Memory Architectures
Shared Memory Parallel Computers architecture
- Processors can access all memory as global
address space
- Multi-processors can operate independently but
share the same memory resources
- Changes in a memory location effected by one
processor are visible to all other processors
Based on memory access time, we can
classify Shared memory Parallel Computers into
two:
 Uniform Memory Access (UMA)
 Non-Uniform Memory Access (NUMA)
ParallelComputerMemoryArchitectures(Cont…)
 Uniform Memory Access (UMA) (known as Cache Coherent -
UMA)
• Commonly represented today by Symmetric
Multiprocessor (SMP) machines
• Identical processors
• Equal access and access times to memory
Note: Cache coherent is a hardware operation where any update of a
location in shared memory by one processor , is announce to all the
other processors .
Source:Imagesretrievedfromhttps://computing.llnl.gov/tutorials/parallel_comp/#SharedMemory
Non-Uniform Memory Access (NUMA)
• The architecture often link two or more SMPs
In such that :
- One SMP can directly access memory of another SMP
- Not all processors have equal access time to all memories
- Memory access across link is slower
Note: if Cache coherent is implemented, then we can also call it
Cache Coherent NUMA
• The proximity of memory to CPUs on Shared Memory parallel computer
makes Data sharing between tasks fast and uniform.
• But, there is a lack of scalability between memory and CPUs.
ParallelComputerMemoryArchitectures(Cont…)
Source:Imagesretrievedfromhttps://computing.llnl.gov/tutorials/parallel_comp/#SharedMemory
BruceJacob,...DavidT.Wang,inMemorySystems,2008
ParallelComputerMemoryArchitectures(Cont…)
 Distributed Memory Parallel Computer Architecture
• Different varieties as Shared Memory Computer.
• Require a communication network to connect inter-processor memory.
- Each processor operates independently with its own local memory
- individual processors changes does not affect the memory of other
processors.
- Cache Coherency does not apply here !
• Access to data in another processor is usually the task of the
programmer(explicitly define how and when data is communicated)
• This architecture is cost effective (can use commodity, off-the-shelf
processors and networking).
• But, the responsibility of the programmer is more engage for data
communication between processors
Source:Retrievedfrom
https://www.futurelearn.com/courses/supercomputing/0/steps/24022
ParallelComputerMemoryArchitectures(Cont…)
Source:NikolaosPloskas,NikolaosSamaras,inGPUProgramminginMATLAB,2016
ParallelComputerMemoryArchitectures(Cont…)
Overview of Parallel Memory Architecture
Note: - The largest and fastest computers in the world today employ both
shared and distributed memory architectures (Hybrid Memory)
- In hybrid design, Shared memory component here can be a shared
memory machine and/or graphics processing units (GPU)
- And, Distributed memory component is the networking of multiple
shared memory/GPU machines
- This type of memory architecture will continue to prevail and increase
• Parallel computers can be roughly classified according to the level
at which the hardware in the parallel architecture supports
parallelism.
 Multicore Computing
Symmetric multiprocessing ( tightly coupled multiprocessing)
Hardwareclassification
- Made of computer system with multiple
identical processors that share memory
and connect via a bus
- Do not comprise more than 32 processors
to minimize bus contention
- Symmetric multiprocessors are extremely
cost-effective
retrievedfromhttps://en.wikipedia.org/wiki/Parallel_computing#Bit-
level_parallelism,2020
- Processor includes multiple processing units (called "cores") on the
same chip.
- issue multiple instructions per clock cycle from multiple instruction
streams
- Differs from a superscalar processor. But, Each core in a multi-core
processor can potentially be superscalar as well.
Superscalar: issue multiple instructions per clock cycle from one instruction stream
(thread).
- Example: IBM's Cell microprocessor in Sony PlayStation 3
 Distributed Computing (distributed memory multiprocessor)
Cluster Computing
Hardwareclassification(Cont…)
• Not to be confused with Decentralized computing
- Allocation of resources (Hardware + software) to individual
workstations
• components are located on different networked computers,
which communicate and coordinate their actions by passing
messages to one another
• Interaction of components is done to achieve a common goal
• Characterize by concurrency of components, lack of a global
clock, and independent failure of components.
• can include heterogeneous computations where some nodes
may perform a lot more computation, some perform very
little computation and a few others may perform specialized
functionality
• Example: Multiplayer Online game
• loosely coupled computers that work together closely
• in some respects they can be regarded as a single computer
• multiple standalone machines constitute a cluster and
connected by a network.
• computer clusters have each node set to perform the same
task, controlled and scheduled by software.
• Computer clustering relies on a centralized management
approach which makes the nodes available as orchestrated
shared servers.
• Example: IBM's Sequoia
Sources:DinkarSitaram,GeethaManjunath,inMovingToTheCloud,2012
CiscoSystems,2003
PERFORMANCE METRICS
Performance of parallel architectures
 Various ways to measure the performance of a parallel algorithm running
on a parallel processor.
 Most commonly used measurements:
- speed-up
- Efficiency/ Isoefficiency
- Elapsed time (Very important factor Elapsed time for a program divided by the cost of the machine that ran the job.
- price/performance
Note: none of these metrics should be used independent of the run time of the parallel system
 Common metrics of Performance
• FLOPS and MIPS are units of measure for the numerical computing performance of a
computer
• Distributed computing uses the Internet to link personal computers to achieve more
FLOPS
- MIPS: million instructions per second
MIPS = instruction count/(execution time x 106)
- MFLOPS: million floating point operations per second.
FLOPS = FP ops in program/(execution time x 106)
• Which of the metric is better?
• FLOP is more related to the time of a task in numerical code.
# of FLOP / program is determined by the matrix size
See Chapter 1
“In June 2020, Fugaku turned in a High Performance Linpack (HPL) result
of 415.5 petaFLOPS, besting the now second-place Summit system by a
factor of 2.8x. Fugaku is powered by Fujitsu’s 48-core A64FX SoC,
becoming the first number one system on the list to be powered by ARM
processors. In single or further reduced precision, used in machine learning
and AI applications, Fugaku’s peak performance is over 1,000 petaflops (1
exaflops). The new system is installed at RIKEN Center for Computational
Science (R-CCS) in Kobe, Japan ” (wikipedia Flops, 2020).
Performance of parallel architectures
Here we are !
Single CPU Performance
The future
Peak and sustained performance
Peak performance
• Measured in MFLOPS
• Highest possible MFLOPS when the system does nothing but
numerical computation
• Rough hardware measure
• Little indication on how the system will perform in practice.
Peak Theoretical Performance
• Node performance in GFlops = (CPU speed in GHz) x (number of
CPU cores) x (CPU instruction per cycle) x (number of CPUs per
node)
Peak and sustained performance
• Sustained performance
• The MFLOPS rate that a program achieves over the entire run.
• Measuring sustained performance
• Using benchmarks
• Peak MFLOPS is usually much larger than sustained MFLOPS
• Efficiency rate = sustained MFLOPS / peak MFLOPS
Measuring the performance of
parallel computers
• Benchmarks: programs that are used to measure the
performance.
• LINPACK benchmark: a measure of a system’s floating point
computing power
• Solving a dense N by N system of linear equations Ax=b
• Use to rank supercomputers in the top500 list.
No. 1 since June 2020
Fugaku, is powered by Fujitsu’s 48-core A64FX SoC, becoming the first
number one system on the list to be powered by ARM processors.
Other common benchmarks
• Micro benchmarks suit
• Numerical computing
• LAPACK
• ScaLAPACK
• Memory bandwidth
• STREAM
• Kernel benchmarks
• NPB (NAS parallel benchmark)
• PARKBENCH
• SPEC
• Splash
PARALLEL PROGRAMMING MODELS
A programming perspective of Parallelism implementation in parallel
and distributed Computer architectures
Parallel Programming Models
Parallel programming models exist as an abstraction above hardware
and memory architectures.
 There are commonly several parallel programming models used
• Shared Memory (without threads)
• Threads
• Distributed Memory / Message Passing
• Data Parallel
• Hybrid
• Single Program Multiple Data (SPMD)
• Multiple Program Multiple Data (MPMD)
 These models are NOT specific to a particular type of machine or
memory architecture (a given model can be implemented on any
underlying hardware).
Example: - SHARED memory model on a DISTRIBUTED memory
machine ( Machine memory is physically distributed across networked
machines, but at the user level as a single shared memory global address
space --- Kendall Square Research (KSR) ALLCACHE---
Which Model to USE ??
There is no "best" model
However, there are certainly better implementations of some models over others
Parallel Programming Models
SharedMemoryProgramming Model
(WithoutThread)
• A thread is the basic unit to which the operating system allocates
processor time. They are smallest sequence of programmed
instructions
• In a Share Memory programming model,
- Processes/tasks share a common address space, which they
read and write to asynchronously.
- Make use of mechanisms such as locks / semaphores to control
access to the shared memory, resolve contentions and to prevent race
conditions and deadlocks.
• This may be consider as the simplest parallel programming model
• Note: Locks, Mutexe and semaphore are type of
synchronization objects in a share resources
environment. Abstract concepts.
-Locks protects access to some kind of shared resource, and give
right to access the protected share resource when owned.
Example, if you have a lockable object ABC you may:
- acquire the lock on ABC,
- take the lock on ABC,
- lock ABC,
- take ownership of ABC, or relinquish ownership of ABC if not needed
- Mutexe (Mutual EXclusion): lockable object that can be owned by
exactly one thread at a time
• Example: in C++, std::mutex, std::timed_mutex, std::recursive_mutex
-- Semaphore: A semaphore is a very relaxed type of lockable object,
with a predefined maximum count, and a current count.
Shared MemoryProgramming Model(Cont..)
Advantages Disadvantages
• No need to specify explicitly the
communication of data between tasks,
so no need to implement “ownership”.
Very advantageous for a Programmer
It becomes more difficult to understand
and manage data locality.
• All processes see and have equal access
to shared memory
There is Conservation of memory access,
cache refresh and bus traffic when keeping
data local to a given process
• Open for simplification during the
development of the program
controlling data locality is hard to
understand and may be beyond the control
of the average user.
Shared MemoryProgramming Model(Cont..)
During Implementation,
• Case: stand-alone shared memory machines
- native operating systems, compilers and/or hardware provide support for
shared memory programming. E.g. POSIX standard provides an API for using shared memory.
• Case: distributed memory machines:
- memory is physically distributed across a network of machines, but made
global through specialized hardware and software
• This is a type of shared memory programming.
• Here, a single "heavy weight" process can have multiple "light weight",
concurrent execution paths.
• To understand this model, let us consider the execution of a main
program a.out , scheduled to run by the native operating system.
Thread Model
 a.out start by loading and acquiring all of the necessary system and user resources
to run. This constitute the "heavy weight" process
 a.out performs some serial work, and then creates a number of tasks (threads) that
can be scheduled and run by the operating system concurrently
 Each thread has local data, but also, shares the entire resources of a.out “Light
weight” and benefit from a global memory view because it shares the memory
space of a.out
 Need for synchronization coordination to ensure that more than one thread is not
updating the same global address at any time.
• During Implementation, threads implementations commonly comprise:
 A library of subroutines that are called from within parallel source code
 A set of compiler directives imbedded in either serial or parallel source
code.
Note: Often , the programmer is responsible for determining the parallelism.
• Unrelated standardization efforts have resulted in two very different
implementations of threads:
- POSIX Threads
* Specified by the IEEE POSIX 1003.1c standard (1995). C Language only, Part of Unix/Linux operating systems and
Very explicit parallelism--requires significant programmer attention to detail.
- OpenMP ( Used for Tutorial in the context of this course).
* Industry standard, Compiler directive based Portable / multi-platform, including Unix and Windows
platforms, available in C/C++ and Fortran implementations, Can be very easy and simple to use - provides for
"incremental parallelism". Can begin with serial code.
Others include: - Microsoft threads
- Java, Python threads
- CUDA threads for GPUs
Thread Model (Cont…)
• In this Model,
A set of tasks uses their own local memory during computation
Multiple tasks can reside on the same physical machine and/or across an arbitrary
number of machines.
Exchange of data by tasks is done through communication( sending/ receiving
messages).
But, there must be a certain Process Cooperation during data transfer.
During Implementation,
• The programmer is responsible for determining all parallelism
• Message passing implementations usually comprise a library of subroutines that
are imbedded in source code.
• MPI is the "de facto" industry standard for message passing.
- Message Passing Interface (MPI), specification available at http://www.mpi-
forum.org/docs/.
DistributedMemory/MessagePassingModel
Can also be referred to as the Partitioned Global Address Space (PGAS) model.
Here,
 Address space is treated globally
 Most of the parallel work focuses on performing operations on a data set
typically organized into a common structure, such as an array or cube
 A set of tasks work collectively on the same data structure, however, each task
works on a different partition of the same data structure.
 Tasks perform the same operation on their partition of work, for example, "add 4
to every array element“
 Can be implemented on share memory (data structure is accessed through
global memory) and distributed memory architectures (Global Data structure
can be logically/Physical split across tasks).
Data Parallel Model
For the Implementation,
• Various popular, and sometimes developmental parallel
programming based on the Data Parallel / PGAS model.
• - Coarray Fortran, compiler dependent
* further reading (https://en.wikipedia.org/wiki/Coarray_Fortran)
• - Unified Parallel C (UPC), extension to the C programming
language for SPMD parallel programming.
* further reading http://upc.lbl.gov/
- Global Arrays , shared memory style programming environment in the context of
distributed array data structures.
* Further reading on https://en.wikipedia.org/wiki/Global_Arrays
Data Parallel Model ( Cont…)
Single Program Multiple Data (SPMD) Multiple Program Multiple Data (MPMD)
"high level" programming model (Can be build based on any parallel programming
model)
Why SINGLE PROGRAM ?
All tasks execute their copy of the same
program (threads, message passing, data
parallel or hybrid) simultaneously
Why MULTIPLE PROGRAM ?
Tasks may execute different programs
(threads, message passing, data parallel or
hybrid) simultaneously
Why MULTIPLE DATA ?
All tasks may use different data
Why MULTIPLE Data ?
All tasks may use different data
Intelligent Enough: tasks do not necessarily
have to execute the entire program.
Not intelligent enough has SPMD.
But, may be better suited for certain types
of problems (functional decomposition
problems)
Single ProgramMultipleData (SPMD)/
MultipleProgram MultipleData (MPMD)
Conclusion
• Parallel computer architectures contribute in achieving maximum performance within the limit
given by the technology.
• Diversity in parallel computer architecture makes the field challenging to learn and challenging to
present
• Classification can be based on the number of instructions that can be executed and how they
operate on data- Flynn (SISD,SIMD,MISD,MIMD)
• Also, classification can be based on how parallelism is achieved (Data parallel architectures,
Function-parallel architectures)
• Classification can as well be focus on how processors communicate with the memory (Shared
memory computer or Tightly , Distributed memory computer or Loosely coupled system)
• There must be a way to appreciate the performance of the parallelize architecture
• FLOPS and MIPS are units of measure for the numerical computing performance of a computer.
• Parallelism is made possible with implementation of adequate parallel programming models.
• The most simple model appears to be the Shared Memory Programming Model.
• The SPMD and MPMD programming required mastering of the previous programming model for
Proper implementation.
• How do we then design a Parallel Program for effective parallelism?
See Next Chapter: Designing Parallel Programs and understanding notion of
Concurrency and Decomposition.
Challenge your understanding
1- What difference do you make between Parallel computer and Parallel Computing ?
2- What do you understand by True data dependency and Resource dependency?
3- Illustrate the notion of Vertical Waste and Horizontal Waste.
4- According to you, which of the design architecture can provide better performance ?. Use
performance metrics to justify your arguments.
6- what is Concurrent-read, concurrent-write (CRCW) PRAM
5-
On this Figure, we have an illustration of a Bus-based interconnects (a) with no local caches and (b)
Bus-based interconnects with local memory/caches.
Explain the difference focusing on :
- The design architecture
- The operation
- The Pros and Cons
6- Discuss on the HANDLER’S CLASSIFICATION Computers architectures compares to Flynn and others classifications .
Class Work Group and Presentation
• Purpose: Demonstrate Condition to detect eventual
Parallelism.
“Parallel computing requires that the segments to be executed
in parallel must be independent of each other. So, before
executing parallelism, all the conditions of parallelism between
the segments must be analyzed”.
Use Bernstein Conditions for Detection of Parallelism to demonstrate when
instructions i1, i2,….,in can be said “ Parallelized”.
REFERENCES
1. Xin Yuan, CIS4930/CDA5125: Parrallel and Distributed Systems,
Retrieve from http://www.cs.fsu.edu/~xyuan/cda5125/index.html
2. EECC722 – Shaaban, #1 lec # 3 Fall 2000 9-18-2000
3. Blaise Barney, Lawrence Livermore National Laboratory,
https://computing.llnl.gov/tutorials/parallel_comp/#ModelsOverv
iew, Last Modified: 11/02/2020 16:39:01
4. J BlazeWich et al, Handbook on Parallel and distributed
Processing, International Handbook of Information Systems,
spinger, 2000
5. Phillip J. windley, Parallel Architectures, lesson 6, CS462, Large
scale Distributed Systems, 2020
6. A. Grana, et al. Introduction to Parallel Computing, lecture 3
END.

Más contenido relacionado

La actualidad más candente

Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data miningZHAO Sam
 
Distributed database management system
Distributed database management  systemDistributed database management  system
Distributed database management systemPooja Dixit
 
Parallel computing chapter 3
Parallel computing chapter 3Parallel computing chapter 3
Parallel computing chapter 3Md. Mahedi Mahfuj
 
Unit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureUnit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureBalaji Vignesh
 
advanced computer architesture-conditions of parallelism
advanced computer architesture-conditions of parallelismadvanced computer architesture-conditions of parallelism
advanced computer architesture-conditions of parallelismPankaj Kumar Jain
 
Structure of the page table
Structure of the page tableStructure of the page table
Structure of the page tableduvvuru madhuri
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platformsSyed Zaid Irshad
 
Os Swapping, Paging, Segmentation and Virtual Memory
Os Swapping, Paging, Segmentation and Virtual MemoryOs Swapping, Paging, Segmentation and Virtual Memory
Os Swapping, Paging, Segmentation and Virtual Memorysgpraju
 
04 cache memory.ppt 1
04 cache memory.ppt 104 cache memory.ppt 1
04 cache memory.ppt 1Anwal Mirza
 
Inter Process Communication
Inter Process CommunicationInter Process Communication
Inter Process CommunicationAdeel Rasheed
 
File organization 1
File organization 1File organization 1
File organization 1Rupali Rana
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithmsguest084d20
 
Page replacement algorithms
Page replacement algorithmsPage replacement algorithms
Page replacement algorithmsPiyush Rochwani
 
Introduction to Distributed System
Introduction to Distributed SystemIntroduction to Distributed System
Introduction to Distributed SystemSunita Sahu
 

La actualidad más candente (20)

Parallel Processing Concepts
Parallel Processing Concepts Parallel Processing Concepts
Parallel Processing Concepts
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data mining
 
Distributed database management system
Distributed database management  systemDistributed database management  system
Distributed database management system
 
Parallel computing chapter 3
Parallel computing chapter 3Parallel computing chapter 3
Parallel computing chapter 3
 
Unit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureUnit 5 Advanced Computer Architecture
Unit 5 Advanced Computer Architecture
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
advanced computer architesture-conditions of parallelism
advanced computer architesture-conditions of parallelismadvanced computer architesture-conditions of parallelism
advanced computer architesture-conditions of parallelism
 
Structure of the page table
Structure of the page tableStructure of the page table
Structure of the page table
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platforms
 
6.distributed shared memory
6.distributed shared memory6.distributed shared memory
6.distributed shared memory
 
Os Swapping, Paging, Segmentation and Virtual Memory
Os Swapping, Paging, Segmentation and Virtual MemoryOs Swapping, Paging, Segmentation and Virtual Memory
Os Swapping, Paging, Segmentation and Virtual Memory
 
04 cache memory.ppt 1
04 cache memory.ppt 104 cache memory.ppt 1
04 cache memory.ppt 1
 
Inter Process Communication
Inter Process CommunicationInter Process Communication
Inter Process Communication
 
Course outline of parallel and distributed computing
Course outline of parallel and distributed computingCourse outline of parallel and distributed computing
Course outline of parallel and distributed computing
 
DBMS - RAID
DBMS - RAIDDBMS - RAID
DBMS - RAID
 
File organization 1
File organization 1File organization 1
File organization 1
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
 
Page replacement algorithms
Page replacement algorithmsPage replacement algorithms
Page replacement algorithms
 
Introduction to Distributed System
Introduction to Distributed SystemIntroduction to Distributed System
Introduction to Distributed System
 
Open mp
Open mpOpen mp
Open mp
 

Similar a Chap 2 classification of parralel architecture and introduction to parllel program. models

Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresCloudLightning
 
Parallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptxParallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptxkrnaween
 
Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1AbdullahMunir32
 
SecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.pptSecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.pptRubenGabrielHernande
 
Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Sudarshan Mondal
 
Computing notes
Computing notesComputing notes
Computing notesthenraju24
 
CSA unit5.pptx
CSA unit5.pptxCSA unit5.pptx
CSA unit5.pptxAbcvDef
 
Data Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelData Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelNikhil Sharma
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel ComputingAkhila Prabhakaran
 
introduction to cloud computing for college.pdf
introduction to cloud computing for college.pdfintroduction to cloud computing for college.pdf
introduction to cloud computing for college.pdfsnehan789
 
Week 1 lecture material cc
Week 1 lecture material ccWeek 1 lecture material cc
Week 1 lecture material ccAnkit Gupta
 
_Cloud_Computing_Overview.pdf
_Cloud_Computing_Overview.pdf_Cloud_Computing_Overview.pdf
_Cloud_Computing_Overview.pdfTyStrk
 
Week 1 Lecture_1-5 CC_watermark.pdf
Week 1 Lecture_1-5 CC_watermark.pdfWeek 1 Lecture_1-5 CC_watermark.pdf
Week 1 Lecture_1-5 CC_watermark.pdfJohn422973
 

Similar a Chap 2 classification of parralel architecture and introduction to parllel program. models (20)

Chap 1(one) general introduction
Chap 1(one)  general introductionChap 1(one)  general introduction
Chap 1(one) general introduction
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
CC unit 1.pptx
CC unit 1.pptxCC unit 1.pptx
CC unit 1.pptx
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud Infrastructures
 
CCUnit1.pdf
CCUnit1.pdfCCUnit1.pdf
CCUnit1.pdf
 
Aca module 1
Aca module 1Aca module 1
Aca module 1
 
Parallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptxParallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptx
 
Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1
 
SecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.pptSecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.ppt
 
Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)
 
Computing notes
Computing notesComputing notes
Computing notes
 
Par com
Par comPar com
Par com
 
CSA unit5.pptx
CSA unit5.pptxCSA unit5.pptx
CSA unit5.pptx
 
Data Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelData Parallel and Object Oriented Model
Data Parallel and Object Oriented Model
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 
introduction to cloud computing for college.pdf
introduction to cloud computing for college.pdfintroduction to cloud computing for college.pdf
introduction to cloud computing for college.pdf
 
Week 1 lecture material cc
Week 1 lecture material ccWeek 1 lecture material cc
Week 1 lecture material cc
 
_Cloud_Computing_Overview.pdf
_Cloud_Computing_Overview.pdf_Cloud_Computing_Overview.pdf
_Cloud_Computing_Overview.pdf
 
Week 1 Lecture_1-5 CC_watermark.pdf
Week 1 Lecture_1-5 CC_watermark.pdfWeek 1 Lecture_1-5 CC_watermark.pdf
Week 1 Lecture_1-5 CC_watermark.pdf
 
Pdc lecture1
Pdc lecture1Pdc lecture1
Pdc lecture1
 

Último

KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 

Último (20)

KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 

Chap 2 classification of parralel architecture and introduction to parllel program. models

  • 1. UBa/NAHPI-2020 DepartmentofComputer Engineering PARALLEL AND DISTRIBUTED COMPUTING By Malobe LOTTIN Cyrille .M Network and Telecoms Engineer PhD Student- ICT–U USA/CAMEROON Contact Email:malobecyrille.marcel@ictuniversity.org Phone:243004411/695654002
  • 2. CHAPTER 2 Parallel and Distributed Computer Architectures, Performance Metrics And Parallel Programming Models Previous … Chap 1: General Introduction (Parallel and Distributed Computing)
  • 3. CONTENTS • INTRODUCTION • Why parallel Architecture ? • Modern Classification of Parallel Computers • Structural Classification of Parallel Computers • Parallel Computers Memory Architectures • Hardware Classification • Performance of Parallel Computers architectures - Peak and Sustained Performance • Measuring Performance of Parallel Computers • Other Common Benchmarks • Parallel Programming Models - Shared Memory Programming Model - Thread Model - Distributed Memory - Data Parallel - SPMD/MPMD • Conclusion Exercises ( Check your Progress, Further Reading and Evaluation)
  • 4. Previously on Chap 1  Part 1- Introducing Parallel and Distributed Computing • Background Review of Parallel and Distributed Computing • INTRODUCTION TO PARALLEL AND DISTRIBUTED COMPUTING • Some keys terminologies • Why parallel Computing? • Parallel Computing: the Facts • Basic Design Computer Architecture: the von Neumann Architecture • Classification of Parallel Computers (SISD,SIMD,MISD,MIMD) • Assignment 1a  Part 2- Initiation to Parallel Programming Principles • High Performance Computing (HPC) • Speed: a need to solve Complexity • Some Case Studies Showing the need of Parallel Computing • Challenge of explicit Parallelism • General Structure of Parallel Programs • Introduction to the Amdahl's LAW • The GUSTAFSON’s LAW • SCALIBILITY • Fixed Size Versus Scale Size • Assignment 1b • Conclusion
  • 5. INTRODUCTION • Parallel Computer Architecture is the method that consist of Maximizing and organizing computer resources to achieve Maximum performance. - Performance at any instance of time, is achievable within the limit given by the technology. - The same system may be characterized both as "parallel" and "distributed"; the processors in a typical distributed system run concurrently in parallel. • The use of more processors to compute tasks simultaneously contribute in providing more features to computers systems. • In the Parallel architecture, Processors during computation may have access to a shared memory to exchange information between them. • imagesSource:Wikipedia,DistributingComputing,2020
  • 6. • In a Distributed architecture, each processor during computation, make use of its own private memory (distributed memory). In this case, Information is exchanged by passing messages between the processors. • Significant characteristics of distributed systems are: concurrency of components, lack of a global clock (Clock synchronization) , and independent failure of components. • The use of distributed systems to solve computational problems is Called Distributed Computing (Divide problem into many tasks, each task is handle by one or more computers, which communicate with each other via message passing). • High-performance parallel computation operating shared-memory multiprocessor uses parallel algorithms while the coordination of a large-scale distributed system uses distributed algorithms. INTRODUCTION imagesSource:Wikipedia,DistributingComputing,2020
  • 7. • Parallelism is nowadays in all levels of computer architectures. • It is the Enhancements of Processors that justify the success in the development of Parallelism. • Today, they are superscalar (Execute several instructions in parallel each clock cycle). - besides, The advancement of the underlying Very Large-Scale Integration (VLSI )technology, which allows larger and larger numbers of components to fit on a chip and clock rates to increase. • Three main elements define structure and performance of Multiprocessor: - Processors - Memory Hierarchies (registers, cache, main memory, magnetic discs, magnetic tapes) - Interconnection Network • But, the gap of performance between the processor and the memory is still increasing …. • Parallelism is used by computer architecture to translate the raw potential of the technology into greater performance and expanded capability of the computer system • Diversity in parallel computer architecture makes the field challenging to learn and challenging to present. INTRODUCTION ( Cont…)
  • 8. Remember that: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. • The attempt to solve this large problems raises some fundamental questions which the answer can only by satisfy by understanding: - Various components of Parallel and Distributed systems( Design and operation), - How much problems a given Parallel and Distributed system can solve, - How processors corporate, communicate / transmit data between them, - The primitive abstractions that the hardware and software provide to the programmer for better control, - And, How to ensure a proper translation to performance once these elements are under control. INTRODUCTION (Cont…)
  • 9. Why Parallel Architecture ? • No matter the performance of a single processor at a given time, we can achieve in principle higher performance by utilizing many such processors so far we are ready to pay the price (Cost). Parallel Architecture is needed To:  Respond to Applications Trends • Advances in hardware capability enable new application functionality  drives parallel architecture harder, since parallel architecture focuses on the most demanding of these applications. • At the Low end level, we have the largest volume of machines and greatest number of users; at the High end, most demanding applications. • Consequence: pressure for increased performance  most demanding applications must be written as parallel programs to respond to this demand generated from the High end  Satisfy the need of High Computing in the field of computational science and engineering - A response to simulate physical phenomena impossible or very costly to observe through empirical means (modeling global climate change over long periods, the evolution of galaxies, the atomic structure of materials, etc…)
  • 10.  Respond to Technology Trends • Can’t “wait for the single processor to get fast enough ” Respond to Architectural Trends • Advances in technology determine what is possible; architecture translates the potential of the technology into performance and capability . • Four generation of Computer architectures (tubes, transistors, integrated circuits, and VLSI ) where strong distinction is function of the type of parallelism implemented ( Bit level parallelism  4-bits to 64 bits, 128 bits is the future). • There has been tremendous architectural advances over this period : Bit level parallelism, Instruction level Parallelism, Thread Level Parallelism All these forces driving the development of parallel architectures are resumed under one main quest: Achieve absolute maximum performance ( Supercomputing) Why Parallel Architecture ? (Cont …)
  • 11. Modernclassification Accordingto(Sima,Fountain,Kacsuk) Before modern classification, Recall Flynn’s taxonomy classification of Computers - based on the number of instructions that can be executed and how they operate on data. Four Main Type: • SISD: traditional sequential architecture • SIMD: processor arrays, vector processor • Parallel computing on a budget – reduced control unit cost • Many early supercomputers • MIMD: most general purpose parallel computer today • Clusters, MPP, data centers • MISD: not a general purpose architecture Note: Globally four type of parallelism are implemented: - Bit Level Parallelism: performance of processors based on word size ( bits) - Instruction Level Parallelism: give ability to processors to execute more than instruction per clock cycle - Task Parallelism: characterize Parallel programs - Superword Level Parallelism: Based on vectorization Techniques Computer Architectures SISD SIMD MIMD MISD
  • 12. • Classification here is based on how parallelism is achieved • by operating on multiple data: Data parallelism • by performing many functions in parallel: Task parallelism (function) • Control parallelism, task parallelism depending on the level of the functional parallelism. ModernClassification Accordingto(Sima,Fountain,Kacsuk) Parallel architectures Data-parallel architectures Function-parallel architectures - Different operations are performed on the same or different data - Asynchronous computation - Speedup is less as each processor will execute a different thread or process on the same or different set of data. - Amount of parallelization is proportional to the number of independent tasks to be performed - Load balancing depends on the availability of the hardware and scheduling algorithms like static and dynamic scheduling. - Applicability : pipelining - Same operations are performed on different subsets of same data - Synchronous computation - Speedup is more as there is only one execution thread operating on all sets of data. - Amount of parallelization is proportional to the input data size - Designed for optimum load balance on multi processor system Applicability: Arrays, Matrix
  • 13. • Flynn’s classification Focus on the behavioral aspect of computers . • Looking at the structure, Parallel computers can be classified based on a focus on how processors communicate with the memory.  When multiprocessors communicate through the global shared memory modules then this organization is called Shared memory computer or Tightly  when every processor in a multiprocessor system, has its own local memory and the processors communicate via messages transmitted between their local memories, then this organization is called Distributed memory computer or Loosely coupled system StructuralClassificationof ParallelComputers
  • 14. Parallel Computer Memory Architectures Shared Memory Parallel Computers architecture - Processors can access all memory as global address space - Multi-processors can operate independently but share the same memory resources - Changes in a memory location effected by one processor are visible to all other processors Based on memory access time, we can classify Shared memory Parallel Computers into two:  Uniform Memory Access (UMA)  Non-Uniform Memory Access (NUMA)
  • 15. ParallelComputerMemoryArchitectures(Cont…)  Uniform Memory Access (UMA) (known as Cache Coherent - UMA) • Commonly represented today by Symmetric Multiprocessor (SMP) machines • Identical processors • Equal access and access times to memory Note: Cache coherent is a hardware operation where any update of a location in shared memory by one processor , is announce to all the other processors . Source:Imagesretrievedfromhttps://computing.llnl.gov/tutorials/parallel_comp/#SharedMemory
  • 16. Non-Uniform Memory Access (NUMA) • The architecture often link two or more SMPs In such that : - One SMP can directly access memory of another SMP - Not all processors have equal access time to all memories - Memory access across link is slower Note: if Cache coherent is implemented, then we can also call it Cache Coherent NUMA • The proximity of memory to CPUs on Shared Memory parallel computer makes Data sharing between tasks fast and uniform. • But, there is a lack of scalability between memory and CPUs. ParallelComputerMemoryArchitectures(Cont…) Source:Imagesretrievedfromhttps://computing.llnl.gov/tutorials/parallel_comp/#SharedMemory BruceJacob,...DavidT.Wang,inMemorySystems,2008
  • 18.  Distributed Memory Parallel Computer Architecture • Different varieties as Shared Memory Computer. • Require a communication network to connect inter-processor memory. - Each processor operates independently with its own local memory - individual processors changes does not affect the memory of other processors. - Cache Coherency does not apply here ! • Access to data in another processor is usually the task of the programmer(explicitly define how and when data is communicated) • This architecture is cost effective (can use commodity, off-the-shelf processors and networking). • But, the responsibility of the programmer is more engage for data communication between processors Source:Retrievedfrom https://www.futurelearn.com/courses/supercomputing/0/steps/24022 ParallelComputerMemoryArchitectures(Cont…)
  • 19. Source:NikolaosPloskas,NikolaosSamaras,inGPUProgramminginMATLAB,2016 ParallelComputerMemoryArchitectures(Cont…) Overview of Parallel Memory Architecture Note: - The largest and fastest computers in the world today employ both shared and distributed memory architectures (Hybrid Memory) - In hybrid design, Shared memory component here can be a shared memory machine and/or graphics processing units (GPU) - And, Distributed memory component is the networking of multiple shared memory/GPU machines - This type of memory architecture will continue to prevail and increase
  • 20. • Parallel computers can be roughly classified according to the level at which the hardware in the parallel architecture supports parallelism.  Multicore Computing Symmetric multiprocessing ( tightly coupled multiprocessing) Hardwareclassification - Made of computer system with multiple identical processors that share memory and connect via a bus - Do not comprise more than 32 processors to minimize bus contention - Symmetric multiprocessors are extremely cost-effective retrievedfromhttps://en.wikipedia.org/wiki/Parallel_computing#Bit- level_parallelism,2020 - Processor includes multiple processing units (called "cores") on the same chip. - issue multiple instructions per clock cycle from multiple instruction streams - Differs from a superscalar processor. But, Each core in a multi-core processor can potentially be superscalar as well. Superscalar: issue multiple instructions per clock cycle from one instruction stream (thread). - Example: IBM's Cell microprocessor in Sony PlayStation 3
  • 21.  Distributed Computing (distributed memory multiprocessor) Cluster Computing Hardwareclassification(Cont…) • Not to be confused with Decentralized computing - Allocation of resources (Hardware + software) to individual workstations • components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another • Interaction of components is done to achieve a common goal • Characterize by concurrency of components, lack of a global clock, and independent failure of components. • can include heterogeneous computations where some nodes may perform a lot more computation, some perform very little computation and a few others may perform specialized functionality • Example: Multiplayer Online game • loosely coupled computers that work together closely • in some respects they can be regarded as a single computer • multiple standalone machines constitute a cluster and connected by a network. • computer clusters have each node set to perform the same task, controlled and scheduled by software. • Computer clustering relies on a centralized management approach which makes the nodes available as orchestrated shared servers. • Example: IBM's Sequoia Sources:DinkarSitaram,GeethaManjunath,inMovingToTheCloud,2012 CiscoSystems,2003
  • 23. Performance of parallel architectures  Various ways to measure the performance of a parallel algorithm running on a parallel processor.  Most commonly used measurements: - speed-up - Efficiency/ Isoefficiency - Elapsed time (Very important factor Elapsed time for a program divided by the cost of the machine that ran the job. - price/performance Note: none of these metrics should be used independent of the run time of the parallel system  Common metrics of Performance • FLOPS and MIPS are units of measure for the numerical computing performance of a computer • Distributed computing uses the Internet to link personal computers to achieve more FLOPS - MIPS: million instructions per second MIPS = instruction count/(execution time x 106) - MFLOPS: million floating point operations per second. FLOPS = FP ops in program/(execution time x 106) • Which of the metric is better? • FLOP is more related to the time of a task in numerical code. # of FLOP / program is determined by the matrix size See Chapter 1
  • 24. “In June 2020, Fugaku turned in a High Performance Linpack (HPL) result of 415.5 petaFLOPS, besting the now second-place Summit system by a factor of 2.8x. Fugaku is powered by Fujitsu’s 48-core A64FX SoC, becoming the first number one system on the list to be powered by ARM processors. In single or further reduced precision, used in machine learning and AI applications, Fugaku’s peak performance is over 1,000 petaflops (1 exaflops). The new system is installed at RIKEN Center for Computational Science (R-CCS) in Kobe, Japan ” (wikipedia Flops, 2020). Performance of parallel architectures Here we are ! Single CPU Performance The future
  • 25. Peak and sustained performance Peak performance • Measured in MFLOPS • Highest possible MFLOPS when the system does nothing but numerical computation • Rough hardware measure • Little indication on how the system will perform in practice. Peak Theoretical Performance • Node performance in GFlops = (CPU speed in GHz) x (number of CPU cores) x (CPU instruction per cycle) x (number of CPUs per node)
  • 26. Peak and sustained performance • Sustained performance • The MFLOPS rate that a program achieves over the entire run. • Measuring sustained performance • Using benchmarks • Peak MFLOPS is usually much larger than sustained MFLOPS • Efficiency rate = sustained MFLOPS / peak MFLOPS
  • 27. Measuring the performance of parallel computers • Benchmarks: programs that are used to measure the performance. • LINPACK benchmark: a measure of a system’s floating point computing power • Solving a dense N by N system of linear equations Ax=b • Use to rank supercomputers in the top500 list. No. 1 since June 2020 Fugaku, is powered by Fujitsu’s 48-core A64FX SoC, becoming the first number one system on the list to be powered by ARM processors.
  • 28. Other common benchmarks • Micro benchmarks suit • Numerical computing • LAPACK • ScaLAPACK • Memory bandwidth • STREAM • Kernel benchmarks • NPB (NAS parallel benchmark) • PARKBENCH • SPEC • Splash
  • 29. PARALLEL PROGRAMMING MODELS A programming perspective of Parallelism implementation in parallel and distributed Computer architectures
  • 30. Parallel Programming Models Parallel programming models exist as an abstraction above hardware and memory architectures.  There are commonly several parallel programming models used • Shared Memory (without threads) • Threads • Distributed Memory / Message Passing • Data Parallel • Hybrid • Single Program Multiple Data (SPMD) • Multiple Program Multiple Data (MPMD)  These models are NOT specific to a particular type of machine or memory architecture (a given model can be implemented on any underlying hardware). Example: - SHARED memory model on a DISTRIBUTED memory machine ( Machine memory is physically distributed across networked machines, but at the user level as a single shared memory global address space --- Kendall Square Research (KSR) ALLCACHE---
  • 31. Which Model to USE ?? There is no "best" model However, there are certainly better implementations of some models over others Parallel Programming Models
  • 32. SharedMemoryProgramming Model (WithoutThread) • A thread is the basic unit to which the operating system allocates processor time. They are smallest sequence of programmed instructions • In a Share Memory programming model, - Processes/tasks share a common address space, which they read and write to asynchronously. - Make use of mechanisms such as locks / semaphores to control access to the shared memory, resolve contentions and to prevent race conditions and deadlocks. • This may be consider as the simplest parallel programming model
  • 33. • Note: Locks, Mutexe and semaphore are type of synchronization objects in a share resources environment. Abstract concepts. -Locks protects access to some kind of shared resource, and give right to access the protected share resource when owned. Example, if you have a lockable object ABC you may: - acquire the lock on ABC, - take the lock on ABC, - lock ABC, - take ownership of ABC, or relinquish ownership of ABC if not needed - Mutexe (Mutual EXclusion): lockable object that can be owned by exactly one thread at a time • Example: in C++, std::mutex, std::timed_mutex, std::recursive_mutex -- Semaphore: A semaphore is a very relaxed type of lockable object, with a predefined maximum count, and a current count. Shared MemoryProgramming Model(Cont..)
  • 34. Advantages Disadvantages • No need to specify explicitly the communication of data between tasks, so no need to implement “ownership”. Very advantageous for a Programmer It becomes more difficult to understand and manage data locality. • All processes see and have equal access to shared memory There is Conservation of memory access, cache refresh and bus traffic when keeping data local to a given process • Open for simplification during the development of the program controlling data locality is hard to understand and may be beyond the control of the average user. Shared MemoryProgramming Model(Cont..) During Implementation, • Case: stand-alone shared memory machines - native operating systems, compilers and/or hardware provide support for shared memory programming. E.g. POSIX standard provides an API for using shared memory. • Case: distributed memory machines: - memory is physically distributed across a network of machines, but made global through specialized hardware and software
  • 35. • This is a type of shared memory programming. • Here, a single "heavy weight" process can have multiple "light weight", concurrent execution paths. • To understand this model, let us consider the execution of a main program a.out , scheduled to run by the native operating system. Thread Model  a.out start by loading and acquiring all of the necessary system and user resources to run. This constitute the "heavy weight" process  a.out performs some serial work, and then creates a number of tasks (threads) that can be scheduled and run by the operating system concurrently  Each thread has local data, but also, shares the entire resources of a.out “Light weight” and benefit from a global memory view because it shares the memory space of a.out  Need for synchronization coordination to ensure that more than one thread is not updating the same global address at any time.
  • 36. • During Implementation, threads implementations commonly comprise:  A library of subroutines that are called from within parallel source code  A set of compiler directives imbedded in either serial or parallel source code. Note: Often , the programmer is responsible for determining the parallelism. • Unrelated standardization efforts have resulted in two very different implementations of threads: - POSIX Threads * Specified by the IEEE POSIX 1003.1c standard (1995). C Language only, Part of Unix/Linux operating systems and Very explicit parallelism--requires significant programmer attention to detail. - OpenMP ( Used for Tutorial in the context of this course). * Industry standard, Compiler directive based Portable / multi-platform, including Unix and Windows platforms, available in C/C++ and Fortran implementations, Can be very easy and simple to use - provides for "incremental parallelism". Can begin with serial code. Others include: - Microsoft threads - Java, Python threads - CUDA threads for GPUs Thread Model (Cont…)
  • 37. • In this Model, A set of tasks uses their own local memory during computation Multiple tasks can reside on the same physical machine and/or across an arbitrary number of machines. Exchange of data by tasks is done through communication( sending/ receiving messages). But, there must be a certain Process Cooperation during data transfer. During Implementation, • The programmer is responsible for determining all parallelism • Message passing implementations usually comprise a library of subroutines that are imbedded in source code. • MPI is the "de facto" industry standard for message passing. - Message Passing Interface (MPI), specification available at http://www.mpi- forum.org/docs/. DistributedMemory/MessagePassingModel
  • 38. Can also be referred to as the Partitioned Global Address Space (PGAS) model. Here,  Address space is treated globally  Most of the parallel work focuses on performing operations on a data set typically organized into a common structure, such as an array or cube  A set of tasks work collectively on the same data structure, however, each task works on a different partition of the same data structure.  Tasks perform the same operation on their partition of work, for example, "add 4 to every array element“  Can be implemented on share memory (data structure is accessed through global memory) and distributed memory architectures (Global Data structure can be logically/Physical split across tasks). Data Parallel Model
  • 39. For the Implementation, • Various popular, and sometimes developmental parallel programming based on the Data Parallel / PGAS model. • - Coarray Fortran, compiler dependent * further reading (https://en.wikipedia.org/wiki/Coarray_Fortran) • - Unified Parallel C (UPC), extension to the C programming language for SPMD parallel programming. * further reading http://upc.lbl.gov/ - Global Arrays , shared memory style programming environment in the context of distributed array data structures. * Further reading on https://en.wikipedia.org/wiki/Global_Arrays Data Parallel Model ( Cont…)
  • 40. Single Program Multiple Data (SPMD) Multiple Program Multiple Data (MPMD) "high level" programming model (Can be build based on any parallel programming model) Why SINGLE PROGRAM ? All tasks execute their copy of the same program (threads, message passing, data parallel or hybrid) simultaneously Why MULTIPLE PROGRAM ? Tasks may execute different programs (threads, message passing, data parallel or hybrid) simultaneously Why MULTIPLE DATA ? All tasks may use different data Why MULTIPLE Data ? All tasks may use different data Intelligent Enough: tasks do not necessarily have to execute the entire program. Not intelligent enough has SPMD. But, may be better suited for certain types of problems (functional decomposition problems) Single ProgramMultipleData (SPMD)/ MultipleProgram MultipleData (MPMD)
  • 41. Conclusion • Parallel computer architectures contribute in achieving maximum performance within the limit given by the technology. • Diversity in parallel computer architecture makes the field challenging to learn and challenging to present • Classification can be based on the number of instructions that can be executed and how they operate on data- Flynn (SISD,SIMD,MISD,MIMD) • Also, classification can be based on how parallelism is achieved (Data parallel architectures, Function-parallel architectures) • Classification can as well be focus on how processors communicate with the memory (Shared memory computer or Tightly , Distributed memory computer or Loosely coupled system) • There must be a way to appreciate the performance of the parallelize architecture • FLOPS and MIPS are units of measure for the numerical computing performance of a computer. • Parallelism is made possible with implementation of adequate parallel programming models. • The most simple model appears to be the Shared Memory Programming Model. • The SPMD and MPMD programming required mastering of the previous programming model for Proper implementation. • How do we then design a Parallel Program for effective parallelism? See Next Chapter: Designing Parallel Programs and understanding notion of Concurrency and Decomposition.
  • 42. Challenge your understanding 1- What difference do you make between Parallel computer and Parallel Computing ? 2- What do you understand by True data dependency and Resource dependency? 3- Illustrate the notion of Vertical Waste and Horizontal Waste. 4- According to you, which of the design architecture can provide better performance ?. Use performance metrics to justify your arguments. 6- what is Concurrent-read, concurrent-write (CRCW) PRAM 5- On this Figure, we have an illustration of a Bus-based interconnects (a) with no local caches and (b) Bus-based interconnects with local memory/caches. Explain the difference focusing on : - The design architecture - The operation - The Pros and Cons 6- Discuss on the HANDLER’S CLASSIFICATION Computers architectures compares to Flynn and others classifications .
  • 43. Class Work Group and Presentation • Purpose: Demonstrate Condition to detect eventual Parallelism. “Parallel computing requires that the segments to be executed in parallel must be independent of each other. So, before executing parallelism, all the conditions of parallelism between the segments must be analyzed”. Use Bernstein Conditions for Detection of Parallelism to demonstrate when instructions i1, i2,….,in can be said “ Parallelized”.
  • 44. REFERENCES 1. Xin Yuan, CIS4930/CDA5125: Parrallel and Distributed Systems, Retrieve from http://www.cs.fsu.edu/~xyuan/cda5125/index.html 2. EECC722 – Shaaban, #1 lec # 3 Fall 2000 9-18-2000 3. Blaise Barney, Lawrence Livermore National Laboratory, https://computing.llnl.gov/tutorials/parallel_comp/#ModelsOverv iew, Last Modified: 11/02/2020 16:39:01 4. J BlazeWich et al, Handbook on Parallel and distributed Processing, International Handbook of Information Systems, spinger, 2000 5. Phillip J. windley, Parallel Architectures, lesson 6, CS462, Large scale Distributed Systems, 2020 6. A. Grana, et al. Introduction to Parallel Computing, lecture 3
  • 45. END.