SlideShare a Scribd company logo
1 of 29
Krste Asanovic
krste@lcs.mit.edu
http://www.cag.lcs.mit.edu/6.893-f2000/
Parallel Architectures Overview
Parallelism is the “Ketchup* of Computer Architecture”
 Long latencies in off-chip memory and on-chip
interconnect?? Interleave parallel operations to hide
latencies
 High power consumption?? Use parallelism to increase
throughput and use voltage scaling to reduce
energy/operation
 Problem is finding and exploiting application parallelism
 Biggest problem is software
Apply parallelism to a hardware
problem and it usually tastes better
[*substitute condiment of choice…]
Little’s Law
Latency in Cycles
Throughput per Cycle
One Operation
Parallelism = Throughput * Latency
 To maintain throughput T/cycle when each operation has latency
L cycles, need T*L independent operations
 For fixed parallelism:
 decreased latency allows increased throughput
 decreased throughput allows increased latency tolerance
Types of Parallelism
Data-Level Parallelism (DLP)
Time
Time
Thread-Level Parallelism (TLP)
Time
Instruction-Level Parallelism (ILP)
Pipelining
Time
 Pipelined
 minimum of 11 stages for any instruction
 Data Parallel Instructions
 MMX (64-bit) and SSE (128-bit) extensions provide short vector
support
 Instruction-Level Parallel Core
 Can execute up to 3 x86 instructions per cycle
 Thread-Level Parallelism at System Level
 Bus architecture supports shared memory multiprocessing
Dual Processor Pentium-III Desktop
Translating Parallelism Types
Data
Parallel
Pipelining
Thread
Parallel
Instruction
Parallel
Speculation and Parallelism
 Speculation can increase effective parallelism by
avoiding true dependencies
 branch prediction for control flow speculation
 address prediction for memory access reordering
 value prediction to avoid operation latency
 Requires mechanism to recover on mispredict
Issues in Parallel Machine Design
 Communication
 how do parallel operations communicate data results?
 Synchronization
 how are parallel operations coordinated?
 Resource Management
 how are a large number of parallel tasks scheduled onto finite
hardware?
 Scalability
 how large a machine can be built?
Flynn’s Classification (1966)
Broad classification of parallel computing systems based on
number of instruction and data streams
 SISD: Single Instruction, Single Data
 conventional uniprocessor
 SIMD: Single Instruction, Multiple Data
 one instruction stream, multiple data paths
 distributed memory SIMD (MPP, DAP, CM-1&2, Maspar)
 shared memory SIMD (STARAN, vector computers)
 MIMD: Multiple Instruction, Multiple Data
 message passing machines (Transputers, nCube, CM-5)
 non-cache-coherent shared memory machines (BBN Butterfly, T3D)
 cache-coherent shared memory machines (Sequent, Sun Starfire,
SGI Origin)
 MISD: Multiple Instruction, Single Data
 no commercial examples
SIMD Architecture
 Central controller broadcasts instructions to multiple
processing elements (PEs)
Array
Controller
Inter-PE Connection Network
P
E
M
e
m
P
E
M
e
m
P
E
M
e
m
P
E
M
e
m
P
E
M
e
m
P
E
M
e
m
P
E
M
e
m
P
E
M
e
m
Control
Data
• Only requires one controller for whole array
• Only requires storage for one copy of program
• All computations fully synchronized
SIMD Machines
 Illiac IV (1972)
 64 64-bit PEs, 16KB/PE, 2D network
 Goodyear STARAN (1972)
 256 bit-serial associative PEs, 32B/PE, multistage network
 ICL DAP (Distributed Array Processor) (1980)
 4K bit-serial PEs, 512B/PE, 2D network
 Goodyear MPP (Massively Parallel Processor) (1982)
 16K bit-serial PEs, 128B/PE, 2D network
 Thinking Machines Connection Machine CM-1 (1985)
 64K bit-serial PEs, 512B/PE, 2D + hypercube router
 CM-2: 2048B/PE, plus 2,048 32-bit floating-point units
 Maspar MP-1 (1989)
 16K 4-bit processors, 16-64KB/PE, 2D + Xnet router
 MP-2: 16K 32-bit processors, 64KB/PE
Vector Register Machine
Scalar Registers
r0
r15
Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLR
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic
Instructions
VADD v3, v1, v2 v3
v2
v1
v1
Vector Load and
Store Instructions
VLD v1, r1, r2
Base, r1 Stride, r2
Vector Length Register
Memory
Cray-1 (1976)
Single Port
Memory
16 banks of
64-bit words
+
8-bit SECDED
80MW/sec
data
load/store
320MW/sec
instruction
buffer refill
4 Instruction Buffers
64-bitx16 NIP
LIP
CIP
(A0)
( (Ah) + j k m )
64
T Regs
(A0)
( (Ah) + j k m )
64
T Regs
S0
S1
S2
S3
S4
S5
S6
S7
A0
A1
A2
A3
A4
A5
A6
A7
Si
Tjk
Ai
Bjk
FP Add
FP Mul
FP Recip
Int Add
Int Logic
Int Shift
Pop Cnt
Sj
Si
Sk
Addr Add
Addr Mul
Aj
Ai
Ak
memory bank cycle 50 ns processor cycle 12.5 ns (80MHz)
V0
V1
V2
V3
V4
V5
V6
V7
Vk
Vj
Vi V. Mask
V. Length64 Element
Vector Registers
Vector Instruction Execution
VADD C,A,B
C[1]
C[2]
C[0]
A[3] B[3]
A[4] B[4]
A[5] B[5]
A[6] B[6]
C[4]
C[8]
C[0]
A[12] B[12]
A[16] B[16]
A[20] B[20]
A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]
A[17] B[17]
A[21] B[21]
A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]
A[18] B[18]
A[22] B[22]
A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]
A[19] B[19]
A[23] B[23]
A[27] B[27]
Execution using
one pipelined
functional unit
Execution using
four pipelined
functional units
Vector Unit Structure
Lane
Functional Unit
Vector
Registers
Memory Subsystem
Sequential ISA Bottleneck
a = foo(b);
for (i=0, i<
Superscalar compilerSequential
source code
Find independent
operations
Schedule
operations
Sequential
machine code
Check instruction
dependencies
Schedule
execution
Superscalar processor
VLIW: Very Long Instruction Word
 Compiler schedules parallel execution
 Multiple parallel operations packed into one long
instruction word
 Compiler must avoid data hazards (no interlocks)
Two Integer Units,
Single Cycle Latency
Two Load/Store Units,
Three Cycle Latency
Two Floating-Point Units,
Four Cycle Latency
Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2
ILP Datapath Hardware Scaling
 Replicating functional units and cache/memory
banks is straightforward and scales linearly
 Register file ports and bypass logic for N
functional units scale quadratically (N*N)
 Memory interconnection among N functional
units and memory banks also scales
quadratically
 (For large N, could try O(N logN) interconnect schemes)
 Technology scaling: Wires are getting even
slower relative to gate delays
 Complex interconnect adds latency as well as
area
=> Need greater parallelism to hide latencies
Register File
Memory Interconnect
Multiple
Functional
Units
Multiple
Cache/Memory
Banks
Clustered VLIW
 Divide machine into clusters of local
register files and local functional units
 Lower bandwidth/higher latency
interconnect between clusters
 Software responsible for mapping
computations to minimize
communication overhead
Cluster
Interconnect
Local
Regfile
Local
Regfile
Memory Interconnect
Multiple
Cache/Memory
Banks
Cluster
MIMD Machines
 Message passing
 Thinking Machines CM-5
 Intel Paragon
 Meiko CS-2
 many cluster systems (e.g., IBM SP-2, Linux Beowulfs)
 Shared memory
 no hardware cache coherence
 IBM RP3
 BBN Butterfly
 Cray T3D/T3E
 Parallel vector supercomputers (Cray T90, NEC SX-5)
 hardware cache coherence
 many small-scale SMPs (e.g. Quad Pentium Xeon systems)
 large scale bus/crossbar-based SMPs (Sun Starfire)
 large scale directory-based SMPs (SGI Origin)
Message Passing MPPs
(Massively Parallel Processors)
 Initial Research Projects
 Caltech Cosmic Cube (early 1980s) using custom Mosaic processors
 Commercial Microprocessors including MPP Support
 Transputer (1985)
 nCube-1(1986) /nCube-2 (1990)
 Standard Microprocessors + Network Interfaces
 Intel Paragon (i860)
 TMC CM-5 (SPARC)
 Meiko CS-2 (SPARC)
 IBM SP-2 (RS/6000)
 MPP Vector Supers
 Fujitsu VPP series
µP
Mem
NI
Interconnect Network
µP
Mem
NI
µP
Mem
NI
µP
Mem
NI
µP
Mem
NI
µP
Mem
NI
µP
Mem
NI
µP
Mem
NI
Designs scale to 100s or
1000s of nodes
Message Passing MPP Problems
 All data layout must be handled by software
 cannot retrieve remote data except with message request/reply
 Message passing has high software overhead
 early machines had to invoke OS on each message (100µs-
1ms/message)
 even user level access to network interface has dozens of
cycles overhead (NI might be on I/O bus)
 sending messages can be cheap (just like stores)
 receiving messages is expensive, need to poll or interrupt
Shared Memory Multiprocessors
 Will work with any data placement (but might be slow)
 can choose to optimize only critical portions of code
 Load and store instructions used to communicate data between
processes
 no OS involvement
 low software overhead
 Usually some special synchronization primitives
 fetch&op
 load linked/store conditional
 In large scale systems, the logically shared memory is
implemented as physically distributed memory modules
 Two main categories
 non cache coherent
 hardware cache coherent
Cray T3E
 Each node has 256MB-2GB local DRAM memory
 Load and stores access global memory over network
 Only local memory cached by on-chip caches
 Alpha microprocessor surrounded by custom “shell” circuitry to make it
into effective MPP node. Shell provides:
 multiple stream buffers instead of board-level (L3) cache
 external copy of on-chip cache tags to check against remote writes to local
memory, generates on-chip invalidates on match
 512 external E registers (asynchronous vector load/store engine)
 address management to allow all of external physical memory to be addressed
 atomic memory operations (fetch&op)
 support for hardware barriers/eureka to synchronize parallel tasks
• Up to 2048 600MHz Alpha 21164
processors connected in 3D torus
Bus-Based Cache-Coherent SMPs
 Small scale (<= 4 processors) bus-based SMPs by far the most
common parallel processing platform today
 Bus provides broadcast and serialization point for simple snooping
cache coherence protocol
 Modern microprocessors integrate support for this protocol
µP
$
µP
$
µP
$
µP
$
Central
Memory
Bus
Sun Starfire (UE10000)
Uses 4 interleaved address
busses to scale snooping
protocol
16x16 Data Crossbar
Memory
Module
Board Interconnect
µP
$
µP
$
µP
$
µP
$
Memory
Module
Board Interconnect
µP
$
µP
$
µP
$
µP
$
4 processors +
memory module per
system board
• Up to 64-way SMP using bus-based snooping protocol
Separate data
transfer over
high bandwidth
crossbar
SGI Origin 2000
Scalable hypercube switching network
supports up to 64 two-processor nodes (128
processors total)
(Some installations up to 512 processors)
• Large scale distributed directory SMP
• Scales from 2 processor workstation
to 512 processor supercomputer
Node contains:
• Two MIPS R10000 processors plus
caches
• Memory module including directory
• Connection to global network
• Connection to I/O
Convergence in Parallel VLSI Architectures?
I/O
Tile
Bulk SRAM/
Embedded DRAM
Off-chip
DRAM
Addr.
Unit
Data
Unit
Cntl.
Unit
SRAM/cache
Data Net
Control Net
Portable Parallel Programming?
 Most large scale commercial installations emphasize
throughput
 database servers, web servers, file servers
 independent transactions
 Wide variety of parallel systems
 message passing
 shared memory
 shared memory within node, message passing between nodes
⇒ Little commercial software support for portable parallel
programming
Message Passing Interface (MPI) standard widely used for
portability
– lowest common denominator
– “assembly” language level of parallel programming

More Related Content

What's hot

04536342
0453634204536342
04536342fidan78
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
Vector processor : Notes
Vector processor : NotesVector processor : Notes
Vector processor : NotesSubhajit Sahu
 
Lect.10.arm soc.4 neon
Lect.10.arm soc.4 neonLect.10.arm soc.4 neon
Lect.10.arm soc.4 neonsean chen
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network ProcessingRyousei Takano
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
 
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...jemin lee
 
Parallel processing (simd and mimd)
Parallel processing (simd and mimd)Parallel processing (simd and mimd)
Parallel processing (simd and mimd)Bhavik Vashi
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009Léia de Sousa
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachjemin lee
 
PARALLELISM IN MULTICORE PROCESSORS
PARALLELISM  IN MULTICORE PROCESSORSPARALLELISM  IN MULTICORE PROCESSORS
PARALLELISM IN MULTICORE PROCESSORSAmirthavalli Senthil
 
Early Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic ComputingEarly Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic ComputingDESMOND YUEN
 
Uday walvekar dsp_seminar
Uday walvekar dsp_seminarUday walvekar dsp_seminar
Uday walvekar dsp_seminarUday Walvekar
 

What's hot (20)

04536342
0453634204536342
04536342
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Vector processor : Notes
Vector processor : NotesVector processor : Notes
Vector processor : Notes
 
Multicore Processors
Multicore ProcessorsMulticore Processors
Multicore Processors
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Lect.10.arm soc.4 neon
Lect.10.arm soc.4 neonLect.10.arm soc.4 neon
Lect.10.arm soc.4 neon
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
 
Parallel processing (simd and mimd)
Parallel processing (simd and mimd)Parallel processing (simd and mimd)
Parallel processing (simd and mimd)
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approach
 
PARALLELISM IN MULTICORE PROCESSORS
PARALLELISM  IN MULTICORE PROCESSORSPARALLELISM  IN MULTICORE PROCESSORS
PARALLELISM IN MULTICORE PROCESSORS
 
MEMORY & I/O SYSTEMS
MEMORY & I/O SYSTEMS                          MEMORY & I/O SYSTEMS
MEMORY & I/O SYSTEMS
 
Early Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic ComputingEarly Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic Computing
 
Hpc4
Hpc4Hpc4
Hpc4
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Parallel processing Concepts
Parallel processing ConceptsParallel processing Concepts
Parallel processing Concepts
 
Multi-Core on Chip Architecture *doc - IK
Multi-Core on Chip Architecture *doc - IKMulti-Core on Chip Architecture *doc - IK
Multi-Core on Chip Architecture *doc - IK
 
Uday walvekar dsp_seminar
Uday walvekar dsp_seminarUday walvekar dsp_seminar
Uday walvekar dsp_seminar
 

Similar to Parallel Architectures Overview Explained

ARM architcture
ARM architcture ARM architcture
ARM architcture Hossam Adel
 
Introduction to embedded System.pptx
Introduction to embedded System.pptxIntroduction to embedded System.pptx
Introduction to embedded System.pptxPratik Gohel
 
MPC854XE: PowerQUICC III Processors
MPC854XE: PowerQUICC III ProcessorsMPC854XE: PowerQUICC III Processors
MPC854XE: PowerQUICC III ProcessorsPremier Farnell
 
COA Lecture 01(Introduction).pptx
COA Lecture 01(Introduction).pptxCOA Lecture 01(Introduction).pptx
COA Lecture 01(Introduction).pptxsyed rafi
 
Intel® RDT Hands-on Lab
Intel® RDT Hands-on LabIntel® RDT Hands-on Lab
Intel® RDT Hands-on LabMichelle Holley
 
Juniper Networks Router Architecture
Juniper Networks Router ArchitectureJuniper Networks Router Architecture
Juniper Networks Router Architecturelawuah
 
Intel’S Larrabee
Intel’S LarrabeeIntel’S Larrabee
Intel’S Larrabeevipinpnair
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialmadhuinturi
 
Evolution of the Windows Kernel Architecture, by Dave Probert
Evolution of the Windows Kernel Architecture, by Dave ProbertEvolution of the Windows Kernel Architecture, by Dave Probert
Evolution of the Windows Kernel Architecture, by Dave Probertyang
 
Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...Marina Kolpakova
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computingrinnocente
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoSciCompIIT
 

Similar to Parallel Architectures Overview Explained (20)

The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 
x86_1.ppt
x86_1.pptx86_1.ppt
x86_1.ppt
 
ARM architcture
ARM architcture ARM architcture
ARM architcture
 
Hpc 4 5
Hpc 4 5Hpc 4 5
Hpc 4 5
 
Introduction to embedded System.pptx
Introduction to embedded System.pptxIntroduction to embedded System.pptx
Introduction to embedded System.pptx
 
MPC854XE: PowerQUICC III Processors
MPC854XE: PowerQUICC III ProcessorsMPC854XE: PowerQUICC III Processors
MPC854XE: PowerQUICC III Processors
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 
COA Lecture 01(Introduction).pptx
COA Lecture 01(Introduction).pptxCOA Lecture 01(Introduction).pptx
COA Lecture 01(Introduction).pptx
 
DSP Processor.pptx
DSP Processor.pptxDSP Processor.pptx
DSP Processor.pptx
 
Intel® RDT Hands-on Lab
Intel® RDT Hands-on LabIntel® RDT Hands-on Lab
Intel® RDT Hands-on Lab
 
Palestra IBM-Mack Zvm linux
Palestra  IBM-Mack Zvm linux  Palestra  IBM-Mack Zvm linux
Palestra IBM-Mack Zvm linux
 
Juniper Networks Router Architecture
Juniper Networks Router ArchitectureJuniper Networks Router Architecture
Juniper Networks Router Architecture
 
Memory management
Memory managementMemory management
Memory management
 
Intel’S Larrabee
Intel’S LarrabeeIntel’S Larrabee
Intel’S Larrabee
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
Oct2009
Oct2009Oct2009
Oct2009
 
Evolution of the Windows Kernel Architecture, by Dave Probert
Evolution of the Windows Kernel Architecture, by Dave ProbertEvolution of the Windows Kernel Architecture, by Dave Probert
Evolution of the Windows Kernel Architecture, by Dave Probert
 
Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computing
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric Into
 

Recently uploaded

Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 

Recently uploaded (20)

Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 

Parallel Architectures Overview Explained

  • 2. Parallelism is the “Ketchup* of Computer Architecture”  Long latencies in off-chip memory and on-chip interconnect?? Interleave parallel operations to hide latencies  High power consumption?? Use parallelism to increase throughput and use voltage scaling to reduce energy/operation  Problem is finding and exploiting application parallelism  Biggest problem is software Apply parallelism to a hardware problem and it usually tastes better [*substitute condiment of choice…]
  • 3. Little’s Law Latency in Cycles Throughput per Cycle One Operation Parallelism = Throughput * Latency  To maintain throughput T/cycle when each operation has latency L cycles, need T*L independent operations  For fixed parallelism:  decreased latency allows increased throughput  decreased throughput allows increased latency tolerance
  • 4. Types of Parallelism Data-Level Parallelism (DLP) Time Time Thread-Level Parallelism (TLP) Time Instruction-Level Parallelism (ILP) Pipelining Time
  • 5.  Pipelined  minimum of 11 stages for any instruction  Data Parallel Instructions  MMX (64-bit) and SSE (128-bit) extensions provide short vector support  Instruction-Level Parallel Core  Can execute up to 3 x86 instructions per cycle  Thread-Level Parallelism at System Level  Bus architecture supports shared memory multiprocessing Dual Processor Pentium-III Desktop
  • 7. Speculation and Parallelism  Speculation can increase effective parallelism by avoiding true dependencies  branch prediction for control flow speculation  address prediction for memory access reordering  value prediction to avoid operation latency  Requires mechanism to recover on mispredict
  • 8. Issues in Parallel Machine Design  Communication  how do parallel operations communicate data results?  Synchronization  how are parallel operations coordinated?  Resource Management  how are a large number of parallel tasks scheduled onto finite hardware?  Scalability  how large a machine can be built?
  • 9. Flynn’s Classification (1966) Broad classification of parallel computing systems based on number of instruction and data streams  SISD: Single Instruction, Single Data  conventional uniprocessor  SIMD: Single Instruction, Multiple Data  one instruction stream, multiple data paths  distributed memory SIMD (MPP, DAP, CM-1&2, Maspar)  shared memory SIMD (STARAN, vector computers)  MIMD: Multiple Instruction, Multiple Data  message passing machines (Transputers, nCube, CM-5)  non-cache-coherent shared memory machines (BBN Butterfly, T3D)  cache-coherent shared memory machines (Sequent, Sun Starfire, SGI Origin)  MISD: Multiple Instruction, Single Data  no commercial examples
  • 10. SIMD Architecture  Central controller broadcasts instructions to multiple processing elements (PEs) Array Controller Inter-PE Connection Network P E M e m P E M e m P E M e m P E M e m P E M e m P E M e m P E M e m P E M e m Control Data • Only requires one controller for whole array • Only requires storage for one copy of program • All computations fully synchronized
  • 11. SIMD Machines  Illiac IV (1972)  64 64-bit PEs, 16KB/PE, 2D network  Goodyear STARAN (1972)  256 bit-serial associative PEs, 32B/PE, multistage network  ICL DAP (Distributed Array Processor) (1980)  4K bit-serial PEs, 512B/PE, 2D network  Goodyear MPP (Massively Parallel Processor) (1982)  16K bit-serial PEs, 128B/PE, 2D network  Thinking Machines Connection Machine CM-1 (1985)  64K bit-serial PEs, 512B/PE, 2D + hypercube router  CM-2: 2048B/PE, plus 2,048 32-bit floating-point units  Maspar MP-1 (1989)  16K 4-bit processors, 16-64KB/PE, 2D + Xnet router  MP-2: 16K 32-bit processors, 64KB/PE
  • 12. Vector Register Machine Scalar Registers r0 r15 Vector Registers v0 v15 [0] [1] [2] [VLRMAX-1] VLR + + + + + + [0] [1] [VLR-1] Vector Arithmetic Instructions VADD v3, v1, v2 v3 v2 v1 v1 Vector Load and Store Instructions VLD v1, r1, r2 Base, r1 Stride, r2 Vector Length Register Memory
  • 13. Cray-1 (1976) Single Port Memory 16 banks of 64-bit words + 8-bit SECDED 80MW/sec data load/store 320MW/sec instruction buffer refill 4 Instruction Buffers 64-bitx16 NIP LIP CIP (A0) ( (Ah) + j k m ) 64 T Regs (A0) ( (Ah) + j k m ) 64 T Regs S0 S1 S2 S3 S4 S5 S6 S7 A0 A1 A2 A3 A4 A5 A6 A7 Si Tjk Ai Bjk FP Add FP Mul FP Recip Int Add Int Logic Int Shift Pop Cnt Sj Si Sk Addr Add Addr Mul Aj Ai Ak memory bank cycle 50 ns processor cycle 12.5 ns (80MHz) V0 V1 V2 V3 V4 V5 V6 V7 Vk Vj Vi V. Mask V. Length64 Element Vector Registers
  • 14. Vector Instruction Execution VADD C,A,B C[1] C[2] C[0] A[3] B[3] A[4] B[4] A[5] B[5] A[6] B[6] C[4] C[8] C[0] A[12] B[12] A[16] B[16] A[20] B[20] A[24] B[24] C[5] C[9] C[1] A[13] B[13] A[17] B[17] A[21] B[21] A[25] B[25] C[6] C[10] C[2] A[14] B[14] A[18] B[18] A[22] B[22] A[26] B[26] C[7] C[11] C[3] A[15] B[15] A[19] B[19] A[23] B[23] A[27] B[27] Execution using one pipelined functional unit Execution using four pipelined functional units
  • 15. Vector Unit Structure Lane Functional Unit Vector Registers Memory Subsystem
  • 16. Sequential ISA Bottleneck a = foo(b); for (i=0, i< Superscalar compilerSequential source code Find independent operations Schedule operations Sequential machine code Check instruction dependencies Schedule execution Superscalar processor
  • 17. VLIW: Very Long Instruction Word  Compiler schedules parallel execution  Multiple parallel operations packed into one long instruction word  Compiler must avoid data hazards (no interlocks) Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2
  • 18. ILP Datapath Hardware Scaling  Replicating functional units and cache/memory banks is straightforward and scales linearly  Register file ports and bypass logic for N functional units scale quadratically (N*N)  Memory interconnection among N functional units and memory banks also scales quadratically  (For large N, could try O(N logN) interconnect schemes)  Technology scaling: Wires are getting even slower relative to gate delays  Complex interconnect adds latency as well as area => Need greater parallelism to hide latencies Register File Memory Interconnect Multiple Functional Units Multiple Cache/Memory Banks
  • 19. Clustered VLIW  Divide machine into clusters of local register files and local functional units  Lower bandwidth/higher latency interconnect between clusters  Software responsible for mapping computations to minimize communication overhead Cluster Interconnect Local Regfile Local Regfile Memory Interconnect Multiple Cache/Memory Banks Cluster
  • 20. MIMD Machines  Message passing  Thinking Machines CM-5  Intel Paragon  Meiko CS-2  many cluster systems (e.g., IBM SP-2, Linux Beowulfs)  Shared memory  no hardware cache coherence  IBM RP3  BBN Butterfly  Cray T3D/T3E  Parallel vector supercomputers (Cray T90, NEC SX-5)  hardware cache coherence  many small-scale SMPs (e.g. Quad Pentium Xeon systems)  large scale bus/crossbar-based SMPs (Sun Starfire)  large scale directory-based SMPs (SGI Origin)
  • 21. Message Passing MPPs (Massively Parallel Processors)  Initial Research Projects  Caltech Cosmic Cube (early 1980s) using custom Mosaic processors  Commercial Microprocessors including MPP Support  Transputer (1985)  nCube-1(1986) /nCube-2 (1990)  Standard Microprocessors + Network Interfaces  Intel Paragon (i860)  TMC CM-5 (SPARC)  Meiko CS-2 (SPARC)  IBM SP-2 (RS/6000)  MPP Vector Supers  Fujitsu VPP series µP Mem NI Interconnect Network µP Mem NI µP Mem NI µP Mem NI µP Mem NI µP Mem NI µP Mem NI µP Mem NI Designs scale to 100s or 1000s of nodes
  • 22. Message Passing MPP Problems  All data layout must be handled by software  cannot retrieve remote data except with message request/reply  Message passing has high software overhead  early machines had to invoke OS on each message (100µs- 1ms/message)  even user level access to network interface has dozens of cycles overhead (NI might be on I/O bus)  sending messages can be cheap (just like stores)  receiving messages is expensive, need to poll or interrupt
  • 23. Shared Memory Multiprocessors  Will work with any data placement (but might be slow)  can choose to optimize only critical portions of code  Load and store instructions used to communicate data between processes  no OS involvement  low software overhead  Usually some special synchronization primitives  fetch&op  load linked/store conditional  In large scale systems, the logically shared memory is implemented as physically distributed memory modules  Two main categories  non cache coherent  hardware cache coherent
  • 24. Cray T3E  Each node has 256MB-2GB local DRAM memory  Load and stores access global memory over network  Only local memory cached by on-chip caches  Alpha microprocessor surrounded by custom “shell” circuitry to make it into effective MPP node. Shell provides:  multiple stream buffers instead of board-level (L3) cache  external copy of on-chip cache tags to check against remote writes to local memory, generates on-chip invalidates on match  512 external E registers (asynchronous vector load/store engine)  address management to allow all of external physical memory to be addressed  atomic memory operations (fetch&op)  support for hardware barriers/eureka to synchronize parallel tasks • Up to 2048 600MHz Alpha 21164 processors connected in 3D torus
  • 25. Bus-Based Cache-Coherent SMPs  Small scale (<= 4 processors) bus-based SMPs by far the most common parallel processing platform today  Bus provides broadcast and serialization point for simple snooping cache coherence protocol  Modern microprocessors integrate support for this protocol µP $ µP $ µP $ µP $ Central Memory Bus
  • 26. Sun Starfire (UE10000) Uses 4 interleaved address busses to scale snooping protocol 16x16 Data Crossbar Memory Module Board Interconnect µP $ µP $ µP $ µP $ Memory Module Board Interconnect µP $ µP $ µP $ µP $ 4 processors + memory module per system board • Up to 64-way SMP using bus-based snooping protocol Separate data transfer over high bandwidth crossbar
  • 27. SGI Origin 2000 Scalable hypercube switching network supports up to 64 two-processor nodes (128 processors total) (Some installations up to 512 processors) • Large scale distributed directory SMP • Scales from 2 processor workstation to 512 processor supercomputer Node contains: • Two MIPS R10000 processors plus caches • Memory module including directory • Connection to global network • Connection to I/O
  • 28. Convergence in Parallel VLSI Architectures? I/O Tile Bulk SRAM/ Embedded DRAM Off-chip DRAM Addr. Unit Data Unit Cntl. Unit SRAM/cache Data Net Control Net
  • 29. Portable Parallel Programming?  Most large scale commercial installations emphasize throughput  database servers, web servers, file servers  independent transactions  Wide variety of parallel systems  message passing  shared memory  shared memory within node, message passing between nodes ⇒ Little commercial software support for portable parallel programming Message Passing Interface (MPI) standard widely used for portability – lowest common denominator – “assembly” language level of parallel programming