SlideShare a Scribd company logo
1 of 49
Instruction Level Parallelism
and Superscalar Processors
What is Superscalar?
• Common instructions (arithmetic, load/store, conditional branch) can
be initiated and executed independently
• Equally applicable to RISC & CISC
• In practice usually RISC
Why Superscalar?
• Most operations are on scalar quantities (see RISC notes)
• Improve these operations to get an overall improvement
General Superscalar Organization
Superpipelined
• Many pipeline stages need less than half a clock cycle
• Double internal clock speed gets two tasks per external clock cycle
• Superscalar allows parallel fetch execute
Superscalar v
Superpipeline
Limitations
• Instruction level parallelism
• Compiler based optimisation
• Hardware techniques
• Limited by
• True data dependency
• Procedural dependency
• Resource conflicts
• Output dependency
• Antidependency
True Data Dependency
• ADD r1, r2 (r1 := r1+r2;)
• MOVE r3,r1 (r3 := r1;)
• Can fetch and decode second instruction in parallel with first
• Can NOT execute second instruction until first is finished
Procedural Dependency
• Can not execute instructions after a branch in parallel with
instructions before a branch
• Also, if instruction length is not fixed, instructions have to be
decoded to find out how many fetches are needed
• This prevents simultaneous fetches
Resource Conflict
• Two or more instructions requiring access to the same resource at
the same time
• e.g. two arithmetic instructions
• Can duplicate resources
• e.g. have two arithmetic units
Effect of
Dependencies
Design Issues
• Instruction level parallelism
• Instructions in a sequence are independent
• Execution can be overlapped
• Governed by data and procedural dependency
• Machine Parallelism
• Ability to take advantage of instruction level parallelism
• Governed by number of parallel pipelines
Instruction Issue Policy
• Order in which instructions are fetched
• Order in which instructions are executed
• Order in which instructions change registers and memory
In-Order Issue
In-Order Completion
• Issue instructions in the order they occur
• Not very efficient
• May fetch >1 instruction
• Instructions must stall if necessary
In-Order Issue In-Order Completion
(Diagram)
In-Order Issue
Out-of-Order Completion
• Output dependency
• R3:= R3 + R5; (I1)
• R4:= R3 + 1; (I2)
• R3:= R5 + 1; (I3)
• I2 depends on result of I1 - data dependency
• If I3 completes before I1, the result from I1 will be wrong - output (read-
write) dependency
In-Order Issue Out-of-Order Completion
(Diagram)
Out-of-Order Issue
Out-of-Order Completion
• Decouple decode pipeline from execution pipeline
• Can continue to fetch and decode until this pipeline is full
• When a functional unit becomes available an instruction can be
executed
• Since instructions have been decoded, processor can look ahead
Out-of-Order Issue Out-of-Order
Completion (Diagram)
Antidependency
• Write-write dependency
• R3:=R3 + R5; (I1)
• R4:=R3 + 1; (I2)
• R3:=R5 + 1; (I3)
• R7:=R3 + R4; (I4)
• I3 can not complete before I2 starts as I2 needs a value in R3 and I3 changes
R3
Reorder Buffer
• Temporary storage for results
• Commit to register file in program order
Register Renaming
• Output and antidependencies occur because register contents may
not reflect the correct ordering from the program
• May result in a pipeline stall
• Registers allocated dynamically
• i.e. registers are not specifically named
Register Renaming example
• R3b:=R3a + R5a (I1)
• R4b:=R3b + 1 (I2)
• R3c:=R5a + 1 (I3)
• R7b:=R3c + R4b (I4)
• Without subscript refers to logical register in instruction
• With subscript is hardware register allocated
• Note R3a R3b R3c
• Alternative: Scoreboarding
• Bookkeeping technique
• Allow instruction execution whenever not dependent on previous
instructions and no structural hazards
Machine Parallelism
• Duplication of Resources
• Out of order issue
• Renaming
• Not worth duplication functions without register renaming
• Need instruction window large enough (more than 8)
Speedups of Machine Organizations
Without Procedural Dependencies
Branch Prediction
• 80486 fetches both next sequential instruction after branch and
branch target instruction
• Gives two cycle delay if branch taken
RISC - Delayed Branch
• Calculate result of branch before unusable instructions pre-fetched
• Always execute single instruction immediately following branch
• Keeps pipeline full while fetching new instruction stream
• Not as good for superscalar
• Multiple instructions need to execute in delay slot
• Instruction dependence problems
• Revert to branch prediction
Superscalar Execution
Superscalar Implementation
• Simultaneously fetch multiple instructions
• Logic to determine true dependencies involving register values
• Mechanisms to communicate these values
• Mechanisms to initiate multiple instructions in parallel
• Resources for parallel execution of multiple instructions
• Mechanisms for committing process state in correct order
Pentium 4
• 80486 - CISC
• Pentium – some superscalar components
• Two separate integer execution units
• Pentium Pro – Full blown superscalar
• Subsequent models refine & enhance superscalar design
Pentium 4 Block Diagram
Pentium 4 Operation
• Fetch instructions form memory in order of static program
• Translate instruction into one or more fixed length RISC
instructions (micro-operations)
• Execute micro-ops on superscalar pipeline
• micro-ops may be executed out of order
• Commit results of micro-ops to register set in original
program flow order
• Outer CISC shell with inner RISC core
• Inner RISC core pipeline at least 20 stages
• Some micro-ops require multiple execution stages
• Longer pipeline
• c.f. five stage pipeline on x86 up to Pentium
Pentium 4 Pipeline
Pentium 4 Pipeline Operation (1)
Pentium 4 Pipeline Operation (2)
Pentium 4 Pipeline Operation (3)
Pentium 4 Pipeline Operation (4)
Pentium 4 Pipeline Operation (5)
Pentium 4 Pipeline Operation (6)
ARM CORTEX-A8
• ARM refers to Cortex-A8 as application processors
• Embedded processor running complex operating system
• Wireless, consumer and imaging applications
• Mobile phones, set-top boxes, gaming consoles automotive
navigation/entertainment systems
• Three functional units
• Dual, in-order-issue, 13-stage pipeline
• Keep power required to a minimum
• Out-of-order issue needs extra logic consuming extra power
• Figure 14.11 shows the details of the main Cortex-A8
pipeline
• Separate SIMD (single-instruction-multiple-data) unit
• 10-stage pipeline
ARM
Cortex-A8
Block
Diagram
Instruction Fetch Unit
• Predicts instruction stream
• Fetches instructions from the L1 instruction cache
• Up to four instructions per cycle
• Into buffer for decode pipeline
• Fetch unit includes L1 instruction cache
• Speculative instruction fetches
• Branch or exceptional instruction cause pipeline flush
• Stages:
• F0 address generation unit generates virtual address
• Normally next sequentially
• Can also be branch target address
• F1 Used to fetch instructions from L1 instruction cache
• In parallel fetch address used to access branch prediction arrays
• F3 Instruction data are placed in instruction queue
• If branch prediction, new target address sent to address generation unit
• Two-level global history branch predictor
• Branch Target Buffer (BTB) and Global History Buffer (GHB)
• Return stack to predict subroutine return addresses
• Can fetch and queue up to 12 instructions
• Issues instructions two at a time
Instruction Decode Unit
• Decodes and sequences all instructions
• Dual pipeline structure, pipe0 and pipe1
• Two instructions can progress at a time
• Pipe0 contains older instruction in program order
• If instruction in pipe0 cannot issue, pipe1 will not issue
• Instructions progress in order
• Results written back to register file at end of execution
pipeline
• Prevents WAR hazards
• Keeps tracking of WAW hazards and recovery from flush conditions
straightforward
• Main concern of decode pipeline is prevention of RAW hazards
Instruction Processing Stages
• D0 Thumb instructions decompressed and preliminary decode is
performed
• D1 Instruction decode is completed
• D2 Write instruction to and read instructions from pending/replay
queue
• D3 Contains the instruction scheduling logic
• Scoreboard predicts register availability using static scheduling
• Hazard checking
• D4 Final decode for control signals for integer execute load/store
units
Integer Execution Unit
• Two symmetric (ALU) pipelines, an address generator for load and
store instructions, and multiply pipeline
• Pipeline stages:
• E0 Access register file
• Up to six registers for two instructions
• E1 Barrel shifter if needed.
• E2 ALU function
• E3 If needed, completes saturation arithmetic
• E4 Change in control flow prioritized and processed
• E5 Results written back to register file
• Multiply unit instructions routed to pipe0
• Performed in stages E1 through E3
• Multiply accumulate operation in E4
Load/store pipeline
• Parallel to integer pipeline
• E1 Memory address generated from base and index register
• E2 address applied to cache arrays
• E3 load, data returned and formatted
• E3 store, data are formatted and ready to be written to cache
• E4 Updates L2 cache, if required
• E5 Results are written to register file
ARM
Cortex-A8
Integer
Pipeline
SIMD and Floating-Point Pipeline
• SIMD and floating-point instructions pass through integer pipeline
• Processed in separate 10-stage pipeline
• NEON unit
• Handles packed SIMD instructions
• Provides two types of floating-point support
• If implemented, vector floating-point (VFP) coprocessor performs
IEEE 754 floating-point operations
• If not, separate multiply and add pipelines implement floating-point
operations
ARM Cortex-A8 NEON &
Floating Point Pipeline

More Related Content

What's hot

Advanced Pipelining in ARM Processors.pptx
Advanced Pipelining  in ARM Processors.pptxAdvanced Pipelining  in ARM Processors.pptx
Advanced Pipelining in ARM Processors.pptxJoyChowdhury30
 
Computer architecture virtual memory
Computer architecture virtual memoryComputer architecture virtual memory
Computer architecture virtual memoryMazin Alwaaly
 
Computer architecture instruction formats
Computer architecture instruction formatsComputer architecture instruction formats
Computer architecture instruction formatsMazin Alwaaly
 
Processor Organization and Architecture
Processor Organization and ArchitectureProcessor Organization and Architecture
Processor Organization and ArchitectureVinit Raut
 
Direct memory access
Direct memory accessDirect memory access
Direct memory accessshubham kuwar
 
INSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISMINSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISMKamran Ashraf
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture Haris456
 
Pipelining and vector processing
Pipelining and vector processingPipelining and vector processing
Pipelining and vector processingKamal Acharya
 
Risc and cisc computers
Risc and cisc computersRisc and cisc computers
Risc and cisc computersankita mundhra
 
Computer organization memory
Computer organization memoryComputer organization memory
Computer organization memoryDeepak John
 
Superscalar & superpipeline processor
Superscalar & superpipeline processorSuperscalar & superpipeline processor
Superscalar & superpipeline processorMuhammad Ishaq
 
Instruction Formats
Instruction FormatsInstruction Formats
Instruction FormatsRaaviKapoor
 
Instruction Set Architecture (ISA)
Instruction Set Architecture (ISA)Instruction Set Architecture (ISA)
Instruction Set Architecture (ISA)Gaditek
 
Floating point arithmetic operations (1)
Floating point arithmetic operations (1)Floating point arithmetic operations (1)
Floating point arithmetic operations (1)cs19club
 
Multiprocessor
MultiprocessorMultiprocessor
MultiprocessorA B Shinde
 
Page replacement algorithms
Page replacement algorithmsPage replacement algorithms
Page replacement algorithmsPiyush Rochwani
 
Flynns classification
Flynns classificationFlynns classification
Flynns classificationYasir Khan
 
Pipeline processing - Computer Architecture
Pipeline processing - Computer Architecture Pipeline processing - Computer Architecture
Pipeline processing - Computer Architecture S. Hasnain Raza
 

What's hot (20)

Advanced Pipelining in ARM Processors.pptx
Advanced Pipelining  in ARM Processors.pptxAdvanced Pipelining  in ARM Processors.pptx
Advanced Pipelining in ARM Processors.pptx
 
Bus aribration
Bus aribrationBus aribration
Bus aribration
 
Computer architecture virtual memory
Computer architecture virtual memoryComputer architecture virtual memory
Computer architecture virtual memory
 
Computer architecture instruction formats
Computer architecture instruction formatsComputer architecture instruction formats
Computer architecture instruction formats
 
Processor Organization and Architecture
Processor Organization and ArchitectureProcessor Organization and Architecture
Processor Organization and Architecture
 
Direct memory access
Direct memory accessDirect memory access
Direct memory access
 
INSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISMINSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISM
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
 
Pipelining and vector processing
Pipelining and vector processingPipelining and vector processing
Pipelining and vector processing
 
Risc and cisc computers
Risc and cisc computersRisc and cisc computers
Risc and cisc computers
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Computer organization memory
Computer organization memoryComputer organization memory
Computer organization memory
 
Superscalar & superpipeline processor
Superscalar & superpipeline processorSuperscalar & superpipeline processor
Superscalar & superpipeline processor
 
Instruction Formats
Instruction FormatsInstruction Formats
Instruction Formats
 
Instruction Set Architecture (ISA)
Instruction Set Architecture (ISA)Instruction Set Architecture (ISA)
Instruction Set Architecture (ISA)
 
Floating point arithmetic operations (1)
Floating point arithmetic operations (1)Floating point arithmetic operations (1)
Floating point arithmetic operations (1)
 
Multiprocessor
MultiprocessorMultiprocessor
Multiprocessor
 
Page replacement algorithms
Page replacement algorithmsPage replacement algorithms
Page replacement algorithms
 
Flynns classification
Flynns classificationFlynns classification
Flynns classification
 
Pipeline processing - Computer Architecture
Pipeline processing - Computer Architecture Pipeline processing - Computer Architecture
Pipeline processing - Computer Architecture
 

Viewers also liked

Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Dr.K. Thirunadana Sikamani
 
Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism) Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism) A B Shinde
 
Instruction level parallelism
Instruction level parallelismInstruction level parallelism
Instruction level parallelismdeviyasharwin
 
Chapter 3 instruction level parallelism and its exploitation
Chapter 3 instruction level parallelism and its exploitationChapter 3 instruction level parallelism and its exploitation
Chapter 3 instruction level parallelism and its exploitationsubramaniam shankar
 
Task and Data Parallelism: Real-World Examples
Task and Data Parallelism: Real-World ExamplesTask and Data Parallelism: Real-World Examples
Task and Data Parallelism: Real-World ExamplesSasha Goldshtein
 
Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6Ismail Mukiibi
 
Instruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsInstruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsJose Pinilla
 
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...Ahmed kasim
 
Smp and asmp architecture.
Smp and asmp architecture.Smp and asmp architecture.
Smp and asmp architecture.Gaurav Dalvi
 
Intel® hyper threading technology
Intel® hyper threading technologyIntel® hyper threading technology
Intel® hyper threading technologyAmirali Sharifian
 

Viewers also liked (14)

Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
 
Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism) Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism)
 
Instruction level parallelism
Instruction level parallelismInstruction level parallelism
Instruction level parallelism
 
Chapter 3 instruction level parallelism and its exploitation
Chapter 3 instruction level parallelism and its exploitationChapter 3 instruction level parallelism and its exploitation
Chapter 3 instruction level parallelism and its exploitation
 
Task and Data Parallelism
Task and Data ParallelismTask and Data Parallelism
Task and Data Parallelism
 
Task and Data Parallelism: Real-World Examples
Task and Data Parallelism: Real-World ExamplesTask and Data Parallelism: Real-World Examples
Task and Data Parallelism: Real-World Examples
 
Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6
 
Concurrency basics
Concurrency basicsConcurrency basics
Concurrency basics
 
Instruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsInstruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) Limitations
 
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
 
Symmetric multiprocessing
Symmetric multiprocessingSymmetric multiprocessing
Symmetric multiprocessing
 
1.prallelism
1.prallelism1.prallelism
1.prallelism
 
Smp and asmp architecture.
Smp and asmp architecture.Smp and asmp architecture.
Smp and asmp architecture.
 
Intel® hyper threading technology
Intel® hyper threading technologyIntel® hyper threading technology
Intel® hyper threading technology
 

Similar to Instruction Level Parallelism and Superscalar Processors

12 processor structure and function
12 processor structure and function12 processor structure and function
12 processor structure and functionSher Shah Merkhel
 
Pipelining
PipeliningPipelining
PipeliningAJAL A J
 
Performance Enhancement with Pipelining
Performance Enhancement with PipeliningPerformance Enhancement with Pipelining
Performance Enhancement with PipeliningAneesh Raveendran
 
IT209 Cpu Structure Report
IT209 Cpu Structure ReportIT209 Cpu Structure Report
IT209 Cpu Structure ReportBis Aquino
 
Computer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and MicrocontrollerComputer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and MicrocontrollerAmrutaMehata
 
12 processor structure and function
12 processor structure and function12 processor structure and function
12 processor structure and functiondilip kumar
 
UNIT 3 - General Purpose Processors
UNIT 3 - General Purpose ProcessorsUNIT 3 - General Purpose Processors
UNIT 3 - General Purpose ProcessorsButtaRajasekhar2
 
Pipeline & Nonpipeline Processor
Pipeline & Nonpipeline ProcessorPipeline & Nonpipeline Processor
Pipeline & Nonpipeline ProcessorSmit Shah
 
Arm architecture chapter2_steve_furber
Arm architecture chapter2_steve_furberArm architecture chapter2_steve_furber
Arm architecture chapter2_steve_furberasodariyabhavesh
 
2. ILP Processors.ppt
2. ILP Processors.ppt2. ILP Processors.ppt
2. ILP Processors.pptShifaZahra7
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architecturesYoung Alista
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computersSyed Zaid Irshad
 
Computer Organization : CPU, Memory and I/O organization
Computer Organization : CPU, Memory and I/O organizationComputer Organization : CPU, Memory and I/O organization
Computer Organization : CPU, Memory and I/O organizationAmrutaMehata
 
Processor Organization and Architecture
Processor Organization and ArchitectureProcessor Organization and Architecture
Processor Organization and ArchitectureDhaval Bagal
 
12 processor structure and function
12 processor structure and function12 processor structure and function
12 processor structure and functionAnwal Mirza
 

Similar to Instruction Level Parallelism and Superscalar Processors (20)

14 superscalar
14 superscalar14 superscalar
14 superscalar
 
14 superscalar
14 superscalar14 superscalar
14 superscalar
 
13_Superscalar.ppt
13_Superscalar.ppt13_Superscalar.ppt
13_Superscalar.ppt
 
12 processor structure and function
12 processor structure and function12 processor structure and function
12 processor structure and function
 
15 ia64
15 ia6415 ia64
15 ia64
 
Pipelining
PipeliningPipelining
Pipelining
 
Performance Enhancement with Pipelining
Performance Enhancement with PipeliningPerformance Enhancement with Pipelining
Performance Enhancement with Pipelining
 
IT209 Cpu Structure Report
IT209 Cpu Structure ReportIT209 Cpu Structure Report
IT209 Cpu Structure Report
 
Computer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and MicrocontrollerComputer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and Microcontroller
 
12 processor structure and function
12 processor structure and function12 processor structure and function
12 processor structure and function
 
UNIT 3 - General Purpose Processors
UNIT 3 - General Purpose ProcessorsUNIT 3 - General Purpose Processors
UNIT 3 - General Purpose Processors
 
Pipeline & Nonpipeline Processor
Pipeline & Nonpipeline ProcessorPipeline & Nonpipeline Processor
Pipeline & Nonpipeline Processor
 
Arm architecture chapter2_steve_furber
Arm architecture chapter2_steve_furberArm architecture chapter2_steve_furber
Arm architecture chapter2_steve_furber
 
3 Pipelining
3 Pipelining3 Pipelining
3 Pipelining
 
2. ILP Processors.ppt
2. ILP Processors.ppt2. ILP Processors.ppt
2. ILP Processors.ppt
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architectures
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computers
 
Computer Organization : CPU, Memory and I/O organization
Computer Organization : CPU, Memory and I/O organizationComputer Organization : CPU, Memory and I/O organization
Computer Organization : CPU, Memory and I/O organization
 
Processor Organization and Architecture
Processor Organization and ArchitectureProcessor Organization and Architecture
Processor Organization and Architecture
 
12 processor structure and function
12 processor structure and function12 processor structure and function
12 processor structure and function
 

More from Syed Zaid Irshad

More from Syed Zaid Irshad (20)

Operating System.pdf
Operating System.pdfOperating System.pdf
Operating System.pdf
 
DBMS_Lab_Manual_&_Solution
DBMS_Lab_Manual_&_SolutionDBMS_Lab_Manual_&_Solution
DBMS_Lab_Manual_&_Solution
 
Data Structure and Algorithms.pptx
Data Structure and Algorithms.pptxData Structure and Algorithms.pptx
Data Structure and Algorithms.pptx
 
Design and Analysis of Algorithms.pptx
Design and Analysis of Algorithms.pptxDesign and Analysis of Algorithms.pptx
Design and Analysis of Algorithms.pptx
 
Professional Issues in Computing
Professional Issues in ComputingProfessional Issues in Computing
Professional Issues in Computing
 
Reduce course notes class xi
Reduce course notes class xiReduce course notes class xi
Reduce course notes class xi
 
Reduce course notes class xii
Reduce course notes class xiiReduce course notes class xii
Reduce course notes class xii
 
Introduction to Database
Introduction to DatabaseIntroduction to Database
Introduction to Database
 
C Language
C LanguageC Language
C Language
 
Flowchart
FlowchartFlowchart
Flowchart
 
Algorithm Pseudo
Algorithm PseudoAlgorithm Pseudo
Algorithm Pseudo
 
Computer Programming
Computer ProgrammingComputer Programming
Computer Programming
 
ICS 2nd Year Book Introduction
ICS 2nd Year Book IntroductionICS 2nd Year Book Introduction
ICS 2nd Year Book Introduction
 
Security, Copyright and the Law
Security, Copyright and the LawSecurity, Copyright and the Law
Security, Copyright and the Law
 
Computer Architecture
Computer ArchitectureComputer Architecture
Computer Architecture
 
Data Communication
Data CommunicationData Communication
Data Communication
 
Information Networks
Information NetworksInformation Networks
Information Networks
 
Basic Concept of Information Technology
Basic Concept of Information TechnologyBasic Concept of Information Technology
Basic Concept of Information Technology
 
Introduction to ICS 1st Year Book
Introduction to ICS 1st Year BookIntroduction to ICS 1st Year Book
Introduction to ICS 1st Year Book
 
Using the set operators
Using the set operatorsUsing the set operators
Using the set operators
 

Recently uploaded

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Christo Ananth
 

Recently uploaded (20)

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 

Instruction Level Parallelism and Superscalar Processors

  • 1. Instruction Level Parallelism and Superscalar Processors
  • 2. What is Superscalar? • Common instructions (arithmetic, load/store, conditional branch) can be initiated and executed independently • Equally applicable to RISC & CISC • In practice usually RISC
  • 3. Why Superscalar? • Most operations are on scalar quantities (see RISC notes) • Improve these operations to get an overall improvement
  • 5. Superpipelined • Many pipeline stages need less than half a clock cycle • Double internal clock speed gets two tasks per external clock cycle • Superscalar allows parallel fetch execute
  • 7. Limitations • Instruction level parallelism • Compiler based optimisation • Hardware techniques • Limited by • True data dependency • Procedural dependency • Resource conflicts • Output dependency • Antidependency
  • 8. True Data Dependency • ADD r1, r2 (r1 := r1+r2;) • MOVE r3,r1 (r3 := r1;) • Can fetch and decode second instruction in parallel with first • Can NOT execute second instruction until first is finished
  • 9. Procedural Dependency • Can not execute instructions after a branch in parallel with instructions before a branch • Also, if instruction length is not fixed, instructions have to be decoded to find out how many fetches are needed • This prevents simultaneous fetches
  • 10. Resource Conflict • Two or more instructions requiring access to the same resource at the same time • e.g. two arithmetic instructions • Can duplicate resources • e.g. have two arithmetic units
  • 12. Design Issues • Instruction level parallelism • Instructions in a sequence are independent • Execution can be overlapped • Governed by data and procedural dependency • Machine Parallelism • Ability to take advantage of instruction level parallelism • Governed by number of parallel pipelines
  • 13. Instruction Issue Policy • Order in which instructions are fetched • Order in which instructions are executed • Order in which instructions change registers and memory
  • 14. In-Order Issue In-Order Completion • Issue instructions in the order they occur • Not very efficient • May fetch >1 instruction • Instructions must stall if necessary
  • 15. In-Order Issue In-Order Completion (Diagram)
  • 16. In-Order Issue Out-of-Order Completion • Output dependency • R3:= R3 + R5; (I1) • R4:= R3 + 1; (I2) • R3:= R5 + 1; (I3) • I2 depends on result of I1 - data dependency • If I3 completes before I1, the result from I1 will be wrong - output (read- write) dependency
  • 17. In-Order Issue Out-of-Order Completion (Diagram)
  • 18. Out-of-Order Issue Out-of-Order Completion • Decouple decode pipeline from execution pipeline • Can continue to fetch and decode until this pipeline is full • When a functional unit becomes available an instruction can be executed • Since instructions have been decoded, processor can look ahead
  • 20. Antidependency • Write-write dependency • R3:=R3 + R5; (I1) • R4:=R3 + 1; (I2) • R3:=R5 + 1; (I3) • R7:=R3 + R4; (I4) • I3 can not complete before I2 starts as I2 needs a value in R3 and I3 changes R3
  • 21. Reorder Buffer • Temporary storage for results • Commit to register file in program order
  • 22. Register Renaming • Output and antidependencies occur because register contents may not reflect the correct ordering from the program • May result in a pipeline stall • Registers allocated dynamically • i.e. registers are not specifically named
  • 23. Register Renaming example • R3b:=R3a + R5a (I1) • R4b:=R3b + 1 (I2) • R3c:=R5a + 1 (I3) • R7b:=R3c + R4b (I4) • Without subscript refers to logical register in instruction • With subscript is hardware register allocated • Note R3a R3b R3c • Alternative: Scoreboarding • Bookkeeping technique • Allow instruction execution whenever not dependent on previous instructions and no structural hazards
  • 24. Machine Parallelism • Duplication of Resources • Out of order issue • Renaming • Not worth duplication functions without register renaming • Need instruction window large enough (more than 8)
  • 25. Speedups of Machine Organizations Without Procedural Dependencies
  • 26. Branch Prediction • 80486 fetches both next sequential instruction after branch and branch target instruction • Gives two cycle delay if branch taken
  • 27. RISC - Delayed Branch • Calculate result of branch before unusable instructions pre-fetched • Always execute single instruction immediately following branch • Keeps pipeline full while fetching new instruction stream • Not as good for superscalar • Multiple instructions need to execute in delay slot • Instruction dependence problems • Revert to branch prediction
  • 29. Superscalar Implementation • Simultaneously fetch multiple instructions • Logic to determine true dependencies involving register values • Mechanisms to communicate these values • Mechanisms to initiate multiple instructions in parallel • Resources for parallel execution of multiple instructions • Mechanisms for committing process state in correct order
  • 30. Pentium 4 • 80486 - CISC • Pentium – some superscalar components • Two separate integer execution units • Pentium Pro – Full blown superscalar • Subsequent models refine & enhance superscalar design
  • 31. Pentium 4 Block Diagram
  • 32. Pentium 4 Operation • Fetch instructions form memory in order of static program • Translate instruction into one or more fixed length RISC instructions (micro-operations) • Execute micro-ops on superscalar pipeline • micro-ops may be executed out of order • Commit results of micro-ops to register set in original program flow order • Outer CISC shell with inner RISC core • Inner RISC core pipeline at least 20 stages • Some micro-ops require multiple execution stages • Longer pipeline • c.f. five stage pipeline on x86 up to Pentium
  • 34. Pentium 4 Pipeline Operation (1)
  • 35. Pentium 4 Pipeline Operation (2)
  • 36. Pentium 4 Pipeline Operation (3)
  • 37. Pentium 4 Pipeline Operation (4)
  • 38. Pentium 4 Pipeline Operation (5)
  • 39. Pentium 4 Pipeline Operation (6)
  • 40. ARM CORTEX-A8 • ARM refers to Cortex-A8 as application processors • Embedded processor running complex operating system • Wireless, consumer and imaging applications • Mobile phones, set-top boxes, gaming consoles automotive navigation/entertainment systems • Three functional units • Dual, in-order-issue, 13-stage pipeline • Keep power required to a minimum • Out-of-order issue needs extra logic consuming extra power • Figure 14.11 shows the details of the main Cortex-A8 pipeline • Separate SIMD (single-instruction-multiple-data) unit • 10-stage pipeline
  • 42. Instruction Fetch Unit • Predicts instruction stream • Fetches instructions from the L1 instruction cache • Up to four instructions per cycle • Into buffer for decode pipeline • Fetch unit includes L1 instruction cache • Speculative instruction fetches • Branch or exceptional instruction cause pipeline flush • Stages: • F0 address generation unit generates virtual address • Normally next sequentially • Can also be branch target address • F1 Used to fetch instructions from L1 instruction cache • In parallel fetch address used to access branch prediction arrays • F3 Instruction data are placed in instruction queue • If branch prediction, new target address sent to address generation unit • Two-level global history branch predictor • Branch Target Buffer (BTB) and Global History Buffer (GHB) • Return stack to predict subroutine return addresses • Can fetch and queue up to 12 instructions • Issues instructions two at a time
  • 43. Instruction Decode Unit • Decodes and sequences all instructions • Dual pipeline structure, pipe0 and pipe1 • Two instructions can progress at a time • Pipe0 contains older instruction in program order • If instruction in pipe0 cannot issue, pipe1 will not issue • Instructions progress in order • Results written back to register file at end of execution pipeline • Prevents WAR hazards • Keeps tracking of WAW hazards and recovery from flush conditions straightforward • Main concern of decode pipeline is prevention of RAW hazards
  • 44. Instruction Processing Stages • D0 Thumb instructions decompressed and preliminary decode is performed • D1 Instruction decode is completed • D2 Write instruction to and read instructions from pending/replay queue • D3 Contains the instruction scheduling logic • Scoreboard predicts register availability using static scheduling • Hazard checking • D4 Final decode for control signals for integer execute load/store units
  • 45. Integer Execution Unit • Two symmetric (ALU) pipelines, an address generator for load and store instructions, and multiply pipeline • Pipeline stages: • E0 Access register file • Up to six registers for two instructions • E1 Barrel shifter if needed. • E2 ALU function • E3 If needed, completes saturation arithmetic • E4 Change in control flow prioritized and processed • E5 Results written back to register file • Multiply unit instructions routed to pipe0 • Performed in stages E1 through E3 • Multiply accumulate operation in E4
  • 46. Load/store pipeline • Parallel to integer pipeline • E1 Memory address generated from base and index register • E2 address applied to cache arrays • E3 load, data returned and formatted • E3 store, data are formatted and ready to be written to cache • E4 Updates L2 cache, if required • E5 Results are written to register file
  • 48. SIMD and Floating-Point Pipeline • SIMD and floating-point instructions pass through integer pipeline • Processed in separate 10-stage pipeline • NEON unit • Handles packed SIMD instructions • Provides two types of floating-point support • If implemented, vector floating-point (VFP) coprocessor performs IEEE 754 floating-point operations • If not, separate multiply and add pipelines implement floating-point operations
  • 49. ARM Cortex-A8 NEON & Floating Point Pipeline