SlideShare una empresa de Scribd logo
1 de 57
Descargar para leer sin conexión
Instruction-Level Parallelism Limitations 
EECE528: Parallel and Reconfigurable Computing 
Jose P. Pinilla
CONTENT 
I. ILP Background 
II. Hardware Model 
III. Study of Limitations 
IV. Simultaneous Multithreading 
V. ILP today
CONTENT 
I. ILP Background 
II. Hardware Model 
III. Study of Limitations 
IV. Simultaneous Multithreading 
V. ILP today
I. ILP 
• MIPS Example 
– Hazards 
• Structural 
• Data 
• Control 
• Power5 
• ILP Optimizations 
– Register Renaming 
– Branch/Jump Prediction 
– Alias Analysis
Time (clock cycles) 
I$ 
Load 
Instr 1 
Instr 2 
Instr 3 
Instr 4 
I$ Reg D$ Reg 
ALU 
I$ Reg D$ Reg 
ALU 
I$ Reg D$ Reg 
ALU 
Reg D$ Reg 
ALU 
I$ Reg D$ Reg 
ALU 
I. ILP: MIPS
I. ILP: Hazards 
1. Structural 
2. Data 
3. Control
I. ILP: Structural Hazards 
Conflict over the use of resources 
Time (clock cycles) 
I$ 
Load 
Instr 1 
Instr 2 
Instr 3 
Instr 4 
I$ Reg D$ Reg 
ALU 
I$ Reg D$ Reg 
ALU 
I$ Reg D$ Reg 
ALU 
Reg D$ Reg 
ALU 
I$ Reg D$ Reg 
ALU
I. ILP: Structural Hazards 
Conflict over the use of resources 
Time (clock cycles) 
I$ 
Load 
Instr 1 
Instr 2 
Instr 3 
Instr 4 
I$ Reg D$ Reg 
ALU 
I$ Reg D$ Reg 
ALU 
I$ Reg D$ Reg 
ALU 
Reg D$ Reg 
ALU 
I$ Reg D$ Reg 
ALU
I. ILP: Structural Hazards 
Time (clock cycles) 
I$ 
Load 
Instr 1 
Instr 2 
Instr 3 
Instr 4 
I$ Reg D$ Reg 
ALU 
I$ Reg D$ Reg 
ALU 
I$ Reg D$ Reg 
ALU 
Reg D$ Reg 
ALU 
I$ Reg D$ Reg 
ALU 
Solutions R/W: 
*On same clock cycle 
On different R/W ports
I. ILP: Data Hazards 
Time (clock cycles) 
I$ 
Instr 1 
Instr 2 
Instr 3 
Instr 4 
Instr 5 
I$ Reg D$ Reg 
ALU 
I$ Reg D$ Reg 
ALU 
I$ Reg D$ Reg 
ALU 
Reg D$ Reg 
ALU 
I$ Reg D$ Reg 
ALU
I. ILP: Data Hazards 
add $t0, $t1, $t2 
sub $t4, $t0 ,$t3 
and $t5, $t0 ,$t6 
or $t7, $t0 ,$t8 
xor $t9, $t0 ,$t10
I. ILP: Data Hazards
I. ILP: Data Hazards 
Forwarding
I. ILP: Data Hazards 
Forwarding
I. ILP: Data Hazards 
Hardware Interlock 
Allows Forwarding
I. ILP: Data Hazards 
Stalling by compiler
I. ILP: Data Hazards 
Stalling by compiler 
The compiler could schedule a better use of that cycle. Hardware can also do it.
I. ILP: Data Hazards 
Avoid Stalling
I. ILP: Control Hazards
I. ILP: Control Hazards 
Solutions: 
Add HW to be able to compute branch on stage 2 (DECODE) 
Predict Branch: To simplify hardware, predict branch as NOT TAKEN most of the times. End 
of the loop will always be wrong, but then is just once 
Insert instruction after branch, always gets executed. Compiler. MIPS
I. ILP: Power5 Architecture 
16 different stages
I. ILP: Optimizations 
Instruction window: Trace of incoming instructions to analyze for execution. 
Register Renaming: On false data dependences, hardware can rename the register. 
Compilers should optimize this false dependences: R2R memory model. 
Branch Prediction: 
Static: Always not taken, always taken. Forward/Backward taken. Branch delay slot. 
Dynamic: One-level (1bit, 2bit...), Two-level and Multiple Component 
Jump Prediction: Static profiling. Dynamic: Last taken, 2bit tables, return stack. 
Alias Analysis: Indirect memory references. Instruction Inspection.
I. ILP: Branch Prediction 
Saturated counter: Increment on branch 
taken, decrement on not taken. No 
Over or Under flow. 
Branch correlation: Inter/Intra 
Two-level: Remembers the history of the 
last n occurrences of the branch and 
uses one saturating counter for each of 
the possible 2n history patterns. 
Many more...
CONTENT 
I. ILP Background 
II. Hardware Model 
III. Study of Limitations 
IV. Simultaneous Multithreading 
V. ILP today
II. HARDWARE MODEL 
• Profiling Framework 
– Assumptions 
– Window Size
II. HW MODEL: Profiling Framework 
A set of assumptions and a methodology to, experimentally, extract a parallelism 
profile out of a set of benchmarks. 
Program is executed completely, resulting in a trace of instructions. 
Trace includes data addresses referenced, and the results of branches and jumps. 
(D. Wall's 1993 study) Divides the trace in cycles of 64 instructions in flight. 
The only limits on ILP in such a processor are those imposed by the actual data 
flows through either registers or memory.
II. HW MODEL: Assumptions 
• No limits on replicated functional units or ports to registers or memory. 
• Register Renaming: Perfect, Infinite, Finite, None 
• Branch Prediction: Perfect, Infinite, Finite, None 
• Jump Prediction: Perfect, Infinite, Finite, None 
• Memory Address Alias Analysis: 
• Perfect Caches 
• Unit cycle 
• 2k Window size
II. HW MODEL: Register Renaming 
• Perfect: Infinite number of registers to avoid false register dependencies. 
• Finite: Normally 256 integer registers and 256 floating point registers used in LRU 
(Least Recently Used) fashion. 
• No renaming: Number of registers used in the code.
II. HW MODEL: Branch Prediction 
• Perfect: All branches are correctly predicted. 
• 2bit predictor with infinite tables: Dynamic. A 2bit counter per branch option (2). 
Indexed by low-order bits of branch's address. Incremented on branch taken. Does not 
overflow. Branch is taken if table entry is 2 or 3. Up to 512 2bit entries. 
• 2bit predictor with infinite tables: Infinite number of counters. 
• Tournament-based branch predictor: 2 2bit counters competing. A 2bit selector that is 
decremented/incremented according to the correct prediction of the table entries. 
• Profile based: Static predictions. 
• No prediction: Every branch is predicted wrong. 
Not in order of performance
II. HW MODEL: Jump Prediction 
• Direct Jumps are known. 
• Indirect jumps 
– Perfect: Always performed correctly. 
– Finite prediction: A table with destination addresses. The address of a jump 
provides the index of the table. Whenever a jump is executed, we put its address in 
the table. Next jump should be to address in the table. 
– Infinite prediction: Infinite table entries. 
• No prediction: Every jump is predicted wrong.
II. HW MODEL: Alias Analysis 
• If two memory references do not refer to the same address, then they may 
be safely interchanged. 
• Indirect memory references are previous to the instruction execution. 
• No need to predict the actual values, only whether those values conflict. 
• Perfect: All global and stack reference predictions are perfect, heap 
• Inspection: Examine base and offset 
• None: All indirect memory references conflict.
II. HW MODEL: Window Size 
• The set of instructions which is examined for simultaneous execution. 
• The cycle width limits the number of instructions which can be scheduled. 
• A window size of 2k will look at 2048 instructions. 
• Cycle width: Assume we have found 111 instructions which can be parallelized. A 
cycle width of 64 would limit actual parallelism to 64 in flight instructions.
II. HW MODEL 
ctr: counter 
gsh: gshared (global history)
CONTENT 
I. ILP Background 
II. Hardware Model 
III. Study of Limitations 
IV. Simultaneous Multithreading 
V. ILP today
III. STUDY OF LIMITATIONS 
• Effects of... 
– Register Renaming 
– Branch/Jump Prediction 
– Alias Analysis 
– Realizable processor 
• Window Size (Discrete/Continuous) 
• Results
III. LIMITATIONS: Benchmarks
III. LIMITATIONS: Register Renaming
III. LIMITATIONS: Branch Prediction
III. LIMITATIONS: Branch Prediction
III. LIMITATIONS: Alias Analysis
III. LIMITATIONS: Results
III. LIMITATIONS: Realizable Processor 
• Up to 64 instruction issues per clock with no issue restrictions, or roughly 10 times the 
total issue width of the widest processor in 2011 
• A tournament predictor with 1K entries and a 16-entry return predictor. This predictor 
is comparable to the best predictors in 2011; the predictor is not a primary bottleneck 
• Perfect disambiguation of memory references done dynamically—this is ambitious but 
perhaps attainable for small window sizes (and hence small issue rates and load-store 
buffers) or through address aliasing prediction 
• Register renaming with 64 additional integer and 64 additional FP registers, which is 
slightly less than the most aggressive processor in 2011 
• No issue restrictions, no cache misses, unit latencies 
• Variable Window Size (Power5 200, Intel Core i7 ~128)
III. LIMITATIONS: Realizable Processor
III. LIMITATIONS: Conclusions 
• Plateau behavior 
• Window size effect on integer programs (3 top) is 
not as severe. Due to loop-level parallelism. 
• Designers are faced with the challenge: 
– Simpler processors with larger caches and 
higher clock rates 
Vs 
– ILP with slower clock and smaller caches 
• Persistent limitations: 
– WAW and WAR hazards through memory 
– Unnecessary dependences 
– Data flow limit
CONTENT 
I. ILP Background 
II. Hardware Model 
III. Study of Limitations 
IV. Simultaneous Multithreading 
V. ILP today
IV. SIMULTANEOUS MULTITHREADING 
• TLP Background 
– TLP approaches 
– Design Challenges 
• Limits of Multiple-Issue Processors 
– Power 
– Complexity
IV. SMT: TLP Background 
• Largely independent 
– Separate copy of regFile, PC and page table 
• Thread could represent 
– A process that is part of a parallel program consisting of multiple processes 
– An independent program on its own 
• Thread level parallelism occurs naturally 
• It can be used to employ the functional units idle when ILP is insufficient
IV. SMT: TLP Approaches
IV. SMT: Changes 
• Increasing the associativity of the L1 instruction cache and the instruction 
address translation buffers 
• Adding per-thread load and store queues 
• Increasing the size of the L2 and L3 caches 
• Adding separate instruction prefetch and buffering 
• Increasing the number of virtual registers from 152 to 240 
• Increasing the size of several issue queues
IV. SMT: Results
IV. SMT: Results 
• SMT reduces energy by 7% 
• “Because of the costs and diminishing returns in performance, however, rather than 
implement wider superscalars and more aggressive versions of SMT, many designers are 
opting to implement multiple CPU cores on a single die with slightly less aggressive support 
for multiple issue and multithreading; we return to this topic in the next chapter.” - Hennessy 
et al.
CONTENT 
I. ILP Background 
II. Hardware Model 
III. Study of Limitations 
IV. Simultaneous Multithreading 
V. ILP today
V. ILP TODAY: x86 
• Instruction fetch—The processor 
uses a multilevel branch target buffer 
to achieve a balance between speed 
and prediction accuracy. There is also a 
return address stack to speed up 
function return. Mispredictions cause a 
penalty of about 17 cycles. Using the 
predicted address, the instruction fetch 
unit fetches 16 bytes from the 
instruction cache. 
• Micro-code and Macro-code 
• Total pipeline depth is 14 stages 
• 128 reorder (renaming) buffer size
V. ILP TODAY: x86 
• Hyper-Threading: 
– SMT 
– The processor may stall due to a 
cache miss, branch misprediction, 
or data dependency. 
– Branch misprediction costs 17 
cycles
V. ILP TODAY: x86
V. ILP TODAY: ARM 
- The average CPI for the ARM7 family is about 1.9 cycles per instruction. 
- The average CPI for the ARM9 family is about 1.5 cycles per instruction. 
- The average CPI for the ARM11 family is about 1.39 cycles per instruction.
SOURCES 
Computer Architecture: A Quantitative Approach. Hennessy, J.L., Patterson, D.A., Asanović, K.. 
5th Ed. 2012. Morgan Kaufmann/Elsevier. 
Limits of instruction-level parallelism. D. W. Wall. IV international conference on Architectural 
Support for Programming Languages and Operating Systems (ASPLOS), pages 176–188, 1991. 
Computer Science 61C - Lecture 31: Instruction Level Parallelism. Mike Franklin, Dan Garcia. 
UC Berkeley. Fall. 2011 
ILP and TLP in Shared Memory Applications: A Limit Study. E. Fatehi, P. V. Gratz, Proceedings of 
the 23rd international conference on Parallel architectures and compilation, pages 113-126, 2014. 
MIPS Multicycle Model: Pipelining. Michael Langer. Introduction to Computer Systems. McGill 
University. 2012. 
IBM Power5 Chip: A Dual-Core Multithreaded Processor. R. Kalla, B. Sinharoy, J. M. Tendler. IBM. 
IEEE CS. 2004.

Más contenido relacionado

La actualidad más candente

Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism) Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism) A B Shinde
 
program flow mechanisms, advanced computer architecture
program flow mechanisms, advanced computer architectureprogram flow mechanisms, advanced computer architecture
program flow mechanisms, advanced computer architecturePankaj Kumar Jain
 
Data transfer and manipulation
Data transfer and manipulationData transfer and manipulation
Data transfer and manipulationSanjeev Patel
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processingPage Maker
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applicationsBurhan Ahmed
 
Introduction to parallel processing
Introduction to parallel processingIntroduction to parallel processing
Introduction to parallel processingPage Maker
 
Computer Architecture and organization ppt.
Computer Architecture and organization ppt.Computer Architecture and organization ppt.
Computer Architecture and organization ppt.mali yogesh kumar
 
Parallel computing
Parallel computingParallel computing
Parallel computingVinay Gupta
 
Floating point arithmetic operations (1)
Floating point arithmetic operations (1)Floating point arithmetic operations (1)
Floating point arithmetic operations (1)cs19club
 
program partitioning and scheduling IN Advanced Computer Architecture
program partitioning and scheduling  IN Advanced Computer Architectureprogram partitioning and scheduling  IN Advanced Computer Architecture
program partitioning and scheduling IN Advanced Computer ArchitecturePankaj Kumar Jain
 
Computer architecture pipelining
Computer architecture pipeliningComputer architecture pipelining
Computer architecture pipeliningMazin Alwaaly
 
Computer architecture instruction formats
Computer architecture instruction formatsComputer architecture instruction formats
Computer architecture instruction formatsMazin Alwaaly
 
Input output organization
Input output organizationInput output organization
Input output organizationabdulugc
 

La actualidad más candente (20)

Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism) Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism)
 
program flow mechanisms, advanced computer architecture
program flow mechanisms, advanced computer architectureprogram flow mechanisms, advanced computer architecture
program flow mechanisms, advanced computer architecture
 
Vector architecture
Vector architectureVector architecture
Vector architecture
 
Parallel Computing
Parallel ComputingParallel Computing
Parallel Computing
 
VLIW Processors
VLIW ProcessorsVLIW Processors
VLIW Processors
 
Data transfer and manipulation
Data transfer and manipulationData transfer and manipulation
Data transfer and manipulation
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processing
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applications
 
Introduction to parallel processing
Introduction to parallel processingIntroduction to parallel processing
Introduction to parallel processing
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Computer Architecture and organization ppt.
Computer Architecture and organization ppt.Computer Architecture and organization ppt.
Computer Architecture and organization ppt.
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Computer arithmetic
Computer arithmeticComputer arithmetic
Computer arithmetic
 
Floating point arithmetic operations (1)
Floating point arithmetic operations (1)Floating point arithmetic operations (1)
Floating point arithmetic operations (1)
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
program partitioning and scheduling IN Advanced Computer Architecture
program partitioning and scheduling  IN Advanced Computer Architectureprogram partitioning and scheduling  IN Advanced Computer Architecture
program partitioning and scheduling IN Advanced Computer Architecture
 
Computer architecture pipelining
Computer architecture pipeliningComputer architecture pipelining
Computer architecture pipelining
 
Computer architecture instruction formats
Computer architecture instruction formatsComputer architecture instruction formats
Computer architecture instruction formats
 
Input output organization
Input output organizationInput output organization
Input output organization
 
Microprogrammed Control Unit
Microprogrammed Control UnitMicroprogrammed Control Unit
Microprogrammed Control Unit
 

Similar a Instruction Level Parallelism (ILP) Limitations

Advanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPAdvanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPA B Shinde
 
Automating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency SpreadsAutomating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency SpreadsScyllaDB
 
UNIT 3 - General Purpose Processors
UNIT 3 - General Purpose ProcessorsUNIT 3 - General Purpose Processors
UNIT 3 - General Purpose ProcessorsButtaRajasekhar2
 
Performance Enhancement with Pipelining
Performance Enhancement with PipeliningPerformance Enhancement with Pipelining
Performance Enhancement with PipeliningAneesh Raveendran
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computersSyed Zaid Irshad
 
Processor Organization and Architecture
Processor Organization and ArchitectureProcessor Organization and Architecture
Processor Organization and ArchitectureVinit Raut
 
2. ILP Processors.ppt
2. ILP Processors.ppt2. ILP Processors.ppt
2. ILP Processors.pptShifaZahra7
 
Computer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and MicrocontrollerComputer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and MicrocontrollerAmrutaMehata
 
Top schools in gudgao
Top schools in gudgaoTop schools in gudgao
Top schools in gudgaoEdhole.com
 
Top schools in gudgao
Top schools in gudgaoTop schools in gudgao
Top schools in gudgaoEdhole.com
 
Performance Tuning by Dijesh P
Performance Tuning by Dijesh PPerformance Tuning by Dijesh P
Performance Tuning by Dijesh PPlusOrMinusZero
 
Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)AllineaSoftware
 

Similar a Instruction Level Parallelism (ILP) Limitations (20)

Advanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPAdvanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILP
 
Automating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency SpreadsAutomating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency Spreads
 
UNIT 3 - General Purpose Processors
UNIT 3 - General Purpose ProcessorsUNIT 3 - General Purpose Processors
UNIT 3 - General Purpose Processors
 
Performance Enhancement with Pipelining
Performance Enhancement with PipeliningPerformance Enhancement with Pipelining
Performance Enhancement with Pipelining
 
13 risc
13 risc13 risc
13 risc
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computers
 
RISC.ppt
RISC.pptRISC.ppt
RISC.ppt
 
13 risc
13 risc13 risc
13 risc
 
13 superscalar
13 superscalar13 superscalar
13 superscalar
 
13_Superscalar.ppt
13_Superscalar.ppt13_Superscalar.ppt
13_Superscalar.ppt
 
Processor Organization and Architecture
Processor Organization and ArchitectureProcessor Organization and Architecture
Processor Organization and Architecture
 
13 risc
13 risc13 risc
13 risc
 
2. ILP Processors.ppt
2. ILP Processors.ppt2. ILP Processors.ppt
2. ILP Processors.ppt
 
A12 vercelletto indexing_techniques
A12 vercelletto indexing_techniquesA12 vercelletto indexing_techniques
A12 vercelletto indexing_techniques
 
Unit iii
Unit iiiUnit iii
Unit iii
 
Computer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and MicrocontrollerComputer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and Microcontroller
 
Top schools in gudgao
Top schools in gudgaoTop schools in gudgao
Top schools in gudgao
 
Top schools in gudgao
Top schools in gudgaoTop schools in gudgao
Top schools in gudgao
 
Performance Tuning by Dijesh P
Performance Tuning by Dijesh PPerformance Tuning by Dijesh P
Performance Tuning by Dijesh P
 
Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)
 

Más de Jose Pinilla

Summary - Adaptive Insertion Policies for High Performance Caching. Qureshi, ...
Summary - Adaptive Insertion Policies for High Performance Caching. Qureshi, ...Summary - Adaptive Insertion Policies for High Performance Caching. Qureshi, ...
Summary - Adaptive Insertion Policies for High Performance Caching. Qureshi, ...Jose Pinilla
 
CWCAS X-ISCKER Poster
CWCAS X-ISCKER PosterCWCAS X-ISCKER Poster
CWCAS X-ISCKER PosterJose Pinilla
 
Presentación Proyecto de Grado: X-ISCKER
Presentación Proyecto de Grado: X-ISCKERPresentación Proyecto de Grado: X-ISCKER
Presentación Proyecto de Grado: X-ISCKERJose Pinilla
 
Medical images compression: JPEG variations for DICOM standard
Medical images compression: JPEG variations for DICOM standardMedical images compression: JPEG variations for DICOM standard
Medical images compression: JPEG variations for DICOM standardJose Pinilla
 
Black wednesday SOPA/PIPA Report
Black wednesday SOPA/PIPA ReportBlack wednesday SOPA/PIPA Report
Black wednesday SOPA/PIPA ReportJose Pinilla
 
Telemedicine and telecardiology report
Telemedicine and telecardiology reportTelemedicine and telecardiology report
Telemedicine and telecardiology reportJose Pinilla
 
The internet success factors
The internet success factorsThe internet success factors
The internet success factorsJose Pinilla
 
FPGA como alternativa
FPGA como alternativaFPGA como alternativa
FPGA como alternativaJose Pinilla
 
"Basta de historias" de Andrés Oppenheimer
"Basta de historias" de Andrés Oppenheimer"Basta de historias" de Andrés Oppenheimer
"Basta de historias" de Andrés OppenheimerJose Pinilla
 

Más de Jose Pinilla (11)

Summary - Adaptive Insertion Policies for High Performance Caching. Qureshi, ...
Summary - Adaptive Insertion Policies for High Performance Caching. Qureshi, ...Summary - Adaptive Insertion Policies for High Performance Caching. Qureshi, ...
Summary - Adaptive Insertion Policies for High Performance Caching. Qureshi, ...
 
X-ISCKER
X-ISCKERX-ISCKER
X-ISCKER
 
CWCAS X-ISCKER Poster
CWCAS X-ISCKER PosterCWCAS X-ISCKER Poster
CWCAS X-ISCKER Poster
 
Presentación Proyecto de Grado: X-ISCKER
Presentación Proyecto de Grado: X-ISCKERPresentación Proyecto de Grado: X-ISCKER
Presentación Proyecto de Grado: X-ISCKER
 
Medical images compression: JPEG variations for DICOM standard
Medical images compression: JPEG variations for DICOM standardMedical images compression: JPEG variations for DICOM standard
Medical images compression: JPEG variations for DICOM standard
 
Black wednesday SOPA/PIPA Report
Black wednesday SOPA/PIPA ReportBlack wednesday SOPA/PIPA Report
Black wednesday SOPA/PIPA Report
 
Telemedicine and telecardiology report
Telemedicine and telecardiology reportTelemedicine and telecardiology report
Telemedicine and telecardiology report
 
The internet success factors
The internet success factorsThe internet success factors
The internet success factors
 
FPGA como alternativa
FPGA como alternativaFPGA como alternativa
FPGA como alternativa
 
FPGA @ UPB-BGA
FPGA @ UPB-BGAFPGA @ UPB-BGA
FPGA @ UPB-BGA
 
"Basta de historias" de Andrés Oppenheimer
"Basta de historias" de Andrés Oppenheimer"Basta de historias" de Andrés Oppenheimer
"Basta de historias" de Andrés Oppenheimer
 

Último

KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spaintimesproduction05
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 

Último (20)

KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 

Instruction Level Parallelism (ILP) Limitations

  • 1. Instruction-Level Parallelism Limitations EECE528: Parallel and Reconfigurable Computing Jose P. Pinilla
  • 2. CONTENT I. ILP Background II. Hardware Model III. Study of Limitations IV. Simultaneous Multithreading V. ILP today
  • 3. CONTENT I. ILP Background II. Hardware Model III. Study of Limitations IV. Simultaneous Multithreading V. ILP today
  • 4. I. ILP • MIPS Example – Hazards • Structural • Data • Control • Power5 • ILP Optimizations – Register Renaming – Branch/Jump Prediction – Alias Analysis
  • 5. Time (clock cycles) I$ Load Instr 1 Instr 2 Instr 3 Instr 4 I$ Reg D$ Reg ALU I$ Reg D$ Reg ALU I$ Reg D$ Reg ALU Reg D$ Reg ALU I$ Reg D$ Reg ALU I. ILP: MIPS
  • 6. I. ILP: Hazards 1. Structural 2. Data 3. Control
  • 7. I. ILP: Structural Hazards Conflict over the use of resources Time (clock cycles) I$ Load Instr 1 Instr 2 Instr 3 Instr 4 I$ Reg D$ Reg ALU I$ Reg D$ Reg ALU I$ Reg D$ Reg ALU Reg D$ Reg ALU I$ Reg D$ Reg ALU
  • 8. I. ILP: Structural Hazards Conflict over the use of resources Time (clock cycles) I$ Load Instr 1 Instr 2 Instr 3 Instr 4 I$ Reg D$ Reg ALU I$ Reg D$ Reg ALU I$ Reg D$ Reg ALU Reg D$ Reg ALU I$ Reg D$ Reg ALU
  • 9. I. ILP: Structural Hazards Time (clock cycles) I$ Load Instr 1 Instr 2 Instr 3 Instr 4 I$ Reg D$ Reg ALU I$ Reg D$ Reg ALU I$ Reg D$ Reg ALU Reg D$ Reg ALU I$ Reg D$ Reg ALU Solutions R/W: *On same clock cycle On different R/W ports
  • 10. I. ILP: Data Hazards Time (clock cycles) I$ Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 I$ Reg D$ Reg ALU I$ Reg D$ Reg ALU I$ Reg D$ Reg ALU Reg D$ Reg ALU I$ Reg D$ Reg ALU
  • 11. I. ILP: Data Hazards add $t0, $t1, $t2 sub $t4, $t0 ,$t3 and $t5, $t0 ,$t6 or $t7, $t0 ,$t8 xor $t9, $t0 ,$t10
  • 12. I. ILP: Data Hazards
  • 13. I. ILP: Data Hazards Forwarding
  • 14. I. ILP: Data Hazards Forwarding
  • 15. I. ILP: Data Hazards Hardware Interlock Allows Forwarding
  • 16. I. ILP: Data Hazards Stalling by compiler
  • 17. I. ILP: Data Hazards Stalling by compiler The compiler could schedule a better use of that cycle. Hardware can also do it.
  • 18. I. ILP: Data Hazards Avoid Stalling
  • 19. I. ILP: Control Hazards
  • 20. I. ILP: Control Hazards Solutions: Add HW to be able to compute branch on stage 2 (DECODE) Predict Branch: To simplify hardware, predict branch as NOT TAKEN most of the times. End of the loop will always be wrong, but then is just once Insert instruction after branch, always gets executed. Compiler. MIPS
  • 21. I. ILP: Power5 Architecture 16 different stages
  • 22. I. ILP: Optimizations Instruction window: Trace of incoming instructions to analyze for execution. Register Renaming: On false data dependences, hardware can rename the register. Compilers should optimize this false dependences: R2R memory model. Branch Prediction: Static: Always not taken, always taken. Forward/Backward taken. Branch delay slot. Dynamic: One-level (1bit, 2bit...), Two-level and Multiple Component Jump Prediction: Static profiling. Dynamic: Last taken, 2bit tables, return stack. Alias Analysis: Indirect memory references. Instruction Inspection.
  • 23. I. ILP: Branch Prediction Saturated counter: Increment on branch taken, decrement on not taken. No Over or Under flow. Branch correlation: Inter/Intra Two-level: Remembers the history of the last n occurrences of the branch and uses one saturating counter for each of the possible 2n history patterns. Many more...
  • 24. CONTENT I. ILP Background II. Hardware Model III. Study of Limitations IV. Simultaneous Multithreading V. ILP today
  • 25. II. HARDWARE MODEL • Profiling Framework – Assumptions – Window Size
  • 26. II. HW MODEL: Profiling Framework A set of assumptions and a methodology to, experimentally, extract a parallelism profile out of a set of benchmarks. Program is executed completely, resulting in a trace of instructions. Trace includes data addresses referenced, and the results of branches and jumps. (D. Wall's 1993 study) Divides the trace in cycles of 64 instructions in flight. The only limits on ILP in such a processor are those imposed by the actual data flows through either registers or memory.
  • 27. II. HW MODEL: Assumptions • No limits on replicated functional units or ports to registers or memory. • Register Renaming: Perfect, Infinite, Finite, None • Branch Prediction: Perfect, Infinite, Finite, None • Jump Prediction: Perfect, Infinite, Finite, None • Memory Address Alias Analysis: • Perfect Caches • Unit cycle • 2k Window size
  • 28. II. HW MODEL: Register Renaming • Perfect: Infinite number of registers to avoid false register dependencies. • Finite: Normally 256 integer registers and 256 floating point registers used in LRU (Least Recently Used) fashion. • No renaming: Number of registers used in the code.
  • 29. II. HW MODEL: Branch Prediction • Perfect: All branches are correctly predicted. • 2bit predictor with infinite tables: Dynamic. A 2bit counter per branch option (2). Indexed by low-order bits of branch's address. Incremented on branch taken. Does not overflow. Branch is taken if table entry is 2 or 3. Up to 512 2bit entries. • 2bit predictor with infinite tables: Infinite number of counters. • Tournament-based branch predictor: 2 2bit counters competing. A 2bit selector that is decremented/incremented according to the correct prediction of the table entries. • Profile based: Static predictions. • No prediction: Every branch is predicted wrong. Not in order of performance
  • 30. II. HW MODEL: Jump Prediction • Direct Jumps are known. • Indirect jumps – Perfect: Always performed correctly. – Finite prediction: A table with destination addresses. The address of a jump provides the index of the table. Whenever a jump is executed, we put its address in the table. Next jump should be to address in the table. – Infinite prediction: Infinite table entries. • No prediction: Every jump is predicted wrong.
  • 31. II. HW MODEL: Alias Analysis • If two memory references do not refer to the same address, then they may be safely interchanged. • Indirect memory references are previous to the instruction execution. • No need to predict the actual values, only whether those values conflict. • Perfect: All global and stack reference predictions are perfect, heap • Inspection: Examine base and offset • None: All indirect memory references conflict.
  • 32. II. HW MODEL: Window Size • The set of instructions which is examined for simultaneous execution. • The cycle width limits the number of instructions which can be scheduled. • A window size of 2k will look at 2048 instructions. • Cycle width: Assume we have found 111 instructions which can be parallelized. A cycle width of 64 would limit actual parallelism to 64 in flight instructions.
  • 33. II. HW MODEL ctr: counter gsh: gshared (global history)
  • 34. CONTENT I. ILP Background II. Hardware Model III. Study of Limitations IV. Simultaneous Multithreading V. ILP today
  • 35. III. STUDY OF LIMITATIONS • Effects of... – Register Renaming – Branch/Jump Prediction – Alias Analysis – Realizable processor • Window Size (Discrete/Continuous) • Results
  • 42. III. LIMITATIONS: Realizable Processor • Up to 64 instruction issues per clock with no issue restrictions, or roughly 10 times the total issue width of the widest processor in 2011 • A tournament predictor with 1K entries and a 16-entry return predictor. This predictor is comparable to the best predictors in 2011; the predictor is not a primary bottleneck • Perfect disambiguation of memory references done dynamically—this is ambitious but perhaps attainable for small window sizes (and hence small issue rates and load-store buffers) or through address aliasing prediction • Register renaming with 64 additional integer and 64 additional FP registers, which is slightly less than the most aggressive processor in 2011 • No issue restrictions, no cache misses, unit latencies • Variable Window Size (Power5 200, Intel Core i7 ~128)
  • 44. III. LIMITATIONS: Conclusions • Plateau behavior • Window size effect on integer programs (3 top) is not as severe. Due to loop-level parallelism. • Designers are faced with the challenge: – Simpler processors with larger caches and higher clock rates Vs – ILP with slower clock and smaller caches • Persistent limitations: – WAW and WAR hazards through memory – Unnecessary dependences – Data flow limit
  • 45. CONTENT I. ILP Background II. Hardware Model III. Study of Limitations IV. Simultaneous Multithreading V. ILP today
  • 46. IV. SIMULTANEOUS MULTITHREADING • TLP Background – TLP approaches – Design Challenges • Limits of Multiple-Issue Processors – Power – Complexity
  • 47. IV. SMT: TLP Background • Largely independent – Separate copy of regFile, PC and page table • Thread could represent – A process that is part of a parallel program consisting of multiple processes – An independent program on its own • Thread level parallelism occurs naturally • It can be used to employ the functional units idle when ILP is insufficient
  • 48. IV. SMT: TLP Approaches
  • 49. IV. SMT: Changes • Increasing the associativity of the L1 instruction cache and the instruction address translation buffers • Adding per-thread load and store queues • Increasing the size of the L2 and L3 caches • Adding separate instruction prefetch and buffering • Increasing the number of virtual registers from 152 to 240 • Increasing the size of several issue queues
  • 51. IV. SMT: Results • SMT reduces energy by 7% • “Because of the costs and diminishing returns in performance, however, rather than implement wider superscalars and more aggressive versions of SMT, many designers are opting to implement multiple CPU cores on a single die with slightly less aggressive support for multiple issue and multithreading; we return to this topic in the next chapter.” - Hennessy et al.
  • 52. CONTENT I. ILP Background II. Hardware Model III. Study of Limitations IV. Simultaneous Multithreading V. ILP today
  • 53. V. ILP TODAY: x86 • Instruction fetch—The processor uses a multilevel branch target buffer to achieve a balance between speed and prediction accuracy. There is also a return address stack to speed up function return. Mispredictions cause a penalty of about 17 cycles. Using the predicted address, the instruction fetch unit fetches 16 bytes from the instruction cache. • Micro-code and Macro-code • Total pipeline depth is 14 stages • 128 reorder (renaming) buffer size
  • 54. V. ILP TODAY: x86 • Hyper-Threading: – SMT – The processor may stall due to a cache miss, branch misprediction, or data dependency. – Branch misprediction costs 17 cycles
  • 56. V. ILP TODAY: ARM - The average CPI for the ARM7 family is about 1.9 cycles per instruction. - The average CPI for the ARM9 family is about 1.5 cycles per instruction. - The average CPI for the ARM11 family is about 1.39 cycles per instruction.
  • 57. SOURCES Computer Architecture: A Quantitative Approach. Hennessy, J.L., Patterson, D.A., Asanović, K.. 5th Ed. 2012. Morgan Kaufmann/Elsevier. Limits of instruction-level parallelism. D. W. Wall. IV international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 176–188, 1991. Computer Science 61C - Lecture 31: Instruction Level Parallelism. Mike Franklin, Dan Garcia. UC Berkeley. Fall. 2011 ILP and TLP in Shared Memory Applications: A Limit Study. E. Fatehi, P. V. Gratz, Proceedings of the 23rd international conference on Parallel architectures and compilation, pages 113-126, 2014. MIPS Multicycle Model: Pipelining. Michael Langer. Introduction to Computer Systems. McGill University. 2012. IBM Power5 Chip: A Dual-Core Multithreaded Processor. R. Kalla, B. Sinharoy, J. M. Tendler. IBM. IEEE CS. 2004.