1. Computer Organization and
Architecture
(3 Credits/SKS)
Prof. Dr. Bagio Budiardjo
Semester Genap 2010/2011
2. About the Course :
Course Objectives: After completing this course the
students are expected to understand and to be able to
analyze the computer architecture, in particular the
instruction-set design (e.g. addressing modes), and its
influence to performance. The students are also expected
to understand the meaning of computer organization, that
is, the interconnections of computer sub-systems : CPU,
memory, bus and I/O from a computing system.
The student is expected to understand the more advanced
technique in processor design : pipelining.
Key words : architecture, instruction-set design, computer
organization, performance, processor design and,
pipelining techniques
3. About the grading scheme :
• This part is actually not too rigid but it will
appear as the combination of : homework, quiz,
exercise, mid-test and final-test; whenever
possible.
• One scheme possible is :
Homework : 15% (4)
Mid test : 40 %
Final Test : 45 %
• Grading the homework : Maximum point , 5 point
each. Three levels of grading :Good(5), OK(3),
and Bad(2).
4. The books and supporting materials :
• Williams Stalling’s book titled Computer Organization and
Architecture, Seventh Edition, Prentice Hall 2006; will be used
as the main reference for this lecture. There is a new edition of
this book, issued in 2010 but up till now is still unavailable in
Jakarta.
• The classic book is good (Logic and Computer Design
Fundamentals) , by Morris M Manno and Charles Kilme -
Pearson Asia – 2004), but too many stresses on digital logics.
We use materials from this book to explain the hardware
design of computer components, whenever possible
• Chapters covered will be : Chapters: 1, 2, 3, 4, 5, 10 and 11
and 13 (Stalling’s). Additional materials about pipelining are
taken from another book.
5. Books and supporting materials - continued
• There will be no handouts (unless it is very important).
• Lecture notes are given through memory stick/CD, SAP
could be downloaded from SIAK-NG
• Students are encouraged to read books/papers in this field
of study.
Schedule of class :
• At scheduled time and place (K-102) for about 120
minutes
• Lecture will be given mainly using LCD projector
6. About the “course direction”
Why do we study Computer Architecture ?
History :
Course under this name has been taught in many
universities long before the microprocessors
exist. Years ago, people studied mainframe
architectures : IBM S/370, CDC Cyber, CRAY,
Amdahl, etc.
Since the microprocessors emerge, this course is
changed slightly to cope with more advanced
topics: Computer design and performance issues
7. About the “course direction”
Computer Organization & Architecture
Micro & Embedded OAK
Microprocessors Processors Architecture & Design
Processors Architecture & Design
Application of µproc Analyzing processor design emphasizing
Analyzing & Implementing
on how to obtain Systems to achieve
Computer better processing speed
(Cost effectiveness)
best processing speed – Cost effectiveness
Parallel & Distributed
Embedded Systems Computing Systems
embedding µproc based Organizing Processors/Computing
intelligence to new system/device systems to obtain better speed up with
different processing paradigm
8. About the “course direction” - continued
This course is aimed at :
1. Explaining the phenomena of computer
architecture and computer design
Knowing the basic instruction cycle and its
implication to processing speed
2. Studying the “key” problems :
a. CPU memory bottleneck
b. CPU I/O devices problems
3. Studying how the “performance” could be
improved
example : CPU-memory : cache memory
4. How could we improve execution speed
with other techniques ?
Example : pipelining
9. Reasons for studying
Computer Architecture
(Stalling’s arguments)
• Able to select “proper” computer systems for a
particular environment (cost and effectiveness)
• Able to analyzed a processor “embedded” to an
environment. Able to analyzed the use of
processor in automobile, able to use proper tools
to analyzed
• Able to choose proper software for a particular
computer system
11. – Processor Organization : Another view
CPU : Central Processing Unit
Control
Unit
MMU : Mem Mng. Unit
IR
To/from PC
memory R1
Cache MAR
memory MBR
R2
ALU1 ALU2
R3
ADDER
Issues :
ALU3 Clock speed,
Gating signal
FPU : Floating Point Unit
BUS
13. Frequently Asked Question
What is the role of CPU clock ?
What is the difference between P IV/2.4 G &
P IV/3.0 G ? (CPU - clock speed 2.4 and 3.0 Ghz)
Consider an instruction of a CPU :
AR R1, R2
(add register, content of R1 and content of
register R2, place result in R1)
14. – Execution steps of AR R1,R2
The “possible” micro-execution steps are :
a. ALU1 ← [R1] {content of R1 is moved to ALU1}
b. ALU2 ← [R2] {content of R2 is moved to ALU2}
c. ADD {content of ALU1 + ALU2 = ALU3}
d. R1 ← [ALU3] {Result of addition is moved to R1}
If, each micro-step is executed in “one” clock-cycle,
then this AR instruction needs 4 clock-cycles.
For the time being, we ignore the fetch cycle
15. Question : How do we fetch the instruction?
(from memory)
• There is a procedure to bring an instruction from memory
to CPU (IR), is called the instruction fetch
• PC always hold the address of (next) instruction in
memory
• PC tranfer the address to MAR, and READ memory
• PC ususally is icremented by 1 (point to next instruction)
• Instruction is placed by memory in MBR
• Content of MBR is transferred to IR
(instruction is fetched, ready to be executed)
16. Question : How do we fetch the instruction?
(from memory) - continued
• Or with register transfer language, we could express the
fetch cycle as
1. MAR ← [PC]
2. READ (memory) and wait for completion
3. IR ← [MBR]
In terms of CPU clock, this steps may take up to 50 CPU
clocks depending on the memory clock speed.
17. – Processor Organization – continued.1
Control
Unit
IR
To/from
memory PC
R1
MAR
MBR
R2
ALU1 ALU2
R3
ADDER
ALU3
ALU1 [R1]
BUS : jalur/unit tidak
aktif
18. – Processor Organization – continued.2
Control
Unit
IR
To/from
memory PC
R1
MAR
MBR
R2
ALU1 ALU2
R3
ADDER
ALU2 [R2]
ALU3
: jalur/komponen tdk
aktif
BUS
19. – Processor Organization – continued.3
Control
Unit
IR
To/from
memory PC
R1
MAR
MBR
R2
ALU1 ALU2
R3
ADDER
ADD
ALU3
: jalur/komponen tdk
aktif
BUS
20. – Processor Organization – continued.4
Control
Unit
IR
To/from
memory PC
R1
MAR
MBR
R2
ALU1 ALU2
R3
ADDER
R1 [ALU3]
ALU3
: jalur/komponen tdk
aktif
BUS
21. Analysis of Instruction Cycle
• With single bus, it is slow, since in each “clock”
only one transfer could be executed
• Is there any other way to “improve” the speed?
• Dual bus processor may be faster
• Additional processor cost
22. Dual processor-bus : A way to improve speed
1. ALU1 ← [R1] (bus1)
1 2 ALU2 ← [R2] (bus2)
Other components
(Control Unit,IR,PC, 2. ADD
MAR,MBR)
3. R1 ← [ALU3] (bus1)
R1
Only 3 clocks
cycles needed,
R2 25% faster
ALU1 ALU2
How about this :
R3
1. ALU1 ← [R1] (bus1)
ADDER ALU2 ← [R2] (bus2)
ADD
2. R1 ← [ALU3] (bus1)
ALU3
Only 2 clocks
cycles needed,
DUAL BUS 50% faster
23. Triple processor-bus : Can the processing speed imrpoved?
1 2 3
Other components
(Control Unit,IR,PC,
MAR,MBR)
R1 Please notice the
direction of arrows
R2
ALU1 ALU2
If all the CPU components
R3 (registers, ALUs and adder)
could work in a one third (1/3) clock
ADDER cycle (transfer of bits, adding
numbers), how many clock (s)
needed to complete an addition
operation (ADD R1,R2) ?
ALU3 Write down the “register transfer”
(micro instruction steps)
language!
Triple Bus
24. Program Execution
• A scientific program using assembly language is run on a
microprocessor with 1 Ghz clock. To complete the program , it needs
to execute :
a. 150.000 arithmetic instructions (e.g ADD R1,R2; MUL R1,R3;
etc)
b. 250.000 register transfer instructions (e.g MOV R1,R2; etc)
c. 100.000 memory access instructions (e.g LOAD R1,X; STORE
R2,Y; etc).
If, average arithmetic instructions need 2 clocks (to complete), average
register transfer instructions need 1 clock and average memory access
instructions need 10 clocks; calculate the average CPI (clock per
instruction) of the above mentioned program.
How many times it needs to complete the program (in seconds)?
25. Can it be “one clock?” – Yes it can !
Views of Other Books on “Micro Operations”
• The Bus is called “data path”
• It is not only consist of bus (a bunch of wires), but
other digital devices
• Enable signals is forced to fasten execution
• Additional (processor) cost
26. Datapath Example :
Taken from Morris Manno’s book
Load enable A select B select
Write A address B address
• Four parallel-load D data n
registers Load R0 2 2
n n
• Two mux-based Load
R1
register selectors n
0
1
MUX
2
• Register destination n
0
1
3
MUX
decoder Load
n
R2 2
3
n
• Mux B for external Load R3
constant input 0 1 2 3 n
n
Register file
n
Decoder
•
A data B data
Buses A and B with external 2
D address
Constant in n n
Destination select
address and data outputs MB select
n 1
MUX B
0
n Address
Bus A
Out
• ALU and Shifter with A B
Bus B n
n
Data
Out
G select H select
Mux F for output select V
4 A
S2:0 || Cin
B
0
2 S
IR
B
Shifter IL 0
Arithmetic/logic
• Mux D for external data input N
C unit (ALU)
G
n
H
n
Z Zero Detect
• Logic for generating status bits MF select
0
MUX F
1
Function unit
V, C, N, Z
F
n n Data In
MD select 0 1
Bus D
MUX D
n
27. Datapath Example: Performing a Microoperation
Microoperation: R0 ← R1 + R2 Load enable A select B select
Write A address B address
D data n
Apply 01 to A select to place Load R0 2 2
contents of R1 onto Bus A n n
Apply 10 to B select to place Load
R1
contents of R2 onto B data and n
0
1
MUX
apply 0 to MB select to place n
0
2
3
B data on Bus B Load
R2
1
2
MUX
3
Apply 0010 to G select to perform n n
addition G = Bus A + Bus B Load R3
n n
Apply 0 to MF select and 0 to MD 0 1 2 3 n Register file
Decoder
select to place the value of G onto 2
D address
Constant in n
A data
n
B data
BUS D Destination select
MB select
n 1 0
MUX B
Apply 00 to Destination select to Bus A
Bus B
n
n
Address
Out
Data
enable the Load input to R0 G select
A B
H select
n
B
Out
4 A B 2
Apply 1 to Load Enable to force the V
S2:0 || Cin
Arithmetic/logic 0
S
IR Shifter IL 0
unit (ALU)
Load input to R0 to 1 so that R0 is N
C
G
n
H
n
loaded on the clock pulse (not shown) Z Zero Detect
0 1
MF select MUX F Function unit
The overall microoperation requires F
n n Data In
1 clock cycle (!) n
MD select
Bus D
0 1
MUX D
28. Lesson Learned
• We could improve the instruction execution speed by
increasing processor clock speed (can we?)
• We could improve the instruction execution speed by
implementing dual bus (can we?)
• We can overcome (partly) the CPU-Memory bottleneck by
inserting cache memory between CPU and Main Memory
(can we?)
• Is there any other way to improve instruction execution
speed (increasing performance)? - pipelining
• Are these improvements need extra cost? (cost vs
performance issue)
29. What do we get after studying Computer
Architecture ?
• It is always a complicated problem to answer.
• Basically we learn about the processor design
issues, namely hardware of a computer but it was
taught through “software” logics.
• At least we know about basic building blocks of a
computer
• We know the design development trends
30. What is our topic ?
Intruction Set Architecture(ISA)
Application
Program
Compiler OS
ISA
CPU
Design
Circuit
Design
Chip
Layout
32. 1. 1. Introduction : Organization & Architecture
• Organization and Architecture : two jargons that are often
confusing
• Computer organization refers to the operational units and
their interconnections that realize the architectural
specifications (!)
• Computer Architecture refers to those attributes of a
system visible to a programmer, or put another way, those
attributes that have a direct impact on the logical execution
of a program (!)
• The later definition (architecture) concerns more about the
performance, compared to the first one (organization)
33. 1. 1. Introduction - continued
• Architecture concerns more about the basic instruction
design, that may lead to better performance of the system
• Organization, is the implementation of computer
system, in terms of its interconnection of functional units :
CPU, memory, bus and I/O devices.
• Example : IBM/S-370 family architecture. There are
plenty of IBM products having the same architecture (S-
370) but different organization, depending on its
price/performance measures. Cost and performance differs
the organizations
• So, organization of a computer is the implementation of
its architecture, but tailored to fit the intended price and
performance measures.
35. ENIAC - background
• Electronic Numerical Integrator And Computer
• Eckert and Mauchly
• University of Pennsylvania
• Trajectory tables for weapons
• Started 1943
• Finished 1946
– Too late for war effort
• Used until 1955
36. ENIAC - details
• Decimal (not binary)
• 20 accumulators of 10 digits
• Programmed manually by switches
• 18,000 vacuum tubes
• 30 tons
• 15,000 square feet
• 140 kW power consumption
• 5,000 additions per second
41. IAS - details
• 1000 x 40 bit words
– Binary number
– 2 x 20 bit instructions
• Set of registers (storage in CPU)
– Memory Buffer Register
– Memory Address Register
– Instruction Register
– Instruction Buffer Register
– Program Counter
– Accumulator
– Multiplier Quotient
42. 2. 1.Evolution and Performance - history
• 1946 Von Neuman and his gang proposed IAS (Institute
for Advanced Studies)
• The design included :
– main memory
– ALU
– Control Unit
– I/O
• First Stored Program, able to perform :
+, -, x, :
• The “father” of all modern computer/processor
46. 2. 1. Evolution and Performance -history
IAS components are :
• MBR (memory buffer register), MAR (memory address
register), IR (instruction register), IBR (instruction buffer
register), PC (program counter), AC (accumulator and
MQ (multiplier quotient), memory (1000 locations)
• 20 bit instruction : 8 bit opcode, 12 bit address (addressing
one of 1000 memory locations - 0 to 999)
• 39 bit data (with sign bit - 1 bit)
• Operations : data transfer between registers and ALU,
unconditional branch, conditional branch, arithmetic,
address modify
47. 2.1. Evolution - History of Commercial computers
• First Generation : 1950 Mauchly & Eckert developed
UNIVAC I, used by Census Beureau
• Then appeared UNIVAC II, and later grew to UNIVAC 1100
series (1103, 1104,1105,1106,1108) - vacuum tubes and later
transistor
• Second Generation : Transistors, IBM 7094 (although there
are NCR, RCA and others tried to develop their versions -
commercially not successful)
• Third Generation : Integrated Circuit (IC) - SSI. IBM S/360
was the successful example
• Later generations (possibly fourth and fifth) : LSI and VLSI
technology
51. 2.1. Evolution - System 360 Family
Model Model Model Model
Model
Characteristic 30 40 50 65 75
----------------------------------------------------------------------------------------
--
Max memory size (Bytes) 64K 256K 256K 512K 512K
Memory data-rate(MB/s) 0.5 0.8 2.0 8.0 16.0
Processor cycle time (µs) 1.0 0.625 0.5 0.25 0.2
Relative Speed 1 3.5 10 21 50
Max Number data channel 3 3 4 6 6
Max chan. data-rate(KB/s) 250 400 800 1250 1250
---------------------------------------------------------------------------------------
• Family architecture menyebabkan adanya istilah : upward dan
downward compatible
52. Generations of Computer
• Vacuum tube - 1946-1957
• Transistor - 1958-1964
• Small scale integration - 1965 on
– Up to 100 devices on a chip
• Medium scale integration - to 1971
– 100-3,000 devices on a chip
• Large scale integration - 1971-1977
– 3,000 - 100,000 devices on a chip
• Very large scale integration - 1978 to date
– 100,000 - 100,000,000 devices on a chip
• Ultra large scale integration
– Over 100,000,000 devices on a chip
53. Moore’s Law
• Increased density of components on chip
• Gordon Moore - cofounder of Intel
• Number of transistors on a chip will double every year
• Since 1970’s development has slowed a little
– Number of transistors doubles every 18 months
• Cost of a chip has remained almost unchanged
• Higher packing density means shorter electrical paths,
giving higher performance
• Smaller size gives increased flexibility
• Reduced power and cooling requirements
• Fewer interconnections increases reliability
57. IBM 360 series
• 1964
• Replaced (& not compatible with) 7000 series
• First planned “family” of computers
– Similar or identical instruction sets
– Similar or identical O/S
– Increasing speed
– Increasing number of I/O ports (i.e. more terminals)
– Increased memory size
– Increased cost
• Multiplexed switch structure
58. 2.1. Evolution - Later generations
• Semiconductor memories :
1K,4K,16K,64K,256K,1M,4M,16 Mbits on a single chip
At present : 256 Mbit, 512 Mbit per chip
• Microprocessors appeared :
Intel 4004 (1971), Intel 8008 (72), Intel 8080 (8 bit-74),
8086 (16 bit-81), 80386 (32bit-85) onward.
• At almost the same time : Motorola, 6800 (8bit), 68000
(16bit), 68010 (16bit), 68020 (32bit), 68030/40 (32bit)
• Then Motorola’s product disappeared commercially
• Intel products dominated the market, since the appearance
of IBM PC
59. 2.1. Evolution of Microprocessors
Table 2.2
----------------------------------------------------------------------------------------
--
Feature 8008 8080 8086 80386 80486
----------------------------------------------------------------------------------------
--
Year introduced 1972 1974 1978 1985 1989
# of instructions 66 111 133 154 235
Address bus width 8 16 20 32 32
Data bus width 8 8 16 32 32
# of registers 8 8 16 8 8
Memory addressability 16KB 64KB 1 MB 4 GB 4 GB
Bus Bandwidth (MB/s) - 0.75 5 32 32
Reg-Reg add time (µs) - 1.3 0.3 0.125 0.06
----------------------------------------------------------------------------------------
60. 2.2 Designing for Performance
• Price of µprocessor continue to drop every year
• $1000 for an advanced system is today’s price : in it you
may find more than 100 million transistors !
• Even 100 millions pieces of toilet papers cost more !!
• Computing power is for free !!
• People solve problem that never been thought possible
before : image processing, speech recognition,
videoconferencing, multimedia authoring, etc.
• We need more and more computing power
• The organization and architecture of today’s processor
remains the same (basically) as those of IAS !
• Algorithms to improve speed and efficiency differs !
61. 2.2. Designing - µprocessor speed
• Intel Pentium and PowerPC follows Moore’s Law :
By shrinking size of lines in IC chips by 10%, industry may get
new IC with 4 times transistor density every 3 years !
• The above law is true for DRAM (Dynamic Random Access
Memory)
• If the capacity does increase, the speed doesn’t increase
automatically
• More work in designing instructions needed
• Also, techniques for faster instruction execution must be
developed : branch prediction, data flow analysis and
speculative execution
62.
63. Pentium Evolution (1)
• 8080
– first general purpose microprocessor
– 8 bit data path
– Used in first personal computer – Altair
• 8086
– much more powerful
– 16 bit
– instruction cache, prefetch few instructions
– 8088 (8 bit external bus) used in first IBM PC
• 80286
– 16 Mbyte memory addressable
– up from 1Mb
• 80386
– 32 bit
– Support for multitasking
64. Pentium Evolution (2)
• 80486
– sophisticated powerful cache and instruction pipelining
– built in maths co-processor
• Pentium
– Superscalar
– Multiple instructions executed in parallel
• Pentium Pro
– Increased superscalar organization
– Aggressive register renaming
– branch prediction
– data flow analysis
– speculative execution
65. Pentium Evolution (3)
• Pentium II
– MMX technology
– graphics, video & audio processing
• Pentium III
– Additional floating point instructions for 3D graphics
• Pentium 4
– Note Arabic rather than Roman numerals
– Further floating point and multimedia enhancements
• Itanium
– 64 bit
– see chapter 15
• See Intel web pages for detailed information on processors
68. Summary: Important Points
• Organization and Architecture
• Family Architectures
• Function of a Computer (Data Processing, Control, Data movement)
• Born of Computers (Eniac-decimal, IAS-digital) Mauckly-Eckert
• Microprocessors(I-4004,8008,8080,8086/16,80386/32)
• IAS Instructions
• Von Neuman bottleneck
• Increasing clock speed, make bus wider, cache memory
• Loosers : e.g. Motorola Micro Processor, Radio Shack,
• More dense transistor in a single chip (4 times every 3 years, by
shrinking lines by 10%)