Computer organization & ARM microcontrollers module 3 PPT
Summary Of Course Projects
1. SUMMARY OF COURSE
PROJECTS
SETIAWAN SOEKAMTOPUTRA
MASTER OF ELECTRICAL AND COMPUTER
ENGINEERING
ILLINOIS INSTITUTE OF TECHNOLOGY
DECEMBER 2010 GRADUATE
2. CONTENTS
• 32-bit Pipelined CPU
• MC68K-Based Monitor Program
• Pipelined MIPS Processor with hazard handler and data
forwarding
• Simple Mesh-Like and Ring-Like Network on Chip Design
• Small office network design
• 4-bit 10t adder circuit with dual-vt logic design
• Single-ended 6T vs. standard 6T SRAM bitcell design
• QR Matrix Factorization
• Electro Active Polymer Energy Harvesting Design
• Advanced Encryption Standard Hardware Design
2
3. SPRING 2009
• Introduction to VLSI Design
• 32-bit Pipelined CPU
• Multiplier with accumulator and pipeline optimization
• Microcomputer
• MC68K-Based Monitor Program
• Advanced Computer Architecture
• Pipelined MIPS Processor with hazard handler and data
forwarding
Return 3
4. 32-BIT PIPELINED CPU
• Hardware Description Language
• Verilog
• Tools
• Compiler: Cadence Verilog XL
• Logic Synthesis: Synopsys Design Compiler
• Simulation tool: Cadence‟s SimVision, Mentor Graphics
Modelsim
• Place and Route: Cadence SOC Encounter
• Mentor Graphic‟s Modelsim
• Objectives
• Execute ASIC Flow in this implementation using verilog
• RTL, post-synthesis, and post-PR simulation for verification
• Determine maximum frequency, area, delay, and power
Return 4
5. 32-BIT PIPELINED CPU
• 32-bit Memory File
• Eight ALU functions: multiplication, add, subtraction,
OR, AND, XOR, XNOR
• M:multiplicand, N: multiplier
• Multiplier:
• Radix 2r produce N/r partial products
• Radix-4 booth-encoded Multiplier Reduces number of
partial products (N/2 vs. N)
• Wallace Tree Reduces number of logic levels required to
perform summation
Return 5
10. 32-BIT PIPELINED CPU
• Case studies:
• Case 1: Modify ALU multiplier to multiplier with accumulator
(MAC) (useful for implementing DSP)
• Case 2: Pipeline optimization
• MAC benefit: reduces #instruction sets to compute
the final result of sum of product functions.
• Pipeline optimization is applied by inserting registers
at the critical path (in this case MAC unit)
Return 10
14. 32-BIT PIPELINED CPU
• Provided:
• Multiplier accumulator block diagram
• Simple CPU design written in verilog
• All required tools
• Implementation
• Construct fore-mentioned unit in verilog and modify the
design to fit new unit
• Apply numbers of registers for pipelining
• Design functionality Test
• Verify in sumulation that function F= (-10)* 5 + (-60)*2 + (-
60)*8 outputs the correct result
Return 14
16. 32-BIT PIPELINED CPU
• Additional Analysis Result
• Finding the maximum frequency
• Expected maximum frequency of the design: 58 MHz
• Frequency vs. area vs. power consumption
Return 16
17. MC68K-BASED MONITOR PROGRAM
• instructor: Dr. Jafar Saniie
• Requirements/Specifications
• Construct a simple monitor program for MC68000 processor
that allows user to execute common memory and register
accesses, basic exception handlers.
• Language
• 68000 assembly language
• Tools
• Easy68k Editor/Assembler/Simulator
Return 17
20. MC68K-BASED MONITOR PROGRAM
• Includes command interpreter that check and validate
user inputs.
• Monitor debugger commands:
• MEMD Memory display
• MEMS Memory Set
• SORT Memory Sort
• FILL Memory Fill
• MOVE Memory move
• MEMM Memory Modify
• FIND Block Memory Search
• REGM Register Modify
• REGD Register Display
• RUNS Execute program at specified location
Return 20
21. MC68K-BASED MONITOR PROGRAM
• Monitor debugger Exception handling commands:
• TBUS Bus Error Exception
• TADD Address Error
• TILL Illegal Exception
• TPRI Privilege Violation
• TDIV Division by Zero
Return 21
22. MC68K-BASED MONITOR PROGRAM
• Results (partial of 17 commands made)
Register display
Memory display
Return Command interpreter
22
23. HIGH-PERFORMANCE PIPELINED
MIPS PROCESSOR
• MIPS (Microprocessor without Interlocked Pipeline Stages) is a
reduced instruction set computer (RISC) instruction set
architecture (ISA)
• instructor: Prof. Jia Wang
• Requirements/Specifications
• Design a MIPS processor with pipeline, data forwarding, and hazard
handling capabilities.
• Run RTL Simulation to verify the functionalities
• Language
• VHDL
• Tools
• Modelsim PE 6.5
• MARS 3.6 MIPS Simulator
• Provided:
• Data memory unit design
• Testbench code
Return 23
24. HIGH-PERFORMANCE PIPELINED
MIPS PROCESSOR
• Data width: 32-bit
• Branch Hazard
• 5-stage pipeline
• Instruction Fetch • Branch calculation occurred in
• Instruction Decode Instruction Decode Stage
• Execute
• Memory Access
• Branch miss only costs one cycle
• Write-Back of stall.
• Main Modules • Data Hazard
• Program counter (PC)
• Control Unit • Stall if data being written is going
• ALU Control Unit to be used at the next instruction
• Register File
• ALU • Data Forwarding
• Instruction Memory
• Data Memory
• Result data is used immediately
• Hazard Detection Unit rather than written back to
• Forwarding Unit register file first.
Return 24
28. FALL 2009
• Hardware/Software Co-Design
• Simple Mesh-Like Network on Chip Design
• Simple Ring-Like Network on Chip Design
• Introduction to Computer Network
• Design of 2-story small office computer network
Return 28
29. HARDWARE/SOFTWARE CO-
DESIGN
• Projects:
• Network on chip prototype design with three
nodes
• Simple Mesh-Like Network on Chip Design
Return 29
30. NETWORK ON CHIP PROTOTYPE
DESIGN WITH THREE NODES
• Instructor: Prof. Jia Wang
• Specifications
• Three-node in partially connected mesh topology NoC
architecture
• Three processing elements and three routers.
• Queue system: FIFO
• Language
• SystemC running on Visual C++
• Tools
• Microsoft Visual C++
Return 30
31. NETWORK ON CHIP PROTOTYPE
DESIGN WITH THREE NODES
• Three-node NoC System Diagram
• Third node function (called PE_dumpbox)
• It receives all packets that cannot be processed by the
destination processing unit due to overloading in the network
Return 31
32. NETWORK ON CHIP PROTOTYPE
DESIGN WITH THREE NODES
• Results
• Overload in Router 1 network
buffer at cycle 3
• 3rd processing unit
PE_dumpbox receives
packet
Return 32
33. MESH-LIKE NETWORK ON CHIP
PROTOTYPE DESIGN
• Specifications
• a simple mesh-like NoC architecture.
• One router has one processing unit (PE).
• Queue system: FIFO
• 4 by 4 matrix-like size
• Language
• SystemC
• Tools
• Microsoft Visual C++
Return 33
35. MESH-LIKE NETWORK ON CHIP
PROTOTYPE DESIGN
• Results
• Generated packets
• Result shows packets are
delivered
Return 35
36. MESH-LIKE NETWORK ON CHIP
PROTOTYPE DESIGN
• Results
• Delays due to the fact
that only one packet is
delivered to processing
element PE at a time
Return 36
37. MESH-LIKE NETWORK ON CHIP
PROTOTYPE DESIGN
• Benefit and drawback:
• Packet arrives in the destination address with fewer hops
reducing contention and increasing average bit rate.
• Increases the complexity of the design and more wires
are needed.
Return 37
38. INTRODUCTION TO COMPUTER
NETWORK
• Project:
• Design a prototype of 2-story small office computer network
capable of serving 20 users with three department LANs,
four servers and wireless Internet
• Language
• N/A
• Tools
• Microsoft Visio
Return 38
39. SMALL OFFICE NETWORK DESIGN
• Proposed configurations
• IP address allocation
Return 39
41. SMALL OFFICE NETWORK DESIGN
• Office Layout
2nd floor
Colored arrows show how
1st floor cables are managed
Return 41
42. SPRING 2010
• Advanced VLSI
• 4-bit 10t adder circuit with dual-vt logic design
• High Performance VLSI IC System
• Single-ended 6T vs. standard 6T SRAM bitcell design
comparison
• QR Factorization
• Implementing QR factorization algorithm in C
Return 42
43. 4-BIT 10T ADDER CIRCUIT WITH
DUAL-VT LOGIC DESIGN
• Project:
• 4-bit 10t adder circuit with dual-vt logic design
• Specifications
• Adder circuit is based on:
J. Lin, M. Sheu, and C.Ho. A Novel High-Speed and Energy Efficient 10-Transistor Full
Adder Design. IEEE Trans. on Circuits and Systems, May 2007.
• Adder: cascaded Carry ripple Adders
• Technology node: 45nm (FreePDK)
• Voltage: 1.1V @ 25 MHz
• Performance measurements (delay and power consumption) for 10T
Adder Circuit using high-threshold (Vt), low-Vt, and dual-Vt transistors
• Tools
• Cadence Virtuoso Schematic Design
• Synopsys HSPICE Simulator
• Nanosim Simulator
Return 43
44. 4-BIT 10T ADDER CIRCUIT WITH
DUAL-VT LOGIC DESIGN
• High Vt vs. low Vt
• Full Adder Design (1-bit)
• Complementary and level restoring carry logic (CLRCL)
Return 44
45. 4-BIT 10T ADDER CIRCUIT WITH
DUAL-VT LOGIC DESIGN
• Full Adder Design (1-bit) Critical Path
• Dual-VT: Low-VT apply on transistors which are in critical path for
speed and High-VT for others for low leakage
• NMOS at multiplexer and PMOS in inverter are low-VT transistors
Return 45
46. 4-BIT 10T ADDER CIRCUIT WITH
DUAL-VT LOGIC DESIGN
• Logic Equation
Sum = (A XNOR B).Cin + (A XOR B). Cin_bar
Cout= (A XOR B) .Cin + (A XNOR B).A
• Design Components
• Inverter (left) and multiplexer (right)
Return 46
47. 4-BIT 10T ADDER CIRCUIT WITH
DUAL-VT LOGIC DESIGN
• 1-bit Full Adder (consisting of multiplexers and
inversters) and its symbol
• 4-bit Full Adder
Return 47
48. 4-BIT 10T ADDER CIRCUIT WITH
DUAL-VT LOGIC DESIGN
• Methodology
• Using combination of input vector to measure delay and
power consumptions
• Delay : Switching delay between least significant bit (bit 0)
and most significant bit (bit 3)
• Power : Average and maximum power during simulation
• Results 4.00E-10
3.50E-10
• Delay (in seconds)
3.00E-10
2.50E-10
High-VT
2.00E-10
Low-VT
1.50E-10
Dual-VT
1.00E-10
5.00E-11
0.00E+00
High-to-Low Low-to-High
Return 48
49. 4-BIT 10T ADDER CIRCUIT WITH
DUAL-VT LOGIC DESIGN
• Results
• Power consumption (in Watt)
6.00E-05 5.00E-04
4.50E-04
5.00E-05
4.00E-04
4.00E-05 3.50E-04
3.00E-04
3.00E-05 High-VT 2.50E-04 High-VT
Low-VT 2.00E-04
2.00E-05 Low-VT
1.50E-04
Dual-VT Dual-VT
1.00E-05 1.00E-04
5.00E-05
0.00E+00 0.00E+00
Average Power Maximum Power
Return 49
51. 4-BIT 10T ADDER CIRCUIT WITH
DUAL-VT LOGIC DESIGN
• Issue
• Voltage degradation specifically for high-vt or high
frequency (> 125 MHz) due to pass transistors behavior to
deliver weak-1 (NMOS) or weak-0 (PMOS).
Return 51
52. SINGLE-ENDED 6T VS. STANDARD 6T
SRAM BITCELL DESIGN
• Specifications
• Design from:
J. Singh, et al. Single Ended 6T SRAM with Isolated Read-Port for Low-
Power Embedded Systems. IEEE. 2009
• Technology node: 45nm
• Use: high VT MOSFET
• Tools
• Cadence Virtuoso Schematic Design
• Synopsys HSPICE Simulator
Return 52
53. SINGLE-ENDED 6T VS. STANDARD 6T
SRAM BITCELL DESIGN
• Background
• SRAM consumes majority of die area
• Dynamic power via reads and writes activities
• Static power : retaining its logic value
• Benefits/Drawbacks of Single-Ended SRAM
• Faster reading logic „1‟
• One bit line (no complementary bit bar line) wire
reduction
• More delay in Writing „1‟ due to weak-1 behavior of pass
transistor NMOS (but around 85% of writes are zero writes)
• Role of Isolated Read Port: Prevents bitcell content to be
exposed during READs
• Considerable lower power dissipation, better read SNM
Return 53
60. SINGLE-ENDED 6T VS. STANDARD 6T
SRAM BITCELL DESIGN
• Standard SRAM Design (using Cadence Virtuoso)
Return 60
61. SINGLE-ENDED 6T VS. STANDARD 6T
SRAM BITCELL DESIGN
• Single-Ended SRAM Design
Return 61
62. SINGLE-ENDED 6T VS. STANDARD 6T
SRAM BITCELL DESIGN
• Comparison Results
• Write Delay (0 to 0.5Vdd or 1 to 0.5Vdd)
“…around 85% of the instruction write bits are “0,” and over 90% of the data
write bits are “0.”.. “ (quoted from [3])
[3] Y. Chang, F. Lai, C. Yang. Zero-Aware Asymmetric SRAM Cell for
Reducing Cache Power in Writing Zero. IEEE Trans. On VLSI
Systems, Vol.12, No.8, August 2004.
Return 62
63. SINGLE-ENDED 6T VS. STANDARD 6T
SRAM BITCELL DESIGN
• Comparison Results
• Power Consumption Comparison
Return 63
64. SINGLE-ENDED 6T VS. STANDARD 6T
SRAM BITCELL DESIGN
• Noise Margin
Return 64
65. QR MATRIX FACTORIZATION
• Purposes:
• Implementing QR factorization algorithm in C
• Specifications
• Written in C under RedHat OS
• QR Factorization
• Decomposition method of a matrix to solve linear problems or
equations without inverting one of the left-hand side matrix.
• Applicable to: m-by-n matrix A
• Decomposition: A = QR where Q is an orthogonal matrix of size m-by-
m, and R is an upper triangular
• The QR decomposition provides an alternative way of solving the
system of equations Ax = b without inverting the matrix A. The fact that
Q is orthogonal means that QTQ = I, so that Ax = b is
• equivalent to Rx = QTb, which is easier to solve since R is triangular.
Return 65
68. FALL 2010
• Electro Active Polymer Energy Harvesting
• Advanced Encryption Standard
Return 68
69. ELECTRO ACTIVE POLYMER
ENERGY HARVESTING DESIGN
• EAP Circuitry provides mechanical to electrical
energy conversion when it is stretched, given bias
voltage.
• EAP material VHB 4905 tape and carbon grease
Return 69
70. ELECTRO ACTIVE POLYMER
ENERGY HARVESTING DESIGN
• Previous prototype: • Drawbacks
• High energy consumption
• Charge management • EAP output power is too
IC: TI‟s bq2000 small to even turn on battery
• Li-ion battery 3V, 45mAh charging circuit (which
needs 20.6 mA)
• Application: TI‟s eZ430- • Solutions
F2013 • EAP material efficiency
• Boost Converter to • Higher capacitance
supply biasing voltage (5 • Battery and circuit that can
V 1.5KV): store small energy without
requiring much energy to
• EMCO Q15N-5 operate
• Apply low biasing voltage
eliminate use of boost
converter
Return 70
71. ELECTRO ACTIVE POLYMER
ENERGY HARVESTING DESIGN
• Simulation model using Simulink
• Circuit model parameters:
• EAP Model parameters, input voltage (battery), and output
capacitor Co
Return 71
72. ELECTRO ACTIVE POLYMER
ENERGY HARVESTING DESIGN
• Simulation model using Simulink
• EAP Model Parameters:
• Cidle, Cforced, force frequency f(how often the EAP is stretched)
• Absolute function to create always-positive sine waveform from
original sine wave
Return 72
78. ADVANCED ENCRYPTION STANDARD
HARDWARE DESIGN
• Variant AES with 512-bit and 1024-bit key
• Area and power consumption comparison with 128-bit
and 256-bit AES keys
• CMOS technology : 45nm
• Operating Voltage : 1.1 V @ 100 MHz
• Verilog language
• Tools:
• Synthesis : Synopsys DC Compiler
• Simulation : Modelsim
• Find the relationship between key size and implemented
hardware area and power consumption.
Return 78
79. ADVANCED ENCRYPTION STANDARD
HARDWARE DESIGN
Cipher Key Plaintext
• Initial Round
Key Expansion RoundKey[0] AddRoundKey
Normal Round
SubBytes
ShiftRows
MixColumns i=i+1
RoundKey[i] AddRoundKey
yes
i < Number of
rounds?
Final Round
No
SubBytes
ShiftRows
AddRoundKey
Ciphered Text
Return 79
81. ADVANCED ENCRYPTION STANDARD
HARDWARE DESIGN
• Block Diagram
SubBytes
Mux
Plain_text AddRoundKey and MixColumns
ShiftRows AddRoundKey
Mux
Ciphered
Mux
Initial _text
Key Expansion Module value
Cipher_key
(zero)
Return 81
82. ADVANCED ENCRYPTION STANDARD
HARDWARE DESIGN
7
Results 6
y = 0.852x + 2.739
R² = 0.985
5 100000
95000
4 90000 power (dynamic) in mW
85000
80000 power (static) in mW
3 75000 Total Power in mW
70000
65000 Linear (Total Power in mW)
2
60000
55000
1 50000
AES128 AES256 AES512 AES1024
area 58824.876 64188.036 76881.193 96312.560
0
AES128 AES256 AES512 AES1024
power (dynamic) in mW power (static) in mW Total Power in mW
AES128 3.3574 0.2971603 3.6545603
AES256 3.9442 0.3341722 4.2783722
AES512 5.0289 0.409219 5.438119
AES1024 5.6042 0.5053051 6.1095051
Return 82
83. ADVANCED ENCRYPTION STANDARD
HARDWARE DESIGN
Results: Area
100000
95000
90000
85000
80000
75000
70000
65000
60000
55000
50000
AES128 AES256 AES512 AES1024
area 58824.87654 64188.0369 76881.19388 96312.56036
Return 83