1. An Energy Efficient Sub-threshold Multiplication
and Accumulation Unit for Low Power Digital
Signal Processing Applications
Harsha Yelisala
SPRING 2009 - SUMMER 2010
2. Technology Profile
The following technologies are used in this project,
90nm Pass Transistor Technology.
Cadence IC design.
Virtuoso schematic.
Virtuoso Analog Design Environment.
Cadence Spectre Simulator.
Virtuoso Layout Suite.
Synopsys Nanosim.
Synopsys Hspice.
Tcl Scripting.
Perl Scripting.
Python Programming Language.
3. Aim
The Objectives of this project are
1. To design an industry standard energy efficient circuit in a
90nm Technology.
2. To emphasize the Subthreshold mode of operation.
3. To get hands on expertise on Cadence and Synopsys Tools.
4. To understand the hardware design flow.
5. To work with Perl and Tcl Scripting Languages.
4. Introduction
Abstract
The increased use of power consuming devices led to a new corner
of research in energy and power efficient designs. The conventional
design methodologies proved to be inefficient when energy
efficiency is a prime metric. Of the several novel approaches, the
one that is promising in terms of high energy savings and reduced
complexity is the Sub-threshold mode of operation. A 220mV
energy efficient Subthreshold MAC unit is designed based on the
designed custom cell library in 90nm Pass transistor technology.
5. Work Flow
1. Studying the literature regarding Subthreshold operation.
2. Investigating various logic families for Subthreshold scheme.
3. Designing a custom library of standard cells out of the
proposed logic family.
4. Designing a MAC unit.
5. Verifying and testing the unit from power and energy
perspective.
6. Subthreshold Mode
What is Subthreshold mode
A basic MOS transistor works in three different modes of operation.
1. Active or Saturation Mode
2. Linear or Triode Mode
3. Cutoff or Subthreshold Mode
7. Modes of a MOSFET operation
Modes of a MOSFET
A basic MOS transistor works in three different modes of operation.
1. Active or Saturation Mode
2. Linear or Triode Mode
3. Cutoff or Subthreshold Mode
8. All about Subthreshold Mode!
What is Subthreshold mode
The subthreshold operation of CMOS transistor is performed when
the gate to source potential (Vgs ) is less than threshold
voltage(Vth ).
Advantages:
1. As the device is operating in ultra low voltages(200-300mV),
the dynamic power component is highly reduced.
2. Highly suitable for low power low speed applications like sensor
nodes, battery operated devices etc.,
Disadvantages:
1. As the driving currents are the weak leakage currents the time
to charge and discharge the nodes is high, making the speed in
between 1-10MHz.
2. Transistor sizing criticality
3. Low On-Off Current ratio.
4. High Sensitivity to Process, Voltage and Temperature variations.
9. Subthreshold Current Model (1 of 2)
In Subthreshold regime, the drain current(Ids ) varies exponentially.
In long channel device, threshold voltage does not depend on drain
voltage or channel length. But in sub-micron technology, due to
drain induced barrier lowering(DIBL), threshold voltage does
depend on drain voltage, as source/drain depletion region
penetrates significantly into the channel.
The subthreshold current of CMOS transistor is given by the
following equation,
Isub = I0 × e (Vgs −Vth +ηVds )/nvt × 1 − e −Vds /Vth . (1)
10. Subthreshold Current Model (2 of 2)
Isub = I0 × e (Vgs −Vth +ηVds )/nvt × 1 − e −Vds /Vth . (2)
where
2
I0 = µo Cox (W /L)(n − 1)Vth (3)
and Vgs = transistor gate to source voltage,
Vds = drain to source voltage,
Vth = threshold voltage,
vt = KT /q is the thermal voltage,
n = subthreshold slope factor = (1 + Cd /Cox )
Cd = drain capacitance
Cox = gate capacitance
η = DIBL co-efficient
µo = Mobility.
W and L are the width and channel length of MOSFET
respectively.
11. Subthreshold Power Model (1 )
For low frequency mobile devices, the advantage of subthreshold
design is widely achieved through radical circuit power reduction at
the cost of operating speed . The total power consumption of the
digital circuit is given by following equation.
Ptotal = Pdynamic + Pshort−circuit + Pstatic (4)
12. Subthreshold Power Model (2 )
Dynamic Power
Dynamic power is described by following equation,
Pdynamic = αfCeff Vdd 2 (5)
where α is activity factor, f is switching frequency, Ceff is the
effective capacitance. As dynamic power is directly proportional
with the square of supply voltage, significant power reduction is
achieved in subthreshold voltage.
13. Subthreshold Power Model (3 )
Dynamic Power
At 220mV, the dynamic charging current which is directly
proportional with dynamic power, is reduced by almost 248.49X
compared to supply voltage of 1.2V for an inverter at TT process
corner.
3
10
TT
FS
SF
2 SS
10
FF
Current (uA)
1
10
0
10
−1
10
0.2 0.4 0.6 0.8 1 1.2
Supply voltage (V)
14. Subthreshold Power Model (4 )
Static Power
Static power is the power consumed by the circuit during idle state
and described by following equation.
Pstatic = ILeakage Vdd (6)
The leakage current consists of various components, subthreshold
leakage, gate tunneling, gate induced drain lowering (GIDL) and
reverse bias diode leakage. The subthreshold leakage varies
according to equation (2). Thus with reduction of drain voltage,
the DIBL effect reduces which in turn reduces subthreshold leakage
current. The gate tunneling has significant contribution to overall
leakage current, which also reduces with gate or supply voltage.
GIDL and reverse bias diode leakage also significantly reduce due
to supply voltage reduction in a subthreshold circuit.
15. Subthreshold Power Model (5 )
Static Power
At 220mV, the subthreshold leakage current at weak inversion is
reduced by almost 8.55X compared to strong inversion(supply
voltage 1.2V) at TT process corner.
3
10
TT
FS
SF
2 SS
10 FF
Current (nA)
1
10
0
10
−1
10
0.2 0.4 0.6 0.8 1 1.2
Supply voltage (V)
16. Subthreshold Power Model (6 )
Short Circuit Power
Short circuit power is the power dissipated due to current
conduction between Vdd and VSS during logic transition. It is
described by the following equation.
Pstatic = Ishort−circuit Vdd (7)
Although short-circuit current flowing time is increased due to
slower operation in subthreshold, but reduced supply voltage
decreases electron conduction, which in turn reduces Ishort−circuit .
17. Subthreshold Power Model (5 )
Short Circuit Power
At 220mV, there is a 446.45X reduction in short circuit current
compared to full rail voltage of 1.2V in TT process corner.
2
10
1
10
Current (uA)
0
10
TT
FS
−1 SF
10 SS
FF
−2
10
0.2 0.4 0.6 0.8 1 1.2
Supply voltage (V)
Figure: Short circuit current rating under varying supply voltage for an
19. Subthreshold Design Challenges (2)
Transistor Sizing Criticality
The relative strength of pull-up, pull-down is very critical for
optimal rise and fall time. As subthreshold current depends
exponentially on Vth , any variation in threshold of NMOS and
PMOS can change the β ratio drastically which directly affects
rise/fall time and may trigger logic failure. The shift in β ratio is
observed in low-voltage, enforcing us to size the cell transistor very
carefully.
20. Subthreshold Design Challenges (2)
Transistor Sizing Criticality
The relative strength of pull-up, pull-down is very critical for
optimal rise and fall time. As subthreshold current depends
exponentially on Vth , any variation in threshold of NMOS and
PMOS can change the β ratio drastically which directly affects
rise/fall time and may trigger logic failure. The shift in β ratio is
observed in low-voltage, enforcing us to size the cell transistor very
carefully.
21. Subthreshold Design Challenges (2)
Ratio of NMOS ION and PMOS ION at different corners
3
10
TT
FF
FS
SF
SS
ION NMOS / ION PMOS
2
10
1
10
0
10
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Supply(V)
Figure: Ratio of NMOS ION and PMOS ION at different corners
22. Subthreshold Design Challenges (2)
Ratio of NMOS ION and PMOS ION at different temperatures
30
−40C
−20C
0C
25
20C
40C
60C
ION NMOS / ION PMOS
20 80C
100C
120C
15
10
5
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Supply(V)
Figure: Ratio of NMOS ION and PMOS ION at different temperatures
23. Subthreshold Design Challenges (3)
On-Off Current Ratio
The drain current of MOSFET increases exponentially in
subthreshold region whereas in strong inversion it changes very
slowly due to velocity saturation of majority carriers. In
subthreshold region, the threshold voltage deviation and
degradation of ION /IOFF of the current makes the circuit operation
very critical. In subthreshold region like 0.2V, ION /IOFF degrades
to below 300 at room temperature.There is strong race condition
between on and off devices during setting of a critical signal and
this determines the maximum number of allowable cells per
bit-line. When this current ratio degrades to very low value, it
becomes very difficult to differentiate between logic ‘1’ and logic
‘0’. If we consider process variations, this ratio becomes worse in
FF corner as shown.
24. Subthreshold Design Challenges (3)
On-Off Current Ratio
5
10
4
10
NMOS ION / IOFF
3
10
−40C
−20C
2
10 0C
20C
40C
1 60C
10
80C
100C
120C
0
10
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Supply(V)
Figure: Ratio of NMOS ION and IOFF at different temperatures
Observation: Significant β ratio variation is observed in low
25. Subthreshold Design Challenges (3)
On-Off Current Ratio
7
10
6
10
5
10
PMOS ION / IOFF
4
10
−40C
3
10 −20C
0C
2 20C
10
40C
60C
1
10 80C
100C
0
120C
10
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Supply(V)
Figure: Ratio of PMOS ION and IOFF at different temperatures
Observation: Significant β ratio variation is observed in low
26. On-Off Current Ratio
5
10
4
10
NMOS ION / IOFF 3
10
2 TT
10
FF
FS
SF
1
10 SS
0
10
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Supply(V)
Figure: Ratio of NMOS ION and IOFF at different corners
Observation: Significant β ratio variation is observed in low
voltage at different temperatures.
27. On-Off Current Ratio
6
10
5
10
PMOS ION / IOFF 4
10
3
TT
10 FF
FS
SF
2 SS
10
1
10
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Supply(V)
Figure: Ratio of PMOS ION and IOFF at different corners
Observation: Significant β ratio variation is observed in low
voltage at different temperatures.
28. A Look into other Logic families
The conventional Complimentary MOS Logic family when
operated in subthreshold voltages poses several disadvantages.
A few of them are:
1. High Power dissipation
2. Weak Noise margins.
3. Huge delays.
Thus it is evident that a CMOS logic family is not optimum for
subthreshold operation.
29. A study of several other logic families is made with power and
energy consumption as prime concern.
Table: Minimum working voltages for different logic families for a basic
AND gate
Logic Family Minimum Voltage(mv) Delay(ns) Driving Current(nA) Power(nW) PDP(fJ)
Sub-CMOS 250 2.56 3330 1859 4.759
Pseudo NMOS 220 4.765 102.56 0.6023 2.87
DTMOS 180 8.4173 32.54 233.63 1.97
Domino 240 7.6477 476.13 639.41 4.89
Pass Transistor 200 4.9953 201.43 426.17 2.13
DTPT 175 6.598 128.39 204.68 1.35
Table: Energy comparison at 250mV for different logic families for basic
AND gate
Logic Family Delay(ns) Driving Current(nA) Power(nW) PDP(fJ)
Sub-CMOS 2.56 3330 1859 4.759
Pseudo NMOS 3.8637 761.938 0.9848 3.805
DTMOS 11.116 89.204 1.501 16.68
Domino 4.5477 568.31 1.119 5.09
Pass Transistor 2.2641 652.88 1.502 3.39
DTPT 1.8432 830 1.503 2.77
30. Custom Cell Library
All the standard cells are designed in 90nm PT technology. The
cells are fine tuned for their sizings, driving capability and minimum
working voltage magnitudes. The cells that are customized are:
Inverter
Buffer
And
Or
Xor
Xnor
31. Inverter
This is the only gate in the library that is based on CMOS
technology. The only modification is that the driving capability of
the cell is increased by improving the effective channel length of
the P and N devices as shown.
32. Buffer
Buffer gate is obtained by connecting two inverters in series.
33. And (1 of 2)
Operation:
When A=0, B=0 the transistors
A
p1
p1, n1, n3 are on and p2, n2, p3,
p4 are off and transmits gnd.
n1
A'
output
When A=0, B=1 the transistors
p1, n1, p3 are on and p2, n2, n3,
p2
B
p3 p4 p4 are off and transmits gnd.
vdc
n2 When A=1, B=0 the transistors
B' p2, n2, n3, p4 are on and p1, n1,
gnd
n3 p3 are off and transmits B.
When A=1, B=1 the transistors
Figure: And gate
p1, n2, p3, p4 are on and p2, n2,
n3 are off and transmits vdc.
34. And (2 of 2)
Need for additional Mosfets n3, p3, p4:
A
when inputs are A=1, B=0, the
p1 output node is discharged to zero.
n1
output
when inputs are A=1, B=1, the
A' output should be connected to B
p2 and should charge it to ‘1’.
p3 p4
B
But due to larger sub threshold
vdc
n2
B'
delay, the node which was
gnd
discharged earlier takes longer
n3
time to charge to ‘1’.
Figure: And gate Hence an alternate path is
provided to charge the output
node to ‘1’ .
35. Or (1 of 2)
Operation:
When A=0, B=0 the transistors
p1, n1 are on and p2 is off and
transmits B.
A p2
output When A=0, B=1 the transistors
p1
p1, n1 are on and p2 is off and
B transmits B.
n1 When A=1, B=0 the transistors
A'
p1, n1 are off and p2 is on and
transmits A.
Figure: Or gate
When A=1, B=1 the transistors
p1, n1 are off and p2 is on and
transmits A.
36. Or (2 of 2)
A
This works fine in strong inversion
p2
output region. But when subthreshold mode is
p1 considered, the output current is not
B sufficient for the gate to drive a FO4
n1 load. Hence a chain of two inverters are
A' connected at the final output to
consider it as custom OR gate.
Figure: Or gate
37. Xnor
Operation:
When A=0, B=0 the transistors
n1
B' p1, n1 are on and p2, n2 are off
and transmits B .
p1
When A=0, B=1 the transistors
A output
p1, n1 are on and p2, n2 are off
and transmits B .
n2
B When A=1, B=0 the transistors
p1, n1 are off and p2, n2 are on
p2 and transmits B.
A' When A=1, B=1 the transistors
p1, n1 are off and p2, n2 are on
Figure: Xnor gate and transmits B.
38. Xor(1 of 2)
Operation:
When A=0, B=0 the transistors
p1
B
p1, n1 are off and p2, n2 are on
and transmits B.
n1
When A=0, B=1 the transistors
A output p1, n1 are off and p2, n2 are on
and transmits B.
p2
B
When A=1, B=0 the transistors
p1, n1 are on and p2, n2 are off
n2 and transmits B .
A
When A=1, B=1 the transistors
p1, n1 are on and p2, n2 are off
Figure: Xor gate and transmits B .
39. Xor(2 of 2)
p1
B However, the direct XOR
n1
implementation is not used in our
custom library, as the XOR derived from
A output
XNOR works for much lesser minimum
working voltage than direct XOR
p2
B
implementation upon investigation. The
details are mentioned in the further
n2 slides.
A
Figure: Xor gate
40. Summary of the standard cells in PT technology
Table: Electrical characteristics of different basic cells using pass
transistor logic in TT process corner
Basic cell Minimum Voltage(mv) Delay(ns) Driving Current(fA) Power(nW) PDP(aJ)
Buffer 148 2.7258 582.06 0.134 0.365
Inverter 150 1.5655 590.65 0.197 0.308
XOR 155 1.5739 611.69 0.562 0.884
NAND 170 0.9638 673.64 0.435 0.419
AND 175 2.1523 689.82 0.47 1.011
OR 155 3.9219 611.81 0.431 1.6903
Full adder 185 2.9647 734.61 29.516 87.506
41. Design of a MAC Unit
MAC is one of the most occurring and energy consuming
operation in DSP or other computationally intensive
applications.
It represents a fundamental building block in all DSP tasks.
Therefore, designing an ultra-low power MAC becomes a
subject of substantial research interest.
An energy efficient MAC unit is designed using the custom
cell library.
42. Design of a MAC Unit
Brief Specifications:
Inputs : 8-bit Multiplier, 8-bit Multiplicand, 17-bit Addend
Outputs :17-bit MAC output
Type of Multiplier : Radix-4 Booth encoded multiplier
Type of Adder : Ripple carry adder
43. Block diagram of MAC unit
MULTIPLIER ADDER
MD<7:0> -MD Partial Product
2s Compliment Generation
I PP0 P0
-2MD
N Shifter <16:0>
Partial
P PP1 P1 Product <16:0> Adder
U 2MD O
Shifter Adder
T PP2 P2 U
T
MR<7:0> PP3 P3 P
Booth Encoder U
T
Figure: Block diagram of MAC unit
:
44. Flowchart of MAC Unit
MULTIPLICAND
2s Compliment
Boot h encoder
Shift er s
MULTIPLIER Partial product
generation
Partial product
addition
ADDER INPUT
Adder
MAC OUTPUT
Figure: Flowchart for MAC operation
45. Sequence of logic flow
The multiplicand(MD) input enters the 2s compliment block
which negates the value of MD.
The obtained -MD when shifted left gives a -2MD.
The non negated MD is also shifted left to obtain 2MD.
The booth encoder block encodes the 8 bit multiplier(MR) to
12 bits which are used to control the partial product
generation.
The partial product generation involves selection of four 8 bit
vectors based on the encoded bits.
The four partial products are generated by the PP0, PP1, PP2
and PP3 blocks respectively.
The partial products are shifted and sign extended to 16 bits
by the P0, P1, P2 and P3 blocks respectively.
The obtained partial products are finally added to obtain the
17 bit multiplier output.
A 17 bit external input is added with the obtained multiplier
product to give final MAC output.
46. Modified booth encoding algorithm
Modified booth encoding algorithm is an often selected algorithm
for multiplication of signed numbers. This scheme is selected by its
virtue of reducing the number of partial products to half the
number of multiplier bits as compared to a conventional booth
encoding scheme. This reduces the number of iterations at an
increased circuit complexity. Thus the power consumption is also
reduced by half. The modified booth encoder based multiplier
architecture is designed keeping in view of the power consumption.
47. Algorithm Description and Control Implementation
The modified booth algorithm considers 3 multiplier bits (MRi+1 ,
MRi , MRi−1 ) at a time and encodes to any value among -2MD,
-MD, 0, MD, 2MD based on Table below. The value MRi refers to
the i th bit of the multiplier where i ranges from 0 to number of
multiplier bits and MR−1 is taken to be 0.
Table: Mapping of multiplier bits to encoded bits using Radix 4 Booth
Encoder
MRi+1 MRi MRi−1 Partial Product A B C
0 0 0 0 0 0 0
0 0 1 MD 0 1 0
0 1 0 MD 0 1 0
0 1 1 2MD 0 0 1
1 0 0 -2MD 1 0 1
1 0 1 -MD 1 1 0
1 1 0 -MD 1 1 0
1 1 1 0 1 0 0
where A, B, C indicate the encoded bits for a given MRi+1 , MRi ,
MRi−1 bits of the multiplier bit sequence starting from the LSB.
48. Example
Consider an example where,
Multiplier(MR) :01001000 Adder input as
Multiplicand(MD):00110110 01100010001000001
So, 2MD=01101100, -MD=11001010, -2MD=10010100
Encoding the MR:
010010000 000 encodes to 000
01001000 100 encodes to 101
01001000 001 encodes to 010
01001000 010 encodes to 010
Partial Products: After shifting and sign extending:
pp0 :00000000 p0 :0000000000000000
pp1 :10010100 p1 :1111111001010000
pp2 :00110110 p2 :0000001101100000
pp3 :00110110 p3 :0000110110000000
Adder = 01100010001000001 + Product = 00000111100110000
MAC OUTPUT = 0000111100110000
49. Test Chip
A 17 bit subthreshold MAC unit is implemented using 90nm
CMOS technology. The fan-in of each logic gate is carefully
selected to achieve maximum robustness in near-threshold supply
voltage. Since pad-frame input to the MAC is 1.2V, input data
and clock signals are down-converted using level shifter down
converter. The output of MAC is up converted to 1.2V before
being latched to output padframe using an efficient 2-stage down
level-shifter. The design layout is done using cadence virtuoso.A
total of four metal layers are employed to design the MAC unit.
The MAC unit size is 658.4µm × 149.49µm which consumes an
area of 0.098mm2 in 90nm technology. The transistor level circuit
analysis is performed using random test vector. The design is
elaborately tested for PVT variations.
50. Full chip layout of the proposed design with pad frame
Figure: Layout of MAC unit
:
51. Design Specs
Table: Subthreshold MAC design specifications
Minimum voltage 220mV
Speed 1 MHz
Energy per operation 1.63pJ
Average power 2.04uW
Standby power 1.4uW
The MAC unit is configured to operate at an extremely low voltage
of 220mV at a speed of 1MHz for the worst case process corner
(SS) at room temperature and can be functional even down to
180mV at typical corner (TT).
52. MAC Simulation Results (1 of 8)
100
90
80
70
60
power (uW)
50
40
30
20
10
0
200 250 300 350 400 450 500
voltage (mV)
Figure: Average Power Consumption of MAC at different supply voltages
:
53. MAC Simulation Results (2 of 8)
12
SS
10 SF
FS
TT
8 FF
Frequency (MHz)
6
4
2
0
220 225 230 235 240 245 250
Voltage (mV)
Figure: Operating frequency of MAC unit at different supply voltages
under global variation
:
54. MAC Simulation Results (3 of 8)
7000
6000
5000
Energy/op (fJ)
4000
3000
2000
1000
200 250 300 350 400 450 500
voltage (mV)
Figure: Energy/operation at different supply voltages
:
55. MAC Simulation Results (4 of 8)
3
static current
dynamic current
2.5 capacitive current
2
1.5
Current (uA)
1
0.5
0
−0.5
−1
200 250 300 350 400 450 500
Votage (mV)
Figure: Short circuit, static and capacitive current ratings at different
supply voltages
:
56. MAC Simulation Results (5 of 8)
3
temp 0c
temp 27c
2.5 temp100c
2
Stand By Power (uW)
1.5
1
0.5
0
−0.5
200 250 300 350 400 450 500
Supply (mV)
Figure: Standby power versus supply voltage at different temperatures
:
57. MAC Simulation Results (6 of 8)
3
static current
dynamic current
2.5
capacitive current
2
1.5
Current (uA)
1
0.5
0
−0.5
−1
−40 −20 0 20 40 60 80 100 120
temp (c)
Figure: Current ratings at different operating temperatures at supply
voltage 220mV
:
58. MAC Simulation Results (7 of 8)
1000
900
800
700
600
dealy (ns)
500
400
300
200
100
−40 −20 0 20 40 60 80 100 120
temp (c)
Figure: Performance of MAC at different temperatures at supply voltage
220mV
:
59. MAC Simulation Results (8 of 8)
300
250
200
power (uW)
150
100
50
0
−40 −20 0 20 40 60 80 100 120
temp (c)
Figure: Average power of MAC at different temperatures at supply
voltage 220mV
:
60. Conclusion
In this research project,
Several logical families are investigated in subthreshold range
to build the optimum subthreshold standard cells.
Pass transistor logic family was chosen due to its energy
efficiency compared to other subthreshold logic families.
An optimal design choice is made for each subthreshold
standard cell, based on power delay product.
A 17 bit subthreshold MAC chip is implemented using
customized subthreshold standard cells.
The custom cell layout is done using cadence virtuoso and
tested in all process corners using nanosim simulator.
It is designed to work for a minimum voltage of 220mV and
consumes an ultra low energy as minimum as 1.62pJ per
operation for an operating performance of 1.0MHz.