This document analyzes the energy dissipation of digital half band filters operated in the sub-threshold region with throughput constraints. It explores various architectures of a 12-bit half band filter including the basic implementation and unfolded structures. Simulation results show that the unfolded by 2 architecture dissipates 22% less energy per sample compared to the original filter, making it the most energy efficient. The unfolded by 4 architecture best meets throughput requirements of 120K-1M samples/sec, dissipating less energy than other implementations in this speed range.
1. Ultra Low Energy vs Throughput Design
Exploration of 65 nm Sub-VT CMOS Digital Filters
S. M. Yasser Sherazi, Joachim N. Rodrigues, Omer C. Akgun, Henrik Sjöland, and Peter Nilsson
Department of Electrical and Information Technology, Lund University
Box 118, SE-221 00 Lund, Sweden
Email: {yasser.sherazi, joachim.rodrigues, omercan.akgun, henrik.sjoland, peter.nilsson}@eit.lth.se
Abstract—This paper presents an analysis on energy dissipa-
tion of a digital half band filters operated in the the sub-threshold
(sub-VT ) region with throughput constraints. The degradation
of speed in the sub-VT domain is counteracted by unfolding
the architectures. A filter is implemented in a basic 12-bit and
its various unfolded structures. The designs are synthesized in
a 65 nm low-leakage high-threshold CMOS technology. A sub- Fig. 1. Receiver system.
VT energy model is applied to characterize the designs in the
sub-VT domain. The results from application of an energy 250 Ksamples/s. Therefore, a chain of decimation filters needs
model shows that the unfolded by 2 architecture is most energy to be applied. To achieve lower energy dissipation, we are
efficient, dissipating 22 % less energy compared to it the original employing voltage scaling techniques rigorously, hence mak-
filter implementation at energy minimum voltage. Unfolded by ing the designed circuits run in the sub-threshold (sub-VT )
4 architecture, however, is the best for throughput requirements
of around 120 Ksamples/sec to 1 Msamples/s, as it dissipates less domain [1]. When operating in the sub-VT domain, leakage
energy than any other implementation in this speed range. currents are to be dealt with, which are the source of energy
dissipation in idle CMOS [2]. This current puts an important
I. I NTRODUCTION design constraint especially in implantable medical devices.
Miniaturized devices are important in medicine, sensor Consequently, we need to optimize the circuits in terms of
networks, and many other applications. Engineers aim to energy dissipation and throughput for sub-VT operation.
develop ultra compact and low energy circuits that may be In Sec. II we briefly present the applied sub-VT energy
used in devices like hearing aids, medical implants, and remote model. In Sec. III we present a 12-bit architecture of a
sensors. There is currently a major interest in small wireless Half Band Digital (HBD) filter that is implemented as direct
devices with ultra low energy dissipation targeting on-body mapped and its various unfolded structures. In Sec. IV the
applications or medical implants. In such devices minimal results attained from the HBD filters are shown and discussed,
energy dissipation in active and standby mode, is of highest and finally, the conclusions are presented in Sec. V.
importance as it makes the battery last longer, which is II. S UB -VT E NERGY M ODEL
important as it is non-trivial to change or charge a battery in a
medical implant. Devices like hearing aids that communicate The current of a MOS transistor is not equal drop to zero
between the two ears to improve binaural hearing may benefit when the gate to source voltage VGS is equal to or below
from energy efficient wireless receivers. Another example is a the threshold voltage VT , VGS ≤ VT , which is an indication
neural sensor inside the body that communicates with a robotic for leakage currents, commonly referred to as the sub-VT or
arm or leg. If a radio is made sufficiently small and with weak inversion conduction [3]. The existent current is due
minimal power consumption, there will be vast possibilities to leakage and low in amperage, and in the sub-VT domain
for new applications. used as the operating switching current. The drawback of sub-
In the conducted project the design constraints are, less VT circuits is speed penalty. However, circuits that operate at
than 1 mW and 1 µW power consumption in active and sub-VT manage to satisfy the ultra low energy requirements,
standby mode, respectively, capacity to handle data rates up since order of magnitudes less energy is dissipated compared
to 250 kbits/s, and realization on a single chip with an area of to super-threshold circuits [3]. The total energy dissipation of
1 mm2 in 65 nm CMOS. A block digram shows the receiver static CMOS digital circuits typically modelled as
system in Fig. 1, containing a RF front-end (2.5 GHz), an Etotal = αCtot VDD 2 + Ileak VDD Tclk + Ipeak tsc VDD , (1)
analog-to-digital converter, a digital baseband for demodula-
Edyn Eleak Esc
tion and control, and finally, an analog decoder that processes
the received data packets. where Edyn is the average switching energy and Eleak is
The main focus of this paper is on the digital baseband leakage energy dissipated during a clock cycle Tclk . As it
part of the receiver system. The first task of the digital is known that the energy dissipation due to short circuit
baseband circuit is to re-sample data from 4 Msamples/s to (Esc ) in the sub-VT domain is minor compared to the overall
978-1-4244-8971-8/10$26.00 c 2010 IEEE
2. (a)
(a) (b)
Fig. 2. Half Band Digital Filter. (a) single HBD filter (b) uf-2 HBD filter.
energy dissipation, which therefore is neglected [1]. In (1),
Edyn during one clock period is proportional to the switching
activity factor (α), and the total switched capacitance of the
circuit (Ctot ).
The model used to calculate energy dissipation delivers
SPICE-accurate results [4]. This model calculates total energy
dissipation by (2), and the key parameters required are ob-
tained during synthesis and high level simulations.
ET = Cinv VDD µe kcap + kcrit kleak e−VDD /(nUt ) ,
2
(2)
where kleak is average leakage scaling factor of the circuit is
normalized to the average leakage current of a single inverter.
The scaling factor kcap is the normalized total capacitance
of the circuit in terms of a single inverter capacitance. The
kcrit is a coefficient that measures the critical path delay of (b)
the circuit in terms of a single inverter delays. The average Fig. 3. Unfolded Architectures of the HBD filter. (a) uf-4 HBD filter (b)
switching activity of circuit per N samples operations is µe . uf-8 HBD filter.
A process dependent constant called slope factor is n, and Ut
is the thermal voltage and its value is 26 mV at 300 K. For All the filter coefficients are 1 or 2 may be implemented
more details the reader is referred to [4]. by simple shifting, and thereby saving the area and energy
dissipation of the circuit. An initial analysis indicates that
III. F ILTER A RCHITECTURES the required throughput would not be achieved by a single
Minimum energy dissipation with medium to high through- sample implementation of this filter. Therefore, unfolding was
put requirement puts stringent constraints on a design. There- applied. Unfolding is a transformation technique that calculate
fore, it is important to explore and analyse the architectures j samples per clock cycle, where j is the unfolding factor.
that best fulfill the requirements. This section presents the Unfolding has a property of preserving the number of delays
HBD filter and the architectural differences in the basic and in a Direct Form Graph (DFG) [7]. The basic HBD filter
unfolded versions. architecture was unfolded to get three more structures, i.e.,
unfolded by 2 (uf-2), unfolded by 4 (uf-4) and, unfolded
A. Half Band Digital Filter
by 8 (uf-8). In all unfolded architectures the number of
An optimized third order filter structure is evaluated for registers remain unchanged, whereas the adders scale with the
minimum energy dissipation. The filter structure for the par- unfolding factor. Fig. 2(b), shows the uf-2 version of the filter.
allel implementation, see Fig. 2(a), is a parallel third-order Furthermore, the critical path of this circuit is equal to the
bi-reciprocal lattice wave digital filter, [5], considered as original HBD filter structure. Fig. 3(a) shows an architecture
highly suitable as decimator or interpolator, for sample rate that was unfolded by a factor of 4. The number of adders has
conversions with a factor of two. The benefit of using this type increased according to the unfolding factor. The critical path
of filter is that all filtering may be performed at lower sample has increased, since two of the feedback paths do not contain
rates, with low arithmetic complexity, therefore, yielding both a register. Similarly, Fig. 3(b), shows the architecture of uf-8
low energy dissipation and a low chip area [6]. The transfer HBD, the adders have increased by a factor of 8, compared
function of the proposed filter is to the original HBD structure. The critical path increases,
1 + 2z −1 + 2z −2 + z −3 since six of the feed back paths do not contain any register.
Hz = , (3) However, there are more samples processed per clock cycle in
2 + z −2
3. TABLE I
E XTRACTED PARAMETER FOR THE S YNTHESIZED I MPLEMENTATIONS Energy dissipation is calculated under the assumption that
Arch. kleak kcap kcrit µe Area tp [nsec] the designs operate at critical path speed, which gives an En-
par 1113.6 835.4 127.4 0.727 1124 2.84 ergy Minimum Voltage (EMV) point [9]. The threshold voltage
uf-2 1695.5 1375.7 127.4 0.708 1836 2.84 for this LL-HVT device is around 430 mV. The designs’ energy
uf-4 3172.5 2797.9 164.2 0.703 3275 3.66 characteristics, over a scaled supply voltage VDD per clock
uf-8 5924.5 5422.3 232.2 0.890 6170 5.22 cycle is presented in Fig. 4(a). It is shown that the basic
HBD filter implementation denoted by (par) dissipates the
TABLE II minimum amount of energy per clock cycle when compared
C HARACTERIZATION OF THE I MPLEMENTATIONS AT EMV with the other three implementations. The reason being that
Arch. EMV Freq. Throughput E/Cyc E/smp the leakage for this circuit is less than that of the other circuits
[mV] [kHz] [ksamples/s] [fJ] [fJ] thanks to less area. The energy minima (per clock cycle) of
par 241 23.6 23.6 45 45 45.5 fJ for par implementation is achieved around 241 mV
uf-2 238 23.6 47.2 71 35 (indicated by the dot), which is lower than EMV of any other
uf-4 247 22.0 88.0 150 38
architecture, which confirms that lesser area contributes to less
uf-8 251 15.4 123.4 380 48
energy per clock cycle. However, it is crucial to investigate
TABLE III the energy spent on the processing of each sample of data,
P ERFORMANCES OF THE I MPLEMENTATIONS AT R EQUIRED and the apparent benefit of using par structure is lost when
T HROUGHPUTS the energy per operation or energy per sample is considered.
Throughput Circuits Vdd V [mV] E/Cyc [fJ] E/smp [fJ] Fig. 4(b), shows the energy dissipation per sample for different
2 Msamples/s uf-8 390 656 82.2 structures. Reason being that unfolded circuits perform twice,
1 Msamples/s uf-8 368 586 73.3 four and eight times as much operations per clock cycle,
uf-4 376 246 61.5 therefore the over all energy per sample for these circuits is
uf-2 400 136 68.3 reduced when compared to a single sample implementation.
500 Ksamples/s uf-8 344 525 65.2 Fig. 4(b), shows that the most efficient architecture is uf-2 as it
uf-4 352 226 54.7 dissipates 35.8 fJ per sample which is 45 % less than the energy
uf-2 368 116 58.4 dissipated by the par structure. Here, we may observe that
par 400 85.2 85.2
the uf-8 architecture is less energy efficient than par, even in
250 Ksamples/s uf-8 300 434 55.0
energy dissipation per sample at lower voltages and is almost
uf-4 320 188 47.0
equal to par, near the threshold voltages. The reason for this
uf-2 344 126 51.8
behaviour is that the uf-8 has higher switching activity µe . The
par 368 72.9 72.9
maximum frequency attainable with respect to VDD is shown
the unfolded structures, which wins with respect to throughput in Fig 4(c), the maximum frequency for both par and uf-2, is
over a limited increase in the critical path [8]. always higher than their counterparts due to a shorter critical
path, and the uf-8 has the slowest maximum speed because of
B. Hardware Mapping longer critical path, see Table I. Fig 4(d), shows the energy
dissipation of all the structures with respect to throughput.
All the cells used for implementation are from a low-leakage Table II, presents the characteristics of all the presented
high-threshold (LL-HVT) standard cell library. Tight synthesis architectures at EMV, showing the maximum frequencies
constraints were set to get minimum area and a short critical attainable, the corresponding throughputs, energy dissipated
path. The parameters for the energy model were retrieved by per clock cycle, as well as per sample. These simulations show
gate-level simulations with back annotated toggle and timing that we benefit from unfolding technique, both in energy per
information, which includes glitches. The parameters obtained sample and in throughput.
were applied to the energy model to characterize the designs In the project discussed in Sec. I, we need a chain of four
in the sub-VT domain. HBD filters, that reduces the high frequency data with the
rate of 4 Msamples/s from the ADC to the actual data rate of
IV. S IMULATION R ESULT
250 Ksamples/s. The first HBD filter must process the input
In this section the architectures of the filter are evaluated data stream with the rate of 2 Msamples/s. This throughput
with respect to energy and throughput. The parameters re- requirement is only fulfilled by using uf-8 HBD near 390 mV,
quired for the energy model [4], extracted during synthesis as shown in Table III and Fig. 4(d). The throughput require-
and energy simulations, discussed in II, are presented in ment of data with the rate of 1 Msamples/s for the second
Table I. The values for kleak follow the area cost, indicating HBD is fulfilled by using any three of the unfolded structure,
proportional leakage with respect to area. The k parameters uf-8, uf-4 and uf-2. The throughput requirement of data with
for the unfolded implementations are not proportional to the the rate of 500 Ksamples/s for third HBD is fulfilled by all
unfolding factor j since the number of internal registers remain four structures as shown in Table III and Fig. 4(d). The
unchanged from the basic implementation, although there is throughput requirement of data with the rate of 250 Ksamples/s
an increase in the number of input and output registers. for last HBD is again fulfilled by all structures. In Fig. 4(b),
4. 3
10
2
10
uf-8
Energy/samp [fJ]
Energy [fJ]
uf-4
10
2 uf-8
uf-2 par
uf-4
par
uf-2
0.15 0.2 0.25 0.3 0.35 0.4 0.15 0.2 0.25 0.3 0.35 0.4
VDD [V] VDD [V]
(a) (b)
3
10
90
2
10
80
Energy/samp [fJ]
fmax [kHz]
uf-2
uf-8 70
1
10 par
uf-4 par
60
0
10 50 uf-8
40
uf-4
−1
uf-2
10
0.1 0.15 0.2 0.25 0.3 0.35 0.4 1k 10k 100k 1M
V [V] Throughput [samples]
DD
(c) (d)
Fig. 4. Simulation Plots of HBD filter architectures, (a) Energy vs VDD per clock cycle, (b) Energy vs VDD per sample. (c) Frequency vs VDD , (d)
Energy vs Throughput
the uf-2 structure appears to be the most energy efficient unfolded implementation to achieve low energy dissipation per
circuit. However, when stringent throughput requirements are sample at EMV, when compared to the energy dissipated by
in-place the uf-4 structure proves to be the best option as a basic basic HBD filter implementation.
shown in Fig. 4(d) and Table III. This analysis shows that
ACKNOWLEDGMENT
its crucial to identify the most suitable architectures for the
given throughput and energy requirements. Furthermore, in The authors would like to thank Swedish Foundation for
[10] it is argued that low-leakage low-threshold cells are more Strategic Research (SSF) for funding the Wireless Communi-
beneficial at higher throughput rates in sub-VT domain, which cation for Ultra Portable Devices projects at Lund University.
needs to be further investigated for these filter implementation. R EFERENCES
In [1] it was shown in that the supply voltage of sub-VT
[1] E. Vittoz, Low-Power Electronics Design. CRC Press, 2004, ch. 16.
circuits may be reduced down to 50 mV. However, in practical [2] P. van der Meer, Low-Power Deep Sub-Micron CMOS Logic. Kluwer
terms at such low voltage values functional failures frequently Academic Publishers, 2006.
occur due to the process variations. It was found in [11] that [3] H. Soeleman and et al., “Robust subthreshold logic for ultra-low power
operation,” IEEE T-VLSI Systems, vol. 9, pp. 90–99, Feb 2001.
the supply voltage value which realizes operation with less [4] O. C. Akgun and Y. Leblebici, “Energy efficiency comparison of
than 0.001 failure rate for a 65 nm LL-HVT process is 250 mV asynchronous and synchronous circuits operating in the sub-threshold
and this value is taken as the minimum reliable operating regime,” Journal of Low Power Electronics, vol. 4, OCT 2008.
[5] P. Nilsson and M. Torkelson, “Method to save silicon area by increasing
voltage (ROV), indicated in the Fig. 4(b) by a line at 250 mV. the filter order,” in Electronic letters. ACM, NY, USA, 1995.
The simulations show that for the required throughput we are [6] H. Ohlsson and et al., “Arithmetic transformations for increased maximal
operating safely above ROV, see Table III. sample rate of bit-parallel bireciprocal lattice wave digital filters,” in
ISCAS, 2001.
V. C ONCLUSION [7] K. K. Parhi, VLSI Digital Signal Processing Systems, 1999, ch. 5.
[8] P. Åstrom, P. Nilsson, and et al., “Power reduction in custom CMOS
In this paper four HBD filter structures are evaluated for digital filter structures,” AICSP Journal, vol. 18, pp. 97–105, 1998.
minimum energy dissipation in the sub-VT domain for a [9] J. Rodrigues and et al., “A <1 pJ Sub-VT cardiac event detector in 65
nm LL-HVT CMOS,” VLSI-SOC, 2010.
throughput constrained system. All architectures i.e., the un- [10] D. Markovic, J.M.Rabaey, and et al., “Ultralow-power design in near-
folded by 2,4,8 and the basic HBD filter, are implemented and threshold region,” Proceedings of the IEEE, 2010.
simulated using 65 nm LL-HVT standard cells. The application [11] J. Rodrigues and et al., “Energy dissipation reduction of a cardiac event
detector in the sub-Vt domain by architectural folding,” PATMOS, 2009.
of a sub-VT energy model reveals that it is beneficial to use