This document describes a proposed 10T low power SRAM bit cell. It compares the proposed 10T cell to a conventional 6T cell. The 10T cell has higher read and write stability at low voltages and lower power consumption. Simulation results show the 10T cell has faster write time, lower write power, and similar read power and static power compared to the 6T cell. The document also describes the layout and peripheral circuits used to implement the 10T SRAM, including the clock generator, sense amplifier, and decoding circuits. Post-layout simulation results meeting timing requirements are presented.
1. 10T Dual-voltage Low Power SRAM
Ruobai Feng, Zhuonan Li, Zhesheng Lou, Yimai Peng, Jie Song
ABSTRACT
With technology scaling, low power operation becomes one of the
crucial topics in VLSI. In memory design, it is vital to reduce the
power consumption of the memory with little trade-off in
performance and area penalty. In this paper, we propose a low
power 10T SRAM bit cell to be implemented in a RISC processor,
compare it with the 6T SRAM cell in various perspectives, and
present the results after HSPICE simulation.
Concepts
• Hardware Static Memory
• Hardware Clock Generation and Timing
Keywords
SRAM; 10T Bit Cell SRAM; Voltage Scaling; Read Static Noise
Margin
1. INTRODUCTION
As the demand for memory continuously grows, SRAM becomes
increasingly important in modern VLSI Design. However, since
SRAM occupies a large fraction of area, and consumes a
significant amount of dynamic and leakage power, the aggressive
technology scaling makes cooling and power issues even worse.
As a result, the power consumption becomes the major concern in
SRAM design. To reduce the total power consumption, there are
several approaches: voltage scaling, multi-voltage supply, logic
optimization, pipelining, and parallelism, etc. Because the
required performance varies between components in a SRAM in
most cases, multiple voltage supply appears to be a good solution
to optimize the balance between performance and overall power
consumption.
Bit cell is the core storage structure of SRAM and will greatly
influence the performance of SRAM. 6T SRAM cell is
conventionally used as the memory cell. Because of the compact
design and the voltage division between access and driver
transistor, 6T SRAM cells has relatively small hold and read noise
margin, substantial problems will occur especially when power
supply voltage is low. To deal with such problems, we propose a
non-conventional 10T SRAM cell that achieves higher stability in
read and write in low voltage environment, and at the same time,
has a lower overall power consumption.
In this paper, we will compare 6T and 10T SRAM cells in terms
of delay and power consumption during read and write operations
and noise margins in section 2. Section 3 explains the layout
techniques we adopted when implementing the bit cell in an actual
SRAM circuit, and section 4 details peripheral circuit design of
the SRAM. The simulation results will be presented and discussed
in section 5, and section 6 will describe the problems we
encountered in design as well as possible future improvements.
during read operation decreases because of the voltage division
between the access and driver transistor. In order to find a SRAM
cell whose performance is proper in read and write operation in
low voltage and stability is higher, 8T SRAM cell and 10T SRAM
cell are proposed to make a comparison with conventional 6T
SRAM cell.
2. COMPARISON BETWEEN 6T AND 10T
SRAM CELLS
After comparing and combining references from references [2]~[7]
we propose a 10T SRAM cell that achieves higher stability in read
and write in low voltage environment, and consumes lower power.
The 10T SRAM bit cell we proposed is controlled by three control
signals, write, read, and footer. In read operation, both bit lines
should be pre-charged high, and read and footer signals are turned
high. The inverter pair is grounded as in 6T SRAM cell, and data
stored in the cell will turn on one of the pass transistors, allowing
voltage to drop in the corresponding bit line. In write operation,
write signal will be turned on and footer signal turned off to float
the nodes to be written. After the write driver alters the state in the
cell, footer is then turned on to finish to pull-down transition.
Because low power is our ultimate goal in SRAM design, and we
decide to decrease the power supply voltage as long as the
performance is acceptable, the robustness in low power supply is
an important criterion in our design process. Figures. 2 and 3.
compare the read performance of 6T and 10T SRAM cells when
VDD is lowered to 570 mV. When the 6T cell fails to give the
correct output, 10T cell is still working properly.
Permission to make digital or hard copies of all or part of this work
for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial
advantage and that copies bear this notice and the full citation on the
first page. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee.
EECS 427, Fall, 2015, Ann Arbor, Michigan, U.S.
Figure 1. 10T SRAM cell proposed.
2. In 6T SRAM, the data stored in cell not only controls the pull
down transistor, but is in the path of bit line discharging. The
contention between pull down transistor and access transistor
makes the data susceptible to noise. In contrast, the data stored in
10T SRAM cell is connected to the gate of lower pass transistors
and hence is isolated from the discharging path. The Monte-Carlo
simulation results for the read noise margin of both SRAM cells
are presented in Figure 4.a and figure 4.b. When the read noise
margin for 6T cell reaches 0 when sigma = 4, 10T cell has fair
noise margin even when sigma is less than 6.
Our 10T SRAM cell design greatly decreases the contention
during both read and write operations, and thus lowers the overall
dynamic power consumption. The footer contributes the most to
reduce write time and write power. Before each write operation,
the footer transistor is turned off, the inverter pair is disconnected
from ground, and as a result, floating nodes inside can be easily
flipped by the write driver. Table.1. shown below provides the
quantitative comparisons between 6T and 10T SRAM cells in
terms of delay and power.
Table 1. Comparison between 6T and 10T SRAM cells in
delay and power consumption.
6T SRAM Cell 10T SRAM Cell
Write Time (ps) 165 113
Read Time (ps) 332 360
Write Power (uW) 5.73 2.675
Read Power (uW) 26.88 25.63
Static Power (nW) 0.46 0.898
3. LAYOUT FLOOR PLAN
3.1 Layout of bit cell
10T cell layout is done with thin cell style with body contact
shared by every 32 rows to reduce area cost. The cell is actually
not so thin as we need extra M3 for control signals. the cell footer
is shared among every two cells adjacent vertically, and the VSS
of read footer is shared among the two vertically adjacent cells.
Initially we also think about sharing the read transistor, but it is
not a good idea as two bit lines can be shorted by the connected
drain of read. This footer must be shared inside the cell cause
great trouble to the layout, and we have to assign it an additional
0.4 height to avoid using already-scarce M3 resource. The total
area of the 10T cell is 2.4um x 4um, while the 6T cell we layout is
1.2um x 2.4um. The area penalty is quite large for the 10T cell.
Figure 2. 6T SRAM cell fails at VDD = 570 mV.
.
Figure 3. 10T SRAM cell works at VDD = 570 mV.
.
Figure 4.a. (Left) 6T SRAM cell read noise margin.
4.b. (Right) 10T SRAM cell read noise margin.
3. 3.2 Layout of single word column
Our SRAM is constructed with 64 rows and 4 words in a row, as a
result, 256 words can be stored in the SRAM controlled by 8-bit
address code. Figure. 6 shows the structure of a single word
column. A control block is assigned to each word column, which
combines address information, read and write signals, and clock
generated by the clock generator, locally selects the correct row,
and gives the clocked read, write, or footer signal. Column
controllers that pass on instructions and process signals are
located below the bit cells. For the sake of power efficiency, all
the blocks except for the level converter are operated under virtual
VDD, which is substantially lower than the normal VDD. The
output signal after read operation is then converted back to normal
VDD by the level converter.
Figure. 7. given below shows the overall structure of the entire
SRAM circuit. Four word columns are used, and more peripheral
circuits such as multiplexers, decoders, clock generators, drivers
and buffers are added to enhance the functionality. The actual
layout is shown in Figure. 8.
4. PERIPHERAL CIRCUITS
4.1 Clock Generator
Timing needs to be precisely controlled in SRAM for correct
functionality. As a result, a clock generator that generates clock
signals with different skews based on the global CLK is necessary.
The reference timing is depicted in Figure.9, inverter chains are
used to adjust the skew time. The precharge clock is generated by
inverting the CLK signal and add 300 ps delay, which ensures the
proper function of the previous evaluation cycle. When precharge
finishes and bit cells are ready to read or write, the read and write
instructions will be sent to bit cells 550 ps after the negative edge
of the global CLK. The connection between bit lines and sense
amplifier will be turned off 400 ps after receiving the read signal,
by then the potential difference between bit lines is expected to
reach 120 mV, and ready to be amplified by sense amplifier in the
following circuit.
Figure 9. Instruction timing for SRAM.
.
Figure 5. 10T SRAM bit cell layout.
.
Figure 6. 10T SRAM single word column structure.
Figure 7. 4K SRAM structure.
Figure 8. 4K SRAM layout.
4. 4.2 Pre-charge Circuit
Both bit lines need to be pre-charged to virtual VDD before read
and write. The pre-charge circuit we used is shown in Figure. 10.
M1 and M2 are the driver transistors that pre-charge bit lines, and
transistor M3 equalizes the voltages to ensure both bit lines are at
the same potential before read.
4.3 Sense Amplifier
The sense amplifier design will greatly impact the read
performance of a SRAM. A properly designed sense amplifier can
reduce the voltage swing in bit lines, improve the read speed,
lower the power consumption, and avoid potential disturbance to
state stored in cell. However, with technology scaling, SRAM
circuit becomes denser, and more bit cells are added to each bit
line. Parasitic capacitances are increased, and thus slows down the
voltage sensing process.
In our SRAM, we used the conventional regenerative inverter
based sense amplifier that is shown below in Figure. 11.
The read time is decided by the SRAM cell drivability and sense
amplifier swing. The required swing of sense amplifier is
simulated with Monte Carlo. 10k simulations result that the sense
amplifier can work properly with as small as 50mV swing for a
working voltage of 0.7V with robustness of 3!, which is shown in
Figure. 12. However, we implement the read time to satisfy a SA
swing of 100mV to allow for possible noises.
To achieve the best read delay and compensate for the large load
capacitance, we ran parametric sweep to optimize the size for
each transistor. For example, Figure. 13. shows a sweep for sizes
of pull down NMOS in the inverter pair, and a wider NMOS will
result in the faster propagation delay. After taking the area penalty
into consideration, we chose 1.5um as the final width for the
NMOS.
4.4 Decoding Circuits
Decoding is required to select proper cells for read or write
operation. Duel to the great number of cells of modern SRAMs,
the load of decoders boosts. It becomes important to develop a
fast decoder with good drivability. Meanwhile it is tricky to layout
the decoding circuits matching with the cell height as the cell
bank becomes denser and denser.
The 10T cell our group used for SRAM has different working
conditions for read and write process, so it suffers from the half
select problem and the word lines could not be shared. To solve
the problem, our group have cells of the same word placed next to
each other and there is a control unit in front of each word in
charge of the three control lines. The word controls receive
signals from the decoding circuits, which is implemented as pre-
decoding and separated column and row decoding. Input
addresses are first processed by the pre-decoder. The lowest two
bits of address is decoded to four separate enable lines to four
column decoders, where the column enable signals meet CEN and
Figure 10. Pre-charge circuit.
.
Figure 11. Sense amplifier circuit.
.
Figure 12. Monte Carlo results for SA working at 0.7V
with 50mV swing.
.
Figure 13. Sense amplifier sizing sweep.
.
5. WEN and the resulting read/write enable was not sent to the cell
controls until clock signal arrives. The higher 6 bits of address is
sent to row decoder with their complements. The row decoder
then completes a two step decoding with a NAND-INV-NAND-
INV chain. This multi level decoding technique works properly to
relieve the load pressure from the high branching and long wires,
as well as reducing the size of each part to reduce area/power.
Adding post PEX capacitance loads to the decoding path, we see a
260ps delay from CLK to READ word line, which is the critical
one.
5. RESULTS ANALYSIS
The timing series is shown in Figure 14.a and Figure 14.b. And
the data of post PEX simulation of SRAM are shown in the
Table.2. In Figure. 14., the CLK2 signal is the clock signal of the
chip with 550ps skew. The reason generating the CLK2 is in the
baseline design, the cycle time is 3.8ns. And the decoder and
Register file takes 1.7ns. In order to send the control signal of
SRAM at the negative edge of the clock, decoding time in SRAM
should be smaller than 0.2ps. It’s not safe to use the negative edge
of CLK. Therefore, adding 550ps skew can make sure the
operation function is correct.
In read cycle, at the negative edge of CLK2, column decoder
starts to generate the read control signal for the SRAM cell, it
takes 0.18ns for the decoding. This time can be reduced with a
better layout design. When the read signal comes, it starts to read
SRAM. The read time is the same as the time for generating the
differential voltage in bit lines. In this customization, the
differential voltage is 216mV in bit lines at 25°C and the input
differential voltages of Sense Amplifier is 154mV. The
corresponding read time is 0.3ns. In the testing data, it shows that
the differential voltage for Sense Amplifier to regenerate the
signal should be close to 100mV with noise interruption. In that
case the read time can be reduced to accelerate the reading speed
and decrease the power consumption. That is a way to improve
the design. After regenerating the data, the output Sense is sent to
Level converter to convert the swing from 0.75V to 1.2V and it
takes 500ps to produce output Q. In the level converter, the
inverter at the output port is designed very big in order to drive
high capacitance of the output. However, since the load
capacitance is lower than the expected value, that is to say the
using an inverter chain can be more efficient and approximately
200ps in the converting time will be saved.
In write cycle, at the negative edge of CLK2, column decoder
starts to generate the write control signal for the SRAM cell. Since
the write working mode is the time from negative edge of CLK2
to the rise of CLK, which is enough for the data to be written into
cell, we concentrate more on another write time as shown in the
figure. The data write time is 250 ps. Data hold time is to make
sure the data written into the SRAM cell is stable and the data
hold time is 120 ps.
Table 2. Post-PEX Simulation results.
Pin Symbol -55°C 25°C 125°C
Min Min Min
Cycle time
(ns) Tclk 3.8 3.8 3.8
Clock high
(ns) Tclk,high 1.9 1.9 1.9
Clock low
(ns) Tclk,low 1.9 1.9 1.9
Read signal
Decoding (ps) Tclk-r 218 250 260
read time (ps) Tread 388 437 485
SA
regenerating
(ps) Tsense 60 67 70
Level
converting
(ps) Tout 453 504 542
write time
(ps) Twrite 206 250 220
Data setup
(ps) Ts 0 0 0
Data hold (ps) Th 110 120 125
6. REFERENCES
[1] Neil Weste and David Harris, CMOS VLSI Design: A Circuits
and Systems Perspective, Addison Wesley, Fourth Edition,
2011.
Figure 14.b. Write Timing.
.
Figure 14.a. Read Timing.
.
6. [2] Vamsi Kiran, P.N.; Saxena, N., Design and Analysis of
Different Types SRAM Cell, Electronics and Communication
Systems (ICECS), 2015 2nd International Conference on,
vol., pp.1060-1065, 2015.
[3] Athe, P.; Dasgupta, S., A comparative study of 6T, 8T and
9T decanano SRAM cell, Industrial Electronics &
Applications, 2009. ISIEA 2009. IEEE Symposium on, vol.2,
pp.889-894, 2009.
[4] Zamani, M.; Hassanzadeh, S.; Hajsadeghi, K.; Saeidi, R.,
A 32kb 90nm 9T-cell sub-threshold SRAM with improved
read and write SNM, Design & Technology of Integrated
Systems in Nanoscale Era (DTIS), 2013 8th International
Conference on, vol., pp.104-107, 2013.
[5] Ramani, A.R.; Ken Choi., A Novel 9T SRAM Design in Sub-
Threshold Region, Electro/Information Technology (EIT),
2011 IEEE International Conference on, vol., pp.1-6, 2011.
[6] Jinmo Kwon; Ik Joon Chang; Insoo Lee; Heemin
Park; Jongsun Park, Heterogeneous SRAM Cell Sizing for
Low-Power H.264 Applications, Circuits and Systems I:
Regular Papers, IEEE Transactions on, vol.59, pp.2275-2284,
2012.
[7] Madiwalar, B.; Kariyappa, B.S., Single Bit-line 7T SRAM
cell for low Power and High SNM, Automation, Computing,
Communication, Control and Compressed Sensing (iMac4s),
2013 International Multi-Conference on, vol., pp.223-228,
2013.
7. APPENDIX
1. PROCESSOR INTEGRATION
1.1 Processor Floor Plan
The processor is fully integrated in APR. The floor plan is designed to minimize the communication wire by placing related blocks
adjacently. And we sized the decoder and PC matching the height of RF-ALU and shifter to facilitate routing.
The decoder is a crucial part of the processor, which has much logic and account for an important part of the total delay. To decrease the
decoding delay and decoder size, we allowed M4 and M5 to be used, trading off with final integration routing. Fortunately, the final routing
is not much a problem as we have considered the floor plan carefully.
When implementing the power rings, we considered the power consumption of different blocks. For example, SRAM has only a small part
of the total cells active during active cycle, and the power ring need not to be very wide. On the other hand, the decoder and datapath have
much larger spatial activity factor, which requires more power and wider rings. But in the final floor plan, the power ring is sized for a
more compact floor plan.
Figure. 1. Floor Plan of the Processor
Figure. 2. Processor Image
1.2 Processor testability
8. We used 22 of the 40 generated pads for testability, among which are 5 inputs, 16 bits of D to be written back to RF, and scan signals for
PC. We did not implement any scan in decoder, since the logic is already complex. Instead the D signal combined with PC scan-in is
enough to test the chip functionality.
1.3 Timing consideration and clock signal
PC, RF, ROM and RAM are timed according to clock signals. The PC output is an immediate output after the control signal from decoder,
and the PC register is written at the rising edge. RF has data inside the master latch when clock is high, and write the data into slave latch
when clk is low. The write address signal for RF is refreshed every cycle on the falling edge of clock, when the decoding is already
completed and before the RF write timing. If we have write address at the rising edge, there may be violations when the address arrives too
late, and D is already written to the previous address. Our customized RF is design to work at the negative edge of CLK, after decoder and
RF.
Clock skew happens. In order to reduce the skew, clock signal inputs of different blocks are designed to be close to each other. However,
we did not realize the large load of clock signal, and we did not make good use of the clock pad drivers. The clock wire we used is too
narrow, causing large slew.
1.4 Processor performance
Before the final integration, we improved the performance of our ALU by pursuing the limit from layout and schematic, and we reduced
the delay of our ALU from 2.08ns to 1.75ns without changing the architecture, which is still based on a carry select adder. The delay of
other blocks is also listed here. The RF has a setup time of 460ps for the data to be written into master latch, and the worst case read time is
700ps, corresponding to the CLK-Q delay where data are first written into slave latch and sent to output immediately. The delay of shifter
is 1.13ns.