1. Managing Retransmissions Using the Retransmission Queue
The method for detecting lost segments and retransmitting them is conceptually
simple. Each time we send a segment, we start a retransmission timer. This timer
starts at a predetermined value and counts down over time. If the timer expires
before an acknowledgment is received for the segment, we retransmit the
segment.
TCP uses this basic technique but implements it in a slightly different way. The
reason for this is the need to efficiently deal with many segments that may be
unacknowledged at once, to ensure that they are each retransmitted at the
appropriate time if needed. The TCP system works according to the following
specific sequence:
o Placement On Retransmission Queue, Timer Start: As soon as a segment containing
data is transmitted, a copy of the segment is placed in a data structure called the
retransmission queue. A retransmission timer is started for the segment when it is placed
on the queue. Thus, every segment is at some point placed in this queue. The queue is
kept sorted by the time remaining in the retransmission timer, so the TCP software can
keep track of which timers have the least time remaining before they expire.
o Acknowledgment Processing: If an acknowledgment is received for a segment before
its timer expires, the segment is removed from the retransmission queue.
o Retransmission Timeout: If an acknowledgment is not received before the timer for a
segment expires, a retransmission timeout occurs, and the segment is automatically
retransmitted.
Of course, we have no more guarantee that a retransmitted segment will be
received than we had for the original segment. For this reason, after
retransmitting a segment, it remains on the retransmission queue. The
retransmission timer is reset, and the countdown begins again. Hopefully an
acknowledgment will be received for the retransmission, but if not, the segment
will be retransmitted again and the process repeated.
Certain conditions may cause even repeated retransmissions of a segment to
fail. We don't want TCP to just keep retransmitting forever, so TCP will only
retransmit a lost segment a certain number of times before concluding that there
is a problem and terminating the connection.
Key Concept: To provide basic reliability for sent data, each device’s TCP
implementation uses a retransmission queue. Each sent segment is placed on
the queue and a retransmission timer started for it. When an acknowledgment is
received for the data in the segment, it is removed from the retransmission
queue. If the timer goes off before an acknowledgment is received the segment
is retransmitted and the timer restarted.
2. TCP uses a cumulative acknowledgment system. The Acknowledgment Number field in a
segment received by a device indicates that all bytes of data with sequence numbers less than
that value have been successfully received by the other device. A segment is considered
acknowledged when all of its bytes have been acknowledged; in other words, when an
Acknowledgment Number containing a value larger than the sequence number of its last byte is
received.
Policies For Dealing with Retransmission When Unacknowledged
Segments Exist
This then leads to an important question: how do we handle retransmissions
when there are subsequent segments outstanding beyond the lost segment? In
our example above, when the server experiences a retransmission timeout on
Segment #3, it must decide what to do about Segment #4, when it simply doesn't
know whether or not the client received it. In our “worst-case scenario”, we have
19 segments that may or may not have shown up at the client after the first one
that was lost.
We have two different possible ways to handle this situation.
Retransmit Only Timed-Out Segments
This is the more “conservative”, or if you prefer, “optimistic” approach. We
retransmit only the segment that timed out, hoping that the other segments
beyond it were successfully received.
This method is best if the segments after the timed-out segment actually showed
up. It doesn't work so well if they did not. In the latter case, each segment would
have to time out individually and be retransmitted. Imagine that in our “worst-
case scenario” that all 20 500-byte segments were lost. We would have to wait
for Segment #1 to time out and be retransmitted. This retransmission would be
acknowledged (hopefully) but then we would get stuck waiting for Segment #2 to
time out and be resent. We would have to do this many times.
Retransmit All Outstanding Segments
This is the more “aggressive” or “pessimistic” method. Whenever a segment
times out we re-send not only it but all other segments that are still
unacknowledged.
This method ensures that any time there is a hold up with acknowledgments, we
“refresh” all outstanding segments to give the other device an extra chance at
receiving them in case they too were lost. In the case where all 20 segments
were lost, this saves substantial amounts of time over the “optimistic” approach.
The problem here is that these retransmissions may not be necessary. If the first
of 20 segments was lost and the other 19 were actually received, we'd be re-
sending 9,500 bytes of data (plus headers) for no reason.
3. The optional TCP selective acknowledgment feature provides a more elegant way of handling
subsequent segments when a retransmission timer expires. When a device receives a non-
contiguous segment it includes a special Selective Acknowledgment (SACK) option in its regular
acknowledgment that identifies non-contiguous segments that have already been received, even
if they are not yet acknowledged. This saves the original sender from having to retransmit them.
TCP Adaptive Retransmission and
Retransmission Timer Calculations
(Page 1 of 3)
Whenever a TCP segment is transmitted, a copy of it is also placed on the
retransmission queue. When the segment is placed on the queue, a
retransmission timer is started for the segment, which starts from a particular
value and counts down to zero. It is this timer that controls how long a segment
can remain unacknowledged before the sender gives up, concludes that it is lost
and sends it again.
The length of time we use for retransmission timer is thus very important. If it is
set too low, we might start retransmitting a segment that was actually received,
because we didn't wait long enough for the acknowledgment of that segment to
arrive. Conversely, if we set the timer too long, we waste time waiting for an
acknowledgment that will never arrive, reducing overall performance.
Difficulties in Choosing the Duration of the Retransmission Timer
Ideally, we would like to set the retransmission timer to a value just slightly larger
than the round-trip time (RTT) between the two TCP devices, that is, the typical
time it takes to send a segment from a client to a server and the server to send
an acknowledgment back to the client (or the other way around, of course). The
problem is that there is no such “typical” round-trip time. There are two main
reasons for this:
o Differences In Connection Distance: Suppose you are at work in the United States,
and during your lunch hour you are transferring a large file between your workstation and
a local server connection using 100 Mbps Fast Ethernet, at the same time you are
downloading a picture of your nephew from your sister's personal Web site—which is
connected to the Internet using an analog modem to an ISP in a small town near Lima,
Peru. Would you want both of these TCP connections to use the same retransmission
timer value? I certainly hope not!
o Transient Delays and Variability: The amount of time it takes to send data between any
two devices will vary over time due to various happenings on the internetwork:
fluctuations in traffic, router loads and so on. To see an example of this for yourself, try
typing “ping www.tcpipguide.com” from the command line of an Internet-connected PC
and you'll see how the reported times can vary.
Adaptive Retransmission Based on Round-Trip Time Calculations
4. It is for these reasons that TCP does not attempt to use a static, single number
for its retransmission timers. Instead, TCP uses a dynamic, or adaptive
retransmission scheme. TCP attempts to determine the approximate round-trip
time between the devices, and adjusts it over time to compensate for increases
or decreases in the average delay. The practical issues of how this is done are
important, but are not covered in much detail in the main TCP standard. RFC
2988, Computing TCP's Retransmission Timer, discusses the issue extensively.
Round-trip times can “bounce” up and down, as we have seen, so we want to
aim for an average RTT value for the connection. This average should respond
to consistent movement up or down in the RTT without overreacting to a few very
slow or fast acknowledgments. To allow this to happen, the RTT calculation uses
a smoothing formula:
New RTT = (α * Old RTT) + ( (1-α) * Newest RTT Measurement)
Where “α” (alpha) is a smoothing factor between 0 and 1. Higher values of “α∀
(closer to 1) provide better smoothing and avoiding sudden changes as a result
of one very fast or very slow RTT measurement. Conversely, this also slows
down how quickly TCP reacts to more sustained changes in round-trip time.
Lower values of alpha (closer to 0) make the RTT change more quickly in
reaction to changes in measured RTT, but can cause “over-reaction” when RTTs
fluctuate wildly.
Acknowledgment Ambiguity
Measuring the round-trip time between two devices is simple in concept: note the
time that a segment is sent, note the time that an acknowledgment is received,
and subtract the two. The measurement is more tricky in actual implementation,
however. One of the main potential “gotchas” occurs when a segment is
assumed lost and is retransmitted. The retransmitted segment carries nothing
that distinguishes it from the original. When an acknowledgment is received for
this segment, it's unclear as to whether this corresponds to the retransmission or
the original segment. (Even though we decided the segment was lost and
retransmitted it, it's possible the segment eventually got there, after taking a long
time; or that the segment got their quickly but the acknowledgment took a long
time!)
This is called acknowledgment ambiguity, and is not trivial to solve. We can't just
decide to assume that an acknowledgment always goes with the oldest copy of
the segment sent, because this makes the round-trip time appear too high. We
also don't want to just assume an acknowledgment always goes with the latest
sending of the segment, as that may artificially lower the average round-trip time
Refinements to RTT Calculation and Karn's Algorithm
5. TCP's solution to round-trip time calculation is based on the use of a technique
called Karn's algorithm, after its inventor, Phil Karn. The main change this
algorithm makes is the separation of the calculation of average round-trip time
from the calculation of the value to use for timers on retransmitted segments.
The first change made under Karn's algorithm is to not use measured round-trip
times for any segments that are retransmitted in the calculation of the overall
average round-trip time for the connection. This completely eliminates the
problem of acknowledgment ambiguity.
However, this by itself would not allow increased delays due to retransmissions
to affect the average round-trip time. For this, we need the second change:
incorporation of a timer backoff scheme for retransmitted segments. We start by
setting the retransmission timer for each newly-transmitted segme based on the
current average round-trip time. When a segment is retransntmitted, the timer is
not reset to the same value it was set for the initial transmission. It is “backed off”
(increased) using a multiplier (typically 2) to give the retransmission more time to
be received. The timer continues to be increased until a retransmission is
successful, up to a certain maximum value. This prevents retransmissions from
being sent too quickly and further adding to network congestion.
Once the retransmission succeeds, the round-trip timer is kept at the longer
(backed-off) value until a valid round-trip time can be measured on a segment
that is sent and acknowledged without retransmission. This permits a device to
respond with longer timers to occasional circumstances that cause delays to
persist for a period of time on a connection, while eventually having the round-trip
time settle back to a long-term average when normal conditions resume.
Reducing Send Window Size To Reduce The Rate Data Is Sent
Let's go back to our earlier example so I can hopefully explain better what I
mean, but let’s make a few changes. First, to keep things simple, let’s just look at
the transmissions made from the client to the server, not the server’s replies
(other than acknowledgments)—this is illustrated in Figure 222. As before, the
client sends 140 bytes to the server. After sending the 140 bytes, the client has
220 bytes remaining in its usable window—360 in the send window less the 140
bytes it just sent.
Sometime later, the server receives the 140 bytes and puts them in the buffer.
Now, in an “ideal world”, the 140 bytes go into the buffer, are acknowledged and
immediately removed from the buffer. Another way of thinking of this is that the
buffer is of “infinite size” and can hold as much as the client can send. The
buffer's free space remains 360 bytes in size, so the same window size can be
advertised back to the client. This was the “simplification” in the previous
example.
6. As long as the server can process the data as fast as it comes in, it will keep the
window size at 360 bytes. The client, upon receipt of the acknowledgment of 140
bytes and the same window size it had before, “slides” the full 360-byte window
140 bytes to the right. Since there are now 0 unacknowledged bytes, the client
can now once again send 360 bytes of data. These correspond to the 220 bytes
that were formerly in the usable window, plus 140 new bytes for the ones that
were just acknowledged.
In the “real world”, however, that server might be dealing with dozens, hundreds
or even thousands of TCP connections. The TCP might not be able to process
the data immediately. Alternately, it is possible the application itself might not be
ready for the 140 bytes for whatever reason. In either case, the server's TCP
may not be able to immediately remove all 140 bytes from the buffer. If so, upon
sending an acknowledgment back to the client, will want to change the window
size that it advertises to the client, to reflect the fact that the buffer is partially
filled.
Suppose that we receive 140 bytes as above, but are able to send only 40 bytes
to the application, leaving 100 bytes in the buffer. When we send back the
acknowledgment for the 140 bytes, the server can reduce its send window by
100, to 260. When the client receives this segment from the server it will see the
acknowledgment of the 140 bytes sent and slide its window 140 bytes to the
right. However, as it slides this window, it reduces its size to only 260 bytes. We
can consider this like sliding the left edge of the window 140 bytes, but the right
edge only 40 bytes. The new, smaller window ensures that the server receives a
maximum of 260 bytes from the client, which will fit in the 260 bytes remaining in
its receive buffer. This is illustrated in the first exchange of messages (Steps #1
through #3) at the top of Figure 226.
Reducing Send Window Size To Stop The Sending of New Data
What if the server is so bogged down that it can't process any of the bytes
received? Let’s suppose that the next transmission from the client is 180 bytes in
size, but the server is so busy it can’t remove any of them. It could buffer the 180
bytes and in the acknowledgment it sends for those bytes, reduce the window
size by the same amount: from 260 to 80. When the client received the
acknowledgment for 180 bytes it would see the window size had reduced by 180
bytes as well. It would “slide” its window by the same amount as the window size
was reduced! This is effectively like the server saying “I acknowledge receipt of
180 bytes, but I am not allowing you to send any new bytes to replace them”.
Another way of looking at this is that the left edge of the window slides 180 bytes
while the right edge remained fixed. And as long as the right edge of the window
doesn't move, the client can't send any more data than it could before receipt of
the acknowledgment. This is the middle exchange (Steps #4 to #6) in Figure 226.
7. The TCP sliding window system is used not just for ensuring reliability through acknowledgments
and retransmissions—it is also the basis for TCP’s flow control mechanism. By increasing or
reducing the size of its receive window, a device can raise or lower the rate at which its
connection partner sends it data. In the case where a device becomes extremely busy, it can
even reduce the receive window to zero, closing it; this will halt any further transmissions of data
until the window is reopened.
Problem of shrinking window
What if the server were so overloaded that we actually needed to reduce the size
of the buffer itself? Say memory was short and the operating system said “I
know you have 360 bytes allocated for the receive buffer for this connection, but I
need to free up memory so now you only have 240”. The server still can't
immediately process the 140 bytes it received, so it would need to drop the
window size it sent back to the client all the way from 360 bytes down to 100
(240 in the total buffer less the 140 already received).
In effect, doing this actually moves the right edge of the client's send window
back to the left. It says “not only can't you send more data when you receive this
acknowledgment, but you now can send less”. In TCP parlance, this is called
shrinking the window.
There's a very serious problem with doing this, however: while the original 140
bytes were in transit from the client to the server, the client still thought it had 360
bytes of total window, of which 220 bytes were usable (360 less 140). The client
may well have already sent some of that 220 bytes of data to the server before it
gets notification that the server has shrunk the window! If so, and the server
reduces its buffer to 240 bytes with 140 used, when those 220 bytes show up at
the server, only 100 will fit and any additional ones will need to be discarded.
This will force the client to have to retransmit that data, which is inefficient. Figure
227 illustrates graphically how this situation would play out.
A phenomenon called shrinking the window occurs when a device reduces its
receive window so much that its partner device’s usable transmit window shrinks
in size (meaning that the right edge of its send window moves to the left). Since
this can result in data already in transit having to be discarded, devices must
instead reduce their receive window size more gradually.
A device that reduces its receive window to zero is said to have closed the window. The other
device’s send window is thus closed; it may not send regular data segments. It may, however,
send probe segments to check the status of the window, thus making sure it does not miss
notification when the window reopens.
The MSS parameter ensures that we don't send segments that are too large—
TCP is not allowed to create a segment larger than the MSS. Unfortunately, the
basic sliding windows mechanism doesn't provide any minimum size of segment
that can be transmitted. In fact, not only is it possible for a device to send very
small, inefficient segments, the simplest implementation of flow control using
8. unrestricted window size adjustments ensures that under conditions of heavy
load, window size will become small, leading to significant performance
reduction!
How Silly Window Syndrome Occurs
To see how this can happen, let's consider an example that is a variation on the
one we’ve been using so far in this section. We'll assume the MSS is 360 and a
client/server pair where again, the server's initial receive window is set to this
same value, 360. This means the client can send a “full-sized” segment to the
server. As long as the server can keep removing the data from the buffer as fast
as the client sends it, we should have no problem. (In reality the buffer size would
normally be larger than the MSS.)
Now, imagine that instead, the server is bogged down for whatever reason while
the client needs to send it a great deal of data. For simplicity, let's say that the
server is only able to remove 1 byte of data from the buffer for every 3 it receives.
Let's say it also removes 40 additional bytes from the buffer during the time it
takes for the next client's segment to arrive. Here's what will happen:
1. The client's send window is 360, and it has lots of data to send. It immediately sends a
360 byte segment to the server. This uses up its entire send window.
2. When the server gets this segment it acknowledges it. However, it can only remove 120
bytes so the server reduces the window size from 360 to 120. It sends this in the Window
field of the acknowledgment.
3. The client receives an acknowledgment of 360 bytes, and sees that the window size has
been reduced to 120. It wants to send its data as soon as possible, so it sends off a 120
byte segment.
4. The server has removed 40 more bytes from the buffer by the time the 120-byte segment
arrives. The buffer thus contains 200 bytes (240 from the first segment, less the 40
removed). The server is able to immediately process one-third of those 120 bytes, or 40
bytes. This means 80 bytes are added to the 200 that already remain in the buffer, so
280 bytes are used up. The server must reduce the window size to 80 bytes.
5. The client will see this reduced window size and send an 80-byte segment.
6. The server started with 280 bytes and removed 40 to yield 240 bytes left. It receives 80
bytes from the client, removes one third, so 53 are added to the buffer, which becomes
293 bytes. It reduces the window size to 67 bytes (360-293).
This process, which is illustrated in Figure 228, will continue for many rounds,
with the window size getting smaller and smaller, especially if the server gets
even more overloaded. Its rate of clearing the buffer may decrease even more,
and the window may close entirely.
The basic TCP sliding window system sets no minimum size on transmitted segments. Under
certain circumstances, this can result in a situation where many small, inefficient segments are
sent, rather than a smaller number of large ones. Affectionately termed silly window syndrome
9. (SWS), this phenomenon can occur either as a result of a recipient advertising window sizes that
are too small, or a transmitter being too aggressive in immediately sending out very small
amounts of data.