3. Back in 2000, home Internet is slow
MODEM data rate:
33.6Kbps or 56Kbps
round trip latency:
>100ms
2 minutes to load a
webpage
4. Today, Internet isn’t always fast
Satellite link (eg. Iridium)
◦ high latency
◦ 2.4KB/s
◦ $1.35 per minute
2G cellular data (eg. H2O Wireless)
◦ high latency
◦ low bandwidth
◦ $0.30 per MB
5. Web contents are redundant
Screenshots of http://quotes.wsj.com/index/CN/SHCOMP during a trading day. Quote changes, but other remains same.
6. Web contents are often uncached
Web authors don’t want you to cache
their contents, because:
◦ Contents are dynamic. Stock price may
change at any time. News articles are
posted throughout the day.
◦ Contents are personalized. Your Facebook
homepage is different from anyone else’s.
◦ Access count must be accurate. Advertising
revenue is calculated per thousand
impressions.
response headers of http://www.dailyfinance.com/
9. Architecture
convert repeated
strings into tokens
network layer,
protocol-independent
reconstruct
original packet
bandwidthconstrained
channel
cache
cache
contents of both caches must be consistent
10. The Cache
Cache: holds most recent packets
◦ admission policy: admit all
◦ replacement policy: FIFO
Indexed by representative fingerprints of the packets it holds
◦ map fingerprint to the most recent packet it appears
11. window size: β
select one in 2γ fingerprints
fingerprint space: M
Representative fingerprints
1. Calculate rolling Rabin fingerprints for sequences of β bytes, mod M.
2. Select fingerprints ending with γ zeros as representative fingerprints.
Rabin fingerprints are not cryptographically secure. Algorithm should not
assume collision-free.
Rabin fingerprints are used for finding similar documents, not for chunking.
12. Sender process
generate representative
fingerprints
lookup fingerprints in
cache index
cache
add packet to
cache, evicting
oldest packet if
necessary
verify no collision
expand to the left and to
the right, byte-by-byte
token format
• the fingerprint
• # bytes expanded to the left
• # bytes expanded to the right
convert matched regions
into tokens
send encoded, smaller packet
13. Receiver process
lookup tokens in cache
index
generate representative
fingerprints
reconstruct original
packet
add packet to cache,
evicting oldest
packet if necessary
cache
deliver original packet
14. Cache consistency
Contents of sender cache and receiver cache must be consistent.
Why caches might be inconsistent?
◦ Network channel isn’t reliable. A packet that entered sender cache but lost on the
channel will not be present in receiver cache.
How to detect cache inconsistency?
◦ Fingerprints! If there’s no collision, receiving an unrecognized fingerprint indicates
caches are inconsistent.
What happens if caches are inconsistent?
◦ Receiver cannot reconstruct original packet.
17. Parameters
Fingerprint space: M=260
◦ collision almost impossible
Penalty for each matching region: 12 octets
◦ to represent the space needed for the token
Windows size β and fingerprint selecting frequency 2γ
◦
◦
◦
◦
◦
large β: better “quality” of matches, less potential bytes saving
small β: worse “quality” of matches (shorter matches in more recent packets)
small γ: more likely to find a match, larger index (=less memory for cached packets)
large γ: less likely to find a match, less memory usage
γ=5, β=64
18. Performance
45Mbps on a PC with Pentium Ⅲ-550 and 1GB memory
This work is designed for slow links.
19. Follow-up work
Future works by same authors:
◦ universal redundancy elimination
◦ SmartRE: coordinated network-wide redundancy elimination
◦ EndRE: end-system redundancy elimination
21. Amount of redundancy
Internet => corporate
30% redundant
with just 1MB of memory
for cache+index:
at least 10% redundant
corporate => Internet
50% redundant
22. redundant traffic
60
Redundancy by protocol
traffic amount (%)
50
HTTP, Telnet, POP, ASF have high percentage of repeated strings.
40
HTTPS, FTP-data, Napster, RTSP, NNTP have low percentage of
repeated strings.
30
20
Redundancy elimination algorithm is protocol-independent, so we can save bytes on non-Web traffic.
10
0
HTTP
RTSP
Napster
Lotus
HTTPS FTP-data NNTP
DNS
ASF
AOL
SMTP
POP
Telnet
Other
23. Comparison with HTTP caching
100
redundancy elimination
works better than HTTP
caching and compression
traffic (%)
80
60
40
20
0
Squid
gzip
Squid+gzip
RE
Squid+RE