1. Caching HTTP 404 Responses Eliminates
Unnecessary Archival Replay Requests
1 Web Science & Digital Libraries Research Group
Old Dominion University, Norfolk VA, USA
@WebSciDL
Kritika Garg1, Himarsha R. Jayanetti1, Sawood Alam2 , Michele C. Weigle1, and Michael L. Nelson1
2 Wayback Machine, Internet Archive
San Francisco, California, USA
@internetarchive
The 24th International Conference on Asia-Pacific Digital Libraries
(ICADL 2022) November 30
@kritika_garg @HimarshaJ @ibnesayeed @weiglemc @phonedude_mln
2. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Several types of web pages make repeated HTTP requests to the
server for the latest/live updates
2
Social Media
Live News Radio
Live sports scores
https://www.cbsnews.com/
https://twitter.com/home
https://www.iheart.com/
https://www.livesport.com/en/
3. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Web archive rehosting the captured webpage
(memento)
3
https://web.archive.org/web/20221115072418/https://oduwsdl.github.io/
All the embeds and
outlinked pages are also
served from the web
archive. For ex,
https://web.archive.org/w
eb/20221115072418im_/
https://oduwsdl.github.io/i
mg/bg-masthead.jpg
https://web.archive.org/web/20221115072418/https://oduwsdl.github.io/
Archive banner providing
details of the capture. For
ex, this capture is from Nov
15, 2022.
5. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
1098 requests per minute to the server because
embedded resources are missing
https://arquivo.pt/wayback/20090628044051/http://www.radiocomercial.iol.pt/
http://www.radiocomercial.iol.pt/styles/slideshow/loader-0.png
5
The following types of archived web pages
are more likely to cause the recurring
requests:
1. Web pages with image carousels,
banners, widgets, etc.
1. Web pages that require regular
updates and poll the server
periodically for the updates. For
example,
● sports scores updates,
● stock market updates,
● news updates,
● chat applications,
● social media feed
Carousel with missing images
6. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Linear growth in number of wasteful requests to the server by
radiocomercial.iol.pt memento
6
The cumulative number of requests/second by radiocomercial.iol.pt memento.
The linear growth after the first 203 requests due
to recurring requests (1098 requests/min)
7. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Archived web pages at esdica.pt with missing banner averaging
400 requests per minute
https://arquivo.pt/wayback/20131105211447/http://esdica.pt/
7
8. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Archived web page of livesports.com that polls for regular feeds
causes unnecessary recurring requests
https://web.archive.org/web/20210901092755/https://www.livesport.com/en/
8
9. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Some archives may patch the missing resources by archiving the
resource from the live web
https://web.archive.org/web/20221122230303/https://edition.cnn.com/
9
Archiving the resource by
requesting the live web
Successfully archived
10. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Patching the archive from the live web creates
unnecessary writes & reads
10
https://web.archive.org/web/20100822133654/http://www.radiocomercial.iol.pt/
Missing resource
Archiving the missing
resource is unsuccessful
(Since the resource does not
exist in the live web)
13. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Archived carousel example making recurring requests to the server due
to missing resources
13
We
archived
this
carousel
example
locally
using
pywb.
14. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Avoid recurring requests using Cache-Control HTTP
response header
14
Cache-Control HTTP header field is
used to specify directives for caching
mechanisms in both requests and
responses.
public: The response may be cached by
any cache, even if the response would
normally be non-cacheable.
max-age: The cached response remains
fresh for N seconds.
Web Archive Server
response sent from server
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control
GET /1.jpg HTTP/1.1
1st request for archived resource
HTTP/1.1 404 Not Found
Cache-Control: public, max-
age=600
response sent from cache
recurring request
GET /1.jpg
HTTP/1.1
600s
HTTP/1.1 404 Not
Found
15. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Avoid recurring requests for missing resources in MRE by caching
HTTP 404 responses
15
We used Nginx proxy
server to set-up the
Cache-Control HTTP
Response Header to
Cache HTTP 404
responses
17. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Halt in growth of recurring requests by MRE to server
after caching
17
The cumulative number of requests/second by MRE memento before and after caching 404 responses.
The linear growth after the first seven requests due to
recurring requests (174 requests/min)
0 recurring requests/seconds after
caching 404 responses
No new requests
are sent to the
server until the
Max-Age value
times out
18. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
No recurring wasteful requests to the server by
radiocomercial.iol.pt memento after caching
18
The cumulative number of requests/second by radiocomercial.iol.pt memento before (red line) and anticipated
requests/seconds (blue line) after caching 404 responses.
The linear growth after the first 203 requests due to
recurring requests (1098 requests/min)
anticipated rate of recurring requests after caching
404 responses (until the Max-Age value times out)
Arquivo.pt has
implemented this
solution. They have
added a Cache-Control
HTTP response header
to cache HTTP 404
responses.
19. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Summary: Use Cache-Control response headers
● Replaying an archived web page with carousels, widgets, etc.
should not cause ~1000 requests/min to the web archive!
● Web archives that try to patch 404s from the live web may cause
even more unnecessary traffic (reads + writes) to the web archive.
● We demonstrated that these requests can be mitigated by sending
the 404 responses with:
○ Cache-Control: public, max-age=600
19