SlideShare una empresa de Scribd logo
1 de 19
Caching HTTP 404 Responses Eliminates
Unnecessary Archival Replay Requests
1 Web Science & Digital Libraries Research Group
Old Dominion University, Norfolk VA, USA
@WebSciDL
Kritika Garg1, Himarsha R. Jayanetti1, Sawood Alam2 , Michele C. Weigle1, and Michael L. Nelson1
2 Wayback Machine, Internet Archive
San Francisco, California, USA
@internetarchive
The 24th International Conference on Asia-Pacific Digital Libraries
(ICADL 2022) November 30
@kritika_garg @HimarshaJ @ibnesayeed @weiglemc @phonedude_mln
Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Several types of web pages make repeated HTTP requests to the
server for the latest/live updates
2
Social Media
Live News Radio
Live sports scores
https://www.cbsnews.com/
https://twitter.com/home
https://www.iheart.com/
https://www.livesport.com/en/
Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Web archive rehosting the captured webpage
(memento)
3
https://web.archive.org/web/20221115072418/https://oduwsdl.github.io/
All the embeds and
outlinked pages are also
served from the web
archive. For ex,
https://web.archive.org/w
eb/20221115072418im_/
https://oduwsdl.github.io/i
mg/bg-masthead.jpg
https://web.archive.org/web/20221115072418/https://oduwsdl.github.io/
Archive banner providing
details of the capture. For
ex, this capture is from Nov
15, 2022.
Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Archived web page averaging 1098 requests per minute
https://arquivo.pt/wayback/20090628044051/http://www.radiocomercial.iol.pt/
4
Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
1098 requests per minute to the server because
embedded resources are missing
https://arquivo.pt/wayback/20090628044051/http://www.radiocomercial.iol.pt/
http://www.radiocomercial.iol.pt/styles/slideshow/loader-0.png
5
The following types of archived web pages
are more likely to cause the recurring
requests:
1. Web pages with image carousels,
banners, widgets, etc.
1. Web pages that require regular
updates and poll the server
periodically for the updates. For
example,
● sports scores updates,
● stock market updates,
● news updates,
● chat applications,
● social media feed
Carousel with missing images
Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Linear growth in number of wasteful requests to the server by
radiocomercial.iol.pt memento
6
The cumulative number of requests/second by radiocomercial.iol.pt memento.
The linear growth after the first 203 requests due
to recurring requests (1098 requests/min)
Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Archived web pages at esdica.pt with missing banner averaging
400 requests per minute
https://arquivo.pt/wayback/20131105211447/http://esdica.pt/
7
Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Archived web page of livesports.com that polls for regular feeds
causes unnecessary recurring requests
https://web.archive.org/web/20210901092755/https://www.livesport.com/en/
8
Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Some archives may patch the missing resources by archiving the
resource from the live web
https://web.archive.org/web/20221122230303/https://edition.cnn.com/
9
Archiving the resource by
requesting the live web
Successfully archived
Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Patching the archive from the live web creates
unnecessary writes & reads
10
https://web.archive.org/web/20100822133654/http://www.radiocomercial.iol.pt/
Missing resource
Archiving the missing
resource is unsuccessful
(Since the resource does not
exist in the live web)
Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Minimal reproducible example (MRE): Carousel
https://kritikagarg.github.io/Unnecessary-Archival-Replay-Requests/MREcarousel_working.html
11
Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
MRE with missing embedded resources averaging 174 requests/min
https://kritikagarg.github.io/Unnecessary-Archival-Replay-Requests/MREcarousel.html
12
Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Archived carousel example making recurring requests to the server due
to missing resources
13
We
archived
this
carousel
example
locally
using
pywb.
Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Avoid recurring requests using Cache-Control HTTP
response header
14
Cache-Control HTTP header field is
used to specify directives for caching
mechanisms in both requests and
responses.
public: The response may be cached by
any cache, even if the response would
normally be non-cacheable.
max-age: The cached response remains
fresh for N seconds.
Web Archive Server
response sent from server
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control
GET /1.jpg HTTP/1.1
1st request for archived resource
HTTP/1.1 404 Not Found
Cache-Control: public, max-
age=600
response sent from cache
recurring request
GET /1.jpg
HTTP/1.1
600s
HTTP/1.1 404 Not
Found
Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Avoid recurring requests for missing resources in MRE by caching
HTTP 404 responses
15
We used Nginx proxy
server to set-up the
Cache-Control HTTP
Response Header to
Cache HTTP 404
responses
Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Optimizing the replay using Cache-Control
HTTP response header
16
Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Halt in growth of recurring requests by MRE to server
after caching
17
The cumulative number of requests/second by MRE memento before and after caching 404 responses.
The linear growth after the first seven requests due to
recurring requests (174 requests/min)
0 recurring requests/seconds after
caching 404 responses
No new requests
are sent to the
server until the
Max-Age value
times out
Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
No recurring wasteful requests to the server by
radiocomercial.iol.pt memento after caching
18
The cumulative number of requests/second by radiocomercial.iol.pt memento before (red line) and anticipated
requests/seconds (blue line) after caching 404 responses.
The linear growth after the first 203 requests due to
recurring requests (1098 requests/min)
anticipated rate of recurring requests after caching
404 responses (until the Max-Age value times out)
Arquivo.pt has
implemented this
solution. They have
added a Cache-Control
HTTP response header
to cache HTTP 404
responses.
Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln
Summary: Use Cache-Control response headers
● Replaying an archived web page with carousels, widgets, etc.
should not cause ~1000 requests/min to the web archive!
● Web archives that try to patch 404s from the live web may cause
even more unnecessary traffic (reads + writes) to the web archive.
● We demonstrated that these requests can be mitigated by sending
the 404 responses with:
○ Cache-Control: public, max-age=600
19

Más contenido relacionado

Similar a Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests

Brushing skills on SignalR for ASP.NET developers
Brushing skills on SignalR for ASP.NET developersBrushing skills on SignalR for ASP.NET developers
Brushing skills on SignalR for ASP.NET developersONE BCG
 
4 useful things web designers can do with meta tags
4 useful things web designers can do with meta tags4 useful things web designers can do with meta tags
4 useful things web designers can do with meta tagsInnomedia Technologies
 
Real-Time Web Apps & .NET - What are your options?
Real-Time Web Apps & .NET - What are your options?Real-Time Web Apps & .NET - What are your options?
Real-Time Web Apps & .NET - What are your options?Phil Leggetter
 
Hedis - GET HBase via Redis
Hedis - GET HBase via RedisHedis - GET HBase via Redis
Hedis - GET HBase via RedisMu Chun Wang
 
Apache Geode - The First Six Months
Apache Geode -  The First Six MonthsApache Geode -  The First Six Months
Apache Geode - The First Six MonthsAnthony Baker
 
GlueCon 2018: Are REST APIs Still Relevant Today?
GlueCon 2018: Are REST APIs Still Relevant Today?GlueCon 2018: Are REST APIs Still Relevant Today?
GlueCon 2018: Are REST APIs Still Relevant Today?LaunchAny
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptMichael Nelson
 
Optimizing Web Performance for Mobile Users
Optimizing Web Performance for Mobile UsersOptimizing Web Performance for Mobile Users
Optimizing Web Performance for Mobile UsersMuhammad Samu
 
SPTECHCON - Rev Your Engines - SharePoint 2013 Performance Enhancements
SPTECHCON - Rev Your Engines - SharePoint 2013 Performance EnhancementsSPTECHCON - Rev Your Engines - SharePoint 2013 Performance Enhancements
SPTECHCON - Rev Your Engines - SharePoint 2013 Performance EnhancementsEric Shupps
 
Bloom Filters for Web Caching - Lightning Talk
Bloom Filters for Web Caching - Lightning TalkBloom Filters for Web Caching - Lightning Talk
Bloom Filters for Web Caching - Lightning TalkFelix Gessert
 
SharePoint 2013 Performance and Capacity Management
SharePoint 2013 Performance and Capacity Management SharePoint 2013 Performance and Capacity Management
SharePoint 2013 Performance and Capacity Management jems7
 
Datasets, APIs, and Web Scraping
Datasets, APIs, and Web ScrapingDatasets, APIs, and Web Scraping
Datasets, APIs, and Web ScrapingDamian T. Gordon
 
It is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pagesIt is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pagesmaturban
 
Reto2.011 APEX API
Reto2.011 APEX APIReto2.011 APEX API
Reto2.011 APEX APIreto20
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperMárton Kodok
 
GDD Japan 2009 - Designing OpenSocial Apps For Speed and Scale
GDD Japan 2009 - Designing OpenSocial Apps For Speed and ScaleGDD Japan 2009 - Designing OpenSocial Apps For Speed and Scale
GDD Japan 2009 - Designing OpenSocial Apps For Speed and ScalePatrick Chanezon
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY
 
Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017
 Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017 Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017
Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017BigchainDB
 
Using Streaming APIs in Production
Using Streaming APIs in ProductionUsing Streaming APIs in Production
Using Streaming APIs in ProductionLuca Mattia Ferrari
 

Similar a Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests (20)

Brushing skills on SignalR for ASP.NET developers
Brushing skills on SignalR for ASP.NET developersBrushing skills on SignalR for ASP.NET developers
Brushing skills on SignalR for ASP.NET developers
 
4 useful things web designers can do with meta tags
4 useful things web designers can do with meta tags4 useful things web designers can do with meta tags
4 useful things web designers can do with meta tags
 
Real-Time Web Apps & .NET - What are your options?
Real-Time Web Apps & .NET - What are your options?Real-Time Web Apps & .NET - What are your options?
Real-Time Web Apps & .NET - What are your options?
 
Hedis - GET HBase via Redis
Hedis - GET HBase via RedisHedis - GET HBase via Redis
Hedis - GET HBase via Redis
 
Apache Geode - The First Six Months
Apache Geode -  The First Six MonthsApache Geode -  The First Six Months
Apache Geode - The First Six Months
 
GlueCon 2018: Are REST APIs Still Relevant Today?
GlueCon 2018: Are REST APIs Still Relevant Today?GlueCon 2018: Are REST APIs Still Relevant Today?
GlueCon 2018: Are REST APIs Still Relevant Today?
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
 
Optimizing Web Performance for Mobile Users
Optimizing Web Performance for Mobile UsersOptimizing Web Performance for Mobile Users
Optimizing Web Performance for Mobile Users
 
Cloud@ebay
Cloud@ebayCloud@ebay
Cloud@ebay
 
SPTECHCON - Rev Your Engines - SharePoint 2013 Performance Enhancements
SPTECHCON - Rev Your Engines - SharePoint 2013 Performance EnhancementsSPTECHCON - Rev Your Engines - SharePoint 2013 Performance Enhancements
SPTECHCON - Rev Your Engines - SharePoint 2013 Performance Enhancements
 
Bloom Filters for Web Caching - Lightning Talk
Bloom Filters for Web Caching - Lightning TalkBloom Filters for Web Caching - Lightning Talk
Bloom Filters for Web Caching - Lightning Talk
 
SharePoint 2013 Performance and Capacity Management
SharePoint 2013 Performance and Capacity Management SharePoint 2013 Performance and Capacity Management
SharePoint 2013 Performance and Capacity Management
 
Datasets, APIs, and Web Scraping
Datasets, APIs, and Web ScrapingDatasets, APIs, and Web Scraping
Datasets, APIs, and Web Scraping
 
It is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pagesIt is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pages
 
Reto2.011 APEX API
Reto2.011 APEX APIReto2.011 APEX API
Reto2.011 APEX API
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
 
GDD Japan 2009 - Designing OpenSocial Apps For Speed and Scale
GDD Japan 2009 - Designing OpenSocial Apps For Speed and ScaleGDD Japan 2009 - Designing OpenSocial Apps For Speed and Scale
GDD Japan 2009 - Designing OpenSocial Apps For Speed and Scale
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017
 Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017 Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017
Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017
 
Using Streaming APIs in Production
Using Streaming APIs in ProductionUsing Streaming APIs in Production
Using Streaming APIs in Production
 

Último

『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119APNIC
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxAndrieCagasanAkio
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxmibuzondetrabajo
 
ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxNIMMANAGANTI RAMAKRISHNA
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxMario
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 

Último (11)

『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptx
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptx
 
ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptx
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptx
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 

Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests

  • 1. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests 1 Web Science & Digital Libraries Research Group Old Dominion University, Norfolk VA, USA @WebSciDL Kritika Garg1, Himarsha R. Jayanetti1, Sawood Alam2 , Michele C. Weigle1, and Michael L. Nelson1 2 Wayback Machine, Internet Archive San Francisco, California, USA @internetarchive The 24th International Conference on Asia-Pacific Digital Libraries (ICADL 2022) November 30 @kritika_garg @HimarshaJ @ibnesayeed @weiglemc @phonedude_mln
  • 2. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Several types of web pages make repeated HTTP requests to the server for the latest/live updates 2 Social Media Live News Radio Live sports scores https://www.cbsnews.com/ https://twitter.com/home https://www.iheart.com/ https://www.livesport.com/en/
  • 3. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Web archive rehosting the captured webpage (memento) 3 https://web.archive.org/web/20221115072418/https://oduwsdl.github.io/ All the embeds and outlinked pages are also served from the web archive. For ex, https://web.archive.org/w eb/20221115072418im_/ https://oduwsdl.github.io/i mg/bg-masthead.jpg https://web.archive.org/web/20221115072418/https://oduwsdl.github.io/ Archive banner providing details of the capture. For ex, this capture is from Nov 15, 2022.
  • 4. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Archived web page averaging 1098 requests per minute https://arquivo.pt/wayback/20090628044051/http://www.radiocomercial.iol.pt/ 4
  • 5. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln 1098 requests per minute to the server because embedded resources are missing https://arquivo.pt/wayback/20090628044051/http://www.radiocomercial.iol.pt/ http://www.radiocomercial.iol.pt/styles/slideshow/loader-0.png 5 The following types of archived web pages are more likely to cause the recurring requests: 1. Web pages with image carousels, banners, widgets, etc. 1. Web pages that require regular updates and poll the server periodically for the updates. For example, ● sports scores updates, ● stock market updates, ● news updates, ● chat applications, ● social media feed Carousel with missing images
  • 6. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Linear growth in number of wasteful requests to the server by radiocomercial.iol.pt memento 6 The cumulative number of requests/second by radiocomercial.iol.pt memento. The linear growth after the first 203 requests due to recurring requests (1098 requests/min)
  • 7. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Archived web pages at esdica.pt with missing banner averaging 400 requests per minute https://arquivo.pt/wayback/20131105211447/http://esdica.pt/ 7
  • 8. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Archived web page of livesports.com that polls for regular feeds causes unnecessary recurring requests https://web.archive.org/web/20210901092755/https://www.livesport.com/en/ 8
  • 9. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Some archives may patch the missing resources by archiving the resource from the live web https://web.archive.org/web/20221122230303/https://edition.cnn.com/ 9 Archiving the resource by requesting the live web Successfully archived
  • 10. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Patching the archive from the live web creates unnecessary writes & reads 10 https://web.archive.org/web/20100822133654/http://www.radiocomercial.iol.pt/ Missing resource Archiving the missing resource is unsuccessful (Since the resource does not exist in the live web)
  • 11. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Minimal reproducible example (MRE): Carousel https://kritikagarg.github.io/Unnecessary-Archival-Replay-Requests/MREcarousel_working.html 11
  • 12. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln MRE with missing embedded resources averaging 174 requests/min https://kritikagarg.github.io/Unnecessary-Archival-Replay-Requests/MREcarousel.html 12
  • 13. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Archived carousel example making recurring requests to the server due to missing resources 13 We archived this carousel example locally using pywb.
  • 14. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Avoid recurring requests using Cache-Control HTTP response header 14 Cache-Control HTTP header field is used to specify directives for caching mechanisms in both requests and responses. public: The response may be cached by any cache, even if the response would normally be non-cacheable. max-age: The cached response remains fresh for N seconds. Web Archive Server response sent from server https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control GET /1.jpg HTTP/1.1 1st request for archived resource HTTP/1.1 404 Not Found Cache-Control: public, max- age=600 response sent from cache recurring request GET /1.jpg HTTP/1.1 600s HTTP/1.1 404 Not Found
  • 15. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Avoid recurring requests for missing resources in MRE by caching HTTP 404 responses 15 We used Nginx proxy server to set-up the Cache-Control HTTP Response Header to Cache HTTP 404 responses
  • 16. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Optimizing the replay using Cache-Control HTTP response header 16
  • 17. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Halt in growth of recurring requests by MRE to server after caching 17 The cumulative number of requests/second by MRE memento before and after caching 404 responses. The linear growth after the first seven requests due to recurring requests (174 requests/min) 0 recurring requests/seconds after caching 404 responses No new requests are sent to the server until the Max-Age value times out
  • 18. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln No recurring wasteful requests to the server by radiocomercial.iol.pt memento after caching 18 The cumulative number of requests/second by radiocomercial.iol.pt memento before (red line) and anticipated requests/seconds (blue line) after caching 404 responses. The linear growth after the first 203 requests due to recurring requests (1098 requests/min) anticipated rate of recurring requests after caching 404 responses (until the Max-Age value times out) Arquivo.pt has implemented this solution. They have added a Cache-Control HTTP response header to cache HTTP 404 responses.
  • 19. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Summary: Use Cache-Control response headers ● Replaying an archived web page with carousels, widgets, etc. should not cause ~1000 requests/min to the web archive! ● Web archives that try to patch 404s from the live web may cause even more unnecessary traffic (reads + writes) to the web archive. ● We demonstrated that these requests can be mitigated by sending the 404 responses with: ○ Cache-Control: public, max-age=600 19