SlideShare una empresa de Scribd logo
1 de 20
Descargar para leer sin conexión
TrendMachine:
Temporal Resilience of Web Pages
@WaybackMachine
IIPC Web Archiving Conference (WAC), May 03, 2023, Online
Sawood Alam
Mark Graham
Kritika Garg
Michele C. Weigle
Michael L. Nelson
Dietrich Ayala
Internet Archive
Internet Archive
Old Dominion University
Old Dominion University
Old Dominion University
Protocol Labs
@WebSciDL @ProtocolLabs
Supported in part by Protocol Labs and Filecoin Foundation
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 2
Research Question
How healthy has a web page been
throughout its lifetime?
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 3
Temporal and Spatial Landscape of Archival Analysis
Long Duration
Single
Webpage
● TMVis
● Wayback Machine Changes
● TrendMachine
● MementoMap
● CDX Summary
● Archives Unleashed Toolkit
Webpage
Collection
● Memento Damage
● Archival ACID Test
● Reconstructive
● Warrick
● Wayback Machine Downloader
● Video Archiving Insights
Short Duration
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 4
Modeling Web Page Health: Linear vs. S-Curve
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 5
Sigmoid Function for Web Page Resilience
Spread: How far up or down the value can go from its starting position?
Shift: How soon any significant change in the value can begin?
Slope: How quickly the value reaches close to the maximum change?
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 6
TrendMachine: Composite Sigmoid Parameters of Resilience
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 7
TrendMachine: Overview
Code: https://github.com/internetarchive/trendmachine
Demo: https://trendmachine.sawood-dev.us.archive.org/
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 8
TrendMachine: Temporal Distribution of Archiving Activities
The page is archived
as few as one or zero
times and as many as
tens of thousands of
times in a single day.
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 9
Specimen Selection Algorithm
PRIORITY = ["2xx", "4xx", "5xx", "3xx"]
FOREACH st OF PRIORITY
IF st IN statuses(day)
specimen = statuses(day).match(st)[0]
BREAK
DAY1 DAY2 DAY3 DAY4
4xx 3xx 5xx 3xx
3xx 3xx 3xx 5xx
2xx 3xx 5xx 3xx
5xx 4xx 5xx
2xx 4xx
A 3xx specimen usually suggests that the URL is
redirecting to somewhere other than a variation of
the same URL.
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 10
Filling Missing Observations
Policy DAY1 DAY2 DAY3 DAY4 DAY5 DAY6
Identical 2xx 2xx 2xx 4xx 2xx
Closest 2xx 2xx 2xx 4xx 4xx 2xx
Forward 2xx 2xx 2xx 2xx 4xx 2xx
Backward 2xx 2xx 4xx 4xx 4xx 2xx
ANY 2xx 2xx
Do not fill the gap if the
status codes before and
after are not identical.
Do not fill the gap if it is
larger than a configured
threshold.
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 11
TrendMachine: TimeMap Status Codes vs. Daily Specimens
Most of the self-redirect 3xx observations
(HTTP/HTTPS or WWW/Apex domain) are
eliminated in daily specimens.
About one third of the days since the first
observation have no captures, of which
some are filled using a filling policy.
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 12
TrendMachine: Resilience
● Resilience score is calculated using Sigmoid function on status codes of daily specimens
● Initial value of 0.5 and normalized between 0 and 1
● After the first few observations, Wayback Machine did not archive it for several months in 2002
● Towards the end of 2002, Resilience score went up slowly due to infrequent archiving
● In 2003 “wikipedia.org” started to redirect to “en.wikipedia.org”
● After 2005, Resilience of the Wikipedia home page has mostly been stable and high
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 13
TrendMachine: Fixity
● Fixity score (normalized) is calculated using Sigmoid
function on content digests of daily specimens
● Content digest reported in CDX can be sensitive to
Content-Encoding, resulting in false alarms, even
when the underlying content remains unchanged
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 14
TrendMachine: Chaos
● Chaos score (normalized) is calculated using a Run-Length Encoding inspired technique on all
status codes of the CDX data in which consecutive duplicates are removed in the numerator
● An alternate sliding-window calculation is performed on the last N observations as the score
becomes insensitive to recent changes on large TimeMaps
● A high Chaos along with a high Resilience is often an indication of canonical redirects (e.g.,
adoption of HTTPS and/or consolidation of WWW and Apex domain)
Chaos =
| 2xx, 2xx, 2xx, 3xx, 3xx, 2xx |
=
3
= 0.5
| 2xx, 2xx, 2xx, 3xx, 3xx, 2xx | 6
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 15
TrendMachine: Status Code Transitions
● Large numbers along the major diagonal
indicate status code stability for extended
periods of time
● Large numbers in non-diagonal cells suggest
frequent changes in Resilience curve
● Web pages with high Resilience score for
extended periods usually exhibit large numbers
in the top-left cell (2xx -> 2xx)
● A large number in the 3xx -> 3xx cell usually
indicates extended periods of redirection to
other URLs (e.g., URL restructuring, login wall,
domain change, and parked domain)
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 16
TrendMachine: Compare First and Last Mementos
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 17
TrendMachine: Live Web Page With Headers
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 18
Potential Use Cases
● Detect points of interest in a large TimeMap
● Sample captures/mementos from TimeMaps for visual summarization
● Detect archival sinks (like login pages, paywalls, and misconfigured redirects)
● Detect poor-quality pages like Soft-404 and parked domains
● Detect potential link-rot (and fix them when possible, like in a wiki page)
● Optimize crawl jobs by minimizing wasteful downloads and maximizing coverage
● Archival quality assurance
● Cluster pages of a large archival collection in different categories
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 19
Future Work
● Report heuristics-based archival summary by combining various scores
● Report/embed captures/mementos that can be points of interest
● Calculate Fixity using less-sensitive digests (e.g., SimHash)
● Calculate Chaos after applying convolutions to smooth out alternate changes
● Allow alternate web page health models (not just Sigmoid functions)
● Deploy in production by integrating with Wayback Machine
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 20
Summary
Code: https://github.com/internetarchive/trendmachine
Demo: https://trendmachine.sawood-dev.us.archive.org/
A mathematical model
to quantify temporal
health of a web page
Resilience, Fixity,
Chaos, Distributions,
Transitions, etc. reports
An interactive portal with
configuration options for
experiments
An evolving
open-source codebase
and demo deployment

Más contenido relacionado

Similar a TrendMachine: Temporal Resilience of Web Pages

Performance-driven front-end development
Performance-driven front-end developmentPerformance-driven front-end development
Performance-driven front-end developmentyouth Overturn
 
WordPress Cluster for Enterprise High-Availability and On-Demand Scaling
WordPress Cluster for Enterprise High-Availability and On-Demand ScalingWordPress Cluster for Enterprise High-Availability and On-Demand Scaling
WordPress Cluster for Enterprise High-Availability and On-Demand ScalingJelastic Multi-Cloud PaaS
 
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveLessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveKritika Garg
 
MySQL Schema Design in Practice
MySQL Schema Design in PracticeMySQL Schema Design in Practice
MySQL Schema Design in PracticeJaime Crespo
 
Monitoring web application response times^lj a hybrid approach for windows
Monitoring web application response times^lj a hybrid approach for windowsMonitoring web application response times^lj a hybrid approach for windows
Monitoring web application response times^lj a hybrid approach for windowsMark Friedman
 
Why is this ASP.NET web app running slowly?
Why is this ASP.NET web app running slowly?Why is this ASP.NET web app running slowly?
Why is this ASP.NET web app running slowly?Mark Friedman
 
Targeting Mobile Platform with MVC 4.0
Targeting Mobile Platform with MVC 4.0Targeting Mobile Platform with MVC 4.0
Targeting Mobile Platform with MVC 4.0Mayank Srivastava
 
Introduction to WSO2 Storage Server
Introduction to WSO2 Storage Server Introduction to WSO2 Storage Server
Introduction to WSO2 Storage Server WSO2
 
Majid_Jalili_SRC_2014
Majid_Jalili_SRC_2014Majid_Jalili_SRC_2014
Majid_Jalili_SRC_2014Majid Jalili
 
Private cloud with vmware
Private cloud with vmwarePrivate cloud with vmware
Private cloud with vmwareAnton An
 
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...DataStax Academy
 
Docker в автоматизации тестирования
Docker в автоматизации тестированияDocker в автоматизации тестирования
Docker в автоматизации тестированияCOMAQA.BY
 
Understanding the Top Four Use Cases for IoT
Understanding the Top Four Use Cases for IoTUnderstanding the Top Four Use Cases for IoT
Understanding the Top Four Use Cases for IoTVoltDB
 
Quarkus Denmark 2019
Quarkus Denmark 2019Quarkus Denmark 2019
Quarkus Denmark 2019Max Andersen
 
CloudBridge and NetApp Storage Solutions - The Killer App
CloudBridge and NetApp Storage Solutions - The Killer AppCloudBridge and NetApp Storage Solutions - The Killer App
CloudBridge and NetApp Storage Solutions - The Killer AppNetApp
 
VMworld 2013: Maximize Database Performance in Your Software-Defined Data Center
VMworld 2013: Maximize Database Performance in Your Software-Defined Data CenterVMworld 2013: Maximize Database Performance in Your Software-Defined Data Center
VMworld 2013: Maximize Database Performance in Your Software-Defined Data CenterVMworld
 
Velocity spa faster_092116
Velocity spa faster_092116Velocity spa faster_092116
Velocity spa faster_092116Manuel Alvarez
 
Making Single Page Applications (SPA) faster
Making Single Page Applications (SPA) faster Making Single Page Applications (SPA) faster
Making Single Page Applications (SPA) faster Boris Livshutz
 

Similar a TrendMachine: Temporal Resilience of Web Pages (20)

Performance-driven front-end development
Performance-driven front-end developmentPerformance-driven front-end development
Performance-driven front-end development
 
WordPress Cluster for Enterprise High-Availability and On-Demand Scaling
WordPress Cluster for Enterprise High-Availability and On-Demand ScalingWordPress Cluster for Enterprise High-Availability and On-Demand Scaling
WordPress Cluster for Enterprise High-Availability and On-Demand Scaling
 
IT Resilience Technical
IT Resilience TechnicalIT Resilience Technical
IT Resilience Technical
 
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveLessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
 
MySQL Schema Design in Practice
MySQL Schema Design in PracticeMySQL Schema Design in Practice
MySQL Schema Design in Practice
 
Monitoring web application response times^lj a hybrid approach for windows
Monitoring web application response times^lj a hybrid approach for windowsMonitoring web application response times^lj a hybrid approach for windows
Monitoring web application response times^lj a hybrid approach for windows
 
Why is this ASP.NET web app running slowly?
Why is this ASP.NET web app running slowly?Why is this ASP.NET web app running slowly?
Why is this ASP.NET web app running slowly?
 
Targeting Mobile Platform with MVC 4.0
Targeting Mobile Platform with MVC 4.0Targeting Mobile Platform with MVC 4.0
Targeting Mobile Platform with MVC 4.0
 
Introduction to WSO2 Storage Server
Introduction to WSO2 Storage Server Introduction to WSO2 Storage Server
Introduction to WSO2 Storage Server
 
Majid_Jalili_SRC_2014
Majid_Jalili_SRC_2014Majid_Jalili_SRC_2014
Majid_Jalili_SRC_2014
 
Private cloud with vmware
Private cloud with vmwarePrivate cloud with vmware
Private cloud with vmware
 
Web Performance Optimization
Web Performance OptimizationWeb Performance Optimization
Web Performance Optimization
 
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
 
Docker в автоматизации тестирования
Docker в автоматизации тестированияDocker в автоматизации тестирования
Docker в автоматизации тестирования
 
Understanding the Top Four Use Cases for IoT
Understanding the Top Four Use Cases for IoTUnderstanding the Top Four Use Cases for IoT
Understanding the Top Four Use Cases for IoT
 
Quarkus Denmark 2019
Quarkus Denmark 2019Quarkus Denmark 2019
Quarkus Denmark 2019
 
CloudBridge and NetApp Storage Solutions - The Killer App
CloudBridge and NetApp Storage Solutions - The Killer AppCloudBridge and NetApp Storage Solutions - The Killer App
CloudBridge and NetApp Storage Solutions - The Killer App
 
VMworld 2013: Maximize Database Performance in Your Software-Defined Data Center
VMworld 2013: Maximize Database Performance in Your Software-Defined Data CenterVMworld 2013: Maximize Database Performance in Your Software-Defined Data Center
VMworld 2013: Maximize Database Performance in Your Software-Defined Data Center
 
Velocity spa faster_092116
Velocity spa faster_092116Velocity spa faster_092116
Velocity spa faster_092116
 
Making Single Page Applications (SPA) faster
Making Single Page Applications (SPA) faster Making Single Page Applications (SPA) faster
Making Single Page Applications (SPA) faster
 

Más de Sawood Alam

CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsSawood Alam
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineSawood Alam
 
Profiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingProfiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingSawood Alam
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesSawood Alam
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSawood Alam
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingSawood Alam
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSawood Alam
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkSawood Alam
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesSawood Alam
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkSawood Alam
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingSawood Alam
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File FormatSawood Alam
 
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingInterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingSawood Alam
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoSawood Alam
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationSawood Alam
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerSawood Alam
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerSawood Alam
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingSawood Alam
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupSawood Alam
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesSawood Alam
 

Más de Sawood Alam (20)

CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection Insights
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
 
Profiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingProfiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento Routing
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web Packaging
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination Framework
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web Archives
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File Format
 
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingInterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in Go
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to Containerization
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorker
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive Profiling
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research Group
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
 

Último

Networking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGNetworking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGAPNIC
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445ruhi
 
Radiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girlsRadiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girlsstephieert
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Call Girls in Nagpur High Profile
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceDelhi Call girls
 
Challengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya Shirtrahman018755
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024APNIC
 
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$kojalkojal131
 
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girlsstephieert
 
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663Call Girls Mumbai
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLimonikaupta
 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebJames Anderson
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.soniya singh
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...Diya Sharma
 

Último (20)

Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Networking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGNetworking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOG
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
 
Radiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girlsRadiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girls
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
 
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
 
Challengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya Shirt
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
 
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
 
Call Girls In Noida 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In Noida 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICECall Girls In Noida 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In Noida 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
 
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls
 
Call Girls In South Ex 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In South Ex 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICECall Girls In South Ex 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In South Ex 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
 
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
 
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 

TrendMachine: Temporal Resilience of Web Pages

  • 1. TrendMachine: Temporal Resilience of Web Pages @WaybackMachine IIPC Web Archiving Conference (WAC), May 03, 2023, Online Sawood Alam Mark Graham Kritika Garg Michele C. Weigle Michael L. Nelson Dietrich Ayala Internet Archive Internet Archive Old Dominion University Old Dominion University Old Dominion University Protocol Labs @WebSciDL @ProtocolLabs Supported in part by Protocol Labs and Filecoin Foundation
  • 2. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 2 Research Question How healthy has a web page been throughout its lifetime?
  • 3. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 3 Temporal and Spatial Landscape of Archival Analysis Long Duration Single Webpage ● TMVis ● Wayback Machine Changes ● TrendMachine ● MementoMap ● CDX Summary ● Archives Unleashed Toolkit Webpage Collection ● Memento Damage ● Archival ACID Test ● Reconstructive ● Warrick ● Wayback Machine Downloader ● Video Archiving Insights Short Duration
  • 4. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 4 Modeling Web Page Health: Linear vs. S-Curve
  • 5. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 5 Sigmoid Function for Web Page Resilience Spread: How far up or down the value can go from its starting position? Shift: How soon any significant change in the value can begin? Slope: How quickly the value reaches close to the maximum change?
  • 6. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 6 TrendMachine: Composite Sigmoid Parameters of Resilience
  • 7. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 7 TrendMachine: Overview Code: https://github.com/internetarchive/trendmachine Demo: https://trendmachine.sawood-dev.us.archive.org/
  • 8. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 8 TrendMachine: Temporal Distribution of Archiving Activities The page is archived as few as one or zero times and as many as tens of thousands of times in a single day.
  • 9. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 9 Specimen Selection Algorithm PRIORITY = ["2xx", "4xx", "5xx", "3xx"] FOREACH st OF PRIORITY IF st IN statuses(day) specimen = statuses(day).match(st)[0] BREAK DAY1 DAY2 DAY3 DAY4 4xx 3xx 5xx 3xx 3xx 3xx 3xx 5xx 2xx 3xx 5xx 3xx 5xx 4xx 5xx 2xx 4xx A 3xx specimen usually suggests that the URL is redirecting to somewhere other than a variation of the same URL.
  • 10. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 10 Filling Missing Observations Policy DAY1 DAY2 DAY3 DAY4 DAY5 DAY6 Identical 2xx 2xx 2xx 4xx 2xx Closest 2xx 2xx 2xx 4xx 4xx 2xx Forward 2xx 2xx 2xx 2xx 4xx 2xx Backward 2xx 2xx 4xx 4xx 4xx 2xx ANY 2xx 2xx Do not fill the gap if the status codes before and after are not identical. Do not fill the gap if it is larger than a configured threshold.
  • 11. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 11 TrendMachine: TimeMap Status Codes vs. Daily Specimens Most of the self-redirect 3xx observations (HTTP/HTTPS or WWW/Apex domain) are eliminated in daily specimens. About one third of the days since the first observation have no captures, of which some are filled using a filling policy.
  • 12. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 12 TrendMachine: Resilience ● Resilience score is calculated using Sigmoid function on status codes of daily specimens ● Initial value of 0.5 and normalized between 0 and 1 ● After the first few observations, Wayback Machine did not archive it for several months in 2002 ● Towards the end of 2002, Resilience score went up slowly due to infrequent archiving ● In 2003 “wikipedia.org” started to redirect to “en.wikipedia.org” ● After 2005, Resilience of the Wikipedia home page has mostly been stable and high
  • 13. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 13 TrendMachine: Fixity ● Fixity score (normalized) is calculated using Sigmoid function on content digests of daily specimens ● Content digest reported in CDX can be sensitive to Content-Encoding, resulting in false alarms, even when the underlying content remains unchanged
  • 14. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 14 TrendMachine: Chaos ● Chaos score (normalized) is calculated using a Run-Length Encoding inspired technique on all status codes of the CDX data in which consecutive duplicates are removed in the numerator ● An alternate sliding-window calculation is performed on the last N observations as the score becomes insensitive to recent changes on large TimeMaps ● A high Chaos along with a high Resilience is often an indication of canonical redirects (e.g., adoption of HTTPS and/or consolidation of WWW and Apex domain) Chaos = | 2xx, 2xx, 2xx, 3xx, 3xx, 2xx | = 3 = 0.5 | 2xx, 2xx, 2xx, 3xx, 3xx, 2xx | 6
  • 15. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 15 TrendMachine: Status Code Transitions ● Large numbers along the major diagonal indicate status code stability for extended periods of time ● Large numbers in non-diagonal cells suggest frequent changes in Resilience curve ● Web pages with high Resilience score for extended periods usually exhibit large numbers in the top-left cell (2xx -> 2xx) ● A large number in the 3xx -> 3xx cell usually indicates extended periods of redirection to other URLs (e.g., URL restructuring, login wall, domain change, and parked domain)
  • 16. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 16 TrendMachine: Compare First and Last Mementos
  • 17. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 17 TrendMachine: Live Web Page With Headers
  • 18. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 18 Potential Use Cases ● Detect points of interest in a large TimeMap ● Sample captures/mementos from TimeMaps for visual summarization ● Detect archival sinks (like login pages, paywalls, and misconfigured redirects) ● Detect poor-quality pages like Soft-404 and parked domains ● Detect potential link-rot (and fix them when possible, like in a wiki page) ● Optimize crawl jobs by minimizing wasteful downloads and maximizing coverage ● Archival quality assurance ● Cluster pages of a large archival collection in different categories
  • 19. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 19 Future Work ● Report heuristics-based archival summary by combining various scores ● Report/embed captures/mementos that can be points of interest ● Calculate Fixity using less-sensitive digests (e.g., SimHash) ● Calculate Chaos after applying convolutions to smooth out alternate changes ● Allow alternate web page health models (not just Sigmoid functions) ● Deploy in production by integrating with Wayback Machine
  • 20. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 20 Summary Code: https://github.com/internetarchive/trendmachine Demo: https://trendmachine.sawood-dev.us.archive.org/ A mathematical model to quantify temporal health of a web page Resilience, Fixity, Chaos, Distributions, Transitions, etc. reports An interactive portal with configuration options for experiments An evolving open-source codebase and demo deployment