7. variable (Sanderson, Phillips, and
Van de Sompel, 2011)
• literature review of 17 studies
• research focused on scholarly citations
• decay rates of 39-82%
• over periods of 1-13 years
8. ―Digital documents last forever—or five years, whichever comes first.‖
(Jeff Rothenberg, 1997)
―Out of books sprout... plants‖ by DeviantArt user quinn.anya under CC BY-SA 2.0
9. LINK CHECKING
The Art and Science of
―http Blue Background‖ by DeviantArt user SoulArt2012 under CC BY-NC-ND 3.0
12. what link checking tells us
―200 ok‖ by Flickr user reidab under CC BY-NC-SA 2.0
13. possible scenarios
• link works; same website
• link works; different website
– website may or may not still exist
• link doesn’t work; website still exists
• link doesn’t work; website no longer exists
14. link works; same website
http://www.fair.org/ (2002) http://www.fair.org/ (2013)
20. assumptions
• link works; same website
• link works; different website
– website may or may not still exist
• link doesn’t work; website still exists
• link doesn’t work; website no longer
exists
21. research questions
• how much are we overestimating website
persistence?
– some working links point to different websites
• how much are we underestimating website
persistence?
– websites may still exist even though links
don’t work or do work but point to different
websites
24. preparing the list of links
• exclude links corresponding to electoral
candidate websites
• 1,071 links
– state government
– political parties
– advocacy organizations
– major newspapers
– political blogs
25. methodology
automated
• run Heritrix against links,
ignoring robots.txt
• log http response codes
• log redirects
manual
• manually check each link
• same website behind
working link?
• does website still exist?
26. methodology
automated
• run Heritrix against links,
ignoring robots.txt
• log http response codes
• log redirects
manual
• manually check each link
• same website behind
working link?
• does website still exist?
31. summary of results
• how much are we overestimating website
persistence?
– 8% of working links point to different websites
• how much are we underestimating website
persistence?
– 82% of websites associated with non-working
links still exist
– 48% of websites whose links now point to
different websites still exist
32. what does it mean?
• websites are (much
more) persistent than
links
• websites are
surprisingly durable?
―Golden Spider Silk‖ by Flickr user amandabhslater under CC BY-SA 2.0
34. building a website checker
1. check whether link still works
2. check whether link still corresponds to
website
3. check whether website still exists
35. ―Most web archiving problems are problems of scale.‖
(Kris Carpenter Negulescu, 2012)
―chutes and ladders‖ by Flickr user reallyboring under CC BY-NC-SA 2.0
36. building a website checker
1. check whether link still works
2. check whether link still corresponds to
website
3. check whether website still exists
38. …but checksums are limited
―Hashing Emily‖ by Flickr user wlef70 under CC BY-NC-SA 3.0
39. visual analysis of page changes
Pehlivan, Ben-Saad, and Gançarski: ―Vi-DIFF: Understanding Web Pages Changes‖
40. building a website checker
1. check whether link still works
2. check whether link still corresponds to
website
3. check whether website still exists
41. lexical signature of archived page
Ware, Klein, and Nelson: ―An Evaluation of Link Neighborhood
Lexical Signatures to Rediscover Missing Web Pages‖
42. find archived pages w/ Memento
• http protocol
enhancement
• enables discovery of
archived resources in
distributed web
archives
44. ―The future is already here; it’s just not very evenly distributed.‖
(William Gibson, 1999)
―Time Travel‖ by Flickr user xcalibr under CC BY-NC-ND 2.0
We’ve been losing the web for as long as it’s existed; the first webpage, created by Tim Berners-Lee, exists as only a copy recreated a year after the original.http://www.w3.org/History/19921103-hypertext/hypertext/WWW/TheProject.html
Mainstream recognition of the once-esoteric “page not found” http response code reflects the popular perception of the ephemerality of the web
I started looking into the literature on link persistence in preparation for writing a blog post for the Library of Congress’ digital preservation blog, the Signal. Brewster Kahle, founder of the Internet Archive, has offered various numbers for the average lifespan of a webpage over the years. As someone trying to archive the entire public web, he seemed like someone who would know.
A meta-study of 17 other studies of link persistence suggested that links decay…but at all sorts of different frequencies.
The ambiguity about the ephemerality of web content is reminiscent of Rothenberg’s famous quote about the persistence of digital documents in general.
Now let’s take a look at the simplest automated approach to checking the persistence of links.
When the client’s browser requests the resource at a particular URL, the web server first sends an http response code, indicating the disposition of the resource at the requested URL. These are some common response codes.
An automated link checker, also known as a “spider” or “robot”, works by requesting a series of links and recording the response codes.https://secure.flickr.com/photos/chidorian/3461667159/
Response codes are limited, however; they can tell us about the disposition of content at the specified URL, but they can’t tell us what the content at the specified URL is.
Considering a link and a corresponding website over time, there are a number of possible scenarios when we go back to check on the persistence of both.
The most straight-forward case is where the link, the website, and their correspondence all persist.
Sometimes, however, the link still works, but it points to a different website.
That website may still exist at another URL.
Alternatively, maybe the link doesn’t work.
But the website that the link previously corresponded to may still exist at another URL.
Lastly, sometimes both the link doesn’t work and the website doesn’t exist.
These examples illustrate that link persistence and website persistence are two different things and that using the former as a proxy for the latter misses some of the possible scenarios.
Considering those scenarios, conflating link persistence with website persistence will result in systematic mis-measurements of website persistence. How significant are these mis-measurements?
Measuring website persistence requires knowing about the state of websites in the past, a perfect use case for web archives. I decided to do a study based on the web archives I was most familiar with, those of the Library of Congress.http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html
The U.S. Election 2002 Web Archive is one of their earliest web archive collections. The Library of Congress has archived U.S. national election websites every two years since 2000.
There were many more links in the collection than were utilized in this study. Links corresponding to electoral candidate websites were excluded given that they were universally short-lived and would skew the results.
The study consisted of two stages. First, we ran Heritrix against the prepared list of links and logged http response codes and redirects.
In the second stage, we manually visited each link and noted whether it was the same website as we had previously archived. If it was a different website or if the link didn’t work, we attempted to locate the new location of the previously-archived website using a search engine.
The link checker found that 91% of the links ultimately returned a “200” response code. The remaining 9% ultimately returned either “4xx” or “5xx” series response codes.
Bringing in the data on whether the working links still corresponded to the same websites, the percentage of working links that still correspond to the same site drops to 83%. Now, 8% of all the links are working links pointing to different sites.
Diving in on the non-working links, roughly 77% of the previously-archived websites still exist, even though the previously-archived links no longer point to them.
In aggregate, the percentage of websites that still exist after 10 years is 3% higher (94%) than link checking would’ve suggested (91%).This isn’t at all to say that web archiving isn’t important – if I included the candidate websites, the pie chart would suddenly show that less than half of the websites still existed. Also, for example, the White House website has existed for these last ten years, but specific content on the website has invariably disappeared.
The results suggest that we may be marginally overestimating website persistence by conflating working links with website persistence but greatly underestimating website persistence by conflating non-working links with websites that have disappeared.
The key caveat for these results is that I excluded from the study over 1,000 URLs in the web archive collection that all would have likely been both non-working links and websites that no longer existed. The remaining set of URLs represented those about which I more reasonably supposed there was a more typical probability that they would either persist or disappear.
We’re able to effectively perform link checking with current technologies. Can we come up with a better approach to checking the persistence of websites? Better understanding website persistence would facilitate better capacity planning (e.g., by reducing storage requirements for near-duplicate resources), inform capture frequency scheduling, and increase confidence that captured links corresponded to desired websites.
A website checker would need to be able to check links, too, but that functionality is already covered. What are the prospects for tools that could check link and website correspondence and check whether a website still exists?
In theory, these two latter tasks aren’t that difficult; it’s just that they need to be automated in order to be scalable.
Let’s look first at possible tools for checking link and website correspondence.
Heritrix already has the ability to compare the checksums of a resource at a particular URL over successive visits. This allows for an “absolute” assessment of sameness.
However, even the smallest change is enough to produce a checksum mis-match. We need a tool that can assess the magnitude or importance of the difference between successive versions, not just the fact of a difference.
The Vi-DIFF algorithm evaluates both the structure of a webpage and its segmented visual appearance to assess the magnitude of change. As a follow-on to a link checker, the algorithm could be calibrated to indicate whether it was the same site as previously visited or an entirely new one.
Now let’s look first at possible tools for checking website persistence, irrespective of link persistence.
The lexical signature is a set of keywords that are sufficiently descriptive and unique to be used in a search engine to dereference the page.
If the URL no longer works but exists in an archive, the lexical signature can be derived from the archived page and used to locate the new URL.
If the URL itself isn’t archived, the lexical signature can be derived from backlinks.
These tools exist but are not yet in wide use in the web archiving community. Wider utilization of these tools would allow us to better assess website persistence and the discrepancy with link persistence.