Link Persistence, Website Persistence

Link Persistence,
Website Persistence
Nicholas Taylor
@nullhandle
May 28, 2013 ―Forward‖ by Flickr user Hitchster under CC BY 2.0

broken links
―404‖ by Flickr user adactio under CC BY 2.0

variable (Sanderson, Phillips, and
Van de Sompel, 2011)
• literature review of 17 studies
• research focused on scholarly citations
• decay rates of 39-82%
• over periods of 1-13 years

―Digital documents last forever—or five years, whichever comes first.‖
(Jeff Rothenberg, 1997)
―Out of books sprout... plants‖ by DeviantArt user quinn.anya under CC BY-SA 2.0

LINK CHECKING
The Art and Science of
―http Blue Background‖ by DeviantArt user SoulArt2012 under CC BY-NC-ND 3.0

http response codes
• 404: ―Not Found‖
• 200: ―OK‖
• 301: ―Moved Permanently‖
• 500: ―Internal Server Error‖

automated link checker
―La Machine @ Yokohama‖ by Flickr user chidorian under CC BY-SA 2.0

what link checking tells us
―200 ok‖ by Flickr user reidab under CC BY-NC-SA 2.0

possible scenarios
• link works; same website
• link works; different website
– website may or may not still exist
• link doesn’t work; website still exists
• link doesn’t work; website no longer exists

link works; same website
http://www.fair.org/ (2002) http://www.fair.org/ (2013)

link works; different website…
http://www.fb.com/ (2002) http://www.fb.com/ (2013)

…but website still exists
http://www.fb.org/ (2013)

link doesn’t work…
http://www.state.mo.us/ (2002) http://www.state.mo.us/ (2013)

…but website still exists
http://www.sos.mo.gov/ (2013)

link doesn’t work;
website no longer exists

assumptions
• link works; same website
• link works; different website
– website may or may not still exist
• link doesn’t work; website still exists
• link doesn’t work; website no longer
exists

research questions
• how much are we overestimating website
persistence?
– some working links point to different websites
• how much are we underestimating website
persistence?
– websites may still exist even though links
don’t work or do work but point to different
websites

Library of Congress
U.S. Election 2002 Web Archive

preparing the list of links
• exclude links corresponding to electoral
candidate websites
• 1,071 links
– state government
– political parties
– advocacy organizations
– major newspapers
– political blogs

methodology
automated
• run Heritrix against links,
ignoring robots.txt
• log http response codes
• log redirects
manual
• manually check each link
• same website behind
working link?
• does website still exist?

working link?
91%
9%
working
non-working

same website?
83%
9%
8%
working link; same site
non-working link
working link; different site

non-working link;
website still exists?
91%
7%
2%
8%
working
still exists
doesn't exist

website still exists?
94%
6%
still exists
doesn't exist

summary of results
• how much are we overestimating website
persistence?
– 8% of working links point to different websites
• how much are we underestimating website
persistence?
– 82% of websites associated with non-working
links still exist
– 48% of websites whose links now point to
different websites still exist

what does it mean?
• websites are (much
more) persistent than
links
• websites are
surprisingly durable?
―Golden Spider Silk‖ by Flickr user amandabhslater under CC BY-SA 2.0

WEBSITE CHECKING?
Beyond Link Checking,
―Check‖ by Flickr user ex.libris under CC BY-NC-ND 2.0

building a website checker
1. check whether link still works
2. check whether link still corresponds to
website
3. check whether website still exists

―Most web archiving problems are problems of scale.‖
(Kris Carpenter Negulescu, 2012)
―chutes and ladders‖ by Flickr user reallyboring under CC BY-NC-SA 2.0

Heritrix compares checksums
―Fingerprint‖ by Flickr user CPOA under CC BY-ND 2.0

…but checksums are limited
―Hashing Emily‖ by Flickr user wlef70 under CC BY-NC-SA 3.0

visual analysis of page changes
Pehlivan, Ben-Saad, and Gançarski: ―Vi-DIFF: Understanding Web Pages Changes‖

lexical signature of archived page
Ware, Klein, and Nelson: ―An Evaluation of Link Neighborhood
Lexical Signatures to Rediscover Missing Web Pages‖

find archived pages w/ Memento
• http protocol
enhancement
• enables discovery of
archived resources in
distributed web
archives

lexical signatures of backlink pages

―The future is already here; it’s just not very evenly distributed.‖
(William Gibson, 1999)
―Time Travel‖ by Flickr user xcalibr under CC BY-NC-ND 2.0

Nicholas Taylor
@nullhandle
―Thank You‖ by Flickr user muffintinmom under CC BY 2.0

Link Persistence, Website Persistence

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (7)

Similar a Link Persistence, Website Persistence

Similar a Link Persistence, Website Persistence (20)

Más de nullhandle

Más de nullhandle (20)

Último

Último (20)

Link Persistence, Website Persistence

Notas del editor