1. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Herbert Van de Sompel
LANL & DANS
@hvdsomp
En toen was er niets meer …
2. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
The Web
3. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
The Web Evolves
4. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Yet, the Web Exists in a Perpetual Now
5. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
• Content Management Systems
• Web Archives
• Transactional archives
• Search engine caches
• …
Traces of the Past Web Exist
6. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
But Past and Current Web(s) are Parallel Universes
7. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
The Memento Protocol Integrates the Current and Past Web
7
http://mementoweb.org/guide/rfc/
8. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Original Resource and Mementos
9. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Bridge from Present to Past
10. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Bridge from Present to Past
11. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Bridge from Past to Present
12. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Today
Select Date
Mar 9 1999
Feb 8 1999
Bibliotheca
Alexandrina
Web Archive
Memento: Access Versions via the Original URI and a Datetime
13. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
vogin.nl in 1999
http://web.archive.bibalex.org/web/19990208021257/http://www.vogin.nl/
14. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Memento for Chrome
http://bit.ly/memento-for-chome
15. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Hyperlinks
Eric Sieverts (2017) https://vogin-ip-lezing.net/2017/01/17/linkrot-linkroest-en-webarchieven/
16. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Hyperlinks in Theory
17. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Hyperlinks in Reality
18. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Hyperlinks in Reality
19. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Link Rot
20. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Link Rot
http://404-resto.com/typo3temp/pics/7580ea80fa.jpg
21. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Hyperlinks in Reality
22. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Content Drift
23. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Content Drift
24. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Content Drift
http://icecube.wisc.edu/ on May 8 2009 (left) and August 27 2009 (right)
25. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Content Drift
2000 2004
2005 2008
http://dl00.org in 2000, 2004, 2005, 2008
26. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
No Content Drift
http://www.ifa.hawaii.edu/~cowie/k_table.html on June 9 1997 (left) and March 2016 (right)
27. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
The Web, All Hyperlinks Subject to Link Rot, Content Drift
28. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
The Web, All Hyperlinks Subject to Reference Rot
• Reference Rot hinders our ability to follow links as they were
intended when they were put in place:
• Link rot: A link stops working all together
• Content drift: The Linked content changes over time and may
eventually no longer be representative of the content that was
originally linked
29. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Creating Pockets of Persistence
• How to maintain the integrity of links?
• This challenge exists for the entire web. Some communities with well
managed collections care about addressing it because they consider
it a Quality of Service issue:
• Scholarly communication
• Cultural heritage
• Legal publications
• Government communication
• Journalism
• Wikipedia
• …
• What can these communities do to create Pockets of Persistence?
30. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
A Managed Collection Desires Reliable Outlinks
31. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Links to another Managed Collection
32. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Links to Web at Large Resources
33. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Exploring Link Rot & Content Drift
34. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Preamble 2 - Hiberlink Study of Reference Rot in STM Articles
PMC articles published 1997-2012 PMC
Total 479,194
With links to articles 240,857
With links to web-at-large resources 156,160
Links PMC
To articles 744,678
To web-at-large resources 480,853A B
A B
35. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Number of Articles & Links - PMC
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
36. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Links to Articles & to Web At Large Resources - PMC
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
37. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Exploring Link Rot & Content Drift
38. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Links Rot Occurs when B moves to C
39. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Introduce PID(B)
40. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Link to PID(B) ;; HTTP Redirect from PID(B) to B
41. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
When B moves to C: HTTP Redirect from PID(B) to C
42. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used
to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102
Core assumption in the PID solution:
PIDs will be used to establish links.
But are they?
43. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
• When classifying links extracted from PMC as linking to articles, we
assumed that filtering on http://dx.doi.org/* would do the trick
• But we found a lot of e.g. http://link.springer.com/article/*
• For example:
• http://link.springer.com/article/10.1007%2Fs00799-014-018-0
• Instead of:
• http://dx.doi.org/10.1007/s00799-014-0108-0
• We used CrossRef’s Reverse Domain Lookup to classify these
extracted links as linking to articles
A Disconcerting Observation
44. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
URI References - PMC
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent.
In: WWW2016. http://arxiv.org/1602.09102
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used
to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102
45. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Cartoon by Patrick Hochstenbach
A Proposal to Get PIDs Used: Signposting
http://signposting.org
46. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Signposting: HTTP Link with identifier Relation Type
http://signposting.org/identifier/
47. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Signposting: HTTP Link with identifier Relation Type
http://signposting.org/identifier/
48. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Signposting: Use HTTP Link with identifier Relation Type
curl –I
http://www.dlib.org/dlib/november15/vandesompel/11vandesompel.html
HTTP/1.1 200 OK
Date: Wed, 26 Oct 2016 12:36:37 GMT
Server: Apache/2.2.15 (CentOS)
Last-Modified: Thu, 19 Nov 2015 14:50:19 GMT
ETag: "205a5e-f5ef-524e5e0ab80c0"
Accept-Ranges: bytes
Content-Length: 62959
Content-Type: text/html; charset=UTF-8
Link: <http://doi.org/10.1045/november2015-vandesompel>
; rel=“identifier”
49. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
PID Alternative - When B Moves to C: HTTP Redirect from B to C
50. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
PID Alternative - When B Moves to C: HTTP Redirect from B to C
• Custodian of C needs to hold on to domain of B
• Custodian of C needs to establish redirection patterns, often rather
simple rules
• No problem with establishing links to PID(B);; the URI in the browser
address bar (initially B, later C) is just fine
51. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Exploring Link Rot & Content Drift
52. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Content Drift Occurs when B Changes over Time
53. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Content Drift Occurs when B Changes over Time
• Was not really considered an issue because:
• the objects that receive PIDs were typically static, e.g. scientific
papers
• when a (substantially) new version of an object is published, a
new PID is assigned
• But:
• PID links (typically) lead to landing pages, not the identified
objects
• increasingly, landing pages are increasingly rich, aggregate
comments, discussion, annotations;; they do change over time.
54. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Content Drift Occurs when B Changes over Time
55. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Custodian of B Takes Snapshots of B as it Evolves over Time
56. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Custodian of B Ensures Snapshots of B as it Evolves over Time
• This does not happen for PID-identified objects, AFAIK
• Version Control Systems (e.g. Wikipedia) hold on to all versions;;
snapshots are local.
• Pro-active archiving solutions for web servers that create snapshots
when e.g. new content is published/visited or at regular intervals:
• on-demand archiving of a web server, cf. archiefweb.eu,
archive-it.org
• self-archiving web server, cf. SiteStory
• How to access the snapshots of B? Memento!
57. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
SiteStory Transactional Archive & Memento
https://mementoweb.github.io/SiteStory/
58. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
SiteStory, Wikipedia, Web Archive, Memento in Action
http://lanlsource.lanl.gov/hello
59. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Exploring Link Rot & Content Drift
60. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Scholarly Context Not Found
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
61. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Link Rot - PMC
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
62. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Exploring Link Rot & Content Drift
63. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Scholarly Context Adrift
Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0167475
64. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
How to Assess Content Drift?
Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0167475
65. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Step 1: Find Pre/Post Mementos
66. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Step 2: Select Representative Mementos
67. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Text Similarity Measures
• Compute aggregate text similarity scores (values between 0...100)
for:
• Simhash
• Jaccard
• Sørensen-Dice
• Cosine
• If the aggregate score is 100, we decide that the Pre/Post
Mementos are representative
• We find 313K URI references with representative Mementos
68. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
URI References without Representative Mementos - PMC
Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0167475
69. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Step 3: Dereference Live Web Version of URI
70. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Step 4: Representative Memento vs. Live Version
71. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Content Drift - PMC
Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0167475
72. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Exploring Link Rot & Content Drift
73. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Uncertainty Regarding the Future of B when A Links to It
74. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Custodian of A Takes a Snapshot of B when Linking to It
75. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Taking a Snapshots of B: Automation is Key
• Web archive APIs for on-demand archiving
• perma.cc, Internet Archive, archive.is, webcitation
• Amber for Wordpress & Drupal archives resources linked in a page
• http://amberlink.org/
• Hiberlink’s experimental Zotero extension archives bookmarked
URLs
• http://hiberlink.org/zotero.html
• Hiberlink’s experimental HiberActive archives all URLs referenced in
a newly submitted paper
• https://www.slideshare.net/martinklein0815/hiberactive
76. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Linking to Snapshot of B = Potentially Creating a Rotten Link
• Existing practice for linking to snapshots:
<a href=“URL of snapshot of B”>
• Problems with existing practice:
o Impossible to visit the original URI, if desired
o Requires the permanent existence/uptime of the archive that
holds the snapshot
- One link rot problem replaced by another
http://robustlinks.mementoweb.org/about/
77. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Permanent Existence/Uptime of Archives?
Capture of http://webcitation.org dated July 17 2013
https://archive.today/eAETp
78. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Permanent Existence/Uptime of Archives?
Remnant of discontinued web archive http://mummify.it captured on February 14 2014
https://web.archive.org/web/20140214233752/https://www.mummify.it/
79. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Permanent Existence/Uptime of Archives?
http://www.themoscowtimes.com/news/article/russia-bans-wayback-machine-internet-archive-over-
islamic-state-video/510074.html
80. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Permanent Existence/Uptime of Archives?
http://web.archive.org/web/20121101043952/http://vogin.nl on March 6 2017 at 15:59 CET
81. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Link to Snapshot of B and Decorate the Link
• Desired practice for linking to captures is to decorate the link so it
provides a variety of options:
<a href=“URL of snapshot of B”
data-originalurl=“B”
data-versiondate=“datetime of snapshot of B”>
• Supports:
o Revisiting the original URL
o Finding snapshots in any web archive (original URL)
o Finding a temporally appropriate snapshot in any web archive
(original URL & snapshot datetime)
o Automatically accessing a temporally appropriate snapshot in
any web archive (Memento, original URL & snapshot datetime)
http://robustlinks.mementoweb.org/spec/
82. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Robust Links: Link Decoration in Action
Van de Sompel H. & Nelson, M.L. (2015) Reminiscing about 15 years of interoperability efforts. In:
D-Lib Magazine. https://doi.org/10.1045/november2015-vandesompel
JavaScript makes the
link decorations actionable
83. Herbert Van de Sompel
VOGIN-IP, Amsterdam, Nederland, Maart 9 2017
Herbert Van de Sompel
LANL & DANS
@hvdsomp
En toen was er niets meer …