1. Tools for Managing the Past Web
Dr. Michele C. Weigle
Web Sciences and Digital Libraries (WS-DL) Group
Department of Computer Science
Old Dominion University
ODU - ECE Seminar
February 20, 2015
5. But webpages can disappear
• Average lifespan of a webpage: 50-100 days
• A year after publication, about 11% of content
shared on social media will be gone.
February 20, 2015
SalahEldeen and Nelson, "Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?", TPDL 2012
http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html
5
7. Why archives matter
• Malaysia Airlines Flight
17 (MH17)
• Ukrainian separatists
originally took credit for
downing a transport plane
in that location
• Later deleted the post
• Internet Archive had
archived the post before
deletion
February 20, 2015 7
http://www.csmonitor.com/World/Europe/2014/0717/Web-
evidence-points-to-pro-Russia-rebels-in-downing-of-MH17-video
8. Web archiving in the news - 2015
February 20, 2015 8
http://www.newyorker.com/magazine/2015/01/26/cobweb
9. But Wayback is not Google
• Wayback Machine has no full-text search
– too big to be indexed
– 452 billion web pages, 9 petabytes of data
– growing at 20 TB/week
• Enter URL and pick a date
February 20, 2015 9
"It’s more like a phone book than like an archive."
-Jill Lepore, The New Yorker
11. How can I access the
archives?
February 20, 2015
MementoFox
Memento for Chrome
http://ws-dl.blogspot.com/2010/03/2010-03-19-mementofox-add-on-released.html
http://ws-dl.blogspot.com/2013/10/2013-10-14-right-click-to-past-memento.html
http://ws-dl.blogspot.com/2014/10/2014-10-03-integrating-live-and.html
Mink
http://www.mementoweb.org
11
13. ODU WS-DL Projects
Tools for Managing the Past Web
• Archive Quality
• Tweet Intention
• TimeMap
Summaries
• Archive What I See
Now
• Storytelling for
Archives
February 20, 2015 13
14. ODU WS-DL Projects
Tools for Managing the Past Web
• Archive Quality
• Tweet Intention
• TimeMap
Summaries
• Archive What I See
Now
• Storytelling for
Archives
February 20, 2015 14
15. The State of Web Archiving
"Hooray! It's in the archive!"
vs.
"How well was it archived?"
current:
future:
February 20, 2015 15
17. How damaged are these mementos?
February 20, 2015
M = 0.17
(live web)
Brunelle, Kelly, SalahEldeen, Weigle, and Nelson, "Not All Mementos Are Created Equal: Measuring the Impact of Missing
Resources", JCDL 2014, Best Student Paper
17
18. How damaged are these mementos?
February 20, 2015
M = 0.17
(live web)
M = 0.24
(missing main)
Brunelle, Kelly, SalahEldeen, Weigle, and Nelson, "Not All Mementos Are Created Equal: Measuring the Impact of Missing
Resources", JCDL 2014, Best Student Paper
18
19. How damaged are these mementos?
February 20, 2015
M = 0.17
(live web)
M = 0.24
(missing main)
M = 0.29
(missing logo + navigation)
Brunelle, Kelly, SalahEldeen, Weigle, and Nelson, "Not All Mementos Are Created Equal: Measuring the Impact of Missing
Resources", JCDL 2014, Best Student Paper
19
20. How damaged are these mementos?
February 20, 2015
M = 0.17
D = 0.09
(live web)
M = 0.24
D = 0.41
(missing main)
M = 0.29
D = 0.36
(missing logo + navigation)
Brunelle, Kelly, SalahEldeen, Weigle, and Nelson, "Not All Mementos Are Created Equal: Measuring the Impact of Missing
Resources", JCDL 2014, Best Student Paper
20
21. How to detect damage?
February 20, 2015
vs.
Brunelle et al., JCDL 2014
21
22. February 20, 2015
Good News:
Although M is steady/increasing, D is decreasing
22
M = percentage missing
D = our damage metric
Sampled 45,000 mementos
- one memento/year of ~1850 webpages
- webpages from Bitly URIs shared over Twitter and Archive-It collections
Brunelle et al., JCDL 2014
23. Using JavaScript can result in
damaged mementos
February 20, 2015 23
JavaScript is
responsible for an
increasing proportion
of missing embedded
resources over time.
Brunelle, Kelly, Weigle and Nelson, "The Impact of JavaScript on Archivability," International Journal of Digital Libraries (IJDL), 2015
25. Different parts of a page can be
crawled at different times
February 20, 2015
Ainsworth and Nelson, "Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in Acyclic Walks Through a Web
Archive", JCDL 2013
25
26. ODU WS-DL Projects
Tools for Managing the Past Web
• Archive Quality
• Tweet Intention
• TimeMap
Summaries
• Archive What I See
Now
• Storytelling for
Archives
February 20, 2015 26
27. Which page did Chris Hayes
mean to tweet?
February 20, 2015 27
Tweet on Oct 3, 2014
Likely target (captured Oct 1, 2014)
28. What you see depends on
when you click
February 20, 2015 28
Oct 9, 2014
Oct 10, 2014
Nov 19-Dec 15, 2014 Today (Feb 2015) – now fergusonaction.com
29. Mapping Tweet Relevance
February 20, 2015 29
SalahEldeen and Nelson, "Reading the Correct History? Modeling Temporal Intention in Resource Sharing”, JCDL 2013
30. Let the reader choose live or
archived
February 20, 2015 30
31. ODU WS-DL Projects
Tools for Managing the Past Web
• Archive Quality
• Tweet Intention
• TimeMap
Summaries
• Archive What I See
Now
• Storytelling for
Archives
February 20, 2015 31
33. What did usps.com look like?
February 20, 2015 33
http://whatdiditlooklike.mementoweb.org/
Animated GIF
1st memento of each
year
Submit a URL via
Twitter:
“#whatdiditlooklike URL”
34. Which tells you more about the
past of www.apple.com?
February 20, 2015
700 thumbnails
(not even all of them!)
32 sampled thumbnails
34
AlSum and Nelson, "Thumbnail Summarization Techniques for Web Archives", ECIR 2014
35. TimeMap Thumbnail
Summaries
• Compare HTML, not images
• Compute SimHash of HTML
– result is a string representing the content of
the page
• Calculate Hamming distance between
SimHashes of consecutive mementos
• Generate thumbnails of mementos that have at
least a 4 character difference in SimHash
– threshold too low -> near duplicate images
– threshold too high -> miss important
changes
February 20, 2015 35
3 lines of difference
AlSum and Nelson, "Thumbnail Summarization Techniques for Web Archives", ECIR 2014
39. ODU WS-DL Projects
Tools for Managing the Past Web
• Archive Quality
• Tweet Intention
• TimeMap
Summaries
• Archive What I See
Now
• Storytelling for
Archives
February 20, 2015 39
40. Archive What I See Now
• Humanities
researchers know
they should
archive web
resources
• Standard web
archiving tools are
difficult for non IT
experts
February 20, 2015
"Archive What I See Now", NEH Digital Humanities Implementation Grant, 2014-2017, http://bit.ly/odu-dhig-2014
40
41. Why not just take a screenshot or
“save as”?
February 20, 2015
Can't interact with
a screenshot
"Save Page As..."output is
difficult to keep organized --
especially with multiple
captures over time
41
42. What about archiving pages behind
authentication or that change quickly?
February 20, 2015
Facebook - requires login
Twitter - changes faster
than typical crawling rate
42
43. How we're addressing the problem
• Google Chrome extension
• Archive the current state
of the page in standard
Web Archive (WARC)
format
• Compatible with
Wayback
February 20, 2015 43
Kelly and Weigle, "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage", JCDL 2012
Kelly, Weigle, and Nelson. "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage," Digital Preservation
2012, Tools Demo Session
WARCreate
44. WARCreate - Work in Progress
• New modes of operation
– record mode
• while activated, add capture of each page visited to the
WARC
– countdown mode
• every interval, refresh and add new capture of page
– event mode
• add new capture of page every time it dynamically
reloads or refreshes
February 20, 2015 44
45. What to do with created WARCs?
February 20, 2015 45
Kelly, Weigle, and Nelson. "Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving," Personal Digital
Archiving 2013, Poster Session
Kelly, Nelson, and Weigle. "WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy," Digital Preservation 2013
WAIL
• Load created WARCs into
a Wayback instance on
your local computer
• Single-click install of
Wayback (and other
archiving tools)
• Available for Windows,
OS X
46. Bridging the gap between the past web
and the live web
February 20, 2015
Mink
46
Kelly, Nelson, and Weigle, "Mink: Integrating the Live and Archived Web Viewing Experience Using Web Browsers and Memento,"
poster, ACM/IEEE Digital Libraries (DL), September 2014.
• Google Chrome extension
• For each page you visit,
displays the number of
archived versions available
• Provides access by date
• Allows for submission to
public archiving services
48. ODU WS-DL Projects
Tools for Managing the Past Web
• Archive Quality
• Tweet Intention
• TimeMap
Summaries
• Archive What I See
Now
• Storytelling for
Archives
February 20, 2015 48
53. Storytelling For Archives
Archived collectionsStorytelling services
Archived enriched
stories
February 20, 2015 53
AlNoamany, "Using Web Archives to Enrich the Live Web Experience Through Storytelling", TCDL Bulletin, December 2013.
54. Tools for Storytelling
• Tools for Users
– use existing tools like Storify to view the stories of
a collection
• Tools for Curators
– use existing stories to augment your collections
– create stories from your collections
• candidate mementos automatically selected
February 20, 2015 54
55. Story Types
Fixed Page – Fixed Time:
differences in GeoIP,
mobile, etc.
Fixed Page – Sliding Time:
evolution of a single page
(or domain) through time
Sliding Page – Fixed Time:
different perspectives on a
point in time
Sliding Page – Sliding Time:
broadest possible coverage
of a collection
same
Time
different
URI
same
different
Issues: topic modeling, eliminating duplicates, maximizing
novelty, structural & content quality
February 20, 2015 55
56. ODU WS-DL Projects
Tools for Managing the Past Web
• Archive Quality
• Tweet Intention
• TimeMap
Summaries
• Archive What I See
Now
• Storytelling for
Archives
February 20, 2015 56
57. Web Sciences and Digital Libraries
Group (WS-DL)
• Scott Ainsworth
• Sawood Alam
• Lulwah Alkwai
• Yasmin AlNoamany
• Mohamed Aturban
• Justin Brunelle
• Mat Kelly
• Corren McCoy
• Shawn Jones
• Amara Naas
• Louis Nguyen
• Alexander Nwala
• Hany SalahEldeen
@WebSciDL
http://ws-dl.cs.odu.edu/
http://ws-dl.blogspot.com/
Dr. Michele C. Weigle
mweigle@cs.odu.edu
@weiglemc
http://www.cs.odu.edu/~mweigle/
February 20, 2015 57
Faculty
• Dr. Michael L. Nelson
• Dr. Michele C. Weigle
PhD Students