1. Intro to Web Archiving
Dr. Michele C. Weigle, @weiglemc
Web Sciences and Digital Libraries (WS-DL) Group, @WebSciDL
Department of Computer Science
Old Dominion University
June 26, 2018
ODU Machine Learning and Data Sciences Camp
2. @weiglemc, @WebSciDL
ODU WS-DL Group
⢠Web Sciences and Digital Libraries
â digital preservation
â web archiving
â web science (social media analysis, web usage analysis)
⢠Our recent work has been featured in the popular
press
June 26, 2018 2
@WebSciDL
http://ws-dl.cs.odu.edu/
http://ws-dl.blogspot.com/
3. @weiglemc, @WebSciDL
ODU WS-DL Group
⢠Scott Ainsworth
⢠Sawood Alam
⢠Lulwah Alkwai
⢠Mohamed Aturban
⢠Brian Griffin
⢠Hussam Hallak
⢠Shawn Jones
⢠Mat Kelly
⢠Corren McCoy
⢠Louis Nguyen
⢠Alexander Nwala
June 26, 2018 3
PhD Students
⢠Nauman Siddique
⢠Miranda Smith
MS Students
Coming in Fall 2018!
⢠Dr. Sampath Jayarathna
⢠Dr. Jian Wu
⢠Dr. Michael L. Nelson
⢠Dr. Michele C. Weigle
Faculty
@WebSciDL
http://ws-dl.cs.odu.edu/
http://ws-dl.blogspot.com/
6. @weiglemc, @WebSciDL
But webpages can disappear
⢠Average lifespan of a webpage: 50-100 days
⢠A year after publication, about 11% of
content shared on social media will be gone.
June 26, 2018
SalahEldeen and Nelson, "Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?", TPDL 2012
http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html
6
8. @weiglemc, @WebSciDL
Why archives matter
⢠Malaysia Airlines Flight
17 (MH17)
⢠Ukrainian separatists
originally took credit for
downing a transport
plane in that location
⢠Later deleted the post
⢠Internet Archive had
archived the post before
deletion
June 26, 2018 8
http://www.csmonitor.com/World/Europe/2014/0717/Web-
evidence-points-to-pro-Russia-rebels-in-downing-of-MH17-video
9. @weiglemc, @WebSciDL
We can use archives to tell stories
June 26, 2018 9
similar to our Hurricane Katrina example: https://www.slideshare.net/phonedude/why-careaboutthepast
https://www.nytimes.com/2016/11/17/insider/in-13-
headlines-the-drama-of-election-night.html
12. @weiglemc, @WebSciDL
Internet Archive's Wayback Machine
has gone mainstream
June 26, 2018 12
"God bless you Internet Archive"
- Rachel Maddow, Dec 12, 2016
Last Week Tonight, Mar 18, 2018
Jill Lepore, "The Cobweb", The New Yorker, Jan 26, 2015
13. @weiglemc, @WebSciDL
But Wayback is not Google
⢠Wayback Machine has no full-text search
â too big to be indexed
â 654 billion web pages, 9 petabytes of data
â growing at 20 TB/week
⢠Enter URL and pick a date
June 26, 2018 13
"Itâs more like a phone book than like an archive."
-Jill Lepore, The New Yorker
14. @weiglemc, @WebSciDL
What do people think the Wayback
Machine is?
June 26, 2018 14
https://www.politico.com/story/2018/04/25/joy-reid-anti-gay-posts-550213
15. @weiglemc, @WebSciDL
What do people think the Wayback
Machine is?
June 26, 2018 15
https://www.cnn.com/2018/02/16/politics/richard-pinedo-guilty-plea/index.html
https://www.politico.com/story/2018/04/25/joy-reid-anti-gay-posts-550213
https://web.archive.org/web/20180115103952/https:/auctionessistance.com/
16. @weiglemc, @WebSciDL
Caches are not archives
June 26, 2018 16
http://ws-dl.blogspot.com/2018/01/2018-01-02-link-to-web-archives-not.html
http://www.wired.co.uk/article/russia-propaganda-online-blog-longform-medium-posts
https://webcache.googleusercontent.com/search?q=cache:qwqnGPqC2vsJ:https://medium.com/
%40TheFoundingSon/huffington-post-vs-whiteness-and-white-women-
1e67193085d4+&cd=15&hl=en&ct=clnk&gl=uk
17. @weiglemc, @WebSciDL
Is it really that important to archive
instead of just taking a screenshot?
June 26, 2018 17
https://twitter.com/AngryBlackLady/status/990032514080108544
https://twitter.com/phonedude_mln/status/990070331737100288
18. @weiglemc, @WebSciDL
We should be doing both
June 26, 2018 18
https://twitter.com/conspirator0/status/1000475042017366017
19. @weiglemc, @WebSciDL
âIf you see something, save
somethingâ
June 26, 2018 19
https://blog.archive.org/2017/01/25/see-something-save-something/
20. @weiglemc, @WebSciDL
There's more than just the Internet
Archive
June 26, 2018 20
http://timetravel.mementoweb.org/list/20020908180610/http://blog.reidreport.com/
22. @weiglemc, @WebSciDL
Pro tip: submit pages to multiple
archives
June 26, 2018 22
https://twitter.com/phonedude_mln/status/998948823845261312
23. @weiglemc, @WebSciDL
We've built tools to help people
submit webpages to multiple archives
⢠Mink â Google Chrome extension
⢠#icanhazmemento â Twitter bot
⢠ArchiveNow â Python module, Docker
container, local web service
June 26, 2018 23
24. @weiglemc, @WebSciDL
Mink
June 26, 2018 24
âArchive What I See Now: Bringing Institutional Web Archiving Tools to the Individual Researcherâ,
2014-2017, HK-50181-14
Mat Kelly, Michael L. Nelson and Michele C. Weigle, "Mink: Integrating the Live and Archived Web Viewing
Experience Using Web Browsers and Memento," JCDL 2014, poster.
http://ws-dl.blogspot.com/2014/10/2014-10-03-integrating-live-and.html
Google Chrome extension
Submit currently viewed
webpage to public
archives
https://github.com/machawk1/
Mink
25. @weiglemc, @WebSciDL
#icanhazmemento
June 26, 2018 25
http://ws-dl.blogspot.com/2015/07/2015-07-22-i-can-haz-memento.html
Twitter bot
Include #icanhazmemento in a
tweet with a URL
Bot replies with a link to the
memento of the page closest to
the time of the tweet
If page not archived, bot submits
URL to multiple public archives,
replies with a link to the
memento in Time Travel
Alexander Nwala, "2015-07-22: I Can Haz Memento,"
https://github.com/anwala/icanhazmemento
26. @weiglemc, @WebSciDL
ArchiveNow
June 26, 2018 26
Mohamed Aturban, Mat Kelly, Sawood Alam, John Berlin, Michael L. Nelson and Michele C. Weigle,
"ArchiveNow: Simplified, Extensible, Multi-Archive Preservation," JCDL 2018, poster.
http://ws-dl.blogspot.com/2017/02/2017-02-22-archive-now-archivenow.html
Python module, Docker
container
Submit URI to multiple
archives
âTowards a Web-Centric Approach for Capturing the Scholarly Recordâ, 2016-2019
https://github.com/oduwsdl/archivenow
27. @weiglemc, @WebSciDL
Memento: Time Travel for the Web
Access mementos in
multiple web archives
Mementoâs core
components:
⢠A bridge between
present and past: link
and content
negotiation
⢠A bridge between past
and present: link
June 26, 2018 27
30. @weiglemc, @WebSciDL
How can I use Memento?
June 26, 2018
Memento for Chrome
http://ws-dl.blogspot.com/2013/10/2013-10-14-right-click-to-past-memento.html
http://ws-dl.blogspot.com/2014/10/2014-10-03-integrating-live-and.html
http://timetravel.mementoweb.org
30
Mink
41. @weiglemc, @WebSciDL
#whatdiditlooklike
June 26, 2018 41
http://ws-dl.blogspot.com/2015/01/2015-02-05-what-did-it-look-like.html
Twitter bot
Include #whatdiditlooklike in a
tweet with a URL
Bot generates animated GIF of first
memento of each year
Bot replies with a link to entry in
Tumblr
Tumblr:
http://whatdiditlooklike.mementoweb.org/
Source:
https://github.com/anwala/wdill
Alexander Nwala, "2015-02-05: What Did It Look Like?,"
42. @weiglemc, @WebSciDL
Use web archives to save the current
web and view the past web
⢠Web Science and Digital Libraries (WS-DL) group at
ODU
â ws-dl.blogspot.com, @WebSciDL (Twitter)
⢠Websites/Tools for web archiving
â Internet Archive's Wayback Machine - archive.org/web
â On-demand archiving - archive.is
â Memento Time Travel - timetravel.mementoweb.org
â Mink - matkelly.com/mink/
â #icanhazmemento
â #whatdiditlooklike
June 26, 2018 42