SlideShare a Scribd company logo
1 of 42
Intro to Web Archiving
Dr. Michele C. Weigle, @weiglemc
Web Sciences and Digital Libraries (WS-DL) Group, @WebSciDL
Department of Computer Science
Old Dominion University
June 26, 2018
ODU Machine Learning and Data Sciences Camp
@weiglemc, @WebSciDL
ODU WS-DL Group
• Web Sciences and Digital Libraries
– digital preservation
– web archiving
– web science (social media analysis, web usage analysis)
• Our recent work has been featured in the popular
press
June 26, 2018 2
@WebSciDL
http://ws-dl.cs.odu.edu/
http://ws-dl.blogspot.com/
@weiglemc, @WebSciDL
ODU WS-DL Group
• Scott Ainsworth
• Sawood Alam
• Lulwah Alkwai
• Mohamed Aturban
• Brian Griffin
• Hussam Hallak
• Shawn Jones
• Mat Kelly
• Corren McCoy
• Louis Nguyen
• Alexander Nwala
June 26, 2018 3
PhD Students
• Nauman Siddique
• Miranda Smith
MS Students
Coming in Fall 2018!
• Dr. Sampath Jayarathna
• Dr. Jian Wu
• Dr. Michael L. Nelson
• Dr. Michele C. Weigle
Faculty
@WebSciDL
http://ws-dl.cs.odu.edu/
http://ws-dl.blogspot.com/
@weiglemc, @WebSciDL
What is the past web?
June 26, 2018 4
@weiglemc, @WebSciDL
The Web holds our stories
June 26, 2018 5
@weiglemc, @WebSciDL
But webpages can disappear
• Average lifespan of a webpage: 50-100 days
• A year after publication, about 11% of
content shared on social media will be gone.
June 26, 2018
SalahEldeen and Nelson, "Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?", TPDL 2012
http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html
6
@weiglemc, @WebSciDL
Maybe it's archived?
June 26, 2018 7
https://archive.org/web
@weiglemc, @WebSciDL
Why archives matter
• Malaysia Airlines Flight
17 (MH17)
• Ukrainian separatists
originally took credit for
downing a transport
plane in that location
• Later deleted the post
• Internet Archive had
archived the post before
deletion
June 26, 2018 8
http://www.csmonitor.com/World/Europe/2014/0717/Web-
evidence-points-to-pro-Russia-rebels-in-downing-of-MH17-video
@weiglemc, @WebSciDL
We can use archives to tell stories
June 26, 2018 9
similar to our Hurricane Katrina example: https://www.slideshare.net/phonedude/why-careaboutthepast
https://www.nytimes.com/2016/11/17/insider/in-13-
headlines-the-drama-of-election-night.html
@weiglemc, @WebSciDL
If something's gone from the live
web, check a web archive
June 26, 2018 10
@weiglemc, @WebSciDL
Web archives to the rescue!
June 26, 2018 11
https://twitter.com/brian3354/status/966081774194511874
@weiglemc, @WebSciDL
Internet Archive's Wayback Machine
has gone mainstream
June 26, 2018 12
"God bless you Internet Archive"
- Rachel Maddow, Dec 12, 2016
Last Week Tonight, Mar 18, 2018
Jill Lepore, "The Cobweb", The New Yorker, Jan 26, 2015
@weiglemc, @WebSciDL
But Wayback is not Google
• Wayback Machine has no full-text search
– too big to be indexed
– 654 billion web pages, 9 petabytes of data
– growing at 20 TB/week
• Enter URL and pick a date
June 26, 2018 13
"It’s more like a phone book than like an archive."
-Jill Lepore, The New Yorker
@weiglemc, @WebSciDL
What do people think the Wayback
Machine is?
June 26, 2018 14
https://www.politico.com/story/2018/04/25/joy-reid-anti-gay-posts-550213
@weiglemc, @WebSciDL
What do people think the Wayback
Machine is?
June 26, 2018 15
https://www.cnn.com/2018/02/16/politics/richard-pinedo-guilty-plea/index.html
https://www.politico.com/story/2018/04/25/joy-reid-anti-gay-posts-550213
https://web.archive.org/web/20180115103952/https:/auctionessistance.com/
@weiglemc, @WebSciDL
Caches are not archives
June 26, 2018 16
http://ws-dl.blogspot.com/2018/01/2018-01-02-link-to-web-archives-not.html
http://www.wired.co.uk/article/russia-propaganda-online-blog-longform-medium-posts
https://webcache.googleusercontent.com/search?q=cache:qwqnGPqC2vsJ:https://medium.com/
%40TheFoundingSon/huffington-post-vs-whiteness-and-white-women-
1e67193085d4+&cd=15&hl=en&ct=clnk&gl=uk
@weiglemc, @WebSciDL
Is it really that important to archive
instead of just taking a screenshot?
June 26, 2018 17
https://twitter.com/AngryBlackLady/status/990032514080108544
https://twitter.com/phonedude_mln/status/990070331737100288
@weiglemc, @WebSciDL
We should be doing both
June 26, 2018 18
https://twitter.com/conspirator0/status/1000475042017366017
@weiglemc, @WebSciDL
“If you see something, save
something”
June 26, 2018 19
https://blog.archive.org/2017/01/25/see-something-save-something/
@weiglemc, @WebSciDL
There's more than just the Internet
Archive
June 26, 2018 20
http://timetravel.mementoweb.org/list/20020908180610/http://blog.reidreport.com/
@weiglemc, @WebSciDL
TimeTravel
June 26, 2018 21
http://timetravel.mementoweb.org
@weiglemc, @WebSciDL
Pro tip: submit pages to multiple
archives
June 26, 2018 22
https://twitter.com/phonedude_mln/status/998948823845261312
@weiglemc, @WebSciDL
We've built tools to help people
submit webpages to multiple archives
• Mink – Google Chrome extension
• #icanhazmemento – Twitter bot
• ArchiveNow – Python module, Docker
container, local web service
June 26, 2018 23
@weiglemc, @WebSciDL
Mink
June 26, 2018 24
“Archive What I See Now: Bringing Institutional Web Archiving Tools to the Individual Researcher”,
2014-2017, HK-50181-14
Mat Kelly, Michael L. Nelson and Michele C. Weigle, "Mink: Integrating the Live and Archived Web Viewing
Experience Using Web Browsers and Memento," JCDL 2014, poster.
http://ws-dl.blogspot.com/2014/10/2014-10-03-integrating-live-and.html
Google Chrome extension
Submit currently viewed
webpage to public
archives
https://github.com/machawk1/
Mink
@weiglemc, @WebSciDL
#icanhazmemento
June 26, 2018 25
http://ws-dl.blogspot.com/2015/07/2015-07-22-i-can-haz-memento.html
Twitter bot
Include #icanhazmemento in a
tweet with a URL
Bot replies with a link to the
memento of the page closest to
the time of the tweet
If page not archived, bot submits
URL to multiple public archives,
replies with a link to the
memento in Time Travel
Alexander Nwala, "2015-07-22: I Can Haz Memento,"
https://github.com/anwala/icanhazmemento
@weiglemc, @WebSciDL
ArchiveNow
June 26, 2018 26
Mohamed Aturban, Mat Kelly, Sawood Alam, John Berlin, Michael L. Nelson and Michele C. Weigle,
"ArchiveNow: Simplified, Extensible, Multi-Archive Preservation," JCDL 2018, poster.
http://ws-dl.blogspot.com/2017/02/2017-02-22-archive-now-archivenow.html
Python module, Docker
container
Submit URI to multiple
archives
“Towards a Web-Centric Approach for Capturing the Scholarly Record”, 2016-2019
https://github.com/oduwsdl/archivenow
@weiglemc, @WebSciDL
Memento: Time Travel for the Web
Access mementos in
multiple web archives
Memento’s core
components:
• A bridge between
present and past: link
and content
negotiation
• A bridge between past
and present: link
June 26, 2018 27
@weiglemc, @WebSciDL
Memento Aggregator
June 26, 2018 28
@weiglemc, @WebSciDL
Memento Aggregator
June 26, 2018 29
@weiglemc, @WebSciDL
How can I use Memento?
June 26, 2018
Memento for Chrome
http://ws-dl.blogspot.com/2013/10/2013-10-14-right-click-to-past-memento.html
http://ws-dl.blogspot.com/2014/10/2014-10-03-integrating-live-and.html
http://timetravel.mementoweb.org
30
Mink
@weiglemc, @WebSciDL
Use Mink to view the odu.edu of the
past
June 26, 2018 31
@weiglemc, @WebSciDL
Click the Mink icon
June 26, 2018 32
@weiglemc, @WebSciDL
Then choose your datetime
June 26, 2018 33
@weiglemc, @WebSciDL
Archived odu.edu
June 26, 2018 34
@weiglemc, @WebSciDL
Fixing 404 Pages: Google Results Page
June 26, 2018 35
@weiglemc, @WebSciDL
Fixing 404 Pages: Result Page
June 26, 2018 36
http://www.clashmusic.com/news/johnny-marr-leaves-the-cribs
@weiglemc, @WebSciDL
Fixing 404 Pages: Scrolling Down
June 26, 2018 37
@weiglemc, @WebSciDL
Fixing 404 Pages: Server Up, Page 404
June 26, 2018 38
@weiglemc, @WebSciDL
Fixing 404 Pages: Using Mink
June 26, 2018 39
@weiglemc, @WebSciDL
Fixing 404 Pages: Archived Page 2011-
04-16
June 26, 2018 40
@weiglemc, @WebSciDL
#whatdiditlooklike
June 26, 2018 41
http://ws-dl.blogspot.com/2015/01/2015-02-05-what-did-it-look-like.html
Twitter bot
Include #whatdiditlooklike in a
tweet with a URL
Bot generates animated GIF of first
memento of each year
Bot replies with a link to entry in
Tumblr
Tumblr:
http://whatdiditlooklike.mementoweb.org/
Source:
https://github.com/anwala/wdill
Alexander Nwala, "2015-02-05: What Did It Look Like?,"
@weiglemc, @WebSciDL
Use web archives to save the current
web and view the past web
• Web Science and Digital Libraries (WS-DL) group at
ODU
– ws-dl.blogspot.com, @WebSciDL (Twitter)
• Websites/Tools for web archiving
– Internet Archive's Wayback Machine - archive.org/web
– On-demand archiving - archive.is
– Memento Time Travel - timetravel.mementoweb.org
– Mink - matkelly.com/mink/
– #icanhazmemento
– #whatdiditlooklike
June 26, 2018 42

More Related Content

Similar to Intro to Web Archiving

It is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pagesIt is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pagesmaturban
 
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Shawn Jones
 
Evolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsChris Bizer
 
How Social Media Changed Web Design
How Social Media Changed Web DesignHow Social Media Changed Web Design
How Social Media Changed Web DesignDino Baskovic
 
On the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over TimeOn the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over TimeMichael Nelson
 
Combining Social Media Storytelling With Web Archives
Combining Social Media Storytelling With Web ArchivesCombining Social Media Storytelling With Web Archives
Combining Social Media Storytelling With Web ArchivesShawn Jones
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingSawood Alam
 
Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...Shawn Jones
 
It is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pagesIt is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pagesmaturban
 
Bot programming in Wikimedia Commons with Pywikibot
Bot programming in Wikimedia Commons with PywikibotBot programming in Wikimedia Commons with Pywikibot
Bot programming in Wikimedia Commons with PywikibotMiguel-Angel Monjas
 
Web Tools & Mobile Apps for Teaching and Learning Mathematics (2018)
Web Tools & Mobile Apps for Teaching and Learning Mathematics (2018)Web Tools & Mobile Apps for Teaching and Learning Mathematics (2018)
Web Tools & Mobile Apps for Teaching and Learning Mathematics (2018)S. L. Faisal
 
An introduction to the Wikidata Thesis Toolkit / Helen Williams (London Schoo...
An introduction to the Wikidata Thesis Toolkit / Helen Williams (London Schoo...An introduction to the Wikidata Thesis Toolkit / Helen Williams (London Schoo...
An introduction to the Wikidata Thesis Toolkit / Helen Williams (London Schoo...CILIP MDG
 
Preserving the web
Preserving the webPreserving the web
Preserving the webJeremy Floyd
 
Aggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity FrameworkAggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity FrameworkMat Kelly
 
Focused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsFocused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsMartin Klein
 
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Michael Nelson
 
Roadmap to Blended Learning (October 2013)
Roadmap to Blended Learning (October 2013)Roadmap to Blended Learning (October 2013)
Roadmap to Blended Learning (October 2013)Wesley Fryer
 
Detecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARCDetecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARCMichele Weigle
 
Measuring News Similarity Across Ten U.S. News Sites
Measuring News Similarity Across Ten U.S. News SitesMeasuring News Similarity Across Ten U.S. News Sites
Measuring News Similarity Across Ten U.S. News SitesGrant Atkins
 

Similar to Intro to Web Archiving (20)

It is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pagesIt is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pages
 
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
 
Evolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and Applications
 
How Social Media Changed Web Design
How Social Media Changed Web DesignHow Social Media Changed Web Design
How Social Media Changed Web Design
 
On the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over TimeOn the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over Time
 
Combining Social Media Storytelling With Web Archives
Combining Social Media Storytelling With Web ArchivesCombining Social Media Storytelling With Web Archives
Combining Social Media Storytelling With Web Archives
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
 
Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...
 
It is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pagesIt is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pages
 
Bot programming in Wikimedia Commons with Pywikibot
Bot programming in Wikimedia Commons with PywikibotBot programming in Wikimedia Commons with Pywikibot
Bot programming in Wikimedia Commons with Pywikibot
 
Web Tools & Mobile Apps for Teaching and Learning Mathematics (2018)
Web Tools & Mobile Apps for Teaching and Learning Mathematics (2018)Web Tools & Mobile Apps for Teaching and Learning Mathematics (2018)
Web Tools & Mobile Apps for Teaching and Learning Mathematics (2018)
 
An introduction to the Wikidata Thesis Toolkit / Helen Williams (London Schoo...
An introduction to the Wikidata Thesis Toolkit / Helen Williams (London Schoo...An introduction to the Wikidata Thesis Toolkit / Helen Williams (London Schoo...
An introduction to the Wikidata Thesis Toolkit / Helen Williams (London Schoo...
 
Preserving the web
Preserving the webPreserving the web
Preserving the web
 
Aggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity FrameworkAggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity Framework
 
Focused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsFocused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event Collections
 
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035
 
Sources
SourcesSources
Sources
 
Roadmap to Blended Learning (October 2013)
Roadmap to Blended Learning (October 2013)Roadmap to Blended Learning (October 2013)
Roadmap to Blended Learning (October 2013)
 
Detecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARCDetecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARC
 
Measuring News Similarity Across Ten U.S. News Sites
Measuring News Similarity Across Ten U.S. News SitesMeasuring News Similarity Across Ten U.S. News Sites
Measuring News Similarity Across Ten U.S. News Sites
 

More from Michele Weigle

Comparing the Archival Rate of Arabic, English, Danish, and Korean Language W...
Comparing the Archival Rate of Arabic, English, Danish, and Korean Language W...Comparing the Archival Rate of Arabic, English, Danish, and Korean Language W...
Comparing the Archival Rate of Arabic, English, Danish, and Korean Language W...Michele Weigle
 
Visualizing Webpage Changes Over Time
Visualizing Webpage Changes Over TimeVisualizing Webpage Changes Over Time
Visualizing Webpage Changes Over TimeMichele Weigle
 
How to Write an Academic Paper
How to Write an Academic PaperHow to Write an Academic Paper
How to Write an Academic PaperMichele Weigle
 
How to Prepare and Give and Academic Presentation
How to Prepare and Give and Academic PresentationHow to Prepare and Give and Academic Presentation
How to Prepare and Give and Academic PresentationMichele Weigle
 
My Academic Story via Internet Archive
My Academic Story via Internet ArchiveMy Academic Story via Internet Archive
My Academic Story via Internet ArchiveMichele Weigle
 
A Retasking Framework For Wireless Sensor Networks
A Retasking Framework For Wireless Sensor NetworksA Retasking Framework For Wireless Sensor Networks
A Retasking Framework For Wireless Sensor NetworksMichele Weigle
 
Strategies for Sensor Data Aggregation in Support of Emergency Response
Strategies for Sensor Data Aggregation in Support of Emergency ResponseStrategies for Sensor Data Aggregation in Support of Emergency Response
Strategies for Sensor Data Aggregation in Support of Emergency ResponseMichele Weigle
 
Energy Harvesting-aware Design for Wireless Nanonetworks
Energy Harvesting-aware Design for Wireless NanonetworksEnergy Harvesting-aware Design for Wireless Nanonetworks
Energy Harvesting-aware Design for Wireless NanonetworksMichele Weigle
 
2015-capwic-gradschool
2015-capwic-gradschool2015-capwic-gradschool
2015-capwic-gradschoolMichele Weigle
 
2015-odu-ece-tools-for-past-web
2015-odu-ece-tools-for-past-web2015-odu-ece-tools-for-past-web
2015-odu-ece-tools-for-past-webMichele Weigle
 
Tools for Managing the Past Web
Tools for Managing the Past WebTools for Managing the Past Web
Tools for Managing the Past WebMichele Weigle
 
Archive What I See Now - 2014 NEH ODH Overview
Archive What I See Now - 2014 NEH ODH OverviewArchive What I See Now - 2014 NEH ODH Overview
Archive What I See Now - 2014 NEH ODH OverviewMichele Weigle
 
Bits of Research
Bits of ResearchBits of Research
Bits of ResearchMichele Weigle
 
Telling Stories with Web Archives
Telling Stories with Web ArchivesTelling Stories with Web Archives
Telling Stories with Web ArchivesMichele Weigle
 
"Archive What I See Now" - NEH ODH overview
"Archive What I See Now" - NEH ODH overview"Archive What I See Now" - NEH ODH overview
"Archive What I See Now" - NEH ODH overviewMichele Weigle
 
TDMA Slot Reservation in Cluster-Based VANETs
TDMA Slot Reservation in Cluster-Based VANETsTDMA Slot Reservation in Cluster-Based VANETs
TDMA Slot Reservation in Cluster-Based VANETsMichele Weigle
 
Visualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-ItVisualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-ItMichele Weigle
 
Information Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-ItInformation Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-ItMichele Weigle
 
Communications and Energy-Harvesting in Nanosensor Networks
Communications and Energy-Harvesting in Nanosensor NetworksCommunications and Energy-Harvesting in Nanosensor Networks
Communications and Energy-Harvesting in Nanosensor NetworksMichele Weigle
 
A Framework for Dynamic Traffic Monitoring Using Vehicular Ad-Hoc Networks
A Framework for Dynamic Traffic Monitoring Using Vehicular Ad-Hoc NetworksA Framework for Dynamic Traffic Monitoring Using Vehicular Ad-Hoc Networks
A Framework for Dynamic Traffic Monitoring Using Vehicular Ad-Hoc NetworksMichele Weigle
 

More from Michele Weigle (20)

Comparing the Archival Rate of Arabic, English, Danish, and Korean Language W...
Comparing the Archival Rate of Arabic, English, Danish, and Korean Language W...Comparing the Archival Rate of Arabic, English, Danish, and Korean Language W...
Comparing the Archival Rate of Arabic, English, Danish, and Korean Language W...
 
Visualizing Webpage Changes Over Time
Visualizing Webpage Changes Over TimeVisualizing Webpage Changes Over Time
Visualizing Webpage Changes Over Time
 
How to Write an Academic Paper
How to Write an Academic PaperHow to Write an Academic Paper
How to Write an Academic Paper
 
How to Prepare and Give and Academic Presentation
How to Prepare and Give and Academic PresentationHow to Prepare and Give and Academic Presentation
How to Prepare and Give and Academic Presentation
 
My Academic Story via Internet Archive
My Academic Story via Internet ArchiveMy Academic Story via Internet Archive
My Academic Story via Internet Archive
 
A Retasking Framework For Wireless Sensor Networks
A Retasking Framework For Wireless Sensor NetworksA Retasking Framework For Wireless Sensor Networks
A Retasking Framework For Wireless Sensor Networks
 
Strategies for Sensor Data Aggregation in Support of Emergency Response
Strategies for Sensor Data Aggregation in Support of Emergency ResponseStrategies for Sensor Data Aggregation in Support of Emergency Response
Strategies for Sensor Data Aggregation in Support of Emergency Response
 
Energy Harvesting-aware Design for Wireless Nanonetworks
Energy Harvesting-aware Design for Wireless NanonetworksEnergy Harvesting-aware Design for Wireless Nanonetworks
Energy Harvesting-aware Design for Wireless Nanonetworks
 
2015-capwic-gradschool
2015-capwic-gradschool2015-capwic-gradschool
2015-capwic-gradschool
 
2015-odu-ece-tools-for-past-web
2015-odu-ece-tools-for-past-web2015-odu-ece-tools-for-past-web
2015-odu-ece-tools-for-past-web
 
Tools for Managing the Past Web
Tools for Managing the Past WebTools for Managing the Past Web
Tools for Managing the Past Web
 
Archive What I See Now - 2014 NEH ODH Overview
Archive What I See Now - 2014 NEH ODH OverviewArchive What I See Now - 2014 NEH ODH Overview
Archive What I See Now - 2014 NEH ODH Overview
 
Bits of Research
Bits of ResearchBits of Research
Bits of Research
 
Telling Stories with Web Archives
Telling Stories with Web ArchivesTelling Stories with Web Archives
Telling Stories with Web Archives
 
"Archive What I See Now" - NEH ODH overview
"Archive What I See Now" - NEH ODH overview"Archive What I See Now" - NEH ODH overview
"Archive What I See Now" - NEH ODH overview
 
TDMA Slot Reservation in Cluster-Based VANETs
TDMA Slot Reservation in Cluster-Based VANETsTDMA Slot Reservation in Cluster-Based VANETs
TDMA Slot Reservation in Cluster-Based VANETs
 
Visualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-ItVisualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-It
 
Information Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-ItInformation Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-It
 
Communications and Energy-Harvesting in Nanosensor Networks
Communications and Energy-Harvesting in Nanosensor NetworksCommunications and Energy-Harvesting in Nanosensor Networks
Communications and Energy-Harvesting in Nanosensor Networks
 
A Framework for Dynamic Traffic Monitoring Using Vehicular Ad-Hoc Networks
A Framework for Dynamic Traffic Monitoring Using Vehicular Ad-Hoc NetworksA Framework for Dynamic Traffic Monitoring Using Vehicular Ad-Hoc Networks
A Framework for Dynamic Traffic Monitoring Using Vehicular Ad-Hoc Networks
 

Recently uploaded

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraĂşjo
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Recently uploaded (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Intro to Web Archiving

  • 1. Intro to Web Archiving Dr. Michele C. Weigle, @weiglemc Web Sciences and Digital Libraries (WS-DL) Group, @WebSciDL Department of Computer Science Old Dominion University June 26, 2018 ODU Machine Learning and Data Sciences Camp
  • 2. @weiglemc, @WebSciDL ODU WS-DL Group • Web Sciences and Digital Libraries – digital preservation – web archiving – web science (social media analysis, web usage analysis) • Our recent work has been featured in the popular press June 26, 2018 2 @WebSciDL http://ws-dl.cs.odu.edu/ http://ws-dl.blogspot.com/
  • 3. @weiglemc, @WebSciDL ODU WS-DL Group • Scott Ainsworth • Sawood Alam • Lulwah Alkwai • Mohamed Aturban • Brian Griffin • Hussam Hallak • Shawn Jones • Mat Kelly • Corren McCoy • Louis Nguyen • Alexander Nwala June 26, 2018 3 PhD Students • Nauman Siddique • Miranda Smith MS Students Coming in Fall 2018! • Dr. Sampath Jayarathna • Dr. Jian Wu • Dr. Michael L. Nelson • Dr. Michele C. Weigle Faculty @WebSciDL http://ws-dl.cs.odu.edu/ http://ws-dl.blogspot.com/
  • 4. @weiglemc, @WebSciDL What is the past web? June 26, 2018 4
  • 5. @weiglemc, @WebSciDL The Web holds our stories June 26, 2018 5
  • 6. @weiglemc, @WebSciDL But webpages can disappear • Average lifespan of a webpage: 50-100 days • A year after publication, about 11% of content shared on social media will be gone. June 26, 2018 SalahEldeen and Nelson, "Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?", TPDL 2012 http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html 6
  • 7. @weiglemc, @WebSciDL Maybe it's archived? June 26, 2018 7 https://archive.org/web
  • 8. @weiglemc, @WebSciDL Why archives matter • Malaysia Airlines Flight 17 (MH17) • Ukrainian separatists originally took credit for downing a transport plane in that location • Later deleted the post • Internet Archive had archived the post before deletion June 26, 2018 8 http://www.csmonitor.com/World/Europe/2014/0717/Web- evidence-points-to-pro-Russia-rebels-in-downing-of-MH17-video
  • 9. @weiglemc, @WebSciDL We can use archives to tell stories June 26, 2018 9 similar to our Hurricane Katrina example: https://www.slideshare.net/phonedude/why-careaboutthepast https://www.nytimes.com/2016/11/17/insider/in-13- headlines-the-drama-of-election-night.html
  • 10. @weiglemc, @WebSciDL If something's gone from the live web, check a web archive June 26, 2018 10
  • 11. @weiglemc, @WebSciDL Web archives to the rescue! June 26, 2018 11 https://twitter.com/brian3354/status/966081774194511874
  • 12. @weiglemc, @WebSciDL Internet Archive's Wayback Machine has gone mainstream June 26, 2018 12 "God bless you Internet Archive" - Rachel Maddow, Dec 12, 2016 Last Week Tonight, Mar 18, 2018 Jill Lepore, "The Cobweb", The New Yorker, Jan 26, 2015
  • 13. @weiglemc, @WebSciDL But Wayback is not Google • Wayback Machine has no full-text search – too big to be indexed – 654 billion web pages, 9 petabytes of data – growing at 20 TB/week • Enter URL and pick a date June 26, 2018 13 "It’s more like a phone book than like an archive." -Jill Lepore, The New Yorker
  • 14. @weiglemc, @WebSciDL What do people think the Wayback Machine is? June 26, 2018 14 https://www.politico.com/story/2018/04/25/joy-reid-anti-gay-posts-550213
  • 15. @weiglemc, @WebSciDL What do people think the Wayback Machine is? June 26, 2018 15 https://www.cnn.com/2018/02/16/politics/richard-pinedo-guilty-plea/index.html https://www.politico.com/story/2018/04/25/joy-reid-anti-gay-posts-550213 https://web.archive.org/web/20180115103952/https:/auctionessistance.com/
  • 16. @weiglemc, @WebSciDL Caches are not archives June 26, 2018 16 http://ws-dl.blogspot.com/2018/01/2018-01-02-link-to-web-archives-not.html http://www.wired.co.uk/article/russia-propaganda-online-blog-longform-medium-posts https://webcache.googleusercontent.com/search?q=cache:qwqnGPqC2vsJ:https://medium.com/ %40TheFoundingSon/huffington-post-vs-whiteness-and-white-women- 1e67193085d4+&cd=15&hl=en&ct=clnk&gl=uk
  • 17. @weiglemc, @WebSciDL Is it really that important to archive instead of just taking a screenshot? June 26, 2018 17 https://twitter.com/AngryBlackLady/status/990032514080108544 https://twitter.com/phonedude_mln/status/990070331737100288
  • 18. @weiglemc, @WebSciDL We should be doing both June 26, 2018 18 https://twitter.com/conspirator0/status/1000475042017366017
  • 19. @weiglemc, @WebSciDL “If you see something, save something” June 26, 2018 19 https://blog.archive.org/2017/01/25/see-something-save-something/
  • 20. @weiglemc, @WebSciDL There's more than just the Internet Archive June 26, 2018 20 http://timetravel.mementoweb.org/list/20020908180610/http://blog.reidreport.com/
  • 21. @weiglemc, @WebSciDL TimeTravel June 26, 2018 21 http://timetravel.mementoweb.org
  • 22. @weiglemc, @WebSciDL Pro tip: submit pages to multiple archives June 26, 2018 22 https://twitter.com/phonedude_mln/status/998948823845261312
  • 23. @weiglemc, @WebSciDL We've built tools to help people submit webpages to multiple archives • Mink – Google Chrome extension • #icanhazmemento – Twitter bot • ArchiveNow – Python module, Docker container, local web service June 26, 2018 23
  • 24. @weiglemc, @WebSciDL Mink June 26, 2018 24 “Archive What I See Now: Bringing Institutional Web Archiving Tools to the Individual Researcher”, 2014-2017, HK-50181-14 Mat Kelly, Michael L. Nelson and Michele C. Weigle, "Mink: Integrating the Live and Archived Web Viewing Experience Using Web Browsers and Memento," JCDL 2014, poster. http://ws-dl.blogspot.com/2014/10/2014-10-03-integrating-live-and.html Google Chrome extension Submit currently viewed webpage to public archives https://github.com/machawk1/ Mink
  • 25. @weiglemc, @WebSciDL #icanhazmemento June 26, 2018 25 http://ws-dl.blogspot.com/2015/07/2015-07-22-i-can-haz-memento.html Twitter bot Include #icanhazmemento in a tweet with a URL Bot replies with a link to the memento of the page closest to the time of the tweet If page not archived, bot submits URL to multiple public archives, replies with a link to the memento in Time Travel Alexander Nwala, "2015-07-22: I Can Haz Memento," https://github.com/anwala/icanhazmemento
  • 26. @weiglemc, @WebSciDL ArchiveNow June 26, 2018 26 Mohamed Aturban, Mat Kelly, Sawood Alam, John Berlin, Michael L. Nelson and Michele C. Weigle, "ArchiveNow: Simplified, Extensible, Multi-Archive Preservation," JCDL 2018, poster. http://ws-dl.blogspot.com/2017/02/2017-02-22-archive-now-archivenow.html Python module, Docker container Submit URI to multiple archives “Towards a Web-Centric Approach for Capturing the Scholarly Record”, 2016-2019 https://github.com/oduwsdl/archivenow
  • 27. @weiglemc, @WebSciDL Memento: Time Travel for the Web Access mementos in multiple web archives Memento’s core components: • A bridge between present and past: link and content negotiation • A bridge between past and present: link June 26, 2018 27
  • 30. @weiglemc, @WebSciDL How can I use Memento? June 26, 2018 Memento for Chrome http://ws-dl.blogspot.com/2013/10/2013-10-14-right-click-to-past-memento.html http://ws-dl.blogspot.com/2014/10/2014-10-03-integrating-live-and.html http://timetravel.mementoweb.org 30 Mink
  • 31. @weiglemc, @WebSciDL Use Mink to view the odu.edu of the past June 26, 2018 31
  • 32. @weiglemc, @WebSciDL Click the Mink icon June 26, 2018 32
  • 33. @weiglemc, @WebSciDL Then choose your datetime June 26, 2018 33
  • 35. @weiglemc, @WebSciDL Fixing 404 Pages: Google Results Page June 26, 2018 35
  • 36. @weiglemc, @WebSciDL Fixing 404 Pages: Result Page June 26, 2018 36 http://www.clashmusic.com/news/johnny-marr-leaves-the-cribs
  • 37. @weiglemc, @WebSciDL Fixing 404 Pages: Scrolling Down June 26, 2018 37
  • 38. @weiglemc, @WebSciDL Fixing 404 Pages: Server Up, Page 404 June 26, 2018 38
  • 39. @weiglemc, @WebSciDL Fixing 404 Pages: Using Mink June 26, 2018 39
  • 40. @weiglemc, @WebSciDL Fixing 404 Pages: Archived Page 2011- 04-16 June 26, 2018 40
  • 41. @weiglemc, @WebSciDL #whatdiditlooklike June 26, 2018 41 http://ws-dl.blogspot.com/2015/01/2015-02-05-what-did-it-look-like.html Twitter bot Include #whatdiditlooklike in a tweet with a URL Bot generates animated GIF of first memento of each year Bot replies with a link to entry in Tumblr Tumblr: http://whatdiditlooklike.mementoweb.org/ Source: https://github.com/anwala/wdill Alexander Nwala, "2015-02-05: What Did It Look Like?,"
  • 42. @weiglemc, @WebSciDL Use web archives to save the current web and view the past web • Web Science and Digital Libraries (WS-DL) group at ODU – ws-dl.blogspot.com, @WebSciDL (Twitter) • Websites/Tools for web archiving – Internet Archive's Wayback Machine - archive.org/web – On-demand archiving - archive.is – Memento Time Travel - timetravel.mementoweb.org – Mink - matkelly.com/mink/ – #icanhazmemento – #whatdiditlooklike June 26, 2018 42