SlideShare una empresa de Scribd logo
1 de 45
Link Persistence,
Website Persistence
Nicholas Taylor
@nullhandle
May 28, 2013 ―Forward‖ by Flickr user Hitchster under CC BY 2.0
why preserve the web?
broken links
―404‖ by Flickr user adactio under CC BY 2.0
44 days (Kahle, 1997)
75 days (Kahle, 2001)
100 days (Kahle, 2003)
variable (Sanderson, Phillips, and
Van de Sompel, 2011)
• literature review of 17 studies
• research focused on scholarly citations
• decay rates of 39-82%
• over periods of 1-13 years
―Digital documents last forever—or five years, whichever comes first.‖
(Jeff Rothenberg, 1997)
―Out of books sprout... plants‖ by DeviantArt user quinn.anya under CC BY-SA 2.0
LINK CHECKING
The Art and Science of
―http Blue Background‖ by DeviantArt user SoulArt2012 under CC BY-NC-ND 3.0
http response codes
• 404: ―Not Found‖
• 200: ―OK‖
• 301: ―Moved Permanently‖
• 500: ―Internal Server Error‖
automated link checker
―La Machine @ Yokohama‖ by Flickr user chidorian under CC BY-SA 2.0
what link checking tells us
―200 ok‖ by Flickr user reidab under CC BY-NC-SA 2.0
possible scenarios
• link works; same website
• link works; different website
– website may or may not still exist
• link doesn’t work; website still exists
• link doesn’t work; website no longer exists
link works; same website
http://www.fair.org/ (2002) http://www.fair.org/ (2013)
link works; different website…
http://www.fb.com/ (2002) http://www.fb.com/ (2013)
…but website still exists
http://www.fb.org/ (2013)
link doesn’t work…
http://www.state.mo.us/ (2002) http://www.state.mo.us/ (2013)
…but website still exists
http://www.sos.mo.gov/ (2013)
link doesn’t work;
website no longer exists
assumptions
• link works; same website
• link works; different website
– website may or may not still exist
• link doesn’t work; website still exists
• link doesn’t work; website no longer
exists
research questions
• how much are we overestimating website
persistence?
– some working links point to different websites
• how much are we underestimating website
persistence?
– websites may still exist even though links
don’t work or do work but point to different
websites
WEB ARCHIVES
A Study Using
Library of Congress
U.S. Election 2002 Web Archive
preparing the list of links
• exclude links corresponding to electoral
candidate websites
• 1,071 links
– state government
– political parties
– advocacy organizations
– major newspapers
– political blogs
methodology
automated
• run Heritrix against links,
ignoring robots.txt
• log http response codes
• log redirects
manual
• manually check each link
• same website behind
working link?
• does website still exist?
methodology
automated
• run Heritrix against links,
ignoring robots.txt
• log http response codes
• log redirects
manual
• manually check each link
• same website behind
working link?
• does website still exist?
working link?
91%
9%
working
non-working
same website?
83%
9%
8%
working link; same site
non-working link
working link; different site
non-working link;
website still exists?
91%
7%
2%
8%
working
still exists
doesn't exist
website still exists?
94%
6%
still exists
doesn't exist
summary of results
• how much are we overestimating website
persistence?
– 8% of working links point to different websites
• how much are we underestimating website
persistence?
– 82% of websites associated with non-working
links still exist
– 48% of websites whose links now point to
different websites still exist
what does it mean?
• websites are (much
more) persistent than
links
• websites are
surprisingly durable?
―Golden Spider Silk‖ by Flickr user amandabhslater under CC BY-SA 2.0
WEBSITE CHECKING?
Beyond Link Checking,
―Check‖ by Flickr user ex.libris under CC BY-NC-ND 2.0
building a website checker
1. check whether link still works
2. check whether link still corresponds to
website
3. check whether website still exists
―Most web archiving problems are problems of scale.‖
(Kris Carpenter Negulescu, 2012)
―chutes and ladders‖ by Flickr user reallyboring under CC BY-NC-SA 2.0
building a website checker
1. check whether link still works
2. check whether link still corresponds to
website
3. check whether website still exists
Heritrix compares checksums
―Fingerprint‖ by Flickr user CPOA under CC BY-ND 2.0
…but checksums are limited
―Hashing Emily‖ by Flickr user wlef70 under CC BY-NC-SA 3.0
visual analysis of page changes
Pehlivan, Ben-Saad, and Gançarski: ―Vi-DIFF: Understanding Web Pages Changes‖
building a website checker
1. check whether link still works
2. check whether link still corresponds to
website
3. check whether website still exists
lexical signature of archived page
Ware, Klein, and Nelson: ―An Evaluation of Link Neighborhood
Lexical Signatures to Rediscover Missing Web Pages‖
find archived pages w/ Memento
• http protocol
enhancement
• enables discovery of
archived resources in
distributed web
archives
lexical signatures of backlink pages
―The future is already here; it’s just not very evenly distributed.‖
(William Gibson, 1999)
―Time Travel‖ by Flickr user xcalibr under CC BY-NC-ND 2.0
Nicholas Taylor
@nullhandle
―Thank You‖ by Flickr user muffintinmom under CC BY 2.0

Más contenido relacionado

La actualidad más candente

Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet ArchiveMichael Nelson
 
Web-Scale Discovery: Post Implementation
Web-Scale Discovery: Post ImplementationWeb-Scale Discovery: Post Implementation
Web-Scale Discovery: Post ImplementationRachel Vacek
 
Georgia Tech Drupal Users Group - February 2015 Meeting
Georgia Tech Drupal Users Group - February 2015 MeetingGeorgia Tech Drupal Users Group - February 2015 Meeting
Georgia Tech Drupal Users Group - February 2015 MeetingEric Sembrat
 
Web 2.0...it’s okay to play!
Web 2.0...it’s okay to play!Web 2.0...it’s okay to play!
Web 2.0...it’s okay to play!daveyp
 
Knowledge Management System(KMS)
Knowledge Management System(KMS)Knowledge Management System(KMS)
Knowledge Management System(KMS)ayush goyal
 
NCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic Data
NCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic DataNCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic Data
NCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic DataNebraska Library Commission
 
E Write Blogs Wikis Us Courts 9 408
E Write   Blogs Wikis Us Courts 9 408E Write   Blogs Wikis Us Courts 9 408
E Write Blogs Wikis Us Courts 9 408guest45c75b
 
Book Reader Bingo: Which Page-Turner Should I Use?
Book Reader Bingo: Which Page-Turner Should I Use?Book Reader Bingo: Which Page-Turner Should I Use?
Book Reader Bingo: Which Page-Turner Should I Use?ebenenglish
 
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolEvaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolMichael Nelson
 
Collaborative Web Archiving with Ivy Plus / Borrow Direct
Collaborative Web Archiving with Ivy Plus / Borrow Direct Collaborative Web Archiving with Ivy Plus / Borrow Direct
Collaborative Web Archiving with Ivy Plus / Borrow Direct Anna Perricci
 
IT Systems for Knowledge Management used in Software Engineering (2010)
IT Systems for Knowledge Management used in Software Engineering (2010)IT Systems for Knowledge Management used in Software Engineering (2010)
IT Systems for Knowledge Management used in Software Engineering (2010)Peter Kofler
 
Using the Altmetric.com bookmarklet and ImpactStory_5June2014
Using the Altmetric.com bookmarklet and ImpactStory_5June2014Using the Altmetric.com bookmarklet and ImpactStory_5June2014
Using the Altmetric.com bookmarklet and ImpactStory_5June2014SarahG_SS
 
Overview of LibQual+ for Library Staff
Overview of LibQual+ for Library StaffOverview of LibQual+ for Library Staff
Overview of LibQual+ for Library StaffJen Rutner
 
The changing nature of web design and user expectations, and how libraries c...
The changing nature of web design and user expectations, and how libraries c...The changing nature of web design and user expectations, and how libraries c...
The changing nature of web design and user expectations, and how libraries c...Rachel Vacek
 
Webpage Classification
Webpage ClassificationWebpage Classification
Webpage ClassificationPacharaStudio
 

La actualidad más candente (19)

Contributions to the World of eScience from the Royal Society of Chemistry
Contributions to the World of eScience from the Royal Society of ChemistryContributions to the World of eScience from the Royal Society of Chemistry
Contributions to the World of eScience from the Royal Society of Chemistry
 
Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet Archive
 
Web-Scale Discovery: Post Implementation
Web-Scale Discovery: Post ImplementationWeb-Scale Discovery: Post Implementation
Web-Scale Discovery: Post Implementation
 
Georgia Tech Drupal Users Group - February 2015 Meeting
Georgia Tech Drupal Users Group - February 2015 MeetingGeorgia Tech Drupal Users Group - February 2015 Meeting
Georgia Tech Drupal Users Group - February 2015 Meeting
 
Web 2.0...it’s okay to play!
Web 2.0...it’s okay to play!Web 2.0...it’s okay to play!
Web 2.0...it’s okay to play!
 
Knowledge Management System(KMS)
Knowledge Management System(KMS)Knowledge Management System(KMS)
Knowledge Management System(KMS)
 
NCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic Data
NCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic DataNCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic Data
NCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic Data
 
E Write Blogs Wikis Us Courts 9 408
E Write   Blogs Wikis Us Courts 9 408E Write   Blogs Wikis Us Courts 9 408
E Write Blogs Wikis Us Courts 9 408
 
Book Reader Bingo: Which Page-Turner Should I Use?
Book Reader Bingo: Which Page-Turner Should I Use?Book Reader Bingo: Which Page-Turner Should I Use?
Book Reader Bingo: Which Page-Turner Should I Use?
 
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolEvaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
 
Collaborative Web Archiving with Ivy Plus / Borrow Direct
Collaborative Web Archiving with Ivy Plus / Borrow Direct Collaborative Web Archiving with Ivy Plus / Borrow Direct
Collaborative Web Archiving with Ivy Plus / Borrow Direct
 
IT Systems for Knowledge Management used in Software Engineering (2010)
IT Systems for Knowledge Management used in Software Engineering (2010)IT Systems for Knowledge Management used in Software Engineering (2010)
IT Systems for Knowledge Management used in Software Engineering (2010)
 
Stahmer-9-Jun15-final
Stahmer-9-Jun15-finalStahmer-9-Jun15-final
Stahmer-9-Jun15-final
 
Using the Altmetric.com bookmarklet and ImpactStory_5June2014
Using the Altmetric.com bookmarklet and ImpactStory_5June2014Using the Altmetric.com bookmarklet and ImpactStory_5June2014
Using the Altmetric.com bookmarklet and ImpactStory_5June2014
 
Overview of LibQual+ for Library Staff
Overview of LibQual+ for Library StaffOverview of LibQual+ for Library Staff
Overview of LibQual+ for Library Staff
 
NISO Webinar: Library Linked Data: From Vision to Reality
NISO Webinar: Library Linked Data: From Vision to RealityNISO Webinar: Library Linked Data: From Vision to Reality
NISO Webinar: Library Linked Data: From Vision to Reality
 
The changing nature of web design and user expectations, and how libraries c...
The changing nature of web design and user expectations, and how libraries c...The changing nature of web design and user expectations, and how libraries c...
The changing nature of web design and user expectations, and how libraries c...
 
Webpage classification and Features
Webpage classification and FeaturesWebpage classification and Features
Webpage classification and Features
 
Webpage Classification
Webpage ClassificationWebpage Classification
Webpage Classification
 

Destacado

Destacado (7)

As.oct11
As.oct11As.oct11
As.oct11
 
Aja group presentation_4[1]
Aja group presentation_4[1]Aja group presentation_4[1]
Aja group presentation_4[1]
 
AutoSuccessOct04
AutoSuccessOct04AutoSuccessOct04
AutoSuccessOct04
 
Contest v1.1
Contest v1.1Contest v1.1
Contest v1.1
 
WordPress 3.0 at DC PHP
WordPress 3.0 at DC PHPWordPress 3.0 at DC PHP
WordPress 3.0 at DC PHP
 
E3 chap-07
E3 chap-07E3 chap-07
E3 chap-07
 
Just in Case: Archive-It & DuraCloud Integration
Just in Case: Archive-It & DuraCloud IntegrationJust in Case: Archive-It & DuraCloud Integration
Just in Case: Archive-It & DuraCloud Integration
 

Similar a Link Persistence, Website Persistence

SMX Advanced: Thriving in the New World of Pagination
SMX Advanced: Thriving in the New World of PaginationSMX Advanced: Thriving in the New World of Pagination
SMX Advanced: Thriving in the New World of PaginationLily Ray
 
Linking Library Data on the Web
Linking Library Data on the WebLinking Library Data on the Web
Linking Library Data on the WebDan Chudnov
 
A4uexpolinkingstructure 1224068989580452-8
A4uexpolinkingstructure 1224068989580452-8A4uexpolinkingstructure 1224068989580452-8
A4uexpolinkingstructure 1224068989580452-8AiiM Education
 
How to adapt your SEO to the 5 recent Google updates (SAS Con)
How to adapt your SEO to the 5 recent Google updates (SAS Con)How to adapt your SEO to the 5 recent Google updates (SAS Con)
How to adapt your SEO to the 5 recent Google updates (SAS Con)Link-Assistant.Com
 
Deepak semantic web_iitd
Deepak semantic web_iitdDeepak semantic web_iitd
Deepak semantic web_iitdDeepak Shevani
 
Ensuring the Integrity (& Continuity) of Our Record of Scholarship
Ensuring the Integrity (& Continuity) of Our Record of ScholarshipEnsuring the Integrity (& Continuity) of Our Record of Scholarship
Ensuring the Integrity (& Continuity) of Our Record of ScholarshipEDINA, University of Edinburgh
 
Using Wayback Machine for Research
Using Wayback Machine for ResearchUsing Wayback Machine for Research
Using Wayback Machine for Researchnullhandle
 
Analyzing Multidimensional Networks within MediaWikis
Analyzing Multidimensional Networks within MediaWikisAnalyzing Multidimensional Networks within MediaWikis
Analyzing Multidimensional Networks within MediaWikisBrian Keegan
 
Getting Started with Search Engine Optimization
Getting Started with Search Engine OptimizationGetting Started with Search Engine Optimization
Getting Started with Search Engine OptimizationKatherine Chalmers
 
Build your own analytics power tools
Build your own analytics power toolsBuild your own analytics power tools
Build your own analytics power toolsAlban Gérôme
 
ELAG - Mashing Up and Remixing the Library Website
ELAG - Mashing Up and Remixing the Library WebsiteELAG - Mashing Up and Remixing the Library Website
ELAG - Mashing Up and Remixing the Library Websitelibrarywebchic
 
LITA Forum 2012 Web Analytics Preconference
LITA Forum 2012 Web Analytics PreconferenceLITA Forum 2012 Web Analytics Preconference
LITA Forum 2012 Web Analytics PreconferenceNina McHale
 
Evaluating web resources spring2013rev
Evaluating web resources spring2013revEvaluating web resources spring2013rev
Evaluating web resources spring2013revLynn Berard
 
Majestic Workshop on Backlinks and Link Building
Majestic Workshop on Backlinks and Link BuildingMajestic Workshop on Backlinks and Link Building
Majestic Workshop on Backlinks and Link BuildingSante J. Achille
 
Walk this way: Online content platform migration experiences and collaboration
Walk this way: Online content platform migration experiences and collaboration Walk this way: Online content platform migration experiences and collaboration
Walk this way: Online content platform migration experiences and collaboration NASIG
 
NASIG 2020 - Walk this way
NASIG 2020 -  Walk this wayNASIG 2020 -  Walk this way
NASIG 2020 - Walk this wayMatthew Ragucci
 
Key considerations when mapping your end user experience
Key considerations when mapping your end user experienceKey considerations when mapping your end user experience
Key considerations when mapping your end user experienceEduserv
 
TCDL 2009 keynote: Better living through linking
TCDL 2009 keynote: Better living through linkingTCDL 2009 keynote: Better living through linking
TCDL 2009 keynote: Better living through linkingDan Chudnov
 

Similar a Link Persistence, Website Persistence (20)

SMX Advanced: Thriving in the New World of Pagination
SMX Advanced: Thriving in the New World of PaginationSMX Advanced: Thriving in the New World of Pagination
SMX Advanced: Thriving in the New World of Pagination
 
Linking Library Data on the Web
Linking Library Data on the WebLinking Library Data on the Web
Linking Library Data on the Web
 
Website Mashup
Website MashupWebsite Mashup
Website Mashup
 
A4uexpolinkingstructure 1224068989580452-8
A4uexpolinkingstructure 1224068989580452-8A4uexpolinkingstructure 1224068989580452-8
A4uexpolinkingstructure 1224068989580452-8
 
How to adapt your SEO to the 5 recent Google updates (SAS Con)
How to adapt your SEO to the 5 recent Google updates (SAS Con)How to adapt your SEO to the 5 recent Google updates (SAS Con)
How to adapt your SEO to the 5 recent Google updates (SAS Con)
 
Leveraging Library Thing (2009)
Leveraging Library Thing (2009)Leveraging Library Thing (2009)
Leveraging Library Thing (2009)
 
Deepak semantic web_iitd
Deepak semantic web_iitdDeepak semantic web_iitd
Deepak semantic web_iitd
 
Ensuring the Integrity (& Continuity) of Our Record of Scholarship
Ensuring the Integrity (& Continuity) of Our Record of ScholarshipEnsuring the Integrity (& Continuity) of Our Record of Scholarship
Ensuring the Integrity (& Continuity) of Our Record of Scholarship
 
Using Wayback Machine for Research
Using Wayback Machine for ResearchUsing Wayback Machine for Research
Using Wayback Machine for Research
 
Analyzing Multidimensional Networks within MediaWikis
Analyzing Multidimensional Networks within MediaWikisAnalyzing Multidimensional Networks within MediaWikis
Analyzing Multidimensional Networks within MediaWikis
 
Getting Started with Search Engine Optimization
Getting Started with Search Engine OptimizationGetting Started with Search Engine Optimization
Getting Started with Search Engine Optimization
 
Build your own analytics power tools
Build your own analytics power toolsBuild your own analytics power tools
Build your own analytics power tools
 
ELAG - Mashing Up and Remixing the Library Website
ELAG - Mashing Up and Remixing the Library WebsiteELAG - Mashing Up and Remixing the Library Website
ELAG - Mashing Up and Remixing the Library Website
 
LITA Forum 2012 Web Analytics Preconference
LITA Forum 2012 Web Analytics PreconferenceLITA Forum 2012 Web Analytics Preconference
LITA Forum 2012 Web Analytics Preconference
 
Evaluating web resources spring2013rev
Evaluating web resources spring2013revEvaluating web resources spring2013rev
Evaluating web resources spring2013rev
 
Majestic Workshop on Backlinks and Link Building
Majestic Workshop on Backlinks and Link BuildingMajestic Workshop on Backlinks and Link Building
Majestic Workshop on Backlinks and Link Building
 
Walk this way: Online content platform migration experiences and collaboration
Walk this way: Online content platform migration experiences and collaboration Walk this way: Online content platform migration experiences and collaboration
Walk this way: Online content platform migration experiences and collaboration
 
NASIG 2020 - Walk this way
NASIG 2020 -  Walk this wayNASIG 2020 -  Walk this way
NASIG 2020 - Walk this way
 
Key considerations when mapping your end user experience
Key considerations when mapping your end user experienceKey considerations when mapping your end user experience
Key considerations when mapping your end user experience
 
TCDL 2009 keynote: Better living through linking
TCDL 2009 keynote: Better living through linkingTCDL 2009 keynote: Better living through linking
TCDL 2009 keynote: Better living through linking
 

Más de nullhandle

Understanding Legal Use Cases for Web Archives
Understanding Legal Use Cases for Web ArchivesUnderstanding Legal Use Cases for Web Archives
Understanding Legal Use Cases for Web Archivesnullhandle
 
Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...
Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...
Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...nullhandle
 
Unlocking LOCKSS with APIs
Unlocking LOCKSS with APIsUnlocking LOCKSS with APIs
Unlocking LOCKSS with APIsnullhandle
 
Lots of LOCKSS Keeping Stuff Safe: The Future of the LOCKSS Program
Lots of LOCKSS Keeping Stuff Safe: The Future of the LOCKSS ProgramLots of LOCKSS Keeping Stuff Safe: The Future of the LOCKSS Program
Lots of LOCKSS Keeping Stuff Safe: The Future of the LOCKSS Programnullhandle
 
Interoperability and Technical Collaboration for Web and Social Media Archiving
Interoperability and Technical Collaboration for Web and Social Media ArchivingInteroperability and Technical Collaboration for Web and Social Media Archiving
Interoperability and Technical Collaboration for Web and Social Media Archivingnullhandle
 
Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...
Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...
Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...nullhandle
 
2015 NDSA Web Archiving Survey Report Highlights
2015 NDSA Web Archiving Survey Report Highlights2015 NDSA Web Archiving Survey Report Highlights
2015 NDSA Web Archiving Survey Report Highlightsnullhandle
 
Collection Development for Selective Web Archiving
Collection Development for Selective Web ArchivingCollection Development for Selective Web Archiving
Collection Development for Selective Web Archivingnullhandle
 
Why Not Lots of Copies Keep(ing) Software Safe?
Why Not Lots of Copies Keep(ing) Software Safe?Why Not Lots of Copies Keep(ing) Software Safe?
Why Not Lots of Copies Keep(ing) Software Safe?nullhandle
 
WASAPI Web Archive Data Transfer APIs
WASAPI Web Archive Data Transfer APIsWASAPI Web Archive Data Transfer APIs
WASAPI Web Archive Data Transfer APIsnullhandle
 
Building Web Archiving Technology, Together
Building Web Archiving Technology, TogetherBuilding Web Archiving Technology, Together
Building Web Archiving Technology, Togethernullhandle
 
Outreach to Campus Webmasters for a Better Web, and Better Web Archiving
Outreach to Campus Webmasters for a Better Web, and Better Web ArchivingOutreach to Campus Webmasters for a Better Web, and Better Web Archiving
Outreach to Campus Webmasters for a Better Web, and Better Web Archivingnullhandle
 
Measure All the (Web Archiving) Things!
Measure All the (Web Archiving) Things!Measure All the (Web Archiving) Things!
Measure All the (Web Archiving) Things!nullhandle
 
A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...
A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...
A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...nullhandle
 
Campaign Web Archives to Support Multi-Institutional Research
Campaign Web Archives to Support Multi-Institutional ResearchCampaign Web Archives to Support Multi-Institutional Research
Campaign Web Archives to Support Multi-Institutional Researchnullhandle
 
2013 NDSA Web Archiving Survey Report Highlights
2013 NDSA Web Archiving Survey Report Highlights2013 NDSA Web Archiving Survey Report Highlights
2013 NDSA Web Archiving Survey Report Highlightsnullhandle
 
Considerations for Strategic Web Archive Collection Development
Considerations for Strategic Web Archive Collection DevelopmentConsiderations for Strategic Web Archive Collection Development
Considerations for Strategic Web Archive Collection Developmentnullhandle
 
Boiling the Ocean, Together: Web Archive Collection Development in a Global C...
Boiling the Ocean, Together: Web Archive Collection Development in a Global C...Boiling the Ocean, Together: Web Archive Collection Development in a Global C...
Boiling the Ocean, Together: Web Archive Collection Development in a Global C...nullhandle
 
Advocating for Web Archivability
Advocating for Web ArchivabilityAdvocating for Web Archivability
Advocating for Web Archivabilitynullhandle
 
Building Archivable Websites
Building Archivable WebsitesBuilding Archivable Websites
Building Archivable Websitesnullhandle
 

Más de nullhandle (20)

Understanding Legal Use Cases for Web Archives
Understanding Legal Use Cases for Web ArchivesUnderstanding Legal Use Cases for Web Archives
Understanding Legal Use Cases for Web Archives
 
Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...
Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...
Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...
 
Unlocking LOCKSS with APIs
Unlocking LOCKSS with APIsUnlocking LOCKSS with APIs
Unlocking LOCKSS with APIs
 
Lots of LOCKSS Keeping Stuff Safe: The Future of the LOCKSS Program
Lots of LOCKSS Keeping Stuff Safe: The Future of the LOCKSS ProgramLots of LOCKSS Keeping Stuff Safe: The Future of the LOCKSS Program
Lots of LOCKSS Keeping Stuff Safe: The Future of the LOCKSS Program
 
Interoperability and Technical Collaboration for Web and Social Media Archiving
Interoperability and Technical Collaboration for Web and Social Media ArchivingInteroperability and Technical Collaboration for Web and Social Media Archiving
Interoperability and Technical Collaboration for Web and Social Media Archiving
 
Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...
Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...
Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...
 
2015 NDSA Web Archiving Survey Report Highlights
2015 NDSA Web Archiving Survey Report Highlights2015 NDSA Web Archiving Survey Report Highlights
2015 NDSA Web Archiving Survey Report Highlights
 
Collection Development for Selective Web Archiving
Collection Development for Selective Web ArchivingCollection Development for Selective Web Archiving
Collection Development for Selective Web Archiving
 
Why Not Lots of Copies Keep(ing) Software Safe?
Why Not Lots of Copies Keep(ing) Software Safe?Why Not Lots of Copies Keep(ing) Software Safe?
Why Not Lots of Copies Keep(ing) Software Safe?
 
WASAPI Web Archive Data Transfer APIs
WASAPI Web Archive Data Transfer APIsWASAPI Web Archive Data Transfer APIs
WASAPI Web Archive Data Transfer APIs
 
Building Web Archiving Technology, Together
Building Web Archiving Technology, TogetherBuilding Web Archiving Technology, Together
Building Web Archiving Technology, Together
 
Outreach to Campus Webmasters for a Better Web, and Better Web Archiving
Outreach to Campus Webmasters for a Better Web, and Better Web ArchivingOutreach to Campus Webmasters for a Better Web, and Better Web Archiving
Outreach to Campus Webmasters for a Better Web, and Better Web Archiving
 
Measure All the (Web Archiving) Things!
Measure All the (Web Archiving) Things!Measure All the (Web Archiving) Things!
Measure All the (Web Archiving) Things!
 
A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...
A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...
A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...
 
Campaign Web Archives to Support Multi-Institutional Research
Campaign Web Archives to Support Multi-Institutional ResearchCampaign Web Archives to Support Multi-Institutional Research
Campaign Web Archives to Support Multi-Institutional Research
 
2013 NDSA Web Archiving Survey Report Highlights
2013 NDSA Web Archiving Survey Report Highlights2013 NDSA Web Archiving Survey Report Highlights
2013 NDSA Web Archiving Survey Report Highlights
 
Considerations for Strategic Web Archive Collection Development
Considerations for Strategic Web Archive Collection DevelopmentConsiderations for Strategic Web Archive Collection Development
Considerations for Strategic Web Archive Collection Development
 
Boiling the Ocean, Together: Web Archive Collection Development in a Global C...
Boiling the Ocean, Together: Web Archive Collection Development in a Global C...Boiling the Ocean, Together: Web Archive Collection Development in a Global C...
Boiling the Ocean, Together: Web Archive Collection Development in a Global C...
 
Advocating for Web Archivability
Advocating for Web ArchivabilityAdvocating for Web Archivability
Advocating for Web Archivability
 
Building Archivable Websites
Building Archivable WebsitesBuilding Archivable Websites
Building Archivable Websites
 

Último

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 

Último (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 

Link Persistence, Website Persistence

Notas del editor

  1. We’ve been losing the web for as long as it’s existed; the first webpage, created by Tim Berners-Lee, exists as only a copy recreated a year after the original.http://www.w3.org/History/19921103-hypertext/hypertext/WWW/TheProject.html
  2. Mainstream recognition of the once-esoteric “page not found” http response code reflects the popular perception of the ephemerality of the web
  3. I started looking into the literature on link persistence in preparation for writing a blog post for the Library of Congress’ digital preservation blog, the Signal. Brewster Kahle, founder of the Internet Archive, has offered various numbers for the average lifespan of a webpage over the years. As someone trying to archive the entire public web, he seemed like someone who would know.
  4. A meta-study of 17 other studies of link persistence suggested that links decay…but at all sorts of different frequencies.
  5. The ambiguity about the ephemerality of web content is reminiscent of Rothenberg’s famous quote about the persistence of digital documents in general.
  6. Now let’s take a look at the simplest automated approach to checking the persistence of links.
  7. When the client’s browser requests the resource at a particular URL, the web server first sends an http response code, indicating the disposition of the resource at the requested URL. These are some common response codes.
  8. An automated link checker, also known as a “spider” or “robot”, works by requesting a series of links and recording the response codes.https://secure.flickr.com/photos/chidorian/3461667159/
  9. Response codes are limited, however; they can tell us about the disposition of content at the specified URL, but they can’t tell us what the content at the specified URL is.
  10. Considering a link and a corresponding website over time, there are a number of possible scenarios when we go back to check on the persistence of both.
  11. The most straight-forward case is where the link, the website, and their correspondence all persist.
  12. Sometimes, however, the link still works, but it points to a different website.
  13. That website may still exist at another URL.
  14. Alternatively, maybe the link doesn’t work.
  15. But the website that the link previously corresponded to may still exist at another URL.
  16. Lastly, sometimes both the link doesn’t work and the website doesn’t exist.
  17. These examples illustrate that link persistence and website persistence are two different things and that using the former as a proxy for the latter misses some of the possible scenarios.
  18. Considering those scenarios, conflating link persistence with website persistence will result in systematic mis-measurements of website persistence. How significant are these mis-measurements?
  19. Measuring website persistence requires knowing about the state of websites in the past, a perfect use case for web archives. I decided to do a study based on the web archives I was most familiar with, those of the Library of Congress.http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html
  20. The U.S. Election 2002 Web Archive is one of their earliest web archive collections. The Library of Congress has archived U.S. national election websites every two years since 2000.
  21. There were many more links in the collection than were utilized in this study. Links corresponding to electoral candidate websites were excluded given that they were universally short-lived and would skew the results.
  22. The study consisted of two stages. First, we ran Heritrix against the prepared list of links and logged http response codes and redirects.
  23. In the second stage, we manually visited each link and noted whether it was the same website as we had previously archived. If it was a different website or if the link didn’t work, we attempted to locate the new location of the previously-archived website using a search engine.
  24. The link checker found that 91% of the links ultimately returned a “200” response code. The remaining 9% ultimately returned either “4xx” or “5xx” series response codes.
  25. Bringing in the data on whether the working links still corresponded to the same websites, the percentage of working links that still correspond to the same site drops to 83%. Now, 8% of all the links are working links pointing to different sites.
  26. Diving in on the non-working links, roughly 77% of the previously-archived websites still exist, even though the previously-archived links no longer point to them.
  27. In aggregate, the percentage of websites that still exist after 10 years is 3% higher (94%) than link checking would’ve suggested (91%).This isn’t at all to say that web archiving isn’t important – if I included the candidate websites, the pie chart would suddenly show that less than half of the websites still existed. Also, for example, the White House website has existed for these last ten years, but specific content on the website has invariably disappeared.
  28. The results suggest that we may be marginally overestimating website persistence by conflating working links with website persistence but greatly underestimating website persistence by conflating non-working links with websites that have disappeared.
  29. The key caveat for these results is that I excluded from the study over 1,000 URLs in the web archive collection that all would have likely been both non-working links and websites that no longer existed. The remaining set of URLs represented those about which I more reasonably supposed there was a more typical probability that they would either persist or disappear.
  30. We’re able to effectively perform link checking with current technologies. Can we come up with a better approach to checking the persistence of websites? Better understanding website persistence would facilitate better capacity planning (e.g., by reducing storage requirements for near-duplicate resources), inform capture frequency scheduling, and increase confidence that captured links corresponded to desired websites.
  31. A website checker would need to be able to check links, too, but that functionality is already covered. What are the prospects for tools that could check link and website correspondence and check whether a website still exists?
  32. In theory, these two latter tasks aren’t that difficult; it’s just that they need to be automated in order to be scalable.
  33. Let’s look first at possible tools for checking link and website correspondence.
  34. Heritrix already has the ability to compare the checksums of a resource at a particular URL over successive visits. This allows for an “absolute” assessment of sameness.
  35. However, even the smallest change is enough to produce a checksum mis-match. We need a tool that can assess the magnitude or importance of the difference between successive versions, not just the fact of a difference.
  36. The Vi-DIFF algorithm evaluates both the structure of a webpage and its segmented visual appearance to assess the magnitude of change. As a follow-on to a link checker, the algorithm could be calibrated to indicate whether it was the same site as previously visited or an entirely new one.
  37. Now let’s look first at possible tools for checking website persistence, irrespective of link persistence.
  38. The lexical signature is a set of keywords that are sufficiently descriptive and unique to be used in a search engine to dereference the page.
  39. If the URL no longer works but exists in an archive, the lexical signature can be derived from the archived page and used to locate the new URL.
  40. If the URL itself isn’t archived, the lexical signature can be derived from backlinks.
  41. These tools exist but are not yet in wide use in the web archiving community. Wider utilization of these tools would allow us to better assess website persistence and the discrepancy with link persistence.