SlideShare a Scribd company logo
1 of 49
(Re-) Discovering Lost Web Pages Mathematics & Computer Science Seminar Emory University October 2, 2009 Martin Klein & Michael L. Nelson Department of Computer Science Old Dominion University Norfolk VA  www.cs.odu.edu/~{mklein,mln}
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],The Problem
The Actors Put a human -- lots of humans -- in the loop  for preservation purposes
[object Object],[object Object],[object Object],[object Object],The Environment
ftp://techreports.larc.nasa.gov/pub/techreports/larc/93/tm109025.ps.Z http://techreports.larc.nasa.gov/ltrs/PDF/tm109025.pdf
Web Infrastructure: Refreshing & Migrating
Lapsed Website
URI Content Mapping Problem 1 same  URI maps to  same  or very similar content at a later time 2 same  URI maps to  different  content at a later time 3 different  URI maps to  same  or very similar content at the same or at a later time 4 the content can  not be found at any URI U1 C1 U1 C1 time A B U1 C2 U1 C1 time A B U2 C1 U1 C1 U1 404 time A B U1 ?? U1 C1 time A B
JCDL 2008 http://www.jcdl2008.org/ July 2008 http://www.jcdl2008.org/ Today Scenario 1: Same URI, Same Content
Hypertext 2006 http://www.ht06.org/ August 2006 http://www.ht06.org/ Today Scenario 2: Same URI, Different Content
PSP 2003 http://www.pspcentral.org/events/annual_meeting_2003.html August 2003 http://www.pspcentral.org/events/ archive /annual_meeting_2003.html Today Scenario 3a: Same Content, Different URI
ECDL 1999 http://www-rocq.inria.fr/EuroDL99/ October 1999 http://www.informatik.uni-trier.de/~ley/db/conf/ercimdl/ercimdl99.html Today Scenario 3b: Similar Content, Different URI
Greynet 1999 http://www.konbib.nl/infolev/greynet/2.5.htm 1999 Today ? ? Scenario 4: Content Not Findable At Any URI
Otto : You eat a lot of acid, Miller, back in the hippie days?  Miller : A lot o' people don't realize what's really going on.  They view life as a bunch o' unconnected incidents 'n things.  They don't realize that there's this, like, lattice o' coincidence  that lays on top o' everything. Give you an example;  show you what I mean: suppose you're thinkin' about a  plate o' shrimp. Suddenly someone'll say, like,  plate, or  shrimp, or plate o' shrimp  out of the blue, no explanation.  No point in lookin' for one, either. It's all part of a cosmic  unconsciousness.
[object Object],[object Object],[object Object],[object Object],picture from http://www.crystalinks.com/jung.html Synchronicity
The Bigger Picture Synchronicity Architecture ,[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],What Was That Web Page About? GET https://user:pass@api.del.icio.us/v1/posts/suggest?url=http://yahoo.com/ <?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?> <suggest> <popular>web</popular> <popular>tools</popular> <popular>searchengines</popular> <recommended>yahoo!</recommended> <recommended>yahoo</recommended> <recommended>web</recommended> <recommended>tools</recommended> <recommended>search</recommended> <recommended>reference</recommended> <recommended>portal</recommended> <recommended>news</recommended> </suggest> ? A B C
 
What is a Signature? (aka “message digest”, examples include “md5” and “sha-1”) image from Eddie Kohler http://www.cs.ucla.edu/~kohler/
What is a Lexical Signature? ,[object Object],[object Object],[object Object],[object Object],“ Removal Policies in Network Caches for World-Wide Web Documents” Query Google Resource Abstract REMOVAL HIT RATE PROXY CACHE LS
[object Object],[object Object],LSs as Proposed by Phelps and Wilensky ,[object Object],[object Object],[object Object],[object Object],http://www.cs.berkeley.edu/~wilensky/NLP.html becomes: http://www.cs.berkeley.edu/~wilensky/NLP.html? lexical-signature=texttiling+wilensky+disambiguation+subtopic+iago
Lexical Signatures -- Examples A “Googlewhack” ( http://en.wikipedia.org/wiki/Googlewhack ) can be thought  of as a two-term LS that produces a 1/1 ranking. libraries jcdl digital conference pst http://www. google .com/search? q=libraries + jcdl +digital+conference+ pst http://www.jcdl2008.org 1/51 (2/77 in 01/2008) nsdl multiagency imls testbeds extramural http://www. google .com/search? q=nsdl + multiagency + imls + testbeds +extramural http://www.dli2.nsf.gov 0/10 Rank/Results URL LS 1/1 http://www.cs.berkeley.edu/˜wilensky/NLP.html texttiling wilensky disambiguation subtopic iago http://www. google .com/search? q=texttiling + wilensky +disambiguation+subtopic+ iago 1/221,000 (1/174,000 in 01/2008) http://www.loc.gov library collections congress thomas american http://www. google .com/search? q=library +collections+congress+ thomas + american
Generating LSs ,[object Object],[object Object],[object Object],[object Object]
Generating LSs ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Theoretical Underpinnings of Synchronicity ,[object Object],[object Object],[object Object],[object Object]
Hacks for Estimating IDF ,[object Object],[object Object]
URL: http://www.perfect10wines.com Year: 2007 Union: 12 unique terms For LS purposes, it doesn’t matter much…
Comparing LSs Top 5, 10 and 15 terms LC – local universe  SC – screen scraping  NG – N-Grams ~4 of 5 LS terms are the same
How Does Google N-grams TC  Relate to DF? ,[object Object],[object Object],[object Object],[object Object],[object Object]
TC Ranks vs DF Ranks Within ukWaC
Rank Correlation Within ukWaC semi-log scale
TC Frequencies in ukWaC and N-Grams N-Grams have a threshold of 200
LS Evolution Over Time Copies of web pages from the IA (1996-2007) 300 Random URLs, winnowed to 98, 10493 observations over 12 years
Evolution Over Time -- Example 10-term LSs generated for http://www.perfect10wines.com
Two Methods for Measuring Evolution ,[object Object],[object Object],[object Object],Rooted Sliding
Evolution Over Time - Rooted ,[object Object],[object Object],[object Object]
Evolution Over Time - Sliding ,[object Object],[object Object]
Performance of LSs ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Performance – Number of Terms ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The Bigger Picture Performance – Age Score of LSs consisting of 2, 5, 7 and 10 terms ,[object Object],[object Object],[object Object],Fair Optimistic
Titles (TI), 5- & 7-term Lexical Signatures (LS5, LS7), Tags (TA) 500 random URLs from dmoz.org winnowed to 309 (only 47 of 309 had tags in delicious.com). Due to query restrictions, link neighborhood only run on Yahoo -- results were similar to tags.
Frequency of Change ordered in increasing order by: 1) observations 2) changes Number of Title Changes and  Observations in the IA ,[object Object],[object Object],[object Object],6000 random URLs from dmoz.org, winnowed to 1090 URLs and 100k+ observations
Times of Change Mean Time Delta Between Changes Time Span Between First and  Last Observation in the IA ordered in increasing order by: 1) observations 2) changes ,[object Object],[object Object],[object Object]
Degree of Change Mean Levenshtein Scores of all Titles - Sliding ,[object Object],[object Object],[object Object]
Degree of Change Mean Levenshtein Scores of all Titles - Rooted ,[object Object],[object Object],[object Object]
http://www.sun.com/solutions   mean Levenshtein score  sliding: 0.84 rooted: 0.29 1998-01-27 Sun Software Products Selector Guides -Solutions Tree 1999-02-20 Sun Software Solutions 2002-02-01 Sun Microsystems Products 2002-06-01 Sun Microsystems - Business & Industry Solutions 2003-08-01 Sun Microsystems - Industry & Infrastructure Solutions 2004-02-02 Sun Microsystems - Solutions 2004-06-10 Gateway Page - Sun Solutions 2006-01-09 Sun Microsystems Solutions & Services 2007-01-03 Services & Solutions  2007-02-07 Sun Services & Solutions 2008-01-19 Sun Solutions http://www.datacity.com/mainf.html   mean Levenshtein score  sliding: 0.68 rooted: 0.15 2000-06-19 DataCity of Manassas Park Main Page 2000-10-12  DataCity of Manassas Park sells Custom Built Computers & Removable Hard Drives 2001-08-21  DataCity a computer company in Manassas Park sells Custom Built Computers & Removable Hard Drives  2002-10-16 computer company in Manassas Virginia sells Custom Built Computers with Removable Hard Drives Kits and Iomega 2GB Jaz Drives (jazz drives) October 2002 DataCity 800-326-5051 toll free 2006-03-14  Est 1989 Computer company in Stafford Virginia sells Custom Built Secure Computers with DoD 5200.1-R Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite&reg; lighted keyboard DataCity 800-326-5051 Service Disabled Veteran Owned Business SDVOB
Content Change vs. Title Change
Conclusions & Future Work ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Necronomicon

More Related Content

What's hot

Working with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrWorking with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrAnshum Gupta
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph DatabasesPaolo Pareti
 
Consuming Linked Data SemTech2010
Consuming Linked Data SemTech2010Consuming Linked Data SemTech2010
Consuming Linked Data SemTech2010Juan Sequeda
 
Introduction to Linked Data
Introduction to Linked DataIntroduction to Linked Data
Introduction to Linked DataJuan Sequeda
 
Semantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataSemantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataMatthew Rowe
 
Search Engines After The Semanatic Web
Search Engines After The Semanatic WebSearch Engines After The Semanatic Web
Search Engines After The Semanatic Websamar_slideshare
 
A Semantic Data Model for Web Applications
A Semantic Data Model for Web ApplicationsA Semantic Data Model for Web Applications
A Semantic Data Model for Web ApplicationsArmin Haller
 
2011 and still bruteforcing - OWASP Spain
2011 and still bruteforcing - OWASP Spain2011 and still bruteforcing - OWASP Spain
2011 and still bruteforcing - OWASP SpainChristian Martorella
 
Using MongoDB as a high performance graph database
Using MongoDB as a high performance graph databaseUsing MongoDB as a high performance graph database
Using MongoDB as a high performance graph databaseChris Clarke
 
Web open standards for linked data and knowledge graphs as enablers of EU dig...
Web open standards for linked data and knowledge graphs as enablers of EU dig...Web open standards for linked data and knowledge graphs as enablers of EU dig...
Web open standards for linked data and knowledge graphs as enablers of EU dig...Fabien Gandon
 
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...Lucidworks
 
Working with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrWorking with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrAnshum Gupta
 
Introduction to Linked Data
Introduction to Linked DataIntroduction to Linked Data
Introduction to Linked DataThomas Meehan
 
RDF Tutorial - SPARQL 20091031
RDF Tutorial - SPARQL 20091031RDF Tutorial - SPARQL 20091031
RDF Tutorial - SPARQL 20091031kwangsub kim
 
FedX - Optimization Techniques for Federated Query Processing on Linked Data
FedX - Optimization Techniques for Federated Query Processing on Linked DataFedX - Optimization Techniques for Federated Query Processing on Linked Data
FedX - Optimization Techniques for Federated Query Processing on Linked Dataaschwarte
 
The world is y0ur$: Geolocation-based wordlist generation with wordsmith
The world is y0ur$: Geolocation-based wordlist generation with wordsmithThe world is y0ur$: Geolocation-based wordlist generation with wordsmith
The world is y0ur$: Geolocation-based wordlist generation with wordsmithSanjiv Kawa
 

What's hot (20)

Working with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrWorking with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache Solr
 
OpenML DALI
OpenML DALIOpenML DALI
OpenML DALI
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
 
Consuming Linked Data SemTech2010
Consuming Linked Data SemTech2010Consuming Linked Data SemTech2010
Consuming Linked Data SemTech2010
 
Introduction to Linked Data
Introduction to Linked DataIntroduction to Linked Data
Introduction to Linked Data
 
Semantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataSemantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic Data
 
Search Engines After The Semanatic Web
Search Engines After The Semanatic WebSearch Engines After The Semanatic Web
Search Engines After The Semanatic Web
 
A Semantic Data Model for Web Applications
A Semantic Data Model for Web ApplicationsA Semantic Data Model for Web Applications
A Semantic Data Model for Web Applications
 
Sem webmaubeuge
Sem webmaubeugeSem webmaubeuge
Sem webmaubeuge
 
2011 and still bruteforcing - OWASP Spain
2011 and still bruteforcing - OWASP Spain2011 and still bruteforcing - OWASP Spain
2011 and still bruteforcing - OWASP Spain
 
Using MongoDB as a high performance graph database
Using MongoDB as a high performance graph databaseUsing MongoDB as a high performance graph database
Using MongoDB as a high performance graph database
 
Name That Graph !
Name That Graph !Name That Graph !
Name That Graph !
 
Web open standards for linked data and knowledge graphs as enablers of EU dig...
Web open standards for linked data and knowledge graphs as enablers of EU dig...Web open standards for linked data and knowledge graphs as enablers of EU dig...
Web open standards for linked data and knowledge graphs as enablers of EU dig...
 
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
 
Working with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrWorking with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache Solr
 
Introduction to Linked Data
Introduction to Linked DataIntroduction to Linked Data
Introduction to Linked Data
 
RDF Tutorial - SPARQL 20091031
RDF Tutorial - SPARQL 20091031RDF Tutorial - SPARQL 20091031
RDF Tutorial - SPARQL 20091031
 
FedX - Optimization Techniques for Federated Query Processing on Linked Data
FedX - Optimization Techniques for Federated Query Processing on Linked DataFedX - Optimization Techniques for Federated Query Processing on Linked Data
FedX - Optimization Techniques for Federated Query Processing on Linked Data
 
Semantic Web Good News
Semantic Web Good NewsSemantic Web Good News
Semantic Web Good News
 
The world is y0ur$: Geolocation-based wordlist generation with wordsmith
The world is y0ur$: Geolocation-based wordlist generation with wordsmithThe world is y0ur$: Geolocation-based wordlist generation with wordsmith
The world is y0ur$: Geolocation-based wordlist generation with wordsmith
 

Viewers also liked

Music Video Redundancy and Half-Life in YouTube
Music Video Redundancy and Half-Life in YouTubeMusic Video Redundancy and Half-Life in YouTube
Music Video Redundancy and Half-Life in YouTubeMichael Nelson
 
A Research Agenda for "Obsolete Data or Resources"
A Research Agenda for "Obsolete Data or Resources"A Research Agenda for "Obsolete Data or Resources"
A Research Agenda for "Obsolete Data or Resources"Michael Nelson
 
Memento: Time Travel for the Web
Memento: Time Travel for the WebMemento: Time Travel for the Web
Memento: Time Travel for the WebMichael Nelson
 
Memento: Time Travel for the Web
Memento: Time Travel for the WebMemento: Time Travel for the Web
Memento: Time Travel for the WebMichael Nelson
 
Synchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesSynchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesMichael Nelson
 
My Point of View: Michael L. Nelson Web Archiving Cooperative
My Point of View: Michael L. Nelson  Web Archiving CooperativeMy Point of View: Michael L. Nelson  Web Archiving Cooperative
My Point of View: Michael L. Nelson Web Archiving CooperativeMichael Nelson
 
Using timed-release cryptography to mitigate the preservation risk of embargo...
Using timed-release cryptography to mitigate the preservation risk of embargo...Using timed-release cryptography to mitigate the preservation risk of embargo...
Using timed-release cryptography to mitigate the preservation risk of embargo...Michael Nelson
 
Can’t Find Your 404s?
Can’t Find Your 404s?Can’t Find Your 404s?
Can’t Find Your 404s?Michael Nelson
 
Memento: Time Travel for the Web
Memento: Time Travel for the WebMemento: Time Travel for the Web
Memento: Time Travel for the WebMichael Nelson
 
Memento: TimeGates, TimeBundles, and TimeMaps
Memento: TimeGates, TimeBundles, and TimeMapsMemento: TimeGates, TimeBundles, and TimeMaps
Memento: TimeGates, TimeBundles, and TimeMapsMichael Nelson
 
Tools for A Preservation Ready Web
Tools for A Preservation Ready WebTools for A Preservation Ready Web
Tools for A Preservation Ready WebMichael Nelson
 
Review of Web Archiving
Review of Web ArchivingReview of Web Archiving
Review of Web ArchivingMichael Nelson
 
The Open Archives Initiative
The Open Archives InitiativeThe Open Archives Initiative
The Open Archives InitiativeMichael Nelson
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web PagesMichael Nelson
 
Why Care About the Past?
Why Care About the Past?Why Care About the Past?
Why Care About the Past?Michael Nelson
 
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
OAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange ProjectOAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange Project
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange ProjectMichael Nelson
 

Viewers also liked (16)

Music Video Redundancy and Half-Life in YouTube
Music Video Redundancy and Half-Life in YouTubeMusic Video Redundancy and Half-Life in YouTube
Music Video Redundancy and Half-Life in YouTube
 
A Research Agenda for "Obsolete Data or Resources"
A Research Agenda for "Obsolete Data or Resources"A Research Agenda for "Obsolete Data or Resources"
A Research Agenda for "Obsolete Data or Resources"
 
Memento: Time Travel for the Web
Memento: Time Travel for the WebMemento: Time Travel for the Web
Memento: Time Travel for the Web
 
Memento: Time Travel for the Web
Memento: Time Travel for the WebMemento: Time Travel for the Web
Memento: Time Travel for the Web
 
Synchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesSynchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web Pages
 
My Point of View: Michael L. Nelson Web Archiving Cooperative
My Point of View: Michael L. Nelson  Web Archiving CooperativeMy Point of View: Michael L. Nelson  Web Archiving Cooperative
My Point of View: Michael L. Nelson Web Archiving Cooperative
 
Using timed-release cryptography to mitigate the preservation risk of embargo...
Using timed-release cryptography to mitigate the preservation risk of embargo...Using timed-release cryptography to mitigate the preservation risk of embargo...
Using timed-release cryptography to mitigate the preservation risk of embargo...
 
Can’t Find Your 404s?
Can’t Find Your 404s?Can’t Find Your 404s?
Can’t Find Your 404s?
 
Memento: Time Travel for the Web
Memento: Time Travel for the WebMemento: Time Travel for the Web
Memento: Time Travel for the Web
 
Memento: TimeGates, TimeBundles, and TimeMaps
Memento: TimeGates, TimeBundles, and TimeMapsMemento: TimeGates, TimeBundles, and TimeMaps
Memento: TimeGates, TimeBundles, and TimeMaps
 
Tools for A Preservation Ready Web
Tools for A Preservation Ready WebTools for A Preservation Ready Web
Tools for A Preservation Ready Web
 
Review of Web Archiving
Review of Web ArchivingReview of Web Archiving
Review of Web Archiving
 
The Open Archives Initiative
The Open Archives InitiativeThe Open Archives Initiative
The Open Archives Initiative
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages
 
Why Care About the Past?
Why Care About the Past?Why Care About the Past?
Why Care About the Past?
 
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
OAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange ProjectOAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange Project
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
 

Similar to Discovering Lost Web Pages Through Digital Preservation and Information Retrieval Techniques

Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Webebiquity
 
Aqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State UniversityAqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State Universityyouthelectronix
 
Information Extraction and Linked Data Cloud
Information Extraction and Linked Data CloudInformation Extraction and Linked Data Cloud
Information Extraction and Linked Data CloudDhaval Thakker
 
YQL:: Select * from Internet
YQL:: Select * from InternetYQL:: Select * from Internet
YQL:: Select * from Internetdrgath
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internetdrgath
 
Architecture Patterns for Semantic Web Applications
Architecture Patterns for Semantic Web ApplicationsArchitecture Patterns for Semantic Web Applications
Architecture Patterns for Semantic Web Applicationsbpanulla
 
Stream Reasoning : Where We Got So Far
Stream Reasoning: Where We Got So FarStream Reasoning: Where We Got So Far
Stream Reasoning : Where We Got So FarEmanuele Della Valle
 
Linked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIGLinked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIGChris Ewing
 
Semantic Web
Semantic WebSemantic Web
Semantic Webhardchiu
 
Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Jane Stevenson
 
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Takeshi Morita
 
History and Background of the USEWOD Data Challenge
History and Background of the  USEWOD Data ChallengeHistory and Background of the  USEWOD Data Challenge
History and Background of the USEWOD Data ChallengeKnud Möller
 
Exploiter le Web Semantic, le comprendre et y contribuer
Exploiter le Web Semantic, le comprendre et y contribuerExploiter le Web Semantic, le comprendre et y contribuer
Exploiter le Web Semantic, le comprendre et y contribuerMathieu d'Aquin
 
WWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
WWW09 - Triplify Light-Weight Linked Data Publication from Relational DatabasesWWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
WWW09 - Triplify Light-Weight Linked Data Publication from Relational DatabasesSören Auer
 
Implementing the Open Government Directive using the technologies of the Soci...
Implementing the Open Government Directive using the technologies of the Soci...Implementing the Open Government Directive using the technologies of the Soci...
Implementing the Open Government Directive using the technologies of the Soci...George Thomas
 
Evolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic WebEvolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic WebAnkit Solanki
 
WTF is Semantic Web?
WTF is Semantic Web?WTF is Semantic Web?
WTF is Semantic Web?milesw
 
Introduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS PractitionersIntroduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS PractitionersEmanuele Della Valle
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic WebIvan Herman
 
Adding Meaning To Your Data
Adding Meaning To Your DataAdding Meaning To Your Data
Adding Meaning To Your DataDuncan Hull
 

Similar to Discovering Lost Web Pages Through Digital Preservation and Information Retrieval Techniques (20)

Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
 
Aqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State UniversityAqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State University
 
Information Extraction and Linked Data Cloud
Information Extraction and Linked Data CloudInformation Extraction and Linked Data Cloud
Information Extraction and Linked Data Cloud
 
YQL:: Select * from Internet
YQL:: Select * from InternetYQL:: Select * from Internet
YQL:: Select * from Internet
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internet
 
Architecture Patterns for Semantic Web Applications
Architecture Patterns for Semantic Web ApplicationsArchitecture Patterns for Semantic Web Applications
Architecture Patterns for Semantic Web Applications
 
Stream Reasoning : Where We Got So Far
Stream Reasoning: Where We Got So FarStream Reasoning: Where We Got So Far
Stream Reasoning : Where We Got So Far
 
Linked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIGLinked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIG
 
Semantic Web
Semantic WebSemantic Web
Semantic Web
 
Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011
 
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...
 
History and Background of the USEWOD Data Challenge
History and Background of the  USEWOD Data ChallengeHistory and Background of the  USEWOD Data Challenge
History and Background of the USEWOD Data Challenge
 
Exploiter le Web Semantic, le comprendre et y contribuer
Exploiter le Web Semantic, le comprendre et y contribuerExploiter le Web Semantic, le comprendre et y contribuer
Exploiter le Web Semantic, le comprendre et y contribuer
 
WWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
WWW09 - Triplify Light-Weight Linked Data Publication from Relational DatabasesWWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
WWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
 
Implementing the Open Government Directive using the technologies of the Soci...
Implementing the Open Government Directive using the technologies of the Soci...Implementing the Open Government Directive using the technologies of the Soci...
Implementing the Open Government Directive using the technologies of the Soci...
 
Evolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic WebEvolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic Web
 
WTF is Semantic Web?
WTF is Semantic Web?WTF is Semantic Web?
WTF is Semantic Web?
 
Introduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS PractitionersIntroduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS Practitioners
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
 
Adding Meaning To Your Data
Adding Meaning To Your DataAdding Meaning To Your Data
Adding Meaning To Your Data
 

More from Michael Nelson

Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Michael Nelson
 
Uncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesUncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesMichael Nelson
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsMichael Nelson
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsMichael Nelson
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Michael Nelson
 
Summarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesMichael Nelson
 
The Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingThe Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingMichael Nelson
 
We Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesWe Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesMichael Nelson
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptMichael Nelson
 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesMichael Nelson
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple ArchivesMichael Nelson
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web ArchivesMichael Nelson
 
@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015Michael Nelson
 
Evaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesEvaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesMichael Nelson
 
When Should I Make Preservation Copies of Myself?
When Should I Make Preservation Copies of Myself?�When Should I Make Preservation Copies of Myself?�
When Should I Make Preservation Copies of Myself?Michael Nelson
 

More from Michael Nelson (20)

Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035
 
Uncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesUncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pages
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
 
Summarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniques
 
The Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingThe Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web Archiving
 
We Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesWe Need Multiple, Independent Web Archives
We Need Multiple, Independent Web Archives
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web Archives
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple Archives
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web Archives
 
@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015
 
Evaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesEvaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived Pages
 
When Should I Make Preservation Copies of Myself?
When Should I Make Preservation Copies of Myself?�When Should I Make Preservation Copies of Myself?�
When Should I Make Preservation Copies of Myself?
 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Discovering Lost Web Pages Through Digital Preservation and Information Retrieval Techniques

  • 1. (Re-) Discovering Lost Web Pages Mathematics & Computer Science Seminar Emory University October 2, 2009 Martin Klein & Michael L. Nelson Department of Computer Science Old Dominion University Norfolk VA www.cs.odu.edu/~{mklein,mln}
  • 2.
  • 3. The Actors Put a human -- lots of humans -- in the loop for preservation purposes
  • 4.
  • 8. URI Content Mapping Problem 1 same URI maps to same or very similar content at a later time 2 same URI maps to different content at a later time 3 different URI maps to same or very similar content at the same or at a later time 4 the content can not be found at any URI U1 C1 U1 C1 time A B U1 C2 U1 C1 time A B U2 C1 U1 C1 U1 404 time A B U1 ?? U1 C1 time A B
  • 9. JCDL 2008 http://www.jcdl2008.org/ July 2008 http://www.jcdl2008.org/ Today Scenario 1: Same URI, Same Content
  • 10. Hypertext 2006 http://www.ht06.org/ August 2006 http://www.ht06.org/ Today Scenario 2: Same URI, Different Content
  • 11. PSP 2003 http://www.pspcentral.org/events/annual_meeting_2003.html August 2003 http://www.pspcentral.org/events/ archive /annual_meeting_2003.html Today Scenario 3a: Same Content, Different URI
  • 12. ECDL 1999 http://www-rocq.inria.fr/EuroDL99/ October 1999 http://www.informatik.uni-trier.de/~ley/db/conf/ercimdl/ercimdl99.html Today Scenario 3b: Similar Content, Different URI
  • 13. Greynet 1999 http://www.konbib.nl/infolev/greynet/2.5.htm 1999 Today ? ? Scenario 4: Content Not Findable At Any URI
  • 14. Otto : You eat a lot of acid, Miller, back in the hippie days? Miller : A lot o' people don't realize what's really going on. They view life as a bunch o' unconnected incidents 'n things. They don't realize that there's this, like, lattice o' coincidence that lays on top o' everything. Give you an example; show you what I mean: suppose you're thinkin' about a plate o' shrimp. Suddenly someone'll say, like, plate, or shrimp, or plate o' shrimp out of the blue, no explanation. No point in lookin' for one, either. It's all part of a cosmic unconsciousness.
  • 15.
  • 16.
  • 17.
  • 18.  
  • 19. What is a Signature? (aka “message digest”, examples include “md5” and “sha-1”) image from Eddie Kohler http://www.cs.ucla.edu/~kohler/
  • 20.
  • 21.
  • 22. Lexical Signatures -- Examples A “Googlewhack” ( http://en.wikipedia.org/wiki/Googlewhack ) can be thought of as a two-term LS that produces a 1/1 ranking. libraries jcdl digital conference pst http://www. google .com/search? q=libraries + jcdl +digital+conference+ pst http://www.jcdl2008.org 1/51 (2/77 in 01/2008) nsdl multiagency imls testbeds extramural http://www. google .com/search? q=nsdl + multiagency + imls + testbeds +extramural http://www.dli2.nsf.gov 0/10 Rank/Results URL LS 1/1 http://www.cs.berkeley.edu/˜wilensky/NLP.html texttiling wilensky disambiguation subtopic iago http://www. google .com/search? q=texttiling + wilensky +disambiguation+subtopic+ iago 1/221,000 (1/174,000 in 01/2008) http://www.loc.gov library collections congress thomas american http://www. google .com/search? q=library +collections+congress+ thomas + american
  • 23.
  • 24.
  • 25.
  • 26.
  • 27. URL: http://www.perfect10wines.com Year: 2007 Union: 12 unique terms For LS purposes, it doesn’t matter much…
  • 28. Comparing LSs Top 5, 10 and 15 terms LC – local universe SC – screen scraping NG – N-Grams ~4 of 5 LS terms are the same
  • 29.
  • 30. TC Ranks vs DF Ranks Within ukWaC
  • 31. Rank Correlation Within ukWaC semi-log scale
  • 32. TC Frequencies in ukWaC and N-Grams N-Grams have a threshold of 200
  • 33. LS Evolution Over Time Copies of web pages from the IA (1996-2007) 300 Random URLs, winnowed to 98, 10493 observations over 12 years
  • 34. Evolution Over Time -- Example 10-term LSs generated for http://www.perfect10wines.com
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41. Titles (TI), 5- & 7-term Lexical Signatures (LS5, LS7), Tags (TA) 500 random URLs from dmoz.org winnowed to 309 (only 47 of 309 had tags in delicious.com). Due to query restrictions, link neighborhood only run on Yahoo -- results were similar to tags.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46. http://www.sun.com/solutions mean Levenshtein score sliding: 0.84 rooted: 0.29 1998-01-27 Sun Software Products Selector Guides -Solutions Tree 1999-02-20 Sun Software Solutions 2002-02-01 Sun Microsystems Products 2002-06-01 Sun Microsystems - Business & Industry Solutions 2003-08-01 Sun Microsystems - Industry & Infrastructure Solutions 2004-02-02 Sun Microsystems - Solutions 2004-06-10 Gateway Page - Sun Solutions 2006-01-09 Sun Microsystems Solutions & Services 2007-01-03 Services & Solutions 2007-02-07 Sun Services & Solutions 2008-01-19 Sun Solutions http://www.datacity.com/mainf.html mean Levenshtein score sliding: 0.68 rooted: 0.15 2000-06-19 DataCity of Manassas Park Main Page 2000-10-12 DataCity of Manassas Park sells Custom Built Computers & Removable Hard Drives 2001-08-21 DataCity a computer company in Manassas Park sells Custom Built Computers & Removable Hard Drives 2002-10-16 computer company in Manassas Virginia sells Custom Built Computers with Removable Hard Drives Kits and Iomega 2GB Jaz Drives (jazz drives) October 2002 DataCity 800-326-5051 toll free 2006-03-14 Est 1989 Computer company in Stafford Virginia sells Custom Built Secure Computers with DoD 5200.1-R Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite&reg; lighted keyboard DataCity 800-326-5051 Service Disabled Veteran Owned Business SDVOB
  • 47. Content Change vs. Title Change
  • 48.