SlideShare una empresa de Scribd logo
1 de 41
Descargar para leer sin conexión
Synchronicity: Just-In-Time  Discovery of Lost Web Pages NDIIPP Partners Meeting June 25, 2009 Martin Klein & Michael L. Nelson Department of Computer Science Old Dominion University Norfolk VA  www.cs.odu.edu/~{mklein,mln}
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],The Problem
The Actors Put a human -- lots of humans -- in the loop  for preservation purposes
[object Object],[object Object],[object Object],[object Object],The Environment
ftp://techreports.larc.nasa.gov/pub/techreports/larc/93/tm109025.ps.Z http://techreports.larc.nasa.gov/ltrs/PDF/tm109025.pdf
Web Infrastructure: Refreshing & Migrating
Lapsed Website
URI Content Mapping Problem 1 same  URI maps to  same  or very similar content at a later time 2 same  URI maps to  different  content at a later time 3 different  URI maps to  same  or very similar content at the same or at a later time 4 the content can  not be found at any URI U1 C1 U1 C1 time A B U1 C2 U1 C1 time A B U2 C1 U1 C1 U1 404 time A B U1 ?? U1 C1 time A B
JCDL 2008 http://www.jcdl2008.org/ July 2008 http://www.jcdl2008.org/ Today Scenario 1: Same URI, Same Content
Hypertext 2006 http://www.ht06.org/ August 2006 http://www.ht06.org/ Today Scenario 2: Same URI, Different Content
PSP 2003 http://www.pspcentral.org/events/annual_meeting_2003.html August 2003 http://www.pspcentral.org/events/archive/annual_meeting_2003.html Today Scenario 3a: Same Content, Different URI
ECDL 1999 http://www-rocq.inria.fr/EuroDL99/ October 1999 http://www.informatik.uni-trier.de/~ley/db/conf/ercimdl/ercimdl99.html Today Scenario 3b: Similar Content, Different URI
Greynet 1999 http://www.konbib.nl/infolev/greynet/2.5.htm 1999 Today ? ? Scenario 4: Content Not Findable At Any URI
Otto : You eat a lot of acid, Miller, back in the hippie days?  Miller : A lot o' people don't realize what's really going on.  They view life as a bunch o' unconnected incidents 'n things.  They don't realize that there's this, like, lattice o' coincidence  that lays on top o' everything. Give you an example;  show you what I mean: suppose you're thinkin' about a  plate o' shrimp. Suddenly someone'll say, like,  plate, or  shrimp, or plate o' shrimp  out of the blue, no explanation.  No point in lookin' for one, either. It's all part of a cosmic  unconsciousness.
[object Object],[object Object],[object Object],[object Object],picture from http://www.crystalinks.com/jung.html Synchronicity
The Bigger Picture Synchronicity Architecture ,[object Object],[object Object],[object Object],[object Object]
What Was That Web Page About? ,[object Object],[object Object],[object Object],[object Object],? A B C
 
What is a Signature? (aka “message digest”, examples include “md5” and “sha-1”) image from Eddie Kohler http://www.cs.ucla.edu/~kohler/
What is a Lexical Signature? ,[object Object],[object Object],[object Object],[object Object],“ Removal Policies in Network Caches for World-Wide Web Documents” Query Google Resource Abstract REMOVAL HIT RATE PROXY CACHE LS
[object Object],[object Object],LSs as Proposed by Phelps and Wilensky ,[object Object],[object Object],[object Object],[object Object],http://www.cs.berkeley.edu/~wilensky/NLP.html becomes: http://www.cs.berkeley.edu/~wilensky/NLP.html? lexical-signature=texttiling+wilensky+disambiguation+subtopic+iago
Lexical Signatures -- Examples A “Googlewhack” ( http://en.wikipedia.org/wiki/Googlewhack ) can be thought  of as a two-term LS that produces a 1/1 ranking. libraries jcdl digital conference pst http://www.google.com/search?q=libraries+jcdl+digital+conference+pst http://www.jcdl2008.org 1/51 (2/77 in 01/2008) nsdl multiagency imls testbeds extramural http://www. google .com/search? q=nsdl + multiagency + imls + testbeds +extramural http://www.dli2.nsf.gov 0/10 Rank/Results URL LS 1/1 http://www.cs.berkeley.edu/˜wilensky/NLP.html texttiling wilensky disambiguation subtopic iago http://www. google .com/search? q=texttiling + wilensky +disambiguation+subtopic+ iago 1/221,000 (1/174,000 in 01/2008) http://www.loc.gov library collections congress thomas american http://www. google .com/search? q=library +collections+congress+ thomas + american
Generating LSs ,[object Object],[object Object],[object Object],[object Object]
Generating LSs ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Theoretical Underpinnings of Synchronicity ,[object Object],[object Object],[object Object]
LS Evolution Over Time Copies of web pages from the IA (1996-2007) 300 Random URLs, winnowed to 98, 10493 observations over 12 years
Evolution Over Time -- Example 10-term LSs generated for http://www.perfect10wines.com
Two Methods for Measuring Evolution ,[object Object],[object Object],[object Object],Rooted Sliding
Evolution Over Time - Rooted ,[object Object],[object Object],[object Object]
Evolution Over Time - Sliding ,[object Object],[object Object]
Performance of LSs ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Performance – Number of Terms ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Titles (TI), 5- & 7-term Lexical Signatures (LS5, LS7), Tags (TA) 500 random URLs from dmoz.org winnowed to 309 (only 47 of 309 had tags in delicious.com). Due to query restrictions, link neighborhood only run on Yahoo -- results were similar to tags.
Frequency of Change ordered in increasing order by: 1) observations 2) changes Number of Title Changes and  Observations in the IA ,[object Object],[object Object],[object Object],6000 random URLs from dmoz.org, winnowed to 1090 URLs and 100k+ observations
Times of Change Mean Time Delta Between Changes Time Span Between First and  Last Observation in the IA ordered in increasing order by: 1) observations 2) changes ,[object Object],[object Object],[object Object]
Degree of Change Mean Levenshtein Scores of all Titles - Sliding ,[object Object],[object Object],[object Object]
Degree of Change Mean Levenshtein Scores of all Titles - Rooted ,[object Object],[object Object],[object Object]
http://www.sun.com/solutions   mean Levenshtein score  sliding: 0.84 rooted: 0.29 1998-01-27 Sun Software Products Selector Guides -Solutions Tree 1999-02-20 Sun Software Solutions 2002-02-01 Sun Microsystems Products 2002-06-01 Sun Microsystems - Business & Industry Solutions 2003-08-01 Sun Microsystems - Industry & Infrastructure Solutions 2004-02-02 Sun Microsystems - Solutions 2004-06-10 Gateway Page - Sun Solutions 2006-01-09 Sun Microsystems Solutions & Services 2007-01-03 Services & Solutions  2007-02-07 Sun Services & Solutions 2008-01-19 Sun Solutions http://www.datacity.com/mainf.html   mean Levenshtein score  sliding: 0.68 rooted: 0.15 2000-06-19 DataCity of Manassas Park Main Page 2000-10-12  DataCity of Manassas Park sells Custom Built Computers & Removable Hard Drives 2001-08-21  DataCity a computer company in Manassas Park sells Custom Built Computers & Removable Hard Drives  2002-10-16 computer company in Manassas Virginia sells Custom Built Computers with Removable Hard Drives Kits and Iomega 2GB Jaz Drives (jazz drives) October 2002 DataCity 800-326-5051 toll free 2006-03-14  Est 1989 Computer company in Stafford Virginia sells Custom Built Secure Computers with DoD 5200.1-R Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite® lighted keyboard DataCity 800-326-5051 Service Disabled Veteran Owned Business SDVOB
Conclusions & Future Work ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Necronomicon
How much of the Web is indexed? Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)

Más contenido relacionado

La actualidad más candente

Querying datasets on the Web with high availability
Querying datasets on the Web with high availabilityQuerying datasets on the Web with high availability
Querying datasets on the Web with high availabilityRuben Verborgh
 
Live DBpedia querying with high availability
Live DBpedia querying with high availabilityLive DBpedia querying with high availability
Live DBpedia querying with high availabilityRuben Verborgh
 
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern FragmentsInitial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern FragmentsRuben Verborgh
 
Querying data on the Web – client or server?
Querying data on the Web – client or server?Querying data on the Web – client or server?
Querying data on the Web – client or server?Ruben Verborgh
 
Building a data processing pipeline in Python
Building a data processing pipeline in PythonBuilding a data processing pipeline in Python
Building a data processing pipeline in PythonJoe Cabrera
 
Linking media, data, and services
Linking media, data, and servicesLinking media, data, and services
Linking media, data, and servicesRuben Verborgh
 
On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebOn the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebMartin Klein
 
Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)Hector Correa
 
Learning W3C Linked Data Platform with examples
Learning W3C Linked Data Platform with examplesLearning W3C Linked Data Platform with examples
Learning W3C Linked Data Platform with examplesNandana Mihindukulasooriya
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Juan Sequeda
 
Yahoo Making The Web Searchable
Yahoo  Making The  Web  SearchableYahoo  Making The  Web  Searchable
Yahoo Making The Web Searchablekksst
 
Authentication, Authorization & Error Handling with GraphQL
Authentication, Authorization & Error Handling with GraphQLAuthentication, Authorization & Error Handling with GraphQL
Authentication, Authorization & Error Handling with GraphQLNikolas Burk
 
The Serverless GraphQL Backend Architecture
The Serverless GraphQL Backend ArchitectureThe Serverless GraphQL Backend Architecture
The Serverless GraphQL Backend ArchitectureNikolas Burk
 
Epiphany: Adaptable RDFa Generation Linking the Web of Documents to the Web o...
Epiphany: Adaptable RDFa Generation Linking the Web of Documents to the Web o...Epiphany: Adaptable RDFa Generation Linking the Web of Documents to the Web o...
Epiphany: Adaptable RDFa Generation Linking the Web of Documents to the Web o...Benjamin Adrian
 
Creating 3rd Generation Web APIs with Hydra
Creating 3rd Generation Web APIs with HydraCreating 3rd Generation Web APIs with Hydra
Creating 3rd Generation Web APIs with HydraMarkus Lanthaler
 
Location, location, location: A transaction comparison of catalog searches o...
Location, location, location:A transaction comparison of catalog searches o...Location, location, location:A transaction comparison of catalog searches o...
Location, location, location: A transaction comparison of catalog searches o...teaguese
 
Search engines coh m
Search engines coh mSearch engines coh m
Search engines coh mcpcmattc
 
Advanced MongoDB Aggregation Pipelines
Advanced MongoDB Aggregation PipelinesAdvanced MongoDB Aggregation Pipelines
Advanced MongoDB Aggregation PipelinesTom Schreiber
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawlingDenis Shestakov
 

La actualidad más candente (20)

Querying datasets on the Web with high availability
Querying datasets on the Web with high availabilityQuerying datasets on the Web with high availability
Querying datasets on the Web with high availability
 
Live DBpedia querying with high availability
Live DBpedia querying with high availabilityLive DBpedia querying with high availability
Live DBpedia querying with high availability
 
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern FragmentsInitial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
 
Querying data on the Web – client or server?
Querying data on the Web – client or server?Querying data on the Web – client or server?
Querying data on the Web – client or server?
 
Building a data processing pipeline in Python
Building a data processing pipeline in PythonBuilding a data processing pipeline in Python
Building a data processing pipeline in Python
 
Linking media, data, and services
Linking media, data, and servicesLinking media, data, and services
Linking media, data, and services
 
On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebOn the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
 
Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)
 
Learning W3C Linked Data Platform with examples
Learning W3C Linked Data Platform with examplesLearning W3C Linked Data Platform with examples
Learning W3C Linked Data Platform with examples
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011
 
Yahoo Making The Web Searchable
Yahoo  Making The  Web  SearchableYahoo  Making The  Web  Searchable
Yahoo Making The Web Searchable
 
Authentication, Authorization & Error Handling with GraphQL
Authentication, Authorization & Error Handling with GraphQLAuthentication, Authorization & Error Handling with GraphQL
Authentication, Authorization & Error Handling with GraphQL
 
The Serverless GraphQL Backend Architecture
The Serverless GraphQL Backend ArchitectureThe Serverless GraphQL Backend Architecture
The Serverless GraphQL Backend Architecture
 
Epiphany: Adaptable RDFa Generation Linking the Web of Documents to the Web o...
Epiphany: Adaptable RDFa Generation Linking the Web of Documents to the Web o...Epiphany: Adaptable RDFa Generation Linking the Web of Documents to the Web o...
Epiphany: Adaptable RDFa Generation Linking the Web of Documents to the Web o...
 
Rethinking Notes
Rethinking NotesRethinking Notes
Rethinking Notes
 
Creating 3rd Generation Web APIs with Hydra
Creating 3rd Generation Web APIs with HydraCreating 3rd Generation Web APIs with Hydra
Creating 3rd Generation Web APIs with Hydra
 
Location, location, location: A transaction comparison of catalog searches o...
Location, location, location:A transaction comparison of catalog searches o...Location, location, location:A transaction comparison of catalog searches o...
Location, location, location: A transaction comparison of catalog searches o...
 
Search engines coh m
Search engines coh mSearch engines coh m
Search engines coh m
 
Advanced MongoDB Aggregation Pipelines
Advanced MongoDB Aggregation PipelinesAdvanced MongoDB Aggregation Pipelines
Advanced MongoDB Aggregation Pipelines
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
 

Destacado

Can’t Find Your 404s?
Can’t Find Your 404s?Can’t Find Your 404s?
Can’t Find Your 404s?Michael Nelson
 
Memento: Time Travel for the Web
Memento: Time Travel for the WebMemento: Time Travel for the Web
Memento: Time Travel for the WebMichael Nelson
 
A Research Agenda for "Obsolete Data or Resources"
A Research Agenda for "Obsolete Data or Resources"A Research Agenda for "Obsolete Data or Resources"
A Research Agenda for "Obsolete Data or Resources"Michael Nelson
 
Music Video Redundancy and Half-Life in YouTube
Music Video Redundancy and Half-Life in YouTubeMusic Video Redundancy and Half-Life in YouTube
Music Video Redundancy and Half-Life in YouTubeMichael Nelson
 
Memento: TimeGates, TimeBundles, and TimeMaps
Memento: TimeGates, TimeBundles, and TimeMapsMemento: TimeGates, TimeBundles, and TimeMaps
Memento: TimeGates, TimeBundles, and TimeMapsMichael Nelson
 
Using timed-release cryptography to mitigate the preservation risk of embargo...
Using timed-release cryptography to mitigate the preservation risk of embargo...Using timed-release cryptography to mitigate the preservation risk of embargo...
Using timed-release cryptography to mitigate the preservation risk of embargo...Michael Nelson
 
Memento: Time Travel for the Web
Memento: Time Travel for the WebMemento: Time Travel for the Web
Memento: Time Travel for the WebMichael Nelson
 
Memento: Time Travel for the Web
Memento: Time Travel for the WebMemento: Time Travel for the Web
Memento: Time Travel for the WebMichael Nelson
 
My Point of View: Michael L. Nelson Web Archiving Cooperative
My Point of View: Michael L. Nelson  Web Archiving CooperativeMy Point of View: Michael L. Nelson  Web Archiving Cooperative
My Point of View: Michael L. Nelson Web Archiving CooperativeMichael Nelson
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web PagesMichael Nelson
 
The Open Archives Initiative
The Open Archives InitiativeThe Open Archives Initiative
The Open Archives InitiativeMichael Nelson
 
Review of Web Archiving
Review of Web ArchivingReview of Web Archiving
Review of Web ArchivingMichael Nelson
 
Tools for A Preservation Ready Web
Tools for A Preservation Ready WebTools for A Preservation Ready Web
Tools for A Preservation Ready WebMichael Nelson
 
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
OAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange ProjectOAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange Project
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange ProjectMichael Nelson
 
Why Care About the Past?
Why Care About the Past?Why Care About the Past?
Why Care About the Past?Michael Nelson
 

Destacado (15)

Can’t Find Your 404s?
Can’t Find Your 404s?Can’t Find Your 404s?
Can’t Find Your 404s?
 
Memento: Time Travel for the Web
Memento: Time Travel for the WebMemento: Time Travel for the Web
Memento: Time Travel for the Web
 
A Research Agenda for "Obsolete Data or Resources"
A Research Agenda for "Obsolete Data or Resources"A Research Agenda for "Obsolete Data or Resources"
A Research Agenda for "Obsolete Data or Resources"
 
Music Video Redundancy and Half-Life in YouTube
Music Video Redundancy and Half-Life in YouTubeMusic Video Redundancy and Half-Life in YouTube
Music Video Redundancy and Half-Life in YouTube
 
Memento: TimeGates, TimeBundles, and TimeMaps
Memento: TimeGates, TimeBundles, and TimeMapsMemento: TimeGates, TimeBundles, and TimeMaps
Memento: TimeGates, TimeBundles, and TimeMaps
 
Using timed-release cryptography to mitigate the preservation risk of embargo...
Using timed-release cryptography to mitigate the preservation risk of embargo...Using timed-release cryptography to mitigate the preservation risk of embargo...
Using timed-release cryptography to mitigate the preservation risk of embargo...
 
Memento: Time Travel for the Web
Memento: Time Travel for the WebMemento: Time Travel for the Web
Memento: Time Travel for the Web
 
Memento: Time Travel for the Web
Memento: Time Travel for the WebMemento: Time Travel for the Web
Memento: Time Travel for the Web
 
My Point of View: Michael L. Nelson Web Archiving Cooperative
My Point of View: Michael L. Nelson  Web Archiving CooperativeMy Point of View: Michael L. Nelson  Web Archiving Cooperative
My Point of View: Michael L. Nelson Web Archiving Cooperative
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages
 
The Open Archives Initiative
The Open Archives InitiativeThe Open Archives Initiative
The Open Archives Initiative
 
Review of Web Archiving
Review of Web ArchivingReview of Web Archiving
Review of Web Archiving
 
Tools for A Preservation Ready Web
Tools for A Preservation Ready WebTools for A Preservation Ready Web
Tools for A Preservation Ready Web
 
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
OAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange ProjectOAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange Project
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
 
Why Care About the Past?
Why Care About the Past?Why Care About the Past?
Why Care About the Past?
 

Similar a Synchronicity: Just-In-Time Discovery of Lost Web Pages

Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Webebiquity
 
Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Jane Stevenson
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerSean Golliher
 
Future of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic WebFuture of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic Webis20090
 
Information Extraction and Linked Data Cloud
Information Extraction and Linked Data CloudInformation Extraction and Linked Data Cloud
Information Extraction and Linked Data CloudDhaval Thakker
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studyCharlie Hull
 
Dissertation Defense
Dissertation DefenseDissertation Defense
Dissertation DefenseMartin Klein
 
20100614 ISWSA Keynote
20100614 ISWSA Keynote20100614 ISWSA Keynote
20100614 ISWSA KeynoteAxel Polleres
 
Environment Canada's Data Management Service
Environment Canada's Data Management ServiceEnvironment Canada's Data Management Service
Environment Canada's Data Management ServiceSafe Software
 
Linked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIGLinked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIGChris Ewing
 
The Nature of Information
The Nature of InformationThe Nature of Information
The Nature of InformationAdrian Paschke
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commonsJesse Wang
 
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Takeshi Morita
 
History and Background of the USEWOD Data Challenge
History and Background of the  USEWOD Data ChallengeHistory and Background of the  USEWOD Data Challenge
History and Background of the USEWOD Data ChallengeKnud Möller
 
WWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
WWW09 - Triplify Light-Weight Linked Data Publication from Relational DatabasesWWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
WWW09 - Triplify Light-Weight Linked Data Publication from Relational DatabasesSören Auer
 
WTF is Semantic Web?
WTF is Semantic Web?WTF is Semantic Web?
WTF is Semantic Web?milesw
 
Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...
Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...
Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...Amazon Web Services
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosEUCLID project
 
Developing Distributed Web Applications, Where does REST fit in?
Developing Distributed Web Applications, Where does REST fit in?Developing Distributed Web Applications, Where does REST fit in?
Developing Distributed Web Applications, Where does REST fit in?Srinath Perera
 
Knowledge discoverylaurahollink
Knowledge discoverylaurahollinkKnowledge discoverylaurahollink
Knowledge discoverylaurahollinkSSSW
 

Similar a Synchronicity: Just-In-Time Discovery of Lost Web Pages (20)

Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
 
Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
 
Future of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic WebFuture of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic Web
 
Information Extraction and Linked Data Cloud
Information Extraction and Linked Data CloudInformation Extraction and Linked Data Cloud
Information Extraction and Linked Data Cloud
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
 
Dissertation Defense
Dissertation DefenseDissertation Defense
Dissertation Defense
 
20100614 ISWSA Keynote
20100614 ISWSA Keynote20100614 ISWSA Keynote
20100614 ISWSA Keynote
 
Environment Canada's Data Management Service
Environment Canada's Data Management ServiceEnvironment Canada's Data Management Service
Environment Canada's Data Management Service
 
Linked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIGLinked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIG
 
The Nature of Information
The Nature of InformationThe Nature of Information
The Nature of Information
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
 
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...
 
History and Background of the USEWOD Data Challenge
History and Background of the  USEWOD Data ChallengeHistory and Background of the  USEWOD Data Challenge
History and Background of the USEWOD Data Challenge
 
WWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
WWW09 - Triplify Light-Weight Linked Data Publication from Relational DatabasesWWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
WWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
 
WTF is Semantic Web?
WTF is Semantic Web?WTF is Semantic Web?
WTF is Semantic Web?
 
Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...
Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...
Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
 
Developing Distributed Web Applications, Where does REST fit in?
Developing Distributed Web Applications, Where does REST fit in?Developing Distributed Web Applications, Where does REST fit in?
Developing Distributed Web Applications, Where does REST fit in?
 
Knowledge discoverylaurahollink
Knowledge discoverylaurahollinkKnowledge discoverylaurahollink
Knowledge discoverylaurahollink
 

Más de Michael Nelson

Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Michael Nelson
 
Uncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesUncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesMichael Nelson
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsMichael Nelson
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsMichael Nelson
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Michael Nelson
 
Summarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesMichael Nelson
 
The Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingThe Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingMichael Nelson
 
We Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesWe Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesMichael Nelson
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptMichael Nelson
 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesMichael Nelson
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple ArchivesMichael Nelson
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web ArchivesMichael Nelson
 
@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015Michael Nelson
 
Evaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesEvaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesMichael Nelson
 
When Should I Make Preservation Copies of Myself?
When Should I Make Preservation Copies of Myself?�When Should I Make Preservation Copies of Myself?�
When Should I Make Preservation Copies of Myself?Michael Nelson
 

Más de Michael Nelson (20)

Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035
 
Uncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesUncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pages
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
 
Summarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniques
 
The Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingThe Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web Archiving
 
We Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesWe Need Multiple, Independent Web Archives
We Need Multiple, Independent Web Archives
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web Archives
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple Archives
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web Archives
 
@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015
 
Evaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesEvaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived Pages
 
When Should I Make Preservation Copies of Myself?
When Should I Make Preservation Copies of Myself?�When Should I Make Preservation Copies of Myself?�
When Should I Make Preservation Copies of Myself?
 

Último

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfROWELL MARQUINA
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - AvrilIvanti
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 

Último (20)

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - Avril
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 

Synchronicity: Just-In-Time Discovery of Lost Web Pages

  • 1. Synchronicity: Just-In-Time Discovery of Lost Web Pages NDIIPP Partners Meeting June 25, 2009 Martin Klein & Michael L. Nelson Department of Computer Science Old Dominion University Norfolk VA www.cs.odu.edu/~{mklein,mln}
  • 2.
  • 3. The Actors Put a human -- lots of humans -- in the loop for preservation purposes
  • 4.
  • 8. URI Content Mapping Problem 1 same URI maps to same or very similar content at a later time 2 same URI maps to different content at a later time 3 different URI maps to same or very similar content at the same or at a later time 4 the content can not be found at any URI U1 C1 U1 C1 time A B U1 C2 U1 C1 time A B U2 C1 U1 C1 U1 404 time A B U1 ?? U1 C1 time A B
  • 9. JCDL 2008 http://www.jcdl2008.org/ July 2008 http://www.jcdl2008.org/ Today Scenario 1: Same URI, Same Content
  • 10. Hypertext 2006 http://www.ht06.org/ August 2006 http://www.ht06.org/ Today Scenario 2: Same URI, Different Content
  • 11. PSP 2003 http://www.pspcentral.org/events/annual_meeting_2003.html August 2003 http://www.pspcentral.org/events/archive/annual_meeting_2003.html Today Scenario 3a: Same Content, Different URI
  • 12. ECDL 1999 http://www-rocq.inria.fr/EuroDL99/ October 1999 http://www.informatik.uni-trier.de/~ley/db/conf/ercimdl/ercimdl99.html Today Scenario 3b: Similar Content, Different URI
  • 13. Greynet 1999 http://www.konbib.nl/infolev/greynet/2.5.htm 1999 Today ? ? Scenario 4: Content Not Findable At Any URI
  • 14. Otto : You eat a lot of acid, Miller, back in the hippie days? Miller : A lot o' people don't realize what's really going on. They view life as a bunch o' unconnected incidents 'n things. They don't realize that there's this, like, lattice o' coincidence that lays on top o' everything. Give you an example; show you what I mean: suppose you're thinkin' about a plate o' shrimp. Suddenly someone'll say, like, plate, or shrimp, or plate o' shrimp out of the blue, no explanation. No point in lookin' for one, either. It's all part of a cosmic unconsciousness.
  • 15.
  • 16.
  • 17.
  • 18.  
  • 19. What is a Signature? (aka “message digest”, examples include “md5” and “sha-1”) image from Eddie Kohler http://www.cs.ucla.edu/~kohler/
  • 20.
  • 21.
  • 22. Lexical Signatures -- Examples A “Googlewhack” ( http://en.wikipedia.org/wiki/Googlewhack ) can be thought of as a two-term LS that produces a 1/1 ranking. libraries jcdl digital conference pst http://www.google.com/search?q=libraries+jcdl+digital+conference+pst http://www.jcdl2008.org 1/51 (2/77 in 01/2008) nsdl multiagency imls testbeds extramural http://www. google .com/search? q=nsdl + multiagency + imls + testbeds +extramural http://www.dli2.nsf.gov 0/10 Rank/Results URL LS 1/1 http://www.cs.berkeley.edu/˜wilensky/NLP.html texttiling wilensky disambiguation subtopic iago http://www. google .com/search? q=texttiling + wilensky +disambiguation+subtopic+ iago 1/221,000 (1/174,000 in 01/2008) http://www.loc.gov library collections congress thomas american http://www. google .com/search? q=library +collections+congress+ thomas + american
  • 23.
  • 24.
  • 25.
  • 26. LS Evolution Over Time Copies of web pages from the IA (1996-2007) 300 Random URLs, winnowed to 98, 10493 observations over 12 years
  • 27. Evolution Over Time -- Example 10-term LSs generated for http://www.perfect10wines.com
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33. Titles (TI), 5- & 7-term Lexical Signatures (LS5, LS7), Tags (TA) 500 random URLs from dmoz.org winnowed to 309 (only 47 of 309 had tags in delicious.com). Due to query restrictions, link neighborhood only run on Yahoo -- results were similar to tags.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38. http://www.sun.com/solutions mean Levenshtein score sliding: 0.84 rooted: 0.29 1998-01-27 Sun Software Products Selector Guides -Solutions Tree 1999-02-20 Sun Software Solutions 2002-02-01 Sun Microsystems Products 2002-06-01 Sun Microsystems - Business & Industry Solutions 2003-08-01 Sun Microsystems - Industry & Infrastructure Solutions 2004-02-02 Sun Microsystems - Solutions 2004-06-10 Gateway Page - Sun Solutions 2006-01-09 Sun Microsystems Solutions & Services 2007-01-03 Services & Solutions 2007-02-07 Sun Services & Solutions 2008-01-19 Sun Solutions http://www.datacity.com/mainf.html mean Levenshtein score sliding: 0.68 rooted: 0.15 2000-06-19 DataCity of Manassas Park Main Page 2000-10-12 DataCity of Manassas Park sells Custom Built Computers & Removable Hard Drives 2001-08-21 DataCity a computer company in Manassas Park sells Custom Built Computers & Removable Hard Drives 2002-10-16 computer company in Manassas Virginia sells Custom Built Computers with Removable Hard Drives Kits and Iomega 2GB Jaz Drives (jazz drives) October 2002 DataCity 800-326-5051 toll free 2006-03-14 Est 1989 Computer company in Stafford Virginia sells Custom Built Secure Computers with DoD 5200.1-R Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite® lighted keyboard DataCity 800-326-5051 Service Disabled Veteran Owned Business SDVOB
  • 39.
  • 41. How much of the Web is indexed? Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)

Notas del editor

  1. GYM intersection less than 43%