Investigating the Change of Web Pages’ Titles Over Time
1. Investigating the Change of
Web Pages’ Titles Over Time
Martin Klein and Michael L. Nelson
Old Dominion University
{mklein,mln}@cs.odu.edu
InDP 2009
Austin, TX
06/19/2009
7. The Environment
Web Infrastructure (WI) [McCown07]
• Web search engines (Google, Yahoo!, MSN Live) and
their caches
• Research projects (CiteSeer)
• Web archives (Internet Archive)
[McCown07] - F. McCown “Lazy Preservation: Reconstructing Websites from the Web Infrastructure”, PhD thesis, Old Dominion University, 2007. 3
8. The Bigger Picture
(1)
DONE
query for URL in:
·search engine caches present
! (2)
·Internet Archive results
user is
satisfied
(3)
!
·identify dissimilar pages
·extract titles
·generate LSs
·obtain tags
no results ·query search engines
present
found
results
! user is ! (4)
DONE
satisfied
(5)
·include link neighborhood
·relevance feedback
·user interaction:
! request keywords
! change number of terms in LS
! add/delete term from LS
! advanced search operators
(6)
present results
! DONE
4
9. The Bigger Picture
(1) • System catches
DONE
404 “Page not found” errors
query for URL in:
·search engine caches present
! (2)
·Internet Archive results
user is
satisfied
(3)
!
·identify dissimilar pages
·extract titles
·generate LSs
·obtain tags
no results ·query search engines
present
found
results
! user is ! (4)
DONE
satisfied
(5)
·include link neighborhood
·relevance feedback
·user interaction:
! request keywords
! change number of terms in LS
! add/delete term from LS
! advanced search operators
(6)
present results
! DONE
4
10. The Bigger Picture
(1) • System catches
DONE
404 “Page not found” errors
!
•
query for URL in: (2)
·search engine caches
·Internet Archive
present
results
user is Discovers copy of missing page
in WI and provides to user
satisfied
(3)
!
·identify dissimilar pages
·extract titles
·generate LSs
·obtain tags
no results ·query search engines
present
found
results
! user is ! (4)
DONE
satisfied
(5)
·include link neighborhood
·relevance feedback
·user interaction:
! request keywords
! change number of terms in LS
! add/delete term from LS
! advanced search operators
(6)
present results
! DONE
4
11. The Bigger Picture
(1) • System catches
DONE
404 “Page not found” errors
!
•
query for URL in: (2)
·search engine caches
·Internet Archive
present
results
user is Discovers copy of missing page
in WI and provides to user
satisfied
(3)
!
·identify dissimilar pages
·extract titles
·generate LSs • Obtains further data about
missing page (LS, title, tags) and
·obtain tags
no results ·query search engines
present
found
feeds that back into WI
results
! user is ! (4)
DONE
satisfied
(5)
·include link neighborhood
·relevance feedback
·user interaction:
! request keywords
! change number of terms in LS
! add/delete term from LS
! advanced search operators
(6)
present results
! DONE
4
12. The Bigger Picture
(1) • System catches
DONE
404 “Page not found” errors
!
•
query for URL in: (2)
·search engine caches
·Internet Archive
present
results
user is Discovers copy of missing page
in WI and provides to user
satisfied
(3)
!
·identify dissimilar pages
·extract titles
·generate LSs • Obtains further data about
missing page (LS, title, tags) and
·obtain tags
no results ·query search engines
present
found
feeds that back into WI
results
!
•
! user is (4)
Provides page at its new location
DONE
satisfied
(5)
·include link neighborhood
or “good enough” alternative
·relevance feedback
·user interaction:
! request keywords
page
! change number of terms in LS
! add/delete term from LS
! advanced search operators
(6)
present results
! DONE
4
13. The Bigger Picture
(1) • System catches
DONE
404 “Page not found” errors
!
•
query for URL in: (2)
·search engine caches
·Internet Archive
present
results
user is Discovers copy of missing page
in WI and provides to user
satisfied
(3)
!
·identify dissimilar pages
·extract titles
·generate LSs • Obtains further data about
missing page (LS, title, tags) and
·obtain tags
no results ·query search engines
present
found
feeds that back into WI
results
!
•
! user is (4)
Provides page at its new location
DONE
satisfied
(5)
·include link neighborhood
or “good enough” alternative
·relevance feedback
·user interaction:
! request keywords
page
•
! change number of terms in LS
More sophisticated methods
! add/delete term from LS
! advanced search operators
(6)
present results
! DONE
needed if unsuccessful so far
4
14. The Bigger Picture
(1)
DONE
query for URL in:
·search engine caches present
! (2)
·Internet Archive results
user is
satisfied
(3)
!
·identify dissimilar pages
·extract titles
·generate LSs
·obtain tags
no results ·query search engines
present
found
results
! user is ! (4)
DONE
satisfied
(5)
·include link neighborhood
·relevance feedback
·user interaction:
! request keywords
! change number of terms in LS
! add/delete term from LS
! advanced search operators
(6)
present results
! DONE
4
15. The Bigger Picture
(1)
DONE
query for URL in:
·search engine caches present
! (2)
·Internet Archive results
user is
satisfied
(3)
!
REAL TIME!!!
·identify dissimilar pages
·extract titles
·generate LSs
·obtain tags
no results ·query search engines
present
found
results
! user is ! (4)
DONE
satisfied
(5)
·include link neighborhood
·relevance feedback
·user interaction:
! request keywords
! change number of terms in LS
! add/delete term from LS
! advanced search operators
(6)
present results
! DONE
4
16. Search Engine Queries
• Lexical signatures (LSs)
• Small set of terms capturing the “aboutness” of a document
• Generated following the TF-IDF scheme
• Phelps and Wilensky assumed ‘5’[Phelps00]
• We have shown that 5- and 7-term LSs perform best [Klein08]
[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
17. Search Engine Queries
• Lexical signatures (LSs)
• Small set of terms capturing the “aboutness” of a document
• Generated following the TF-IDF scheme
• Phelps and Wilensky assumed ‘5’[Phelps00]
• We have shown that 5- and 7-term LSs perform best [Klein08]
BUT:
[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
18. Search Engine Queries
• Lexical signatures (LSs)
• Small set of terms capturing the “aboutness” of a document
• Generated following the TF-IDF scheme
• Phelps and Wilensky assumed ‘5’[Phelps00]
• We have shown that 5- and 7-term LSs perform best [Klein08]
BUT:
• IDF can only be estimated when the entire web is the corpus
[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
19. Search Engine Queries
• Lexical signatures (LSs)
• Small set of terms capturing the “aboutness” of a document
• Generated following the TF-IDF scheme
• Phelps and Wilensky assumed ‘5’[Phelps00]
• We have shown that 5- and 7-term LSs perform best [Klein08]
BUT:
• IDF can only be estimated when the entire web is the corpus
• Expensive to generate
[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
20. Search Engine Queries
• Lexical signatures (LSs)
• Small set of terms capturing the “aboutness” of a document
• Generated following the TF-IDF scheme
• Phelps and Wilensky assumed ‘5’[Phelps00]
• We have shown that 5- and 7-term LSs perform best [Klein08]
BUT:
• IDF can only be estimated when the entire web is the corpus
• Expensive to generate
Web pages’ titles
[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
21. Web Pages’ Titles
• Easier/cheaper to obtain than LSs
• High availability (1-2% of web pages have no title)
• Also capturing “aboutness” of a web page
6
22. Web Pages’ Titles
• Easier/cheaper to obtain than LSs
• High availability (1-2% of web pages have no title)
• Also capturing “aboutness” of a web page
• We have shown that LSs decay over time and their
retrieval performance decreases [Klein08]
6
23. Web Pages’ Titles
• Easier/cheaper to obtain than LSs
• High availability (1-2% of web pages have no title)
• Also capturing “aboutness” of a web page
• We have shown that LSs decay over time and their
retrieval performance decreases [Klein08]
• Investigate change of titles over time
6
24. Web Pages’ Titles
• Easier/cheaper to obtain than LSs
• High availability (1-2% of web pages have no title)
• Also capturing “aboutness” of a web page
• We have shown that LSs decay over time and their
retrieval performance decreases [Klein08]
• Investigate change of titles over time
• General frequency of change
6
25. Web Pages’ Titles
• Easier/cheaper to obtain than LSs
• High availability (1-2% of web pages have no title)
• Also capturing “aboutness” of a web page
• We have shown that LSs decay over time and their
retrieval performance decreases [Klein08]
• Investigate change of titles over time
• General frequency of change
• Degree of change as Levenshtein score
6
26. Dataset
• 6k URLs randomly sampled from DMOZ
• Parsed the pages and extracted up to three URLs
referencing to in-domain pages
• Applied filter for:
• Inaccessible pages
• Pages not containing any links
• Pages not in the .com, .net, .org or .edu domain
• Pages without copies in the IA
7
27. Dataset
• 6k URLs randomly sampled from DMOZ
• Parsed the pages and extracted up to three URLs
referencing to in-domain pages
• Applied filter for:
• Inaccessible pages
• Pages not containing any links
• Pages not in the .com, .net, .org or .edu domain
• Pages without copies in the IA
1090 URLs and more than 100K observations
7
32. Frequency of Change
Number of Changes and Observations in the IA
ordered in Number of Changes
increasing order by: Number of Observations
10000
1) observations
2) changes
Number of Changes/Observations
1000
100
10
1
0 200 400 600 800 1000
URLs 9
33. Frequency of Change
Number of Changes and Observations in the IA
ordered in Number of Changes
increasing order by: Number of Observations
• generally low number of
10000
1) observations
2) changes change
Number of Changes/Observations
1000
100
10
1
0 200 400 600 800 1000
URLs 9
34. Frequency of Change
Number of Changes and Observations in the IA
ordered in Number of Changes
increasing order by: Number of Observations
• generally low number of
10000
1) observations
2) changes change
• max changes: 25
Number of Changes/Observations
1000
100
10
1
0 200 400 600 800 1000
URLs 9
35. Frequency of Change
Number of Changes and Observations in the IA
ordered in Number of Changes
increasing order by: Number of Observations
• generally low number of
10000
1) observations
2) changes change
• max changes: 25
Number of Changes/Observations
1000
• number of observations
does not impact the
100
number of changes
10
1
0 200 400 600 800 1000
URLs 9