Through analyzing full text search results from web archives, the authors developed a method called the Random Searcher Model (RSM) to efficiently generate profiles of web archive collections with low overhead. The profiles accurately predict an archive's likelihood of containing a URI's mementos while minimizing search costs. Different RSM modes allow customization based on collection characteristics. The authors recommend profile policies and RSM modes to balance accuracy, recall, and costs depending on available archive metadata. Future work includes combining profile attributes and evaluating profiles for applications beyond memento routing.
1. Web Archive Profiling
Through Fulltext Search
Sawood Alam and Michael L. Nelson
Computer Science Department, Old Dominion University
Norfolk, Virginia - 23529
Herbert Van de Sompel
Los Alamos National Laboratory, Los Alamos, NM
David S. H. Rosenthal
Stanford University Libraries, Stanford, CA
Supported in part by the IIPC and NSF 1526700
11. From: Michael Nelson [mailto:mln@cs.odu.edu]
Sent: Wednesday, December 02, 2015 12:33 PM
To: Jones, Gina
Cc: Rourke, Patrick; Grotke, Abigail
Subject: Re: WebSciDL
Hi Gina, I'll investigate. memgator is software that one my students wrote,
but I suspect the traffic you're seeing is b/c it is deployed in
http://oldweb.today/ can you share the IP addr from where you're seeing
the traffic? I presume the requests are for Memento TimeMaps? It should
not being actually scraping HTML pages.
regards,
Michael
On Wed, 2 Dec 2015, Jones, Gina wrote:
> Hi Michael, we have a slight configuration issue with the current OW
> set up for our webarchives. I think, from looking at the logs, that
> "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues
on our wayback.
> Do you know who is running this scraper? Itʼs not part of memento is it?
>
> Gina Jones
> Web Archiving Team
> Library of Congress
From: Ilya Kreymer <ikreymer@gmail.com>
Date: Wed, 2 Dec 2015 10:33:56 -0800
Subject: high traffic on oldweb!
To: Herbert Van de Sompel <hvdsomp@gmail.com>, Sawood Alam
<ibnesayeed@gmail.com>
Hi Herbert, Sawood,
Herbert: Perhaps you are lucky that I am not using the LANL aggregator,
as the traffic has gotten really high, and also I was asked to remove an
archive due to the traffic it was causing temporarily..
I am thinking that ability to remove source archives quickly is an
important aspect of an aggregator.
Sawood: Hopefully yours will support something like this so I don't need
to restart the container to change the archivelist ;)
Ilya
Broadcasting is Bad
11
12. Availability and Overlap
● Archives are sparse
● Broadcasting is wasteful, both clients and archives suffer
12
16. Why Small Archives Matter?
● 400B+ web pages at IA do not cover
everything
● Top three archives after IA produce full
TimeMap 52% of the time (AlSum, et al., TPDL 2013)
● Targeted crawls
● Special focus archives
● Restricted resources
● Private archives
● Censorship
16
18. Archive Profile
● High-level summary of an archive
● Predicts presence of mementos of a URI-R in
an archive
● Provides various statistics about the holdings
● Small in size
● Publicly available
● Easy to update and partially patch
● Useful for Memento query routing and other
things
18
19. Profiling Strategies
● Sample URI Profiling (AlSum, et al., TPDL 2013)
● CDX Profiling (Alam, et al., TPDL 2015)
● Response Cache Profiling (Bornand, et al., JCDL 2016)
● Fulltext Search Profiling
19
21. Random Searcher Model (RSM)
21
START
STOP
Seed Vocabulary
NextWord()
ExtractWords()
Search()
Select a random link
from the search results
Vocabulary
seeding
needed?
Termination
condition
reached?
GenerateProfile()
Store search results
No
Yes
YesNo
Fetch the contents of the
selected document
22. RSM Illustration
Teaching Resources Adjunct Toolkit NC NET Academy PD Planning Tools Regional
Centers Campus Liaisons Nontraditional Careers College Tech Prep NC ACCESS Co op
Education Green Technology You are here NC NET Teaching Resources Discipline Specific
English English Self Paced Modules Writing Across the Curriculum NC NET Western Center
Incorporating Visuals in Workplace Documents Sections 1 2 Wake Tech Community College
Incorporating Visuals in Workplace Documents Section 3 Wake Tech Community College
All self paced modules can be accessed through the NC NET Blackboard server Log in with
the user name faculty and the password nc net Once connected you can view the courses
by topic or alphabetically by title English Webliography North Carolina Community College
System 2012
23. RSM Modes
● Static: Externally supplied static word list
● PopularityBiased: Refresh Vocabulary after
every search attempt and consider term
frequency for selecting next search keyword
● EqualOpportunity: Refresh Vocabulary
after every search attempt and ignore term
frequency for selecting next search keyword
● Conservative: Discover new words only
when the Vocabulary is exhausted
23
24. Profiling Policies & Archive-It Dataset
Policy # Keys Example
URIR 30,800,406 uk,co,bbc,news,)/Images/Logo.png?height=80&width=200
HxP1 1,724,284 uk,co,bbc,news,)/Images
DDom 91,629 uk,co,bbc,)/
H1P0 212 uk,)/
Sample URI: https://www.news.BBC.co.uk/Images/Logo.png?width=80&height=40
24
For a detailed list of profiling policies please refer to:
Alam, et al.: Web Archive Profiling Through CDX Summarization. IJDL (2016) 17: 223-238
26. RSM Operation Mode Costs
Mode
Query
Cost
HTTP
Cost
Remarks
Static C C
Suitable for specialized collection with known top
keywords
PopularityBiased C 2 * C Human like model, but costly
EqualOpportunity C 2 * C Human like model, but costly
Conservative C C +
(where << C)
Suitable for any collection and works without any
supplementary materials with very little overhead
26
27. Routing Confusion Matrix
Predicted Actual Present in the Archive Not in the Archive
Routed to the Archive True Positive (TP) False Positive (FP)
Not Routed to the Archive False Negative (FN) True Negative (TN)
Routing Confusion Matrix Recall Accuracy
27
29. Profile Policy Recommendations
● IF complete CDX is available THEN
○ Generate HxP1 profile
● ELSE IF fulltext search is available THEN
○ Generate DDom profile
● ELSE
○ Generate H1P0 or other smaller profiles using
Sample URIs
Note: It is possible to perform less detailed queries on more
specific (higher order) profiles, but not the other way
29
30. RSM Mode Recommendations
● IF the collection is about a specific topic in a
specific language AND a suitable top
keywords list is available THEN
○ Use Static mode
● ELSE
○ Use Conservative mode
30
31. Who Knows Term Frequency for
Estonian Nouns?
31
https://en.wiktionary.org/wiki/Category:Estonian_nouns
32. Future Work
● Evaluation of combination profiles such as
URI-Key along with Datetime
● Utilize archive profile to generate rank
ordered list of archive
● Profiles for usage other than Memento
routing, such as, site classification based
profiles (e.g., news, wiki, social media, blog
etc.)
32
33. Conclusions
● Evaluated the search cost as a function of archive holdings’
coverage and profiling policy
● Developed the Random Searcher Model
● Correctly route 80% requests while maintaining 0.9 Recall
by only discovering 10% of the archive holdings and
generating a profile that costs less than 1% of the complete
knowledge profile
33