Martin Klein presented research on using focused crawls of web archives to build event collections. The approach crawls multiple web archives simultaneously using event timelines and relevance thresholds. Collections for recent events benefited more from the live web, while older event collections were improved using archived web pages. Utilizing multiple archives and focused crawling techniques produced more comprehensive collections than manual methods alone.
Low Sexy Call Girls In Mohali 9053900678 ๐ฅตHave Save And Good Place ๐ฅต
ย
Focused Crawl of Web Archives to Build Event Collections
1. Focused Crawl of Web Archives to Build Event Collections
@mart1nkle1n
WebSci 2018, 05/30/2018, Amsterdam, NL
Focused Crawl of Web Archives
to Build Event Collections
Martin Klein
Lyudmila Balakireva
Herbert Van de Sompel
Research Library
Los Alamos National Laboratory
2. Focused Crawl of Web Archives to Build Event Collections
@mart1nkle1n
WebSci 2018, 05/30/2018, Amsterdam, NL
2
โข Often orchestrated by subject matter experts, archivists,
special collection librarians, technicians
โข Potentially with guidance from institutional collection policy
โข Results in a list of seeds (URIs, social media accounts, etc)
โข Utilization of crawling services such as Archive-It, Social Feed
Manager
Background โ Event Collection Building
3. Focused Crawl of Web Archives to Build Event Collections
@mart1nkle1n
WebSci 2018, 05/30/2018, Amsterdam, NL
3
โข Temporal: time passed since event is of concern
๏ Use of web archives
โข Selection: seeds often picked manually
๏ Use of references from Wikipedia pages
โข Relevance: seed assessment often done by humans
๏ Use of focused crawling with content and temporal
relevance assessment
Inspiration from:
โExtracting Event-Centric Document Collections from Large-Scale Web Archivesโ
Gerhard Gossen, Elena Demidova, Thomas Risse
https://doi.org/10.1007/978-3-319-67008-9_10
Problems and our Approach
4. Focused Crawl of Web Archives to Build Event Collections
@mart1nkle1n
WebSci 2018, 05/30/2018, Amsterdam, NL
4
โข Web archives are an invaluable resource for researchers,
historians, journalists, etc.
โข Often broad in scope, large in scale, covering different
temporal intervals
โข Makes discovery, access, and analysis difficult
Background โ Archived Web
5. Focused Crawl of Web Archives to Build Event Collections
@mart1nkle1n
WebSci 2018, 05/30/2018, Amsterdam, NL
5
6. Focused Crawl of Web Archives to Build Event Collections
@mart1nkle1n
WebSci 2018, 05/30/2018, Amsterdam, NL
6
7. Focused Crawl of Web Archives to Build Event Collections
@mart1nkle1n
WebSci 2018, 05/30/2018, Amsterdam, NL
7
โข Can we create event collections by focused crawling online-
available web archives?
โข How do event collections created from the archived web
compare to those created from the live web?
โข How does the amount of time passed since the event affect
the collections built from the live and the archived web?
โข How do event collections built from the archived web compare
to manually curated collections?
Questions
8. Focused Crawl of Web Archives to Build Event Collections
@mart1nkle1n
WebSci 2018, 05/30/2018, Amsterdam, NL
8
โข Topics limited to terror attacks and mass shootings in the U.S.
โข From different times in the past
โข Focused crawl of:
a) 22 archives, simultaneously, via Memento infrastructure
b) the live web
โข Take content and temporal relevance into account, equally
weighted
โข Use eventsโ Wikipedia page as input for focused crawler
Experiment
9. Focused Crawl of Web Archives to Build Event Collections
@mart1nkle1n
WebSci 2018, 05/30/2018, Amsterdam, NL
9
1. Content of Wikipedia page + random 60% of pageโs references
โข Generate topic vector (TF-IDF of 1grams + 2grams)
2. Content of remaining 40% of Wikipedia pageโs outlinks
โข Generate topic vector (TF-IDF of 1grams + 2grams)
โข Compute cosine similarity value between vectors 1 and 2
โข Run 10 times
โข Take average similarity value as content threshold
Content Relevance
10. Focused Crawl of Web Archives to Build Event Collections
@mart1nkle1n
WebSci 2018, 05/30/2018, Amsterdam, NL
10
โข Define temporal interval for which crawled pages are
considered relevant
โข Event date extracted from Wikipedia event page
โข Change point determined from graph of proportional
Wikipedia page edits per day
Temporal Relevance
1
Event Date Change Point Today
0 0
11. Focused Crawl of Web Archives to Build Event Collections
@mart1nkle1n
WebSci 2018, 05/30/2018, Amsterdam, NL
11
โข Extract datetime from pages via:
โข URI
http://www.cnn.com/2017/12/09/us/wildfire-fighting-tactics/
โข Meta tags
<meta property="article:published" itemprop="datePublished"
content="2017-12-09T10:14:50-05:00" />
โข ODUโs Carbondate tool
http://carbondate.cs.odu.edu/
โข Memento datetime
โข X-Header
Datetime Extraction
12. Focused Crawl of Web Archives to Build Event Collections
@mart1nkle1n
WebSci 2018, 05/30/2018, Amsterdam, NL
12
โข Use version of Wikipedia page that was live at change point
โข Crawl stop conditions:
โข No more relevant documents left
โข 5 levels deep
โข Utilized crawl priority queue
Crawls
Level 2
Level 1
Level 0
Child 1
Seed
Child 2 Child 3
Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
13. Focused Crawl of Web Archives to Build Event Collections
@mart1nkle1n
WebSci 2018, 05/30/2018, Amsterdam, NL
13
โข New York City, October 31st 2017
โข San Bernadino, December 2nd 2015
โข Tucson, January 8th 2011
โข Binghampton, April 3rd 2009
Collections Crawled (in November 2017)
14. Focused Crawl of Web Archives to Build Event Collections
@mart1nkle1n
WebSci 2018, 05/30/2018, Amsterdam, NL
14
NYC, 10/31/2017 โ URIs per Level
0 1 2 3 4 5
Crawl depth
NumberofURIs
0500100015002000
Web Archive Crawl
0102030405060708090100
All URIs
Relevant URIs
0 1 2 3 4 5
Crawl depth
0500100015002000
Live Web Crawl
0102030405060708090100
Percent
All URIs
Relevant URIs
15. Focused Crawl of Web Archives to Build Event Collections
@mart1nkle1n
WebSci 2018, 05/30/2018, Amsterdam, NL
15
TUC, 01/08/2011 โ URIs per Level
0 1 2 3 4 5
Crawl depth
NumberofURIs
020000400006000080000
Web Archive Crawl
0102030405060708090100
All URIs
Relevant URIs
0 1 2 3 4 5
Crawl depth
020000400006000080000
Live Web Crawl
0102030405060708090100
Percent
All URIs
Relevant URIs
16. Focused Crawl of Web Archives to Build Event Collections
@mart1nkle1n
WebSci 2018, 05/30/2018, Amsterdam, NL
16
NYC, 10/31/2017 โ Relevance overโฆ
Crawled Documents Crawl Time
17. Focused Crawl of Web Archives to Build Event Collections
@mart1nkle1n
WebSci 2018, 05/30/2018, Amsterdam, NL
17
TUC, 01/08/2011 โ Relevance overโฆ
Crawled Documents Crawl Time
18. Focused Crawl of Web Archives to Build Event Collections
@mart1nkle1n
WebSci 2018, 05/30/2018, Amsterdam, NL
18
TUC, 01/08/2011 โ Comparison to Archive-IT
0 5000 10000 15000
050001000015000
Documents
AccumulatedRelevance
Web Archive Crawl
ArchiveโIt Crawl
19. Focused Crawl of Web Archives to Build Event Collections
@mart1nkle1n
WebSci 2018, 05/30/2018, Amsterdam, NL
19
TUC, 01/08/2011 โ Web Archive Contributions
web.archive.org 75%
wayback.archiveโit.org
14%
webarchive.loc.gov 7%
web.archive.bibalex.org 2%
archive.is 2%
20. Focused Crawl of Web Archives to Build Event Collections
@mart1nkle1n
WebSci 2018, 05/30/2018, Amsterdam, NL
20
โข Web archives are great resources to build event collections of
web resources
โข Crawling web archives is much slower than the live web
โข Collections about very recent events benefit more from the
live web than the archived web
but
โข Collections about events from the distant past benefit more
from the archived web than the live web
โข Utilizing multiple web archives is beneficial for the collection
โข Focused crawls have the potential to outperform manual
collection building
Takeaways
21. Focused Crawl of Web Archives to Build Event Collections
@mart1nkle1n
WebSci 2018, 05/30/2018, Amsterdam, NL
21
https://web.archive.org/web/20171206181955/https:/twitter.com/TVNewsArchive/status/938466726190096384
22. Focused Crawl of Web Archives to Build Event Collections
@mart1nkle1n
WebSci 2018, 05/30/2018, Amsterdam, NL
Focused Crawl of Web Archives
to Build Event Collections
Martin Klein
Lyudmila Balakireva
Herbert Van de Sompel
Research Library
Los Alamos National Laboratory