"Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

WEB ARCHIVE
SERVICES FRAMEWORK
FOR TIGHTER INTEGRATION
BETWEEN THE PASTAND PRESENT WEB
Ahmed AlSum
PhD Defense
February 2014
Committee Members:
• Michael L. Nelson
• Michele C. Weigle
• Hussein M. Abdel-Wahab
• M‟Hammad Abdous
• Herbert Van de Sompel
Old Dominion University Computer Science Department
1

Domain
Contribution
Goal
WEB ARCHIVE
SERVICES FRAMEWORK
Ahmed AlSum
PhD Defense
February 2014
Committee Members:
• Michael L. Nelson
• Michele C. Weigle
• Hussein M. Abdel-Wahab
• M‟Hammad Abdous
• Herbert Van de Sompel
2

Outline
• Introduction
• Web Archiving Services Framework
• Content Service
• Metadata Service
• URI Service
• Archive Service
• Conclusions
3

INTRODUCTION
Motivation and Research Questions
4

What is a Web Archive?
Introduction  Motivation
http://www.cs.odu.edu
5

Who are using Web Archives? & How?
• Politicians
• Journalists
• Web designers
• Historians
• Researchers
• Social scientists
• Curious users
6
*IIPC Access Working Group 2006, Costa 2010, Dougherty 2010, Stirling 2011, Smith 2009

Web Archives interfaces are limited
7

Web Archiving Use Cases
• Ponguru asked on Internet Archive forum on May 17,
2010*:
• Hi All - I am new to Archive.org. A few quick questions
(1) Is there any API or tools available to access the Archive.org contents
programmatically?
(2) Are there any research papers where Archive.org was used for data
collection / analysis (e.g. studying a particular topic over time, etc.)? I
digged a little bit, could not find much, so checking with the group. "
*http://archive.org/post/306799/api-or-tools-to-access-research-publications-on-archiveorg
8

Lack of APIs
• Famous websites provide APIs to the third-party
developer.
9

Limited and non-standards APIs
• Current Web Archives have a limited set of APIs that don‟t
cover the user‟s needs.
10

Wayback Machine API
• It returns JSON
interface for the list of
available Mementos.
11

Croatian Web Archive
Full-text search web interface Full-text search APIs in JSON
12

Memento
• Memento provides
TimeMap in the
application CoRE
format.
13

Memento Terminology
URI-R, R
URI-M, M
URI-T, TM
http://www.amazon.com
http://web.archive.org/web/20110411070244/http://amazon.com
Original Resource
Memento
TimeMap
14
Van de Sompel, H., Nelson, M. L., & Sanderson, R. (2013). RFC 7089 - HTTP framework for time-based access to resource states -- Memento. Internet Engineering Task Force
(IETF). Retrieved from http://tools.ietf.org/html/rfc7089

Memento Aggregator
• Merges TimeMaps from various archives.
15

Web Archiving as Big Data
• Internet Archive corpus reached 5 PetaBytes.
• Alexandria Bibliotheca needs one year to recompute
checksum for its corpus.
• Tools
Apache Pig
16

Research Question
How Can We Enrich The Web Archive Access
Interface With The Conjunction Of The Live Web?
Introduction  Research Questions
17

Research Questions
• What are the required services for the web archiving user
community?
• Shall we work on the web archive collection as one entity or on
different levels?
• How can we use the web archive content beyond full-text
search?
• What are the metadata fields that could enhance user
browsing?
• How can we develop access interface to the temporal web
graph?
• How can we optimize creation of thumbnails?
• How can we use the HTTP redirection to enhance the URI-
lookup query?
• How can we optimize the query routing mechanism across the
web archives?
Introduction  Research Questions
18

WEB ARCHIVE
SERVICE FRAMEWORK
Levels and Datasets
19

Web Archive Service Framework
20

• Archive level
• Web Archive profiling to
optimize the query routing.
• URI level
• URI HTTP redirection in the
web archive URI-lookup.
• Metadata level
• ArcLink
• ArcThumb
• Content level
• ArcContent
ArcSys
21

IIPC 2010 Winter Olympics
Web Archive Service Framework  Datasets
* http://olympics.us.archive.org/olympics2010/
Size 700+GB
From Nov 2009
To Mar 2010
#URI-R 6.4M
#URI-M 23.7M
22

Fortune 500
• 499,540 mementos from 488
TimeMaps.
• For each Memento, we download the
HTML and capture the thumbnail using
PhantomJS.
23

DMOZ
• URI Open Directory
based on user
submissions.
24

CONTENT SERVICE
ArcContent
25
Archive
URI
Metadata
Content

Wayback Machine URI Rewriting
Original Rewritten
Content Service
26

Response Types
Raw Response
Modified Response
Extracted Response
Content Service
27

ArcContent Architecture Diagram
Content Service
28

Extracted Response Filters
Content Service
TextContent
TFContent
29

Extracted Response Formats
Content Service
XML
JSON
30

ArcContent Applications
Content Service
TFContent
TagClouds
31

METADATA SERVICE
ArcLink & ArcThumb
32
Archive
URI
Metadata
Content

Metadata Access Service
Metadata Service
• Metadata is data about data.
• Metadata layer is data about mementos.
Type Field Description Example
Technical
Content-type Entity mimetype. text/html
Content-length Size of the entity-body. 90883
Extracted
Title Title of the page. Egypt rejoices at
Mubarak departure
Description Description about the content
of the entity-body.
The BBC World Affairs
Editor John Simpson
reflects on how Egypt
brought about the
overthrow of President
Hosni Mubarak.
Outgoing Links A list of all the outlinks that
the page pointed to.
Derived
Thumbnail Thumbnail of the
representation of the web
page.
Incoming Links A list of all the inlinks that to
pointed to the page
33

ArcLink
Motivation, Stages, Cost Model, Applications
34

ArcLink: optimization
techniques to build and
retrieve the temporal web
graph
A. AlSum and M. L. Nelson,.
In Proceedings of the 13th annual international ACM/IEEE joint conference on Digital
libraries
JCDL „13, Indianapolis, Indiana, 2013
See also: http://arxiv.org/abs/1305.5959
35

Easily Solved Questions
Q: What are the available mementos for
www.vancouver2010.com?
Metadata Service  ArcLink  Motivation
36

Solved Questions, but hard
Q. What are the HTML titles for www.vancouver2010.com
through time?
A. Page scraping for all mementos
37

Impossible Questions
Q What are the anchor-text that pointed to
www.vancouver2010.com through time?
…
<a href=www.vancouver2010.com >
Vancouver Olympics
</a>
….
…
Winter Olympics
</a>
…
…
Vancouver 2010
</a>
…
38

Outlinks
39

ArcLink and Temporal Web Graph
What is ArcLink?
• ArcLink is a complete system to Extract, Preserve, and
Access to Temporal Web Graph.
What is the Temporal Web Graph?
• Link structure through the time, including inlinks and
outlinks.
WG @t2WG @t1 TWG
40

System Stages
Metadata Service  ArcLink  Stages
41

Filtering
• Using CDX files to filter the URI to select the mementos
that will contribute to the Web Graph.
• For example,
• Exclude non-200 HTTP status code
• Exclude Images, style-sheets, videos, etc
• Exclude duplicate mementos
• Technique: Using Pig Latin script on CDX files
• Results: CDX was reduced to 25% of the original size,
from 23.8M mementos to 6.7M mementos.
42

Extraction
• Technique: Hadoop
• Step 1: URI-ID generation
• Canonicalized the URI into SURT format
• Hash the canonicalized format using SimHash
• Completely distributed
• Step 2: Define data sources
Input Source Map (sec) Reduce (sec) Total (sec)
2 Tasks
Wayback 21,422 4,194 25,616
WARC 13,327 2,770 16,098 (62%)
5 Tasks
Wayback 13,721 2,257 15,978
WARC 8,304 1,746 10,051 (62%)
• WARC • Web archive UI
43

Storage
• ArcLink used database to save the web
graph
Insertion Performance Update Performance
44

ArcLink Response
45

ArcLink Response
46

ArcLink Response
47

Impossible Questions
Q. What are the anchor-text that pointed to
www.vancouver2010.com through time?
Metadata Service  ArcLink  Applications
48

Temporal Page Rank
Nov-2009 Dec-2009 Jan-2010
1 vancouver2010.com/code - topsport.com/sportch/liveticker/
2 vancouver2010.com/en/langpolicy - vancouver2010.com/code
3 vancouver2010.com/forgotpassword -
canadacode.vancouver2010.com/
user/register
4 vancouver2010.com/store - canadacode.vancouver2010.com
5 vancouver2010.com/store/index.html - canadacode.vancouver2010.com/explore
6 vancouver2010.com/ -
canadacode.vancouver2010.com/
user/login?destination=node/add/image
7 canadacode.vancouver2010.com - canadacode.vancouver2010.com/pulse
8 canadacode.vancouver2010.com/nfb-onf - canadacode.vancouver2010.com/challenge
9 canadacode.vancouver2010.com/contact - i-credible.nl
10 canadacode.vancouver2010.com/resources - vpzschaatsteam.nl
Feb-2010 Mar-2010 Collection ( Nov-09 to Mar-10 )
1 monlibe.liberation.fr monlibe.liberation.fr monlibe.liberation.fr
2 topsport.com/sportch/liveticker/
laprovence.com/la-provence-le-faq-de-la-
moderation
vancouver2010.com/code
3 lefigaro.fr get.adobe.com/flashplayer lefigaro.fr
4
laprovence.com/la-provence-le-faq-de-la-
moderation
vancouver2010.teamgb.com /teamgb/team-
behind-team-gb/filenotfound.aspx
laprovence.com/la-provence-le-faq-de-
la-moderation
5 lefigaro.fr/sport ledauphine.com lefigaro.fr/sport
6 get.adobe.com/flashplayer lefigaro.fr/economie get.adobe.com/flashplayer
7 lefigaro.fr/meteo lefigaro.fr/sport lefigaro.fr/meteo
8 lefigaro.fr/le-talk lefigaro.fr/actualites-a-la-une lefigaro.fr/le-talk
9
dosb.de/de/vancouver-2010/vancouver-
ticker/detail/printer.html
lemonde.fr/cgv topsport.com/sportch/liveticker/
10 ledauphine.com ffs.fr/index.php vancouver2010.com/en/langpolicy
49

ArcThumb
Motivation, Feature Exploration, Selection Algorithm
50

Thumbnail Summarization
Techniques For Web
Archives
AlSum and M. L. Nelson,.
In Proceedings of the 36th European Conference on Information Retrieval.
ECIR 2014, Amsterdam, Netherlands, 2014
51

Thumbnails in Web Archive
Metadata Service  ArcThumb  Motivation
Internet Archive UK Web Archive
52

Thumbnails Creation Challenges
• Scalability in Time
• IA may need 361 years to create thumbnail for each memento
using one hundred machines.
• Scalability in Space
• IA will need 355 TB to store 1 thumbnail per each memento.
• Page quality
53

Thumbnails Usage Challenges
54
• This is partial view of 700 thumbnails out of 10,500
available mementos for www.apple.com

From 10,500 Mementos to 69 Thumbnails.
55

How many thumbnails do we need?
Metadata Service  ArcThumb  Methodology
www.unfi.com on the live Web
56

How many thumbnails do we need?
www.unfi.com on the live Web
57

40 Thumbnails are good.
58

Visual Similarity and Text Similarity
SimilarDifferent
HTML Text
59

Correlation between
Visual Similarity and Text Similarity
Metadata Service  ArcThumb  Feature Exploration
SimHash DOM tree
Embedded resources Memento Datetime
60
SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013]

Threshold Grouping
Metadata Service  ArcThumb  Selection Algorithms
61

Threshold Grouping
62

Clustering technique
SimHash Feature SimHash and Datetime Features
63
Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341.

Time Normalization
64

Selection Algorithms Comparison
Threshold Grouping K clustering Time Normalization
TimeMap Reduction 27% 9% to 12% 23%
Image Loss 28 78 - 101 109
# Features 1 feature 1 or more 1 feature
Preprocessing required Yes Yes No
Efficient processing Medium Extensive Light
Incremental Yes No Yes
Online/offline Both Both Both
65

URI SERVICE
66
Archive
URI
Metadata
Content

ARCHIVAL HTTP
REDIRECTION RETRIEVAL
POLICIES
A. AlSum, M. L. Nelson, R. Sanderson, and H. Van de Sompel
In Proceedings of 3rd Temporal Web Analytics Workshop.
TempWeb 2013, Rio de Janeiro, Brazil
67

Live Web Redirect
http://bit.ly/r9kIfC redirects to http://www.cs.odu.edu
URI Service
% curl -I http://bit.ly/r9kIfC
HTTP/1.1 301 Moved
….
Location: http://www.cs.odu.edu/
…
68

Live Web Redirect
URI Service
R http://bit.ly/r9kIfC R http://www.cs.odu.edu
redirects to
69

R1 www.draculathemusical.co.uk R2 www.mosaicstudio.co.uk
R1
http://web.archive.org/web/20020212194020/http://www.draculathemusical.co.uk/
R3
http://web.archive.org/web/20020212194020/http://www.geocities.com/draculathemusical
WebArchiveLiveweb
redirects to
redirects to
has Memento
Archived Web Redirect
URI Service
70

Experiment
• Dataset: 10,000 sample URIs from
• Dataset does not include bit.ly nor doi.
• Experiment focused on the root page (no embedded resources)
URI Service  Experiment and Results
HTTP Status/Code (10,000 URI-R)
OK (200) 82.83%
Redirection (3xx) 14.71%
Redirection (301) 8.4%
Redirection (302) 6.1%
Redirection (others) 0.2%
Not-Found (4xx) 1.18%
Others 1.28%
HTTP Status/Code (894,717 URI-M)
OK (200) 93.46%
Redirection (3xx) 5.69%
Not-Found (4xx) 0.26%
Others 0.59%
URIs Live HTTP status code Memento HTTP status code
71

URI Stability
• URI‟s stability is a count of the change in HTTP responses
across time (200, 3xx, or 4xx) and the number of different
URIs in the “Location” for 3xx status code.
High Stability = 1 No Stability = 0
URI Service
72

Abstract Model
•
URI Service
M1 M2 M3
73

Timemap Redirection Categories
URI Service
All Mementos have 200 HTTP status code All Mementos have redirection to the same URI.
All Mementos have redirection to different URIs. Mementos have different HTTP status code.
74

URI Stability
TimeMap Category Percentage Stability
All Mementos have OK 52% 1
Mementos have mixed status codes 36% 0.91
All Mementos have Redirection 0.92% 0.85
Redirection to the same URI 0.62%
Redirection to different URIs 0.30%
URI has no Mementos at all 10.97% 0
Stability in semi-log scale Stability for |TM(R)| < 300
75

Current Wayback Machine Policy
•
URI Service  Retrieval Policies
76

Policy one:
URI-R with HTTP redirection
•
Retrieve the memento M for R.
Status(M) =200
Status(M) =3xx
Stop
Go to Policy 2
Stop
Yes
Yes
Yes No
No
No
77

Policy one:
URI-R with HTTP redirection
• Evaluation:
• Policy scope has: 1471 URIs (that have live redirection)
• 77 out of 1471 have no mementos at all
• 17 out of 77 have been retrieved mementos based on live
redirection
78

Policy two:
URI-M with HTTP redirection
•
http://www.cnn.com/
Accept-Datetime: Sun, 13 May 2006
http://www.cnn.com/
79

Policy two:
URI-M with HTTP redirection
• Evaluation:
• Policy scope: 2980 TimeMap (that showed HTTP redirection status code in at least one memento)
• Success criteria: Using policy two contributed to the original
TimeMap
• Success percentage: 58% of the cases
80

ARCHIVE SERVICE
Percentage and Distribution
81
Archive
URI
Metadata
Content

How Much Of The Web Is
Archived?
S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson
In Proceedings of the 11th annual international ACM/IEEE joint conference on
Digital libraries
JCDL '11, Ottawa, Canada 2011
82

Experiment
• 4 Sample sets – 1000 URIs each
• For each URI, we used Memento Aggregator to record the
TimeMap for this URI.
Archive Service  Percentage  Experiment
83

Archives Under Experiment
2010 2010 and 2013 2013
U
K
84

How Much of the Web is Archived?
• It Depends on Which Web…
Archive Service  Percentage  Results
2010 2013
Including
SE cache
Excluding
SE Cache General
90% 79% 90%
97% 68% 95%
88% 19% 52%
35% 16% 33%
85

Profiling Web Archive
Coverage For
Top-level Domain And
Content Language
A. AlSum, M. C. Weigle, M. L. Nelson, and H. Van de Sompel
In Proceedings of the 17th International Conference on Theory and Practice of
Digital Libraries
TPDL 2013, Valletta, Malta, 2013
Extended version is invited to special edition in IJDL.
86

Memento Aggregator
Archive Service  Distribution
87

Where can you find?
http://www.google.com/
88

Where can you find?
http://www.google.com/
89

Where can you find?
http://www.japantimes.co.jp/
90

Where can you find?
http://www.japantimes.co.jp/
91

Research Question
Problem
• We need to profile the web archives around the world with
these characteristics:
• Age
• Top-level domains
• Languages
• Growth rate
Goal
• To optimize the query routing for Memento Aggregator.
• To determine the missing parts of the web.
92

URIs Samples Sources
Web
1. DMOZ – Random sample
2. DMOZ – TLD 200 URIs for
each TLD from DMOZ (80
tlds)
3. DMOZ – Languages 100
URIs for each Languages
(40 lang.)
Web Archives
4. Top 1-Gram from Bing
5. Top 1000 queries term
by Yahoo in 9
languages
User requests
6. IA Wayback Machine log files
7. Memento aggregator log files
* We used hostnames only
93

TLD Coverage
IA Internet Archive AIT Archive It LoC Library of Congress UK UK National Library BL British Library
SG Singapore CAT Web Archive of Catalonia CR Croatian Web Archive CZ Archive of the Czech Web TW National Taiwan University
IC Icelandic Web Archive CAN Library and Archives Canada SI Slovenia WEB Web Cite AIS Archive.is
94

Language Coverage
IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University
IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
95

Growth Rate
IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University
IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
Stopped archiving
in 2008
Steady growth
Stopped getting new
URIs, but still crawling
96

Building Web Archive Profile
97

• RecallTM@1 = 3/8 = 0.375
• RecallTM@2 = 5/8 = 0.625
Web Archive Selection Evaluation
TM(R)
A1 M1
M2
M3
A2 M4
M5
A3 M6
A4 M7
A5 M8
98

Number of Archive Including IA Excluding IA
RecallTM@3 0.96 0.647
RecallTM@6 0.98 0.83
RecallTM@9 0.998 0.983
RecallTM@12 0.999 0.987
• Total number of archives N = 15
99

Conclusions
• We proposed a new service framework that divides the web archive
corpus into four levels: Content, Metadata, URI, and Archive.
• The development of ArcContent that supports the web archive
interface with extracted version of the mementos based on a set of
predefined filters.
• We studied the challenges of building the temporal web graph and
developed ArcLink, a distributed system to extract, preserve, and
expose the temporal web graph.
• We studied the optimization and summarization techniques to create
the thumbnails for the web graph collections based on SimHash
fingerprints.
• We extended the concept of URI-lookup in the web archive to include
the HTTP redirection status code.
• The concept of “Web Archive Profile” to characterize the web archive
corpus was defined with an application on the distributed search in
the Memento Aggregator.
101

Publications
• S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. “How
much of the Web is Archived?” In Proceedings of the 11th annual international
ACM/IEEE joint conference on Digital libraries, JCDL '11, 2011.
• A. AlSum, M. L. Nelson, R. Sanderson, and H. Van de Sompel. “Archival HTTP
Redirection Retrieval Policies.” In Proceedings of 3rd Temporal Web Analytics
Workshop, TempWeb ‟13, 2013.
• A. AlSum, and M. L. Nelson. “ArcLink: Optimization Techniques to Build and
Retrieve the Temporal Web Graph.” In Proceedings of the 13th annual international
ACM/IEEE joint conference on Digital libraries, JCDL '13, 2013.
• A. AlSum, Michele C. Weigle, M. L. Nelson, and H. Van de Sompel. “Profiling Web
Archive Coverage for Top-Level Domain and Content Language.” In Proceedings
of the 17th International Conference on Theory and Practice of Digital Libraries, TPDL
2013, 2013.
• A. AlSum, and M. L. Nelson. “Thumbnail Summarization Techniques for Web
Archives.” In Proceedings of the 36th European Conference on Information Retrieval.
ECIR „14, 2014.
102

What‟s next?
• Web Archiving Engineer at Stanford University.
103

WEB ARCHIVE
SERVICES FRAMEWORK
Ahmed AlSum
PhD Defense
February 2014
104
@aalsum

Memento
• Memento is an HTTP
extension to integrate the
Past and the Current
Web
I Jacobs and N Walsh Architecture of the world wide web Technical report, W3C, 2004 http://wwww3org/TR/webarch/
Now
T1
T2
T3
106

Memento
• Developer and administrator for Memento aggregator and proxies
107

Memento Clients
• Memento currently is RFC.
108

Lack of APIs
• Famous websites provide APIs to the third-party
developer.
109

Lack of APIs
• US Agencies started to support APIs to data access.
110

Web Archiving Use Cases
• Temporal navigation.
• Full text search.
• Use language filters.
• Provide raw WARC.
• Import of metadata records
into other repositories.
*IIPC Access Working Group. Use cases for Access to Internet Archives. International Internet Preservation Consortium
Publications, http://www.netpreserve.org/resources/use-cases-access-internet-archives, 2006.
111

Related Projects
Data analysis for the web data
Tools and Methods to access the web archive
Enable the user to do experiments on the raw
crawled data on Amazon S3
Enable the user to browse the present and the
past web
Introduction
112

Selection
• Decide what to capture
Everything, any domain
National domains
Delegate selection to partners
Users‟ favorites
• We studied what is already captured
113

URI-Based
WayBack Machine
Web Archiving Trends  Accessing Web Archive
• Textbox to enter the
requested URI.
• BubbleMap to show
you the available
mementos.
114

Collection-Based
• In addition to
browsing the
collection, you can
browse the URIs in
this collection.
115

Full-text search
• BL interface provides
different filtering
techniques for the
results.
116

Past Web Browser
• You can replay the
pages with different
controls to forward,
backward, pause and
stop.
117

Zoetrope
• Different Views
• Comparison between
different Mementos
• Not feasible on the
current web archiving
infrastructure
118

DiffIE
• A browser plug-in that
caches the pages a
person visits and
highlights how those
pages have changed
when the person
returns to them
• It is possible on the
personal archiving.
119

Synchronicity
• Mozilla Firefox add-on
supports internet user
in (re-)discovering
missing web pages in
real time
120

Warrick
• It’s a utility for
reconstructing or
recovering a website
when a back-up is not
available
121

ArcSys Architecture Diagram
122

WAT files
• WAT files are metadata files for WARC files
• WAT files are used to create data analysis reports based
on large datasets.
Metadata Service
123

It‟s More than WAT files
WAT ArcLink
Batch Process on a set of WARCs Batch process on a set of URIs
For internal use For public use
No-way to integerate with others
WAT files in others locations
It could be aggregated with other
graphs
No incremental update Support incremental update
Access on WAT file level using Pig Access on URI level using Web service
124

Cost of Scaling Up
•
Metadata Service  ArcLink  Cost model
Internet
Archive
88 hrs
108 * 109 mementos
247 days
500 TB
Filtering
Extraction
Storage
*Numbers based on Wayback Machine published statistics on Oct 2013 of 360B mementos with total size 5PB
125

Time-Indexed Inlinks Information
Date Anchor Text
04-Nov-09 vancouver2010.com
16-Jan-10 Vancouver 2010 Olympic Games
23-Jan-10 vancouver2010.com
23-Jan-10 2010 Vancouver Olympic Games Medals Results Schedule Sports
30-Jan-10 2010 Vancouver Olympic Games Medals Results Schedule Sports
30-Jan-10 vancouver2010.com
13-Feb-10 Vancouver 2010 Olympic Winter Games
15-Feb-10 Vancouver 2010 Olympic Games
18-Feb-10 Official Vancouver Games site
19-Feb-10 vancouver2010.com
20-Feb-10 Official Vancouver Games site
21-Feb-10 VANOC 2010
126

HTTP Redirection Relationship
between URI-R & URI-M
Live Web URI − R
OK Redirection
Web Archive
URI-M
OK Case 1 5
Redirection 2 3,4
Case 1
Case 2 Case 3 Case 4 Case 5
80.8%
2.74% 1.34%
1.33%
13.7%
127

• Category 1
URI Service
All Mementos have 200 HTTP status code
128

• Category 2
URI Service
All Mementos have redirection to the same URI.
129

• Category 3
URI Service
All Mementos have redirection to different URIs.
130

• Category 4
URI Service
Mementos have different HTTP status code.
131

HTTP Redirection Relationship
between URI-R & URI-M
URI Service
Live Web URI − R
OK Redirection
Web Archive
URI-M
OK Case 1 5
Redirection 2 3,4
Case 1
Case 2 Case 3 Case 4 Case 5
132

URI Reliability
•
URI Service
M1
3xx
M2
3xx
M3
3xx
rel=original
R`M
rel=original
R`M
rel=original
R`M
? ? ?200 404 3xx
133

Summary
• Quantitative study with 10,000 URIs.
• 48% were not fully stable through time.
• 27% were not perfectly reliable through time.
• New archival retrieval policy:
• Policy one: successfully retrieved mementos for 17 out of 77.
• Policy two: Expanded the TimeMap for 58% of cases.
134

URI Reliability
• 23% of the mementos did not lead to a successful
memento at the end.
Reliabilityin semi-log scale Reliabilityfor |TM(R)| < 300
135

Experiment
• For each sample set, we used Memento
Aggregator to get all the possible archived
copies (Mementos).
• For each URI, Memento Aggregator
responded with TimeMap for this URI.
Example
<http://memento.waybackmachine.org/memento/2001081919423
3/http://jcdl2002.org>;rel="first memento";datetime="Sun, 19 Aug
2001 19:42:33 GMT“,
<http://memento.waybackmachine.org/memento/2001121622024
8/http://jcdl2002.org>; rel="memento"; datetime="Sun, 16 Dec
2001 22:02:48 GMT",
136

1000 URIs Ordered by First Observation Date
See also: http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html
137

2010
2013
138

2010 2013
139

2010 2013
140

2010 2013
141

URIs Samples Sources –
Live Web
1. DMOZ – Random sample
• 10,000 URIs randomly sample from DMOZ directory (~5M URIs).
2. DMOZ – TLD: 200 URIs for each TLD
• 80 tlds.
3. DMOZ – Languages 100 URIs for each Languages
• 40 languages.
142

Web Archive
• Query the fulltext search interface for the web archives
with two set of query terms.
4. Top 1-Gram from Bing
• Most of them is English
5. Top 1000 queries term by Yahoo in 9 languages
• We excluded the general keywords such as: Obama,
Facebook.
143

User requests
• Sampling from the users requests to the web archived
materials
6. Sample from IA Wayback Machine Log files
• 10,000 URIs randomly sampled from Feb 22, 2012 to Feb 26,
2012.
7. Sample from Memento aggregator log files
• 1,000 URIs randomly sampled from LANL Memento Aggregator
between 2011 to 2013.
144

General Coverage
145

146

147

iTunes cover application
149

"Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (8)

Similar a "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

Similar a "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation. (20)

Último

Último (20)

"Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

Notas del editor