SlideShare una empresa de Scribd logo
1 de 149
WEB ARCHIVE
SERVICES FRAMEWORK
FOR TIGHTER INTEGRATION
BETWEEN THE PASTAND PRESENT WEB
Ahmed AlSum
PhD Defense
February 2014
Committee Members:
• Michael L. Nelson
• Michele C. Weigle
• Hussein M. Abdel-Wahab
• M‟Hammad Abdous
• Herbert Van de Sompel
Old Dominion University Computer Science Department
1
Domain
Contribution
Goal
WEB ARCHIVE
SERVICES FRAMEWORK
FOR TIGHTER INTEGRATION
BETWEEN THE PASTAND PRESENT WEB
Ahmed AlSum
PhD Defense
February 2014
Committee Members:
• Michael L. Nelson
• Michele C. Weigle
• Hussein M. Abdel-Wahab
• M‟Hammad Abdous
• Herbert Van de Sompel
Old Dominion University Computer Science Department
2
Outline
• Introduction
• Web Archiving Services Framework
• Content Service
• Metadata Service
• URI Service
• Archive Service
• Conclusions
3
INTRODUCTION
Motivation and Research Questions
4
What is a Web Archive?
Introduction  Motivation
http://www.cs.odu.edu
5
Who are using Web Archives? & How?
• Politicians
• Journalists
• Web designers
• Historians
• Researchers
• Social scientists
• Curious users
Introduction  Motivation
6
*IIPC Access Working Group 2006, Costa 2010, Dougherty 2010, Stirling 2011, Smith 2009
Web Archives interfaces are limited
Introduction  Motivation
7
Web Archiving Use Cases
• Ponguru asked on Internet Archive forum on May 17,
2010*:
• Hi All - I am new to Archive.org. A few quick questions
(1) Is there any API or tools available to access the Archive.org contents
programmatically?
(2) Are there any research papers where Archive.org was used for data
collection / analysis (e.g. studying a particular topic over time, etc.)? I
digged a little bit, could not find much, so checking with the group. "
Introduction  Motivation
*http://archive.org/post/306799/api-or-tools-to-access-research-publications-on-archiveorg
8
Lack of APIs
• Famous websites provide APIs to the third-party
developer.
Introduction  Motivation
9
Limited and non-standards APIs
• Current Web Archives have a limited set of APIs that don‟t
cover the user‟s needs.
Introduction  Motivation
10
Wayback Machine API
Introduction  Motivation
• It returns JSON
interface for the list of
available Mementos.
11
Croatian Web Archive
Introduction  Motivation
Full-text search web interface Full-text search APIs in JSON
12
Memento
Introduction  Motivation
• Memento provides
TimeMap in the
application CoRE
format.
13
Memento Terminology
Introduction  Motivation
URI-R, R
URI-M, M
URI-T, TM
http://www.amazon.com
http://web.archive.org/web/20110411070244/http://amazon.com
Original Resource
Memento
TimeMap
14
Van de Sompel, H., Nelson, M. L., & Sanderson, R. (2013). RFC 7089 - HTTP framework for time-based access to resource states -- Memento. Internet Engineering Task Force
(IETF). Retrieved from http://tools.ietf.org/html/rfc7089
Memento Aggregator
• Merges TimeMaps from various archives.
Introduction  Motivation
15
Web Archiving as Big Data
• Internet Archive corpus reached 5 PetaBytes.
• Alexandria Bibliotheca needs one year to recompute
checksum for its corpus.
• Tools
Introduction  Motivation
Apache Pig
16
Research Question
How Can We Enrich The Web Archive Access
Interface With The Conjunction Of The Live Web?
Introduction  Research Questions
17
Research Questions
• What are the required services for the web archiving user
community?
• Shall we work on the web archive collection as one entity or on
different levels?
• How can we use the web archive content beyond full-text
search?
• What are the metadata fields that could enhance user
browsing?
• How can we develop access interface to the temporal web
graph?
• How can we optimize creation of thumbnails?
• How can we use the HTTP redirection to enhance the URI-
lookup query?
• How can we optimize the query routing mechanism across the
web archives?
Introduction  Research Questions
18
WEB ARCHIVE
SERVICE FRAMEWORK
Levels and Datasets
19
Web Archive Service Framework
Web Archive Service Framework
20
• Archive level
• Web Archive profiling to
optimize the query routing.
• URI level
• URI HTTP redirection in the
web archive URI-lookup.
• Metadata level
• ArcLink
• ArcThumb
• Content level
• ArcContent
Web Archive Service Framework
ArcSys
21
IIPC 2010 Winter Olympics
Web Archive Service Framework  Datasets
* http://olympics.us.archive.org/olympics2010/
Size 700+GB
From Nov 2009
To Mar 2010
#URI-R 6.4M
#URI-M 23.7M
22
Fortune 500
• 499,540 mementos from 488
TimeMaps.
• For each Memento, we download the
HTML and capture the thumbnail using
PhantomJS.
Web Archive Service Framework  Datasets
23
DMOZ
Web Archive Service Framework  Datasets
• URI Open Directory
based on user
submissions.
24
CONTENT SERVICE
ArcContent
25
Archive
URI
Metadata
Content
Wayback Machine URI Rewriting
Original Rewritten
Content Service
26
Response Types
Raw Response
Modified Response
Extracted Response
Content Service
27
ArcContent Architecture Diagram
Content Service
28
Extracted Response Filters
Content Service
TextContent
TFContent
29
Extracted Response Formats
Content Service
XML
JSON
30
ArcContent Applications
Content Service
TFContent
TagClouds
31
METADATA SERVICE
ArcLink & ArcThumb
32
Archive
URI
Metadata
Content
Metadata Access Service
Metadata Service
• Metadata is data about data.
• Metadata layer is data about mementos.
Type Field Description Example
Technical
Content-type Entity mimetype. text/html
Content-length Size of the entity-body. 90883
Extracted
Title Title of the page. Egypt rejoices at
Mubarak departure
Description Description about the content
of the entity-body.
The BBC World Affairs
Editor John Simpson
reflects on how Egypt
brought about the
overthrow of President
Hosni Mubarak.
Outgoing Links A list of all the outlinks that
the page pointed to.
Derived
Thumbnail Thumbnail of the
representation of the web
page.
Incoming Links A list of all the inlinks that to
pointed to the page
33
ArcLink
Motivation, Stages, Cost Model, Applications
34
ArcLink: optimization
techniques to build and
retrieve the temporal web
graph
A. AlSum and M. L. Nelson,.
In Proceedings of the 13th annual international ACM/IEEE joint conference on Digital
libraries
JCDL „13, Indianapolis, Indiana, 2013
See also: http://arxiv.org/abs/1305.5959
35
Easily Solved Questions
Q: What are the available mementos for
www.vancouver2010.com?
Metadata Service  ArcLink  Motivation
36
Solved Questions, but hard
Q. What are the HTML titles for www.vancouver2010.com
through time?
A. Page scraping for all mementos
Metadata Service  ArcLink  Motivation
37
Impossible Questions
Q What are the anchor-text that pointed to
www.vancouver2010.com through time?
Metadata Service  ArcLink  Motivation
…
<a href=www.vancouver2010.com >
Vancouver Olympics
</a>
….
…
<a href=www.vancouver2010.com >
Winter Olympics
</a>
…
…
<a href=www.vancouver2010.com >
Vancouver 2010
</a>
…
38
Outlinks
Metadata Service  ArcLink  Motivation
39
ArcLink and Temporal Web Graph
What is ArcLink?
• ArcLink is a complete system to Extract, Preserve, and
Access to Temporal Web Graph.
What is the Temporal Web Graph?
• Link structure through the time, including inlinks and
outlinks.
Metadata Service  ArcLink  Motivation
WG @t2WG @t1 TWG
40
System Stages
Metadata Service  ArcLink  Stages
41
Filtering
• Using CDX files to filter the URI to select the mementos
that will contribute to the Web Graph.
• For example,
• Exclude non-200 HTTP status code
• Exclude Images, style-sheets, videos, etc
• Exclude duplicate mementos
• Technique: Using Pig Latin script on CDX files
• Results: CDX was reduced to 25% of the original size,
from 23.8M mementos to 6.7M mementos.
Metadata Service  ArcLink  Stages
42
Extraction
• Technique: Hadoop
• Step 1: URI-ID generation
• Canonicalized the URI into SURT format
• Hash the canonicalized format using SimHash
• Completely distributed
• Step 2: Define data sources
Metadata Service  ArcLink  Stages
Input Source Map (sec) Reduce (sec) Total (sec)
2 Tasks
Wayback 21,422 4,194 25,616
WARC 13,327 2,770 16,098 (62%)
5 Tasks
Wayback 13,721 2,257 15,978
WARC 8,304 1,746 10,051 (62%)
• WARC • Web archive UI
43
Storage
• ArcLink used database to save the web
graph
Metadata Service  ArcLink  Stages
Insertion Performance Update Performance
44
ArcLink Response
Metadata Service  ArcLink  Stages
45
ArcLink Response
Metadata Service  ArcLink  Stages
46
ArcLink Response
Metadata Service  ArcLink  Stages
47
Impossible Questions
Q. What are the anchor-text that pointed to
www.vancouver2010.com through time?
Metadata Service  ArcLink  Applications
48
Temporal Page Rank
Nov-2009 Dec-2009 Jan-2010
1 vancouver2010.com/code - topsport.com/sportch/liveticker/
2 vancouver2010.com/en/langpolicy - vancouver2010.com/code
3 vancouver2010.com/forgotpassword -
canadacode.vancouver2010.com/
user/register
4 vancouver2010.com/store - canadacode.vancouver2010.com
5 vancouver2010.com/store/index.html - canadacode.vancouver2010.com/explore
6 vancouver2010.com/ -
canadacode.vancouver2010.com/
user/login?destination=node/add/image
7 canadacode.vancouver2010.com - canadacode.vancouver2010.com/pulse
8 canadacode.vancouver2010.com/nfb-onf - canadacode.vancouver2010.com/challenge
9 canadacode.vancouver2010.com/contact - i-credible.nl
10 canadacode.vancouver2010.com/resources - vpzschaatsteam.nl
Metadata Service  ArcLink  Applications
Feb-2010 Mar-2010 Collection ( Nov-09 to Mar-10 )
1 monlibe.liberation.fr monlibe.liberation.fr monlibe.liberation.fr
2 topsport.com/sportch/liveticker/
laprovence.com/la-provence-le-faq-de-la-
moderation
vancouver2010.com/code
3 lefigaro.fr get.adobe.com/flashplayer lefigaro.fr
4
laprovence.com/la-provence-le-faq-de-la-
moderation
vancouver2010.teamgb.com /teamgb/team-
behind-team-gb/filenotfound.aspx
laprovence.com/la-provence-le-faq-de-
la-moderation
5 lefigaro.fr/sport ledauphine.com lefigaro.fr/sport
6 get.adobe.com/flashplayer lefigaro.fr/economie get.adobe.com/flashplayer
7 lefigaro.fr/meteo lefigaro.fr/sport lefigaro.fr/meteo
8 lefigaro.fr/le-talk lefigaro.fr/actualites-a-la-une lefigaro.fr/le-talk
9
dosb.de/de/vancouver-2010/vancouver-
ticker/detail/printer.html
lemonde.fr/cgv topsport.com/sportch/liveticker/
10 ledauphine.com ffs.fr/index.php vancouver2010.com/en/langpolicy
49
ArcThumb
Motivation, Feature Exploration, Selection Algorithm
50
Thumbnail Summarization
Techniques For Web
Archives
AlSum and M. L. Nelson,.
In Proceedings of the 36th European Conference on Information Retrieval.
ECIR 2014, Amsterdam, Netherlands, 2014
51
Thumbnails in Web Archive
Metadata Service  ArcThumb  Motivation
Internet Archive UK Web Archive
52
Thumbnails Creation Challenges
• Scalability in Time
• IA may need 361 years to create thumbnail for each memento
using one hundred machines.
• Scalability in Space
• IA will need 355 TB to store 1 thumbnail per each memento.
• Page quality
Metadata Service  ArcThumb  Motivation
53
Thumbnails Usage Challenges
54
Metadata Service  ArcThumb  Motivation
• This is partial view of 700 thumbnails out of 10,500
available mementos for www.apple.com
From 10,500 Mementos to 69 Thumbnails.
Metadata Service  ArcThumb  Motivation
55
How many thumbnails do we need?
Metadata Service  ArcThumb  Methodology
www.unfi.com on the live Web
56
How many thumbnails do we need?
Metadata Service  ArcThumb  Methodology
www.unfi.com on the live Web
57
40 Thumbnails are good.
Metadata Service  ArcThumb  Methodology
58
Visual Similarity and Text Similarity
Metadata Service  ArcThumb  Methodology
SimilarDifferent
HTML Text
59
Correlation between
Visual Similarity and Text Similarity
Metadata Service  ArcThumb  Feature Exploration
SimHash DOM tree
Embedded resources Memento Datetime
60
SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013]
Threshold Grouping
Metadata Service  ArcThumb  Selection Algorithms
61
Threshold Grouping
Metadata Service  ArcThumb  Selection Algorithms
62
Clustering technique
Metadata Service  ArcThumb  Selection Algorithms
SimHash Feature SimHash and Datetime Features
63
Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341.
Time Normalization
Metadata Service  ArcThumb  Selection Algorithms
64
Selection Algorithms Comparison
Threshold Grouping K clustering Time Normalization
TimeMap Reduction 27% 9% to 12% 23%
Image Loss 28 78 - 101 109
# Features 1 feature 1 or more 1 feature
Preprocessing required Yes Yes No
Efficient processing Medium Extensive Light
Incremental Yes No Yes
Online/offline Both Both Both
Metadata Service  ArcThumb  Selection Algorithms
65
URI SERVICE
66
Archive
URI
Metadata
Content
ARCHIVAL HTTP
REDIRECTION RETRIEVAL
POLICIES
A. AlSum, M. L. Nelson, R. Sanderson, and H. Van de Sompel
In Proceedings of 3rd Temporal Web Analytics Workshop.
TempWeb 2013, Rio de Janeiro, Brazil
67
Live Web Redirect
http://bit.ly/r9kIfC redirects to http://www.cs.odu.edu
URI Service
% curl -I http://bit.ly/r9kIfC
HTTP/1.1 301 Moved
….
Location: http://www.cs.odu.edu/
…
68
Live Web Redirect
URI Service
R http://bit.ly/r9kIfC R http://www.cs.odu.edu
redirects to
69
R1 www.draculathemusical.co.uk R2 www.mosaicstudio.co.uk
R1
http://web.archive.org/web/20020212194020/http://www.draculathemusical.co.uk/
R3
http://web.archive.org/web/20020212194020/http://www.geocities.com/draculathemusical
WebArchiveLiveweb
redirects to
redirects to
has Memento
Archived Web Redirect
URI Service
70
Experiment
• Dataset: 10,000 sample URIs from
• Dataset does not include bit.ly nor doi.
• Experiment focused on the root page (no embedded resources)
URI Service  Experiment and Results
HTTP Status/Code (10,000 URI-R)
OK (200) 82.83%
Redirection (3xx) 14.71%
Redirection (301) 8.4%
Redirection (302) 6.1%
Redirection (others) 0.2%
Not-Found (4xx) 1.18%
Others 1.28%
HTTP Status/Code (894,717 URI-M)
OK (200) 93.46%
Redirection (3xx) 5.69%
Not-Found (4xx) 0.26%
Others 0.59%
URIs Live HTTP status code Memento HTTP status code
71
URI Stability
• URI‟s stability is a count of the change in HTTP responses
across time (200, 3xx, or 4xx) and the number of different
URIs in the “Location” for 3xx status code.
High Stability = 1 No Stability = 0
URI Service
72
Abstract Model
•
URI Service
M1 M2 M3
73
Timemap Redirection Categories
URI Service
All Mementos have 200 HTTP status code All Mementos have redirection to the same URI.
All Mementos have redirection to different URIs. Mementos have different HTTP status code.
74
URI Stability
URI Service  Experiment and Results
TimeMap Category Percentage Stability
All Mementos have OK 52% 1
Mementos have mixed status codes 36% 0.91
All Mementos have Redirection 0.92% 0.85
Redirection to the same URI 0.62%
Redirection to different URIs 0.30%
URI has no Mementos at all 10.97% 0
Stability in semi-log scale Stability for |TM(R)| < 300
75
Current Wayback Machine Policy
•
URI Service  Retrieval Policies
76
Policy one:
URI-R with HTTP redirection
•
URI Service  Retrieval Policies
Retrieve the memento M for R.
Status(M) =200
Status(M) =3xx
Stop
Go to Policy 2
Stop
Yes
Yes
Yes No
No
No
77
Policy one:
URI-R with HTTP redirection
• Evaluation:
• Policy scope has: 1471 URIs (that have live redirection)
• 77 out of 1471 have no mementos at all
• 17 out of 77 have been retrieved mementos based on live
redirection
URI Service  Retrieval Policies
78
Policy two:
URI-M with HTTP redirection
•
URI Service  Retrieval Policies
http://www.cnn.com/
Accept-Datetime: Sun, 13 May 2006
http://www.cnn.com/
79
Policy two:
URI-M with HTTP redirection
• Evaluation:
• Policy scope: 2980 TimeMap (that showed HTTP redirection status code in at least one memento)
• Success criteria: Using policy two contributed to the original
TimeMap
• Success percentage: 58% of the cases
URI Service  Retrieval Policies
80
ARCHIVE SERVICE
Percentage and Distribution
81
Archive
URI
Metadata
Content
How Much Of The Web Is
Archived?
S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson
In Proceedings of the 11th annual international ACM/IEEE joint conference on
Digital libraries
JCDL '11, Ottawa, Canada 2011
See also: http://arxiv.org/abs/1212.6177
82
Experiment
• 4 Sample sets – 1000 URIs each
• For each URI, we used Memento Aggregator to record the
TimeMap for this URI.
Archive Service  Percentage  Experiment
83
Archives Under Experiment
2010 2010 and 2013 2013
Archive Service  Percentage  Experiment
U
K
84
How Much of the Web is Archived?
• It Depends on Which Web…
Archive Service  Percentage  Results
2010 2013
Including
SE cache
Excluding
SE Cache General
90% 79% 90%
97% 68% 95%
88% 19% 52%
35% 16% 33%
85
Profiling Web Archive
Coverage For
Top-level Domain And
Content Language
A. AlSum, M. C. Weigle, M. L. Nelson, and H. Van de Sompel
In Proceedings of the 17th International Conference on Theory and Practice of
Digital Libraries
TPDL 2013, Valletta, Malta, 2013
Extended version is invited to special edition in IJDL.
See also: http://arxiv.org/abs/1309.4008
86
Memento Aggregator
Archive Service  Distribution
87
Where can you find?
Archive Service  Distribution
http://www.google.com/
88
Where can you find?
Archive Service  Distribution
http://www.google.com/
89
Where can you find?
Archive Service  Distribution
http://www.japantimes.co.jp/
90
Where can you find?
Archive Service  Distribution
http://www.japantimes.co.jp/
91
Research Question
Problem
• We need to profile the web archives around the world with
these characteristics:
• Age
• Top-level domains
• Languages
• Growth rate
Goal
• To optimize the query routing for Memento Aggregator.
• To determine the missing parts of the web.
Archive Service  Distribution
92
URIs Samples Sources
Archive Service  Distribution
Web
1. DMOZ – Random sample
2. DMOZ – TLD 200 URIs for
each TLD from DMOZ (80
tlds)
3. DMOZ – Languages 100
URIs for each Languages
(40 lang.)
Web Archives
4. Top 1-Gram from Bing
5. Top 1000 queries term
by Yahoo in 9
languages
User requests
6. IA Wayback Machine log files
7. Memento aggregator log files
* We used hostnames only
93
TLD Coverage
Archive Service  Distribution
IA Internet Archive AIT Archive It LoC Library of Congress UK UK National Library BL British Library
SG Singapore CAT Web Archive of Catalonia CR Croatian Web Archive CZ Archive of the Czech Web TW National Taiwan University
IC Icelandic Web Archive CAN Library and Archives Canada SI Slovenia WEB Web Cite AIS Archive.is
94
Language Coverage
Archive Service  Distribution
IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University
IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
IA Internet Archive AIT Archive It LoC Library of Congress UK UK National Library BL British Library
SG Singapore CAT Web Archive of Catalonia CR Croatian Web Archive CZ Archive of the Czech Web TW National Taiwan University
IC Icelandic Web Archive CAN Library and Archives Canada SI Slovenia WEB Web Cite AIS Archive.is
95
Growth Rate
Archive Service  Distribution
IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University
IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
Stopped archiving
in 2008
Steady growth
Stopped getting new
URIs, but still crawling
IA Internet Archive AIT Archive It LoC Library of Congress UK UK National Library BL British Library
SG Singapore CAT Web Archive of Catalonia CR Croatian Web Archive CZ Archive of the Czech Web TW National Taiwan University
IC Icelandic Web Archive CAN Library and Archives Canada SI Slovenia WEB Web Cite AIS Archive.is
96
Building Web Archive Profile
Archive Service  Distribution
97
• RecallTM@1 = 3/8 = 0.375
• RecallTM@2 = 5/8 = 0.625
Web Archive Selection Evaluation
Archive Service  Distribution
TM(R)
A1 M1
M2
M3
A2 M4
M5
A3 M6
A4 M7
A5 M8
98
Web Archive Selection Evaluation
Archive Service  Distribution
Number of Archive Including IA Excluding IA
RecallTM@3 0.96 0.647
RecallTM@6 0.98 0.83
RecallTM@9 0.998 0.983
RecallTM@12 0.999 0.987
• Total number of archives N = 15
99
CONCLUSIONS
100
Conclusions
• We proposed a new service framework that divides the web archive
corpus into four levels: Content, Metadata, URI, and Archive.
• The development of ArcContent that supports the web archive
interface with extracted version of the mementos based on a set of
predefined filters.
• We studied the challenges of building the temporal web graph and
developed ArcLink, a distributed system to extract, preserve, and
expose the temporal web graph.
• We studied the optimization and summarization techniques to create
the thumbnails for the web graph collections based on SimHash
fingerprints.
• We extended the concept of URI-lookup in the web archive to include
the HTTP redirection status code.
• The concept of “Web Archive Profile” to characterize the web archive
corpus was defined with an application on the distributed search in
the Memento Aggregator.
101
Publications
• S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. “How
much of the Web is Archived?” In Proceedings of the 11th annual international
ACM/IEEE joint conference on Digital libraries, JCDL '11, 2011.
• A. AlSum, M. L. Nelson, R. Sanderson, and H. Van de Sompel. “Archival HTTP
Redirection Retrieval Policies.” In Proceedings of 3rd Temporal Web Analytics
Workshop, TempWeb ‟13, 2013.
• A. AlSum, and M. L. Nelson. “ArcLink: Optimization Techniques to Build and
Retrieve the Temporal Web Graph.” In Proceedings of the 13th annual international
ACM/IEEE joint conference on Digital libraries, JCDL '13, 2013.
• A. AlSum, Michele C. Weigle, M. L. Nelson, and H. Van de Sompel. “Profiling Web
Archive Coverage for Top-Level Domain and Content Language.” In Proceedings
of the 17th International Conference on Theory and Practice of Digital Libraries, TPDL
2013, 2013.
• A. AlSum, and M. L. Nelson. “Thumbnail Summarization Techniques for Web
Archives.” In Proceedings of the 36th European Conference on Information Retrieval.
ECIR „14, 2014.
102
What‟s next?
• Web Archiving Engineer at Stanford University.
103
WEB ARCHIVE
SERVICES FRAMEWORK
FOR TIGHTER INTEGRATION
BETWEEN THE PASTAND PRESENT WEB
Ahmed AlSum
PhD Defense
February 2014
Old Dominion University Computer Science Department
104
@aalsum
BACKUP
105
Memento
• Memento is an HTTP
extension to integrate the
Past and the Current
Web
I Jacobs and N Walsh Architecture of the world wide web Technical report, W3C, 2004 http://wwww3org/TR/webarch/
Now
T1
T2
T3
106
Memento
• Developer and administrator for Memento aggregator and proxies
107
Memento Clients
• Memento currently is RFC.
108
Lack of APIs
• Famous websites provide APIs to the third-party
developer.
Introduction  Motivation
109
Lack of APIs
• US Agencies started to support APIs to data access.
Introduction  Motivation
110
Web Archiving Use Cases
• Temporal navigation.
• Full text search.
• Use language filters.
• Provide raw WARC.
• Import of metadata records
into other repositories.
Introduction  Motivation
*IIPC Access Working Group. Use cases for Access to Internet Archives. International Internet Preservation Consortium
Publications, http://www.netpreserve.org/resources/use-cases-access-internet-archives, 2006.
111
Related Projects
Data analysis for the web data
Tools and Methods to access the web archive
Enable the user to do experiments on the raw
crawled data on Amazon S3
Enable the user to browse the present and the
past web
Introduction
112
Selection
• Decide what to capture
Everything, any domain
National domains
Delegate selection to partners
Users‟ favorites
• We studied what is already captured
113
URI-Based
WayBack Machine
Web Archiving Trends  Accessing Web Archive
• Textbox to enter the
requested URI.
• BubbleMap to show
you the available
mementos.
114
Collection-Based
Web Archiving Trends  Accessing Web Archive
• In addition to
browsing the
collection, you can
browse the URIs in
this collection.
115
Full-text search
Web Archiving Trends  Accessing Web Archive
• BL interface provides
different filtering
techniques for the
results.
116
Past Web Browser
Web Archiving Trends  Accessing Web Archive
• You can replay the
pages with different
controls to forward,
backward, pause and
stop.
117
Zoetrope
Web Archiving Trends  Accessing Web Archive
• Different Views
• Comparison between
different Mementos
• Not feasible on the
current web archiving
infrastructure
118
DiffIE
Web Archiving Trends  Accessing Web Archive
• A browser plug-in that
caches the pages a
person visits and
highlights how those
pages have changed
when the person
returns to them
• It is possible on the
personal archiving.
119
Synchronicity
Web Archiving Trends  Accessing Web Archive
• Mozilla Firefox add-on
supports internet user
in (re-)discovering
missing web pages in
real time
120
Warrick
Web Archiving Trends  Accessing Web Archive
• It’s a utility for
reconstructing or
recovering a website
when a back-up is not
available
121
ArcSys Architecture Diagram
Web Archive Service Framework
122
WAT files
• WAT files are metadata files for WARC files
• WAT files are used to create data analysis reports based
on large datasets.
Metadata Service
123
It‟s More than WAT files
WAT ArcLink
Batch Process on a set of WARCs Batch process on a set of URIs
For internal use For public use
No-way to integerate with others
WAT files in others locations
It could be aggregated with other
graphs
No incremental update Support incremental update
Access on WAT file level using Pig Access on URI level using Web service
Metadata Service  ArcLink  Motivation
124
Cost of Scaling Up
•
Metadata Service  ArcLink  Cost model
Internet
Archive
88 hrs
108 * 109 mementos
247 days
500 TB
Filtering
Extraction
Storage
*Numbers based on Wayback Machine published statistics on Oct 2013 of 360B mementos with total size 5PB
125
Time-Indexed Inlinks Information
Metadata Service  ArcLink  Applications
Date Anchor Text
04-Nov-09 vancouver2010.com
11-Nov-09 vancouver2010.com
18-Nov-09 vancouver2010.com
16-Jan-10 Vancouver 2010 Olympic Games
16-Jan-10 Vancouver 2010 Olympic Games
23-Jan-10 vancouver2010.com
23-Jan-10 2010 Vancouver Olympic Games Medals Results Schedule Sports
30-Jan-10 2010 Vancouver Olympic Games Medals Results Schedule Sports
30-Jan-10 vancouver2010.com
30-Jan-10 Vancouver 2010 Olympic Games
13-Feb-10 Vancouver 2010 Olympic Winter Games
15-Feb-10 Vancouver 2010 Olympic Games
18-Feb-10 Official Vancouver Games site
19-Feb-10 vancouver2010.com
20-Feb-10 Official Vancouver Games site
21-Feb-10 VANOC 2010
126
HTTP Redirection Relationship
between URI-R & URI-M
URI Service  Experiment and Results
Live Web URI − R
OK Redirection
Web Archive
URI-M
OK Case 1 5
Redirection 2 3,4
Case 1
Case 2 Case 3 Case 4 Case 5
80.8%
2.74% 1.34%
1.33%
13.7%
127
Timemap Redirection Categories
• Category 1
URI Service
All Mementos have 200 HTTP status code
128
Timemap Redirection Categories
• Category 2
URI Service
All Mementos have redirection to the same URI.
129
Timemap Redirection Categories
• Category 3
URI Service
All Mementos have redirection to different URIs.
130
Timemap Redirection Categories
• Category 4
URI Service
Mementos have different HTTP status code.
131
HTTP Redirection Relationship
between URI-R & URI-M
URI Service
Live Web URI − R
OK Redirection
Web Archive
URI-M
OK Case 1 5
Redirection 2 3,4
Case 1
Case 2 Case 3 Case 4 Case 5
132
URI Reliability
•
URI Service
M1
3xx
M2
3xx
M3
3xx
rel=original
R`M
rel=original
R`M
rel=original
R`M
? ? ?200 404 3xx
133
Summary
• Quantitative study with 10,000 URIs.
• 48% were not fully stable through time.
• 27% were not perfectly reliable through time.
• New archival retrieval policy:
• Policy one: successfully retrieved mementos for 17 out of 77.
• Policy two: Expanded the TimeMap for 58% of cases.
URI Service  Retrieval Policies
134
URI Reliability
• 23% of the mementos did not lead to a successful
memento at the end.
URI Service  Experiment and Results
Reliabilityin semi-log scale Reliabilityfor |TM(R)| < 300
135
Experiment
Archive Service  Percentage  Experiment
• For each sample set, we used Memento
Aggregator to get all the possible archived
copies (Mementos).
• For each URI, Memento Aggregator
responded with TimeMap for this URI.
Example
<http://memento.waybackmachine.org/memento/2001081919423
3/http://jcdl2002.org>;rel="first memento";datetime="Sun, 19 Aug
2001 19:42:33 GMT“,
<http://memento.waybackmachine.org/memento/2001121622024
8/http://jcdl2002.org>; rel="memento"; datetime="Sun, 16 Dec
2001 22:02:48 GMT",
136
1000 URIs Ordered by First Observation Date
Archive Service  Percentage  Results
See also: http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html
137
2010
Archive Service  Percentage  Results
2013
138
Archive Service  Percentage  Results
2010 2013
139
Archive Service  Percentage  Results
2010 2013
140
Archive Service  Percentage  Results
2010 2013
141
URIs Samples Sources –
Live Web
1. DMOZ – Random sample
• 10,000 URIs randomly sample from DMOZ directory (~5M URIs).
2. DMOZ – TLD: 200 URIs for each TLD
• 80 tlds.
3. DMOZ – Languages 100 URIs for each Languages
• 40 languages.
Archive Service  Distribution
142
URIs Samples Sources –
Web Archive
• Query the fulltext search interface for the web archives
with two set of query terms.
4. Top 1-Gram from Bing
• Most of them is English
5. Top 1000 queries term by Yahoo in 9 languages
• We excluded the general keywords such as: Obama,
Facebook.
Archive Service  Distribution
143
URIs Samples Sources –
User requests
• Sampling from the users requests to the web archived
materials
6. Sample from IA Wayback Machine Log files
• 10,000 URIs randomly sampled from Feb 22, 2012 to Feb 26,
2012.
7. Sample from Memento aggregator log files
• 1,000 URIs randomly sampled from LANL Memento Aggregator
between 2011 to 2013.
Archive Service  Distribution
144
General Coverage
Archive Service  Distribution
IA Internet Archive AIT Archive It LoC Library of Congress UK UK National Library BL British Library
SG Singapore CAT Web Archive of Catalonia CR Croatian Web Archive CZ Archive of the Czech Web TW National Taiwan University
IC Icelandic Web Archive CAN Library and Archives Canada SI Slovenia WEB Web Cite AIS Archive.is
145
Web Archive Selection Evaluation
Archive Service  Distribution
146
Web Archive Selection Evaluation
Archive Service  Distribution
147
Future Works
148
iTunes cover application
Metadata Service  ArcThumb  Motivation
149

Más contenido relacionado

La actualidad más candente

Cenitpede: Analyzing Webcrawl
Cenitpede: Analyzing WebcrawlCenitpede: Analyzing Webcrawl
Cenitpede: Analyzing WebcrawlPrimal Pappachan
 
Repeatable Semantic Queries for the Linked Data Agnostic
Repeatable Semantic Queries for the Linked Data AgnosticRepeatable Semantic Queries for the Linked Data Agnostic
Repeatable Semantic Queries for the Linked Data AgnosticAlbert Meroño-Peñuela
 
SWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebSWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebPascal-Nicolas Becker
 
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...Martin Klein
 
The state of the art in Linked Data
The state of the art in Linked DataThe state of the art in Linked Data
The state of the art in Linked DataJoshua Shinavier
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked DataEUCLID project
 
Illuminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportIlluminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportPascal-Nicolas Becker
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingSawood Alam
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataEUCLID project
 
Intro to IIIF and IIIF @NLW
Intro to IIIF and IIIF @NLWIntro to IIIF and IIIF @NLW
Intro to IIIF and IIIF @NLWGlen Robson
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackChris Bizer
 
Learning W3C Linked Data Platform with examples
Learning W3C Linked Data Platform with examplesLearning W3C Linked Data Platform with examples
Learning W3C Linked Data Platform with examplesNandana Mihindukulasooriya
 
Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)Hector Correa
 
Adoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsAdoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsChris Bizer
 
Decentralized Data Management for the Semantic Web
Decentralized Data Management for the Semantic WebDecentralized Data Management for the Semantic Web
Decentralized Data Management for the Semantic Webhala Skaf
 
Describing LDP Applications with the Hydra Core Vocabulary
Describing LDP Applications with the Hydra Core VocabularyDescribing LDP Applications with the Hydra Core Vocabulary
Describing LDP Applications with the Hydra Core VocabularyNandana Mihindukulasooriya
 
OAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall ForumOAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall ForumRobert Sanderson
 

La actualidad más candente (19)

Cenitpede: Analyzing Webcrawl
Cenitpede: Analyzing WebcrawlCenitpede: Analyzing Webcrawl
Cenitpede: Analyzing Webcrawl
 
Repeatable Semantic Queries for the Linked Data Agnostic
Repeatable Semantic Queries for the Linked Data AgnosticRepeatable Semantic Queries for the Linked Data Agnostic
Repeatable Semantic Queries for the Linked Data Agnostic
 
ResourceSync Quick Overview
ResourceSync Quick OverviewResourceSync Quick Overview
ResourceSync Quick Overview
 
SWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebSWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic Web
 
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
 
The state of the art in Linked Data
The state of the art in Linked DataThe state of the art in Linked Data
The state of the art in Linked Data
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Illuminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportIlluminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data Support
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
 
Intro to IIIF and IIIF @NLW
Intro to IIIF and IIIF @NLWIntro to IIIF and IIIF @NLW
Intro to IIIF and IIIF @NLW
 
Introduction to W3C Linked Data Platform
Introduction to W3C Linked Data PlatformIntroduction to W3C Linked Data Platform
Introduction to W3C Linked Data Platform
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
 
Learning W3C Linked Data Platform with examples
Learning W3C Linked Data Platform with examplesLearning W3C Linked Data Platform with examples
Learning W3C Linked Data Platform with examples
 
Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)
 
Adoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsAdoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical Domains
 
Decentralized Data Management for the Semantic Web
Decentralized Data Management for the Semantic WebDecentralized Data Management for the Semantic Web
Decentralized Data Management for the Semantic Web
 
Describing LDP Applications with the Hydra Core Vocabulary
Describing LDP Applications with the Hydra Core VocabularyDescribing LDP Applications with the Hydra Core Vocabulary
Describing LDP Applications with the Hydra Core Vocabulary
 
OAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall ForumOAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall Forum
 

Destacado

Cuestionario 1 (1)
Cuestionario 1 (1)Cuestionario 1 (1)
Cuestionario 1 (1)castrusa
 
Restoring US First Website
Restoring US First WebsiteRestoring US First Website
Restoring US First WebsiteAhmed AlSum
 
Thumbnail Summarization Techniques For Web Archives
Thumbnail Summarization Techniques For Web ArchivesThumbnail Summarization Techniques For Web Archives
Thumbnail Summarization Techniques For Web ArchivesAhmed AlSum
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 
Defining the Archive on Our Terms: A Look at the Esperanza Peace and Justice ...
Defining the Archive on Our Terms: A Look at the Esperanza Peace and Justice ...Defining the Archive on Our Terms: A Look at the Esperanza Peace and Justice ...
Defining the Archive on Our Terms: A Look at the Esperanza Peace and Justice ...Itza Carbajal
 
News Archive - BBC News Labs presentation on Storylines, Topics & Tags
News Archive - BBC News Labs presentation on Storylines, Topics & TagsNews Archive - BBC News Labs presentation on Storylines, Topics & Tags
News Archive - BBC News Labs presentation on Storylines, Topics & TagsBBC News Labs
 
ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013Ahmed AlSum
 

Destacado (8)

Project 2 Archive
Project 2 ArchiveProject 2 Archive
Project 2 Archive
 
Cuestionario 1 (1)
Cuestionario 1 (1)Cuestionario 1 (1)
Cuestionario 1 (1)
 
Restoring US First Website
Restoring US First WebsiteRestoring US First Website
Restoring US First Website
 
Thumbnail Summarization Techniques For Web Archives
Thumbnail Summarization Techniques For Web ArchivesThumbnail Summarization Techniques For Web Archives
Thumbnail Summarization Techniques For Web Archives
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Defining the Archive on Our Terms: A Look at the Esperanza Peace and Justice ...
Defining the Archive on Our Terms: A Look at the Esperanza Peace and Justice ...Defining the Archive on Our Terms: A Look at the Esperanza Peace and Justice ...
Defining the Archive on Our Terms: A Look at the Esperanza Peace and Justice ...
 
News Archive - BBC News Labs presentation on Storylines, Topics & Tags
News Archive - BBC News Labs presentation on Storylines, Topics & TagsNews Archive - BBC News Labs presentation on Storylines, Topics & Tags
News Archive - BBC News Labs presentation on Storylines, Topics & Tags
 
ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013
 

Similar a "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

Restful风格ž„web服务架构
Restful风格ž„web服务架构Restful风格ž„web服务架构
Restful风格ž„web服务架构Benjamin Tan
 
LoCloud: overview of LoCloud Services
LoCloud: overview of LoCloud ServicesLoCloud: overview of LoCloud Services
LoCloud: overview of LoCloud Serviceslocloud
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Artefactual Systems - AtoM
 
API Testing. Streamline your testing process.
API Testing. Streamline your testing process.API Testing. Streamline your testing process.
API Testing. Streamline your testing process.Andrey Oleynik
 
Introduction to LoCloud
Introduction to LoCloud Introduction to LoCloud
Introduction to LoCloud locloud
 
Linked Energy Data Generation
Linked Energy Data GenerationLinked Energy Data Generation
Linked Energy Data GenerationFilip Radulovic
 
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryBiblioteca Nacional de España
 
IoT Interoperability: a Hub-based Approach
IoT Interoperability: a Hub-based ApproachIoT Interoperability: a Hub-based Approach
IoT Interoperability: a Hub-based ApproachMichael Blackstock
 
Linked services: Connecting services to the Web of Data
Linked services: Connecting services to the Web of DataLinked services: Connecting services to the Web of Data
Linked services: Connecting services to the Web of DataJohn Domingue
 
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022HostedbyConfluent
 
Web Technologies Introduction to web technologies
Web Technologies Introduction to web technologiesWeb Technologies Introduction to web technologies
Web Technologies Introduction to web technologiesVigneshkumar Ponnusamy
 
Development of Web Services for Android Applications
Development of Web Services for Android ApplicationsDevelopment of Web Services for Android Applications
Development of Web Services for Android ApplicationsMd Ashraful Haque
 
Immutable Service Delivery Shenzhen 2016
Immutable Service Delivery   Shenzhen 2016Immutable Service Delivery   Shenzhen 2016
Immutable Service Delivery Shenzhen 2016John Willis
 
Daniel Irwin - Crossrail: Future-Proofing Railway Asset Management
Daniel Irwin - Crossrail: Future-Proofing Railway Asset ManagementDaniel Irwin - Crossrail: Future-Proofing Railway Asset Management
Daniel Irwin - Crossrail: Future-Proofing Railway Asset ManagementGeoEnable Limited
 
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)TimelessFuture
 
Kurento: a media server architecture and API for WebRTC
Kurento: a media server architecture and API for WebRTCKurento: a media server architecture and API for WebRTC
Kurento: a media server architecture and API for WebRTCLuis Lopez
 
Introduction to Digital Humanities: Metadata standards and ontologies
Introduction to Digital Humanities: Metadata standards and ontologies Introduction to Digital Humanities: Metadata standards and ontologies
Introduction to Digital Humanities: Metadata standards and ontologies LIBIS
 
Linked Services for the Web of Data
Linked Services for the Web of DataLinked Services for the Web of Data
Linked Services for the Web of DataCarlos Pedrinaci
 

Similar a "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation. (20)

Restful风格ž„web服务架构
Restful风格ž„web服务架构Restful风格ž„web服务架构
Restful风格ž„web服务架构
 
LoCloud: overview of LoCloud Services
LoCloud: overview of LoCloud ServicesLoCloud: overview of LoCloud Services
LoCloud: overview of LoCloud Services
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
 
API Testing. Streamline your testing process.
API Testing. Streamline your testing process.API Testing. Streamline your testing process.
API Testing. Streamline your testing process.
 
Introduction to LoCloud
Introduction to LoCloud Introduction to LoCloud
Introduction to LoCloud
 
Linked Energy Data Generation
Linked Energy Data GenerationLinked Energy Data Generation
Linked Energy Data Generation
 
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
 
IoT Interoperability: a Hub-based Approach
IoT Interoperability: a Hub-based ApproachIoT Interoperability: a Hub-based Approach
IoT Interoperability: a Hub-based Approach
 
Linked services: Connecting services to the Web of Data
Linked services: Connecting services to the Web of DataLinked services: Connecting services to the Web of Data
Linked services: Connecting services to the Web of Data
 
PhD Defense
PhD DefensePhD Defense
PhD Defense
 
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
 
Web Technologies Introduction to web technologies
Web Technologies Introduction to web technologiesWeb Technologies Introduction to web technologies
Web Technologies Introduction to web technologies
 
Development of Web Services for Android Applications
Development of Web Services for Android ApplicationsDevelopment of Web Services for Android Applications
Development of Web Services for Android Applications
 
Immutable Service Delivery Shenzhen 2016
Immutable Service Delivery   Shenzhen 2016Immutable Service Delivery   Shenzhen 2016
Immutable Service Delivery Shenzhen 2016
 
Daniel Irwin - Crossrail: Future-Proofing Railway Asset Management
Daniel Irwin - Crossrail: Future-Proofing Railway Asset ManagementDaniel Irwin - Crossrail: Future-Proofing Railway Asset Management
Daniel Irwin - Crossrail: Future-Proofing Railway Asset Management
 
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
 
Kurento: a media server architecture and API for WebRTC
Kurento: a media server architecture and API for WebRTCKurento: a media server architecture and API for WebRTC
Kurento: a media server architecture and API for WebRTC
 
Psicquic
PsicquicPsicquic
Psicquic
 
Introduction to Digital Humanities: Metadata standards and ontologies
Introduction to Digital Humanities: Metadata standards and ontologies Introduction to Digital Humanities: Metadata standards and ontologies
Introduction to Digital Humanities: Metadata standards and ontologies
 
Linked Services for the Web of Data
Linked Services for the Web of DataLinked Services for the Web of Data
Linked Services for the Web of Data
 

Último

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 

Último (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

"Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation.

  • 1. WEB ARCHIVE SERVICES FRAMEWORK FOR TIGHTER INTEGRATION BETWEEN THE PASTAND PRESENT WEB Ahmed AlSum PhD Defense February 2014 Committee Members: • Michael L. Nelson • Michele C. Weigle • Hussein M. Abdel-Wahab • M‟Hammad Abdous • Herbert Van de Sompel Old Dominion University Computer Science Department 1
  • 2. Domain Contribution Goal WEB ARCHIVE SERVICES FRAMEWORK FOR TIGHTER INTEGRATION BETWEEN THE PASTAND PRESENT WEB Ahmed AlSum PhD Defense February 2014 Committee Members: • Michael L. Nelson • Michele C. Weigle • Hussein M. Abdel-Wahab • M‟Hammad Abdous • Herbert Van de Sompel Old Dominion University Computer Science Department 2
  • 3. Outline • Introduction • Web Archiving Services Framework • Content Service • Metadata Service • URI Service • Archive Service • Conclusions 3
  • 5. What is a Web Archive? Introduction  Motivation http://www.cs.odu.edu 5
  • 6. Who are using Web Archives? & How? • Politicians • Journalists • Web designers • Historians • Researchers • Social scientists • Curious users Introduction  Motivation 6 *IIPC Access Working Group 2006, Costa 2010, Dougherty 2010, Stirling 2011, Smith 2009
  • 7. Web Archives interfaces are limited Introduction  Motivation 7
  • 8. Web Archiving Use Cases • Ponguru asked on Internet Archive forum on May 17, 2010*: • Hi All - I am new to Archive.org. A few quick questions (1) Is there any API or tools available to access the Archive.org contents programmatically? (2) Are there any research papers where Archive.org was used for data collection / analysis (e.g. studying a particular topic over time, etc.)? I digged a little bit, could not find much, so checking with the group. " Introduction  Motivation *http://archive.org/post/306799/api-or-tools-to-access-research-publications-on-archiveorg 8
  • 9. Lack of APIs • Famous websites provide APIs to the third-party developer. Introduction  Motivation 9
  • 10. Limited and non-standards APIs • Current Web Archives have a limited set of APIs that don‟t cover the user‟s needs. Introduction  Motivation 10
  • 11. Wayback Machine API Introduction  Motivation • It returns JSON interface for the list of available Mementos. 11
  • 12. Croatian Web Archive Introduction  Motivation Full-text search web interface Full-text search APIs in JSON 12
  • 13. Memento Introduction  Motivation • Memento provides TimeMap in the application CoRE format. 13
  • 14. Memento Terminology Introduction  Motivation URI-R, R URI-M, M URI-T, TM http://www.amazon.com http://web.archive.org/web/20110411070244/http://amazon.com Original Resource Memento TimeMap 14 Van de Sompel, H., Nelson, M. L., & Sanderson, R. (2013). RFC 7089 - HTTP framework for time-based access to resource states -- Memento. Internet Engineering Task Force (IETF). Retrieved from http://tools.ietf.org/html/rfc7089
  • 15. Memento Aggregator • Merges TimeMaps from various archives. Introduction  Motivation 15
  • 16. Web Archiving as Big Data • Internet Archive corpus reached 5 PetaBytes. • Alexandria Bibliotheca needs one year to recompute checksum for its corpus. • Tools Introduction  Motivation Apache Pig 16
  • 17. Research Question How Can We Enrich The Web Archive Access Interface With The Conjunction Of The Live Web? Introduction  Research Questions 17
  • 18. Research Questions • What are the required services for the web archiving user community? • Shall we work on the web archive collection as one entity or on different levels? • How can we use the web archive content beyond full-text search? • What are the metadata fields that could enhance user browsing? • How can we develop access interface to the temporal web graph? • How can we optimize creation of thumbnails? • How can we use the HTTP redirection to enhance the URI- lookup query? • How can we optimize the query routing mechanism across the web archives? Introduction  Research Questions 18
  • 20. Web Archive Service Framework Web Archive Service Framework 20
  • 21. • Archive level • Web Archive profiling to optimize the query routing. • URI level • URI HTTP redirection in the web archive URI-lookup. • Metadata level • ArcLink • ArcThumb • Content level • ArcContent Web Archive Service Framework ArcSys 21
  • 22. IIPC 2010 Winter Olympics Web Archive Service Framework  Datasets * http://olympics.us.archive.org/olympics2010/ Size 700+GB From Nov 2009 To Mar 2010 #URI-R 6.4M #URI-M 23.7M 22
  • 23. Fortune 500 • 499,540 mementos from 488 TimeMaps. • For each Memento, we download the HTML and capture the thumbnail using PhantomJS. Web Archive Service Framework  Datasets 23
  • 24. DMOZ Web Archive Service Framework  Datasets • URI Open Directory based on user submissions. 24
  • 26. Wayback Machine URI Rewriting Original Rewritten Content Service 26
  • 27. Response Types Raw Response Modified Response Extracted Response Content Service 27
  • 29. Extracted Response Filters Content Service TextContent TFContent 29
  • 30. Extracted Response Formats Content Service XML JSON 30
  • 32. METADATA SERVICE ArcLink & ArcThumb 32 Archive URI Metadata Content
  • 33. Metadata Access Service Metadata Service • Metadata is data about data. • Metadata layer is data about mementos. Type Field Description Example Technical Content-type Entity mimetype. text/html Content-length Size of the entity-body. 90883 Extracted Title Title of the page. Egypt rejoices at Mubarak departure Description Description about the content of the entity-body. The BBC World Affairs Editor John Simpson reflects on how Egypt brought about the overthrow of President Hosni Mubarak. Outgoing Links A list of all the outlinks that the page pointed to. Derived Thumbnail Thumbnail of the representation of the web page. Incoming Links A list of all the inlinks that to pointed to the page 33
  • 34. ArcLink Motivation, Stages, Cost Model, Applications 34
  • 35. ArcLink: optimization techniques to build and retrieve the temporal web graph A. AlSum and M. L. Nelson,. In Proceedings of the 13th annual international ACM/IEEE joint conference on Digital libraries JCDL „13, Indianapolis, Indiana, 2013 See also: http://arxiv.org/abs/1305.5959 35
  • 36. Easily Solved Questions Q: What are the available mementos for www.vancouver2010.com? Metadata Service  ArcLink  Motivation 36
  • 37. Solved Questions, but hard Q. What are the HTML titles for www.vancouver2010.com through time? A. Page scraping for all mementos Metadata Service  ArcLink  Motivation 37
  • 38. Impossible Questions Q What are the anchor-text that pointed to www.vancouver2010.com through time? Metadata Service  ArcLink  Motivation … <a href=www.vancouver2010.com > Vancouver Olympics </a> …. … <a href=www.vancouver2010.com > Winter Olympics </a> … … <a href=www.vancouver2010.com > Vancouver 2010 </a> … 38
  • 39. Outlinks Metadata Service  ArcLink  Motivation 39
  • 40. ArcLink and Temporal Web Graph What is ArcLink? • ArcLink is a complete system to Extract, Preserve, and Access to Temporal Web Graph. What is the Temporal Web Graph? • Link structure through the time, including inlinks and outlinks. Metadata Service  ArcLink  Motivation WG @t2WG @t1 TWG 40
  • 41. System Stages Metadata Service  ArcLink  Stages 41
  • 42. Filtering • Using CDX files to filter the URI to select the mementos that will contribute to the Web Graph. • For example, • Exclude non-200 HTTP status code • Exclude Images, style-sheets, videos, etc • Exclude duplicate mementos • Technique: Using Pig Latin script on CDX files • Results: CDX was reduced to 25% of the original size, from 23.8M mementos to 6.7M mementos. Metadata Service  ArcLink  Stages 42
  • 43. Extraction • Technique: Hadoop • Step 1: URI-ID generation • Canonicalized the URI into SURT format • Hash the canonicalized format using SimHash • Completely distributed • Step 2: Define data sources Metadata Service  ArcLink  Stages Input Source Map (sec) Reduce (sec) Total (sec) 2 Tasks Wayback 21,422 4,194 25,616 WARC 13,327 2,770 16,098 (62%) 5 Tasks Wayback 13,721 2,257 15,978 WARC 8,304 1,746 10,051 (62%) • WARC • Web archive UI 43
  • 44. Storage • ArcLink used database to save the web graph Metadata Service  ArcLink  Stages Insertion Performance Update Performance 44
  • 45. ArcLink Response Metadata Service  ArcLink  Stages 45
  • 46. ArcLink Response Metadata Service  ArcLink  Stages 46
  • 47. ArcLink Response Metadata Service  ArcLink  Stages 47
  • 48. Impossible Questions Q. What are the anchor-text that pointed to www.vancouver2010.com through time? Metadata Service  ArcLink  Applications 48
  • 49. Temporal Page Rank Nov-2009 Dec-2009 Jan-2010 1 vancouver2010.com/code - topsport.com/sportch/liveticker/ 2 vancouver2010.com/en/langpolicy - vancouver2010.com/code 3 vancouver2010.com/forgotpassword - canadacode.vancouver2010.com/ user/register 4 vancouver2010.com/store - canadacode.vancouver2010.com 5 vancouver2010.com/store/index.html - canadacode.vancouver2010.com/explore 6 vancouver2010.com/ - canadacode.vancouver2010.com/ user/login?destination=node/add/image 7 canadacode.vancouver2010.com - canadacode.vancouver2010.com/pulse 8 canadacode.vancouver2010.com/nfb-onf - canadacode.vancouver2010.com/challenge 9 canadacode.vancouver2010.com/contact - i-credible.nl 10 canadacode.vancouver2010.com/resources - vpzschaatsteam.nl Metadata Service  ArcLink  Applications Feb-2010 Mar-2010 Collection ( Nov-09 to Mar-10 ) 1 monlibe.liberation.fr monlibe.liberation.fr monlibe.liberation.fr 2 topsport.com/sportch/liveticker/ laprovence.com/la-provence-le-faq-de-la- moderation vancouver2010.com/code 3 lefigaro.fr get.adobe.com/flashplayer lefigaro.fr 4 laprovence.com/la-provence-le-faq-de-la- moderation vancouver2010.teamgb.com /teamgb/team- behind-team-gb/filenotfound.aspx laprovence.com/la-provence-le-faq-de- la-moderation 5 lefigaro.fr/sport ledauphine.com lefigaro.fr/sport 6 get.adobe.com/flashplayer lefigaro.fr/economie get.adobe.com/flashplayer 7 lefigaro.fr/meteo lefigaro.fr/sport lefigaro.fr/meteo 8 lefigaro.fr/le-talk lefigaro.fr/actualites-a-la-une lefigaro.fr/le-talk 9 dosb.de/de/vancouver-2010/vancouver- ticker/detail/printer.html lemonde.fr/cgv topsport.com/sportch/liveticker/ 10 ledauphine.com ffs.fr/index.php vancouver2010.com/en/langpolicy 49
  • 51. Thumbnail Summarization Techniques For Web Archives AlSum and M. L. Nelson,. In Proceedings of the 36th European Conference on Information Retrieval. ECIR 2014, Amsterdam, Netherlands, 2014 51
  • 52. Thumbnails in Web Archive Metadata Service  ArcThumb  Motivation Internet Archive UK Web Archive 52
  • 53. Thumbnails Creation Challenges • Scalability in Time • IA may need 361 years to create thumbnail for each memento using one hundred machines. • Scalability in Space • IA will need 355 TB to store 1 thumbnail per each memento. • Page quality Metadata Service  ArcThumb  Motivation 53
  • 54. Thumbnails Usage Challenges 54 Metadata Service  ArcThumb  Motivation • This is partial view of 700 thumbnails out of 10,500 available mementos for www.apple.com
  • 55. From 10,500 Mementos to 69 Thumbnails. Metadata Service  ArcThumb  Motivation 55
  • 56. How many thumbnails do we need? Metadata Service  ArcThumb  Methodology www.unfi.com on the live Web 56
  • 57. How many thumbnails do we need? Metadata Service  ArcThumb  Methodology www.unfi.com on the live Web 57
  • 58. 40 Thumbnails are good. Metadata Service  ArcThumb  Methodology 58
  • 59. Visual Similarity and Text Similarity Metadata Service  ArcThumb  Methodology SimilarDifferent HTML Text 59
  • 60. Correlation between Visual Similarity and Text Similarity Metadata Service  ArcThumb  Feature Exploration SimHash DOM tree Embedded resources Memento Datetime 60 SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013]
  • 61. Threshold Grouping Metadata Service  ArcThumb  Selection Algorithms 61
  • 62. Threshold Grouping Metadata Service  ArcThumb  Selection Algorithms 62
  • 63. Clustering technique Metadata Service  ArcThumb  Selection Algorithms SimHash Feature SimHash and Datetime Features 63 Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341.
  • 64. Time Normalization Metadata Service  ArcThumb  Selection Algorithms 64
  • 65. Selection Algorithms Comparison Threshold Grouping K clustering Time Normalization TimeMap Reduction 27% 9% to 12% 23% Image Loss 28 78 - 101 109 # Features 1 feature 1 or more 1 feature Preprocessing required Yes Yes No Efficient processing Medium Extensive Light Incremental Yes No Yes Online/offline Both Both Both Metadata Service  ArcThumb  Selection Algorithms 65
  • 67. ARCHIVAL HTTP REDIRECTION RETRIEVAL POLICIES A. AlSum, M. L. Nelson, R. Sanderson, and H. Van de Sompel In Proceedings of 3rd Temporal Web Analytics Workshop. TempWeb 2013, Rio de Janeiro, Brazil 67
  • 68. Live Web Redirect http://bit.ly/r9kIfC redirects to http://www.cs.odu.edu URI Service % curl -I http://bit.ly/r9kIfC HTTP/1.1 301 Moved …. Location: http://www.cs.odu.edu/ … 68
  • 69. Live Web Redirect URI Service R http://bit.ly/r9kIfC R http://www.cs.odu.edu redirects to 69
  • 70. R1 www.draculathemusical.co.uk R2 www.mosaicstudio.co.uk R1 http://web.archive.org/web/20020212194020/http://www.draculathemusical.co.uk/ R3 http://web.archive.org/web/20020212194020/http://www.geocities.com/draculathemusical WebArchiveLiveweb redirects to redirects to has Memento Archived Web Redirect URI Service 70
  • 71. Experiment • Dataset: 10,000 sample URIs from • Dataset does not include bit.ly nor doi. • Experiment focused on the root page (no embedded resources) URI Service  Experiment and Results HTTP Status/Code (10,000 URI-R) OK (200) 82.83% Redirection (3xx) 14.71% Redirection (301) 8.4% Redirection (302) 6.1% Redirection (others) 0.2% Not-Found (4xx) 1.18% Others 1.28% HTTP Status/Code (894,717 URI-M) OK (200) 93.46% Redirection (3xx) 5.69% Not-Found (4xx) 0.26% Others 0.59% URIs Live HTTP status code Memento HTTP status code 71
  • 72. URI Stability • URI‟s stability is a count of the change in HTTP responses across time (200, 3xx, or 4xx) and the number of different URIs in the “Location” for 3xx status code. High Stability = 1 No Stability = 0 URI Service 72
  • 74. Timemap Redirection Categories URI Service All Mementos have 200 HTTP status code All Mementos have redirection to the same URI. All Mementos have redirection to different URIs. Mementos have different HTTP status code. 74
  • 75. URI Stability URI Service  Experiment and Results TimeMap Category Percentage Stability All Mementos have OK 52% 1 Mementos have mixed status codes 36% 0.91 All Mementos have Redirection 0.92% 0.85 Redirection to the same URI 0.62% Redirection to different URIs 0.30% URI has no Mementos at all 10.97% 0 Stability in semi-log scale Stability for |TM(R)| < 300 75
  • 76. Current Wayback Machine Policy • URI Service  Retrieval Policies 76
  • 77. Policy one: URI-R with HTTP redirection • URI Service  Retrieval Policies Retrieve the memento M for R. Status(M) =200 Status(M) =3xx Stop Go to Policy 2 Stop Yes Yes Yes No No No 77
  • 78. Policy one: URI-R with HTTP redirection • Evaluation: • Policy scope has: 1471 URIs (that have live redirection) • 77 out of 1471 have no mementos at all • 17 out of 77 have been retrieved mementos based on live redirection URI Service  Retrieval Policies 78
  • 79. Policy two: URI-M with HTTP redirection • URI Service  Retrieval Policies http://www.cnn.com/ Accept-Datetime: Sun, 13 May 2006 http://www.cnn.com/ 79
  • 80. Policy two: URI-M with HTTP redirection • Evaluation: • Policy scope: 2980 TimeMap (that showed HTTP redirection status code in at least one memento) • Success criteria: Using policy two contributed to the original TimeMap • Success percentage: 58% of the cases URI Service  Retrieval Policies 80
  • 81. ARCHIVE SERVICE Percentage and Distribution 81 Archive URI Metadata Content
  • 82. How Much Of The Web Is Archived? S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries JCDL '11, Ottawa, Canada 2011 See also: http://arxiv.org/abs/1212.6177 82
  • 83. Experiment • 4 Sample sets – 1000 URIs each • For each URI, we used Memento Aggregator to record the TimeMap for this URI. Archive Service  Percentage  Experiment 83
  • 84. Archives Under Experiment 2010 2010 and 2013 2013 Archive Service  Percentage  Experiment U K 84
  • 85. How Much of the Web is Archived? • It Depends on Which Web… Archive Service  Percentage  Results 2010 2013 Including SE cache Excluding SE Cache General 90% 79% 90% 97% 68% 95% 88% 19% 52% 35% 16% 33% 85
  • 86. Profiling Web Archive Coverage For Top-level Domain And Content Language A. AlSum, M. C. Weigle, M. L. Nelson, and H. Van de Sompel In Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries TPDL 2013, Valletta, Malta, 2013 Extended version is invited to special edition in IJDL. See also: http://arxiv.org/abs/1309.4008 86
  • 87. Memento Aggregator Archive Service  Distribution 87
  • 88. Where can you find? Archive Service  Distribution http://www.google.com/ 88
  • 89. Where can you find? Archive Service  Distribution http://www.google.com/ 89
  • 90. Where can you find? Archive Service  Distribution http://www.japantimes.co.jp/ 90
  • 91. Where can you find? Archive Service  Distribution http://www.japantimes.co.jp/ 91
  • 92. Research Question Problem • We need to profile the web archives around the world with these characteristics: • Age • Top-level domains • Languages • Growth rate Goal • To optimize the query routing for Memento Aggregator. • To determine the missing parts of the web. Archive Service  Distribution 92
  • 93. URIs Samples Sources Archive Service  Distribution Web 1. DMOZ – Random sample 2. DMOZ – TLD 200 URIs for each TLD from DMOZ (80 tlds) 3. DMOZ – Languages 100 URIs for each Languages (40 lang.) Web Archives 4. Top 1-Gram from Bing 5. Top 1000 queries term by Yahoo in 9 languages User requests 6. IA Wayback Machine log files 7. Memento aggregator log files * We used hostnames only 93
  • 94. TLD Coverage Archive Service  Distribution IA Internet Archive AIT Archive It LoC Library of Congress UK UK National Library BL British Library SG Singapore CAT Web Archive of Catalonia CR Croatian Web Archive CZ Archive of the Czech Web TW National Taiwan University IC Icelandic Web Archive CAN Library and Archives Canada SI Slovenia WEB Web Cite AIS Archive.is 94
  • 95. Language Coverage Archive Service  Distribution IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It IA Internet Archive AIT Archive It LoC Library of Congress UK UK National Library BL British Library SG Singapore CAT Web Archive of Catalonia CR Croatian Web Archive CZ Archive of the Czech Web TW National Taiwan University IC Icelandic Web Archive CAN Library and Archives Canada SI Slovenia WEB Web Cite AIS Archive.is 95
  • 96. Growth Rate Archive Service  Distribution IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It Stopped archiving in 2008 Steady growth Stopped getting new URIs, but still crawling IA Internet Archive AIT Archive It LoC Library of Congress UK UK National Library BL British Library SG Singapore CAT Web Archive of Catalonia CR Croatian Web Archive CZ Archive of the Czech Web TW National Taiwan University IC Icelandic Web Archive CAN Library and Archives Canada SI Slovenia WEB Web Cite AIS Archive.is 96
  • 97. Building Web Archive Profile Archive Service  Distribution 97
  • 98. • RecallTM@1 = 3/8 = 0.375 • RecallTM@2 = 5/8 = 0.625 Web Archive Selection Evaluation Archive Service  Distribution TM(R) A1 M1 M2 M3 A2 M4 M5 A3 M6 A4 M7 A5 M8 98
  • 99. Web Archive Selection Evaluation Archive Service  Distribution Number of Archive Including IA Excluding IA RecallTM@3 0.96 0.647 RecallTM@6 0.98 0.83 RecallTM@9 0.998 0.983 RecallTM@12 0.999 0.987 • Total number of archives N = 15 99
  • 101. Conclusions • We proposed a new service framework that divides the web archive corpus into four levels: Content, Metadata, URI, and Archive. • The development of ArcContent that supports the web archive interface with extracted version of the mementos based on a set of predefined filters. • We studied the challenges of building the temporal web graph and developed ArcLink, a distributed system to extract, preserve, and expose the temporal web graph. • We studied the optimization and summarization techniques to create the thumbnails for the web graph collections based on SimHash fingerprints. • We extended the concept of URI-lookup in the web archive to include the HTTP redirection status code. • The concept of “Web Archive Profile” to characterize the web archive corpus was defined with an application on the distributed search in the Memento Aggregator. 101
  • 102. Publications • S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. “How much of the Web is Archived?” In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, JCDL '11, 2011. • A. AlSum, M. L. Nelson, R. Sanderson, and H. Van de Sompel. “Archival HTTP Redirection Retrieval Policies.” In Proceedings of 3rd Temporal Web Analytics Workshop, TempWeb ‟13, 2013. • A. AlSum, and M. L. Nelson. “ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph.” In Proceedings of the 13th annual international ACM/IEEE joint conference on Digital libraries, JCDL '13, 2013. • A. AlSum, Michele C. Weigle, M. L. Nelson, and H. Van de Sompel. “Profiling Web Archive Coverage for Top-Level Domain and Content Language.” In Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013, 2013. • A. AlSum, and M. L. Nelson. “Thumbnail Summarization Techniques for Web Archives.” In Proceedings of the 36th European Conference on Information Retrieval. ECIR „14, 2014. 102
  • 103. What‟s next? • Web Archiving Engineer at Stanford University. 103
  • 104. WEB ARCHIVE SERVICES FRAMEWORK FOR TIGHTER INTEGRATION BETWEEN THE PASTAND PRESENT WEB Ahmed AlSum PhD Defense February 2014 Old Dominion University Computer Science Department 104 @aalsum
  • 106. Memento • Memento is an HTTP extension to integrate the Past and the Current Web I Jacobs and N Walsh Architecture of the world wide web Technical report, W3C, 2004 http://wwww3org/TR/webarch/ Now T1 T2 T3 106
  • 107. Memento • Developer and administrator for Memento aggregator and proxies 107
  • 108. Memento Clients • Memento currently is RFC. 108
  • 109. Lack of APIs • Famous websites provide APIs to the third-party developer. Introduction  Motivation 109
  • 110. Lack of APIs • US Agencies started to support APIs to data access. Introduction  Motivation 110
  • 111. Web Archiving Use Cases • Temporal navigation. • Full text search. • Use language filters. • Provide raw WARC. • Import of metadata records into other repositories. Introduction  Motivation *IIPC Access Working Group. Use cases for Access to Internet Archives. International Internet Preservation Consortium Publications, http://www.netpreserve.org/resources/use-cases-access-internet-archives, 2006. 111
  • 112. Related Projects Data analysis for the web data Tools and Methods to access the web archive Enable the user to do experiments on the raw crawled data on Amazon S3 Enable the user to browse the present and the past web Introduction 112
  • 113. Selection • Decide what to capture Everything, any domain National domains Delegate selection to partners Users‟ favorites • We studied what is already captured 113
  • 114. URI-Based WayBack Machine Web Archiving Trends  Accessing Web Archive • Textbox to enter the requested URI. • BubbleMap to show you the available mementos. 114
  • 115. Collection-Based Web Archiving Trends  Accessing Web Archive • In addition to browsing the collection, you can browse the URIs in this collection. 115
  • 116. Full-text search Web Archiving Trends  Accessing Web Archive • BL interface provides different filtering techniques for the results. 116
  • 117. Past Web Browser Web Archiving Trends  Accessing Web Archive • You can replay the pages with different controls to forward, backward, pause and stop. 117
  • 118. Zoetrope Web Archiving Trends  Accessing Web Archive • Different Views • Comparison between different Mementos • Not feasible on the current web archiving infrastructure 118
  • 119. DiffIE Web Archiving Trends  Accessing Web Archive • A browser plug-in that caches the pages a person visits and highlights how those pages have changed when the person returns to them • It is possible on the personal archiving. 119
  • 120. Synchronicity Web Archiving Trends  Accessing Web Archive • Mozilla Firefox add-on supports internet user in (re-)discovering missing web pages in real time 120
  • 121. Warrick Web Archiving Trends  Accessing Web Archive • It’s a utility for reconstructing or recovering a website when a back-up is not available 121
  • 122. ArcSys Architecture Diagram Web Archive Service Framework 122
  • 123. WAT files • WAT files are metadata files for WARC files • WAT files are used to create data analysis reports based on large datasets. Metadata Service 123
  • 124. It‟s More than WAT files WAT ArcLink Batch Process on a set of WARCs Batch process on a set of URIs For internal use For public use No-way to integerate with others WAT files in others locations It could be aggregated with other graphs No incremental update Support incremental update Access on WAT file level using Pig Access on URI level using Web service Metadata Service  ArcLink  Motivation 124
  • 125. Cost of Scaling Up • Metadata Service  ArcLink  Cost model Internet Archive 88 hrs 108 * 109 mementos 247 days 500 TB Filtering Extraction Storage *Numbers based on Wayback Machine published statistics on Oct 2013 of 360B mementos with total size 5PB 125
  • 126. Time-Indexed Inlinks Information Metadata Service  ArcLink  Applications Date Anchor Text 04-Nov-09 vancouver2010.com 11-Nov-09 vancouver2010.com 18-Nov-09 vancouver2010.com 16-Jan-10 Vancouver 2010 Olympic Games 16-Jan-10 Vancouver 2010 Olympic Games 23-Jan-10 vancouver2010.com 23-Jan-10 2010 Vancouver Olympic Games Medals Results Schedule Sports 30-Jan-10 2010 Vancouver Olympic Games Medals Results Schedule Sports 30-Jan-10 vancouver2010.com 30-Jan-10 Vancouver 2010 Olympic Games 13-Feb-10 Vancouver 2010 Olympic Winter Games 15-Feb-10 Vancouver 2010 Olympic Games 18-Feb-10 Official Vancouver Games site 19-Feb-10 vancouver2010.com 20-Feb-10 Official Vancouver Games site 21-Feb-10 VANOC 2010 126
  • 127. HTTP Redirection Relationship between URI-R & URI-M URI Service  Experiment and Results Live Web URI − R OK Redirection Web Archive URI-M OK Case 1 5 Redirection 2 3,4 Case 1 Case 2 Case 3 Case 4 Case 5 80.8% 2.74% 1.34% 1.33% 13.7% 127
  • 128. Timemap Redirection Categories • Category 1 URI Service All Mementos have 200 HTTP status code 128
  • 129. Timemap Redirection Categories • Category 2 URI Service All Mementos have redirection to the same URI. 129
  • 130. Timemap Redirection Categories • Category 3 URI Service All Mementos have redirection to different URIs. 130
  • 131. Timemap Redirection Categories • Category 4 URI Service Mementos have different HTTP status code. 131
  • 132. HTTP Redirection Relationship between URI-R & URI-M URI Service Live Web URI − R OK Redirection Web Archive URI-M OK Case 1 5 Redirection 2 3,4 Case 1 Case 2 Case 3 Case 4 Case 5 132
  • 134. Summary • Quantitative study with 10,000 URIs. • 48% were not fully stable through time. • 27% were not perfectly reliable through time. • New archival retrieval policy: • Policy one: successfully retrieved mementos for 17 out of 77. • Policy two: Expanded the TimeMap for 58% of cases. URI Service  Retrieval Policies 134
  • 135. URI Reliability • 23% of the mementos did not lead to a successful memento at the end. URI Service  Experiment and Results Reliabilityin semi-log scale Reliabilityfor |TM(R)| < 300 135
  • 136. Experiment Archive Service  Percentage  Experiment • For each sample set, we used Memento Aggregator to get all the possible archived copies (Mementos). • For each URI, Memento Aggregator responded with TimeMap for this URI. Example <http://memento.waybackmachine.org/memento/2001081919423 3/http://jcdl2002.org>;rel="first memento";datetime="Sun, 19 Aug 2001 19:42:33 GMT“, <http://memento.waybackmachine.org/memento/2001121622024 8/http://jcdl2002.org>; rel="memento"; datetime="Sun, 16 Dec 2001 22:02:48 GMT", 136
  • 137. 1000 URIs Ordered by First Observation Date Archive Service  Percentage  Results See also: http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html 137
  • 138. 2010 Archive Service  Percentage  Results 2013 138
  • 139. Archive Service  Percentage  Results 2010 2013 139
  • 140. Archive Service  Percentage  Results 2010 2013 140
  • 141. Archive Service  Percentage  Results 2010 2013 141
  • 142. URIs Samples Sources – Live Web 1. DMOZ – Random sample • 10,000 URIs randomly sample from DMOZ directory (~5M URIs). 2. DMOZ – TLD: 200 URIs for each TLD • 80 tlds. 3. DMOZ – Languages 100 URIs for each Languages • 40 languages. Archive Service  Distribution 142
  • 143. URIs Samples Sources – Web Archive • Query the fulltext search interface for the web archives with two set of query terms. 4. Top 1-Gram from Bing • Most of them is English 5. Top 1000 queries term by Yahoo in 9 languages • We excluded the general keywords such as: Obama, Facebook. Archive Service  Distribution 143
  • 144. URIs Samples Sources – User requests • Sampling from the users requests to the web archived materials 6. Sample from IA Wayback Machine Log files • 10,000 URIs randomly sampled from Feb 22, 2012 to Feb 26, 2012. 7. Sample from Memento aggregator log files • 1,000 URIs randomly sampled from LANL Memento Aggregator between 2011 to 2013. Archive Service  Distribution 144
  • 145. General Coverage Archive Service  Distribution IA Internet Archive AIT Archive It LoC Library of Congress UK UK National Library BL British Library SG Singapore CAT Web Archive of Catalonia CR Croatian Web Archive CZ Archive of the Czech Web TW National Taiwan University IC Icelandic Web Archive CAN Library and Archives Canada SI Slovenia WEB Web Cite AIS Archive.is 145
  • 146. Web Archive Selection Evaluation Archive Service  Distribution 146
  • 147. Web Archive Selection Evaluation Archive Service  Distribution 147
  • 149. iTunes cover application Metadata Service  ArcThumb  Motivation 149

Notas del editor

  1. Filters and Extracted
  2. Verbally show this is the endExplain this is an initial step in this area