WEB ARCHIVING CHALLENGES & OPPORTUNITIES PRESENTATION

WEB ARCHIVING
CHALLENGES & OPPORTUNITIES
PRESENTATIONFOR WEBARCHIVINGENGINEERINGPOSITION
Ahmed AlSum
PhD Candidate
Old Dominion University

Outline
• Engineering Experience
• IBM
• Old Dominion University
• Internet Archive
• Web Archiving Challenges & Opportunities
• Selection
• Harvesting
• Storage
• Access
• Community
• Conclusions

CCSP Project
• An internal IBM support portal that provides client-facing
audiences a by-client, holistic view of client situations
• Technologies: WebSphere Portal, DB2, deployed on
zLinux machines

Responsibilities
• Software Engineer
• Enterprise Applications with J2EE platform technologies for
frontend (Servlets, JSP, Portlet APIs), and backend tasks based on
EJB
• Front-end components based on Web 20 technologies (AJAX
based on dojo 1.0, and Java Script)
• Lotus Sametime (Plugins and Bot development)
• Software engineer team leader
• Support project quality activities
• Lead code review and static analysis activities

Responsibilities
• Administrator
• Deploying Portal solutions on WebSphere Portal
• WebSphere Portal Administration for standalone and clustered
environment
• Administration on Linux and Windows OS
• DB2 server administration for single instance and multiple
instances with HADR support
• Customer support team lead
• Leading customer support activities

Sharing IBM Internal Solutions
with Broader Community

Memento
• Memento is an HTTP
extension to integrate the
Past and the Current
Web
I Jacobs and N Walsh Architecture of the world wide web Technical report, W3C, 2004 http://wwww3org/TR/webarch/
Now
T1
T2
T3

Memento
• Developer and administrator for Memento aggregator and proxies

Memento Clients
• Memento currently is I-D draft, it is promoted to move to
RFC soon.

WAT Extraction
• Web Archive Transformation (WAT) is a specification for
structuring metadata generated by Web crawls
• Technologies:

WEB ARCHIVING
Challenges and Opportunities

Web Archive Life Cycle
Hockx-Yu, H, 2011 The Past Issue of the Web In Proceedings of 3rd International Conference on Web Science pp 1–8

Selection
• Decide what to capture
Everything, any domain
National domains
Delegate selection to partners
Users’ favorites
• We studied what is already captured

How Much Of The Web Is
Archived?
S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C.
Weigle, and M. L. Nelson
In Proceedings of the 11th annual international ACM/IEEE
joint conference on Digital libraries, JCDL
'11, Ottawa, Canada 2011
See also: http://arxiv.org/abs/1212.6177

Archive categories
We have 3 categories of archives
• Internet Archive (classic interface)
• Search engine
• Other archives
Selection
U
K
U
S
Public Archives, ca. Late 2010 / Early 2011

1000 URIs Ordered by First Observation Date
Selection
See also: http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html

Memento Distribution, ordered by the first observation date

How Much of the Web is Archived?
It Depends on Which Web…
Selection
Including
SE cache
Excluding
SE Cache
90% 79%
97% 68%
88% 19%
35% 16%
Changes since 2011: no more free SE APIs;
greatly reduced IA quarantine period; 15 public web archives
2013
95%
92%
23%
26%

Profiling Web Archive
Coverage For
Top-level Domain And
Content Language
A. AlSum, M. C. Weigle, M. L. Nelson, and H. Van de Sompel
In Proceedings of the 17th International Conference on
Theory and Practice of Digital Libraries, TPDL 2013, 2013

Where is it archived?
Selection
IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University
IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It

Language Coverage
Selection

Growth Rate
Selection
Borrowed Portuguese
material from IA
Stopped archiving
since 2008
Steady growth
Stopped getting new
URIs, but still crawling

Selection Research Output
• Some portions of the web are
not well archived such as India
and Africa.
• Profiling helping us in Memento
query routing.
• IIPC proposal with Herbert Van
de Sompel (LANL) and David
Rosenthal (SUL).
Selection

Selection at SUL
• Focus on the missing parts of the Web
• Twitter - Crowdsource:
• UK Web archive: Twittervana
• Internet Memory: Collect URIs from twitter APIs
• VA Tech: CTRNET project
• Stanford Community
• World News collection: 10 news website from each county
• Tools:
Selection

Harvesting
• Services
• Archive-It
• WAS @ CDLib
• Dedicated servers
• New tools
See also: http://ws-dl.blogspot.com/2013/07/2013-07-10-warcreate-and-wail-warc.html

Special Harvesting Techniques
• Borrow old materials from other web archives
• Ex Stanford WebBase Project*
• 260 TB
• 7 Billion webpages
Harvesting
*http://www-diglib.stanford.edu/~testbed/doc2/WebBase/

• Social Media
• Focus on shared resources in the social media
Harvesting
Hany M SalahEldeen, Michael L Nelson, Losing My Revolution: How Many Resources Shared on Social Media Have Been
Lost?, Proceedings of TPDL 2012
http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html

• SiteStory - Transactional Archive
Harvesting
Justin F Brunelle, Michael L Nelson, Lyudmila Balakireva, Robert Sanderson, Herbert Van de Sompel, Evaluating the SiteStory
Transactional Web Archive With the ApacheBench Tool, Proceedings of TPDL 2013
Sitestory: http://mementoweb.github.io/SiteStory/

Harvesting
• Challenges
• Ajax and Web 2.0/3.0
• Streaming Media
• URI challenges
• Mobile
Harvesting
http://blog.dshr.org/2012/05/harvesting-and-preserving-future-web.html
http://netpreserve.org/sites/default/files/resources/OverviewFutureWebWorkshop.pdf

Storage (Format)
• Flat files:
• WARC files (ISO standard)
• No-SQL db:
• Hbase at Internet memory*
• Storage at SUL:
• We need to use both
Storage
*Philippe Rigaux, Understanding HBase— The data model, IM technology blog
http://internetmemoryorg/en/indexphp/synapse/understanding_the_hbase_data_model/

Storage (Infrastructure)
• Wrong solution could be a disaster
Storage

Accessing Web Archive
URI-Based
WayBack Machine
• Textbox to enter the
requested URI
• BubbleMap to show
you the available
mementos

Full-text search
• Challenges: Temporal
Page Rank, Rank per
site or memento, Date
filtering

• Thumbnail View
• Trade-off between
building the
thumbnail in real time
or pre-building
Also, trade-off
between representing
the thumbnail by URI
or by embedded
binary data Can we
build partial
thumbnail map?

• Title View
• Trade-off between, extracting all the titles and keeping it as a
metadata about the memento and extracting the title from the HTML
content on the real time
Implemented using Simile: http://www.simile-widgets.org/timeline/

• Wayback Machine API
• XML interface for the
list of available
Mementos

• Web Page Snapshot Replay
• URI
rewriting, javascript, a
nd embedded
resources

• Page Completeness Degree
• The completeness
degree could be
calculated on the real
time by using the
preserved HTTP
status for the
embedded resources

• Reconstructing web site
• Current approach is
using the web archive
public interface.

• Wayback Annotator
• Create collections
• Select and save
relevant content to
their collections
• Annotate & mark
important parts of
archived web pages
• Share their work and
collaborate on
archived content use
http://netpreserve.org/sites/default/files/resources/Predstavitev_07.pdf
http://netpreserve.org/sites/default/files/resources/Wayback_annotator_06.pdf

Collection-Based
• In addition to
browsing the
collection, you can
browse the URIs in
this collection
• Research questions:
Collection overview

• Collection visualization
• Term frequency
algorithms should be
normalized to take the
mementos density in
consideration
http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-visualizing.html

• Web Archive analytics
See also: http://ilpubs.stanford.edu:8090/1037/1/arcspread.pdf
• ArcSpread took a
query from the
user, extracted related
information and
displayed the results
in spread sheet style.

Who And What Links To The
Internet Archive
Y. Alnoamany, A. AlSum, M. C. Weigle, M. L. Nelson
In Proceedings of 17th International Conference on
Theory and Practice of Digital Libraries, TPDL
2013, 2013 (Best Student Paper)

Serving Robots!
• Log files analysis using Apache Pig
• Access to IA wayback machine as
Robots outnumber Humans
• 10:1 in terms of sessions,
• 5:4 in terms of raw HTTP accesses
• 4:1 in terms of megabytes transferred
Access
Sessions
10
1
HTTP
accesses
5
4
MB
Transferred
4
1

Where do Wayback Machine Users
Come From?
Website Percentage Description
en.wikipedia.org 12.9% Wikipedia
archive.org 11.9% IA Home Page
reddit.com 10.2% Social News Web Site
google.TLD 9.9% Search Engine
info-poland.buffalo.edu 1.5% Polish Studies
de.wikipedia.org 1.4% Wikipedia
cracked.com 1.2% Humor Site
snopes.com 1.1% Urban Legends Reference Pages
facebook.com 0.9% Social Media
crochetpatterncentral.com 0.9% Crocheting Hobbies
Access

Most Languages Self-Link
Access

ArcLink:
Optimization Techniques To Build And Retrieve
The Temporal Web Graph
A. AlSum, M. L. Nelson
IIPC GA 2013, Ljubljana, Slovenia
In Proceedings of the 13th international ACM/IEEE joint
conference on Digital libraries, JCDL '13, 2013

Easy Solved Questions
Q: What are the available mementos for
vancouver2010.com?
Access

Solved Questions, but hard
Q: What are the HTML titles for vancouver2010com
through time?
A Page scraping for all mementos
Access

Impossible Questions
Q What are the anchor-text that pointed to
www.vancouver2010.com through time?
Access
…
<a href=www.vancouver2010.com >
Vancouver Olympics
</a>
….
…
Winter Olympics
</a>
…
…
Vancouver 2010
</a>
…

ArcLink
Access
Google code: https://code.google.com/p/arcsys/

Impossible Questions
• Q What are the anchor-text that pointed to
www.vancouver2010.com through time?
Access

Thumbnail Summarization
Techniques For Web
Archives
A. AlSum, and M. L. Nelson
Submitted for publication.

Thumbnails
Access
Internet Archive UK Web archive

Thumbnail Creation Challenges
• Scalability in Time
• IA may need 361 years to create thumbnail per each memento
using one hundred machine
• Scalability in Space
• IA will need 355 TB to store 1 thumbnail per each memento
• Page quality
Access

How many thumbnails do we need?
Access
www.unfi.com on the live Web

40 Thumbnails are good.
Access

Same technique applied to apple.com
Access

From 8000 Mementos to 69 Thumbnails.
Access

iTunes cover application
Access

Community
• I suggest to be a member in IIPC
• Join the open Wayback Machine team
• Join the Winter Olympics 2014 collaborative project, even as an
observer

Community
• Web Archiving Workshops
WAC 2011, Ottawa, Canada
WAC 2012, Stanford, CA, USA
WADL 2013, Indianapolis, IN, USATempWeb 2013, Rio de Janeiro, Brazil

Tools to SUL Web Archive
• Selection
• Harvest
• Analysis
• Access

Conclusions
• Be Selective: Cover missing parts of the Web
• Be Older: Include WebBase
• Be Smart: Innovative services
• Be Helpful: Researcher Framework/Dataset
• Be Active: Participate in the WA communities
• Make a difference
aalsum@cs.odu.edu
@aalsum

What is missing?
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW
National Taiwan
University

Thumbnail Features
SimHash DOM tree
Embedded resources Datetime

WEB ARCHIVING CHALLENGES & OPPORTUNITIES PRESENTATION

WEB ARCHIVING CHALLENGES & OPPORTUNITIES PRESENTATION

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (7)

Similar a WEB ARCHIVING CHALLENGES & OPPORTUNITIES PRESENTATION

Similar a WEB ARCHIVING CHALLENGES & OPPORTUNITIES PRESENTATION (20)

Último

Último (20)

WEB ARCHIVING CHALLENGES & OPPORTUNITIES PRESENTATION

Notas del editor