SlideShare a Scribd company logo
1 of 174
Aggregating Private and Public Web Archives
Using the Mementity Framework
PhD Dissertation Defense for:
Mat Kelly
Advisor:
Michele C. Weigle
Committee Members:
Michele C. Weigle, Michael L. Nelson, Danella Zhao,
Justin F. Brunelle, and Dimitrie C. Popescu
Department of Computer Science
Norfolk, Virginia 23529 USA
May 7, 2019
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
The Web
2
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
The Web is Ephemeral
3
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
The Web is Ephemeral
4
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
The Web is Ephemeral
5
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
The Web is Ephemeral
6
@machawk1 • @WebSciDL 7
Web Archives to the Rescue: Typical Access
1. Go to archive.org in your
browser
2. Enter the URL you want to see
in the past in the form field
3. Submit your query
@machawk1 • @WebSciDL 8
Web Archives to the Rescue: Typical Access
4. Locate the archived
Web page on the calendar
or histogram view
5. Select the year/selection
for the day
6. Repeat until you find
the closest date and time
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
7. Finally, View the Archived Web Page
9
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Web Archiving - Live Web odu.edu
10
Now
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Web Archiving: View The Web of the Past
11
May 4, 2019Now
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Web Archiving: View The Web of the Past
12
August 2012
(when I started my PhD)
Now
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Web Archiving: View The Web of the Past
13
August 2010
(when I started at ODU)
Now
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Web Archiving: View The Web of the Past
14
June 2006
(when I finished my BS)
Now
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Web Archiving: View The Web of the Past
15
August 2001
(when I started college)
Now
PhD Defense: Mat Kelly
May 7, 2019
Web Archiving
16
2019
2012
2010
2006
2001
Associate a live Web URI
with their archived representations
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Web Archives Provide Access to the Web that was
17
What did odu.edu look like
in the past?
PhD Defense: Mat Kelly
May 7, 2019
Multiple Archival Efforts
18from Alam et al. IJDL 2016.
https://git.io/archives
PhD Defense: Mat Kelly
May 7, 2019
Consulting More Archives Produces a More
Comprehensive Picture
19
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Even Then, Not Everything is Preserved
20
What did obscuresite.com
look like in the past?
0 mementos for
obscuresite.com
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
User Sees on Live Web May Not Be What is Captured
21
What did facebook.com look
like in the past?
Figure 10b, Chapter 1
Figure 10a, Chapter 1
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
...And Oftentimes That is For the Best
22
Have you preserved my
online banking?
Figure 8, Chapter 1
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Private Live Web Content Prescriptively
Ephemeral
Figures 9a & 9b, Chapter 1
23
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Other Times, We May Want Our Content Archived
24
Have you archived my
daughter’s pictures on
Google Photos?
0 mementos for that URI
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
...Especially When It Has Disappeared
25
0 mementos for that URI
!
Have you archived my
daughter’s pictures on
Google Photos!?!?
“It was there yesterday, where did it go?”
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
“Save this, but only for me.”
● Screenshots of Web pages are insufficient
○ Not interactive/representative, do not integrate, lose context otherwise provided in metadata
● Large-scale archives’ tools are open source
● Individuals can archive, but there are still technical barriers
26
at tA
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Individuals, Too, Can Archive The Web
.com
27
“I want to share this but control who can see it.”
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Archives from Institutional and Personal Sources
at tA at tC at tD→Z
28
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Today’s Memento Aggregation
Archives Queried (A0 )
29
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Today’s Memento Aggregation
Archives Queried (A0 )
30
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Desire: Include Personal Archives
Archives Queried (A0 )
31
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Desire: Include Other Non-
Aggregated Archives
Archives Queried (A0 )
32
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Rapidly Changing Pages May Not Be Comprehensively
Captured
33
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
More Archives Provides a Better Picture of the Web
34
Before aggregating, we must first consider the ramifications,
enumerated in the RESEARCH QUESTIONS:
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ1: What sort of content is difficult to capture and replay for preservation from the perspective
of a Web browser?
RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival
crawlers?
RQ3: What issues exist for capturing and replaying content behind authentication?
RQ4: How can content that was captured behind authentication signal to Web archive replay
systems that it requires special handling?
RQ5: How can Memento aggregators indicate that private Web archive content requires special
handling to be replayed, despite being aggregated with publicly available Web archive content?
RQ6: What kinds of access control do users who create private Web archives need to regulate
access to their archives?
Research Questions
35
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ1: What sort of content is difficult to capture and replay for preservation from the perspective
of a Web browser?
RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival
crawlers?
RQ3: What issues exist for capturing and replaying content behind authentication?
RQ4: How can content that was captured behind authentication signal to Web archive replay
systems that it requires special handling?
RQ5: How can Memento aggregators indicate that private Web archive content requires special
handling to be replayed, despite being aggregated with publicly available Web archive content?
RQ6: What kinds of access control do users who create private Web archives need to regulate
access to their archives?
Research Questions: RQ1
36
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ1: What sort of content is difficult to capture and replay for preservation from the perspective
of a Web browser?
RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival
crawlers?
RQ3: What issues exist for capturing and replaying content behind authentication?
RQ4: How can content that was captured behind authentication signal to Web archive replay
systems that it requires special handling?
RQ5: How can Memento aggregators indicate that private Web archive content requires special
handling to be replayed, despite being aggregated with publicly available Web archive content?
RQ6: What kinds of access control do users who create private Web archives need to regulate
access to their archives?
Research Questions: RQ2
37
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ1: What sort of content is difficult to capture and replay for preservation from the perspective
of a Web browser?
RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival
crawlers?
RQ3: What issues exist for capturing and replaying content behind authentication?
RQ4: How can content that was captured behind authentication signal to Web archive replay
systems that it requires special handling?
RQ5: How can Memento aggregators indicate that private Web archive content requires special
handling to be replayed, despite being aggregated with publicly available Web archive content?
RQ6: What kinds of access control do users who create private Web archives need to regulate
access to their archives?
Research Questions: RQ3
38
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ1: What sort of content is difficult to capture and replay for preservation from the perspective
of a Web browser?
RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival
crawlers?
RQ3: What issues exist for capturing and replaying content behind authentication?
RQ4: How can content that was captured behind authentication signal to Web archive replay
systems that it requires special handling?
RQ5: How can Memento aggregators indicate that private Web archive content requires special
handling to be replayed, despite being aggregated with publicly available Web archive content?
RQ6: What kinds of access control do users who create private Web archives need to regulate
access to their archives?
Research Questions: RQ4
39
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ1: What sort of content is difficult to capture and replay for preservation from the perspective
of a Web browser?
RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival
crawlers?
RQ3: What issues exist for capturing and replaying content behind authentication?
RQ4: How can content that was captured behind authentication signal to Web archive replay
systems that it requires special handling?
RQ5: How can Memento aggregators indicate that private Web archive content requires special
handling to be replayed, despite being aggregated with publicly available Web archive content?
RQ6: What kinds of access control do users who create private Web archives need to regulate
access to their archives?
Research Questions: RQ5
40
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ1: What sort of content is difficult to capture and replay for preservation from the perspective
of a Web browser?
RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival
crawlers?
RQ3: What issues exist for capturing and replaying content behind authentication?
RQ4: How can content that was captured behind authentication signal to Web archive replay
systems that it requires special handling?
RQ5: How can Memento aggregators indicate that private Web archive content requires special
handling to be replayed, despite being aggregated with publicly available Web archive content?
RQ6: What kinds of access control do users who create private Web archives need to regulate
access to their archives?
Research Questions: RQ6
41
To answer these questions, we first have to be familiar with
BACKGROUND INFORMATION
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Outline
● Introduction/Motivation
● Background
● Related Work
● The Mementity Framework
● Applying Framework to Scenarios
● Contributions, Conclusions, and Future Work
42
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Hypertext Transfer Protocol (HTTP)
HTTP Request
> GET /~mkelly/dissertation.pdf HTTP/1.1
> Host: www.cs.odu.edu
> Connection: close
> request headers
43
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Hypertext Transfer Protocol (HTTP)
HTTP Response
HTTP/1.1 200 OK
Server: nginx
Date: Sun, 05 May 2019 18:31:31 GMT
Content-Type: application/pdf
Content-Length: 40067283
%PDF-1.5
...
response headers
(metadata)
start of entity body
44
@machawk1 • @WebSciDL
Background: Memento
Adapted from Memento Guide: Introduction. http://www.mementoweb.org/guide/quick-intro/, January 2015, in Public Domain
Figure 21, Chapter 2
45
* H. Van de Sompel et al. HTTP Framework for Time-Based Access to Resource States – Memento. IETF RFC 7089,
December 2013.
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Background: Memento Request Example
• Accept-Datetime: Wed, 02 Aug 2017 23:15:00 GMT
• GET: http://web.archive.org/web/http://www.cnn.com
Request cnn.com at
Aug 2, 2017 at 6:15pm EST
URI-G
G
HTTP Request
46
PhD Defense: Mat Kelly
May 7, 2019
Background: Memento Request Example
• Memento-Datetime: Wed, 02 Aug 2017 23:18:04 GMT
• Location: http://web.archive.org/web/20170802231804/http://www.cnn.com/
• Link:
• Accept-Datetime: Wed, 02 Aug 2017 23:15:00 GMT
• GET: http://web.archive.org/web/http://www.cnn.com
URI-G
G
mementotimegateoriginaltimemap
HTTP Request
HTTP Response (302)
URI-T
T
URI-R
R
URI-G
G
URI-M
M
47
Request cnn.com at
Aug 2, 2017 at 6:15pm EST
PhD Defense: Mat Kelly
May 7, 2019
Background: Dereferencing a TimeMap at URI-T
Request URI-T
● Date-based pagination
● Other formats for TimeMap
first
memento
last
mementomemento
......
......
timemap timemap timemap
TimeMapURI-T
T
URI-T
T1
URI-T
T
URI-T
Tt
URI-M
1
URI-M
m
URI-M
n
URI-G
G
URI-R
R
48
Background - Privacy and Security
● Web users question trusting
institutions to preserve
private Web contents1
● OAuth 2.02 facilitates
authentication cohesion of entities
1 Marshall and Shipman, “On the Institutional Archiving of Social Media”, JCDL 2012
2 D. Hardt. The OAuth 2.0 Authorization Framework. IETF RFC 6749, October 2012. 49
RQ3: What issues exist for capturing and replaying
content behind authentication?
RQ6: What kinds of access control do users who
create private Web archives need to regulate
access to their archives?
Figure 27, Chapter 2
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
A familiar paradigm used for authentication on the live Web
Role-based Delegation and Authentication
50
1
2
3
RQ3: content behind auth
RQ6: access control
4
5
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Private vs. Public Web archives
Both personal and institutional Web archiving can be either public are private.
Table I, Chapter 1
[49] Brunelle et al., “Leveraging Heritrix and the Wayback Machine on a Corporate Intranet…”, D-Lib Magazine, 22(1/2), 2016
51
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Private vs. Public Web archives
Both personal and institutional Web archiving can be either public are private.
Table I, Chapter 1
[49] Brunelle et al., “Leveraging Heritrix and the Wayback Machine on a Corporate Intranet…”, D-Lib Magazine, 22(1/2), 2016
OUR FOCUS
52
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
HTTP Prefer
● HTTP negotiation already available via Accept-* headers
● Prefer syntax provide mechanism for client to specify preferences
○ ...with which servers may or may not comply
53
* J. Snell. Prefer Header for HTTP. IETF RFC 7240, June 2014.
Prefer: foo; bar="baz"
Preference Applied: bar="baz"
RQ5: How can Memento aggregators indicate that private Web archive content requires special
handling to be replayed, despite being aggregated with publicly available Web archive content?
RQ4: How can content that was captured behind authentication
signal to Web archive replay systems that it requires special handling?
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ5: How can Memento aggregators indicate that private Web archive
content requires special handling to be replayed, despite being aggregated
with publicly available Web archive content?
Memento Aggregation* State of the Art
54
* Querying multiple Web archives for a URI from a single
endpoint at a Memento aggregator
Time Travel Service MemGator
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Memento Aggregation - MementoWeb
55
Also available via CLI:
$ curl http://timetravel.mementoweb.org/timemap/link/http://odu.edu
Figures 25a and 25b, Chapter 2
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
● Open Source Memento Aggregator - github.com/oduwsdl/memgator
● Easy personal/local deployment
● Specify archive list on launch
○ Easily configurable JSON →
○ Use default collection if not specified
● TimeMap Formats:
○ Link
○ JSON
○ CDXJ
Memento Aggregation - MemGator
56
* Alam and Nelson, “MemGator - A Portable Concurrent Memento Aggregator: Cross-Platform CLI and Server
Binaries in Go”, JCDL 2016
...
<http://localhost:8080/20101116060516/http://facebook.com/>; rel="memento";
datetime="Tue, 16 Nov 2010 06:05:16 GMT",
...
...
20101116060516 {
"uri": "http://localhost:8080/20101116060516/http://facebook.com/",
"rel": "memento",
"datetime": "Tue, 16 Nov 2010 06:05:16 GMT",
}
...
57
CDXJ: An Alternative TimeMap Format
Memento entry in Link (RFC 7089) TimeMap
Memento entry in CDXJ TimeMap
Memento (URI-M)
See Alam, “CDXJ: An Object Resource Stream Serialization Format”, 2015
Figure 26, Chapter 2
Relative Relations Memento-Datetime
● Browser extension we developed*
● Bridges gap between live and
archived Webs
● Leverages Memento aggregator’s
capability, returns TimeMaps
● Indicates # of captures for a
URI while you browse
● Provides navigation of mementos
while browsing live Web
● Single-click submission of URI-R to
multiple Web archives
: Visual User Interaction with Aggregators
58
* Kelly et al., “Mink: Integrating the Live and Archived Web Viewing Experience Using Web Browsers and Memento”, JCDL 2014
Figure 36, Chapter 4
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Outline
● Introduction/Motivation
● Background
● Related Work
● The Mementity Framework
● Applying Framework to Scenarios
● Contributions, Conclusions, and Future Work
59
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Related Work
Memento Aggregation (RQ 5):
● Bornand et al. (JCDL 2016) - routing using classifier
● Nelson and Alam (JCDL 2016) - routing using probability
● Rosenthal et al. (DLib 2015) - aggregation w/ restricted access
● AlSum et al. (IJDL 2014) - routing URI lookups
● Brunelle and Nelson (JCDL 2016) - TImeMap caching
● Alam et al. (IJDL 2016) - efficient querying through profiling
● Rosenthal (2013) - aggregator scalability
Prefer and Web Archives (RQ 5):
● Jones et al. (2016) - introduced “raw” mementos
● Van de Sompel et al. (2016) - Prefer usage for raw
● Rosenthal (2016) - other Prefer usage in web archives
Privacy in Web Archives (RQ 3-6):
● Marshall and Shipman (JCDL 2014) - Preserving FB, social
media content
● Lindley et al. (WWW 2013) - Users archiving social media
Access Control in Web Archives (RQ 4-6):
● Webber (2018) - HTTP 451 @ UKWA
● Rao et al. (SIGOPS 2001) iProxy - access control in web archives
Collaboration in Web Archives (RQ 5):
● Potts (UXPA 2015) - creating collections with tags
● Stine (2018) Cobweb - collection building between institutions at
Archive-It.
● Phillips (2003) PANDORA dynamics
● Niu (DLib 2012) - personalized features at PANDORA
Public Web Archiving (RQ 2):
● Brunelle et al. (JCDL 2017) - JS & deferred representations
● Ben-David and Huurdeman (Alexandria 2014) - archives
retrieval beyond URI
● AlNoamany et al. (JCDL 2013) Access Patterns
Private Web Archiving (RQ 1-4):
● Brunelle et al. (DLib 2016) - archiving corporate intranet
● Marshall and Shipman (JCDL 2012) - Institutions archiving
social media
60
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Related Work (RQ1)
Memento Aggregation (RQ 5):
● Bornand et al. (JCDL 2016) - routing using classifier
● Nelson and Alam (JCDL 2016) - routing using probability
● Rosenthal et al. (DLib 2015) - aggregation w/ restricted access
● AlSum et al. (IJDL 2014) - routing URI lookups
● Brunelle and Nelson (JCDL 2016) - TImeMap caching
● Alam et al. (IJDL 2016) - efficient querying through profiling
● Rosenthal (2013) - aggregator scalability
Prefer and Web Archives (RQ 5):
● Jones et al. (2016) - introduced “raw” mementos
● Van de Sompel et al. (2016) - Prefer usage for raw
● Rosenthal (2016) - other Prefer usage in web archives
Privacy in Web Archives (RQ 3-6):
● Marshall and Shipman (JCDL 2014) - Preserving FB, social
media content
● Lindley et al. (WWW 2013) - Users archiving social media
Access Control in Web Archives (RQ 4-6):
● Webber (2018) - HTTP 451 @ UKWA
● Rao et al. (SIGOPS 2001) iProxy - access control in web archives
Collaboration in Web Archives (RQ 5):
● Potts (UXPA 2015) - creating collections with tags
● Stine (2018) Cobweb - collection building between institutions at
Archive-It.
● Phillips (2003) PANDORA dynamics
● Niu (DLib 2012) - personalized features at PANDORA
Public Web Archive Access (RQ 2):
● Brunelle et al. (JCDL 2017) - JS & deferred representations
● Ben-David and Huurdeman (Alexandria 2014) - archives
retrieval beyond URI
● AlNoamany et al. (JCDL 2013) Access Patterns
Private Web Archiving (RQ 1-4):
● Brunelle et al. (DLib 2016) - archiving corporate intranet
● Marshall and Shipman (JCDL 2012) - Institutions archiving
social media
RQ1: What sort of content is difficult to capture and
replay for preservation from the perspective of a Web
browser?
61
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Related Work (RQ2)
Memento Aggregation (RQ 5):
● Bornand et al. (JCDL 2016) - routing using classifier
● Nelson and Alam (JCDL 2016) - routing using probability
● Rosenthal et al. (DLib 2015) - aggregation w/ restricted access
● AlSum et al. (IJDL 2014) - routing URI lookups
● Brunelle and Nelson (JCDL 2016) - TImeMap caching
● Alam et al. (IJDL 2016) - efficient querying through profiling
● Rosenthal (2013) - aggregator scalability
Prefer and Web Archives (RQ 5):
● Jones et al. (2016) - introduced “raw” mementos
● Van de Sompel et al. (2016) - Prefer usage for raw
● Rosenthal (2016) - other Prefer usage in web archives
Privacy in Web Archives (RQ 3-6):
● Marshall and Shipman (JCDL 2014) - Preserving FB, social
media content
● Lindley et al. (WWW 2013) - Users archiving social media
Access Control in Web Archives (RQ 4-6):
● Webber (2018) - HTTP 451 @ UKWA
● Rao et al. (SIGOPS 2001) iProxy - access control in web archives
Collaboration in Web Archives (RQ 5):
● Potts (UXPA 2015) - creating collections with tags
● Stine (2018) Cobweb - collection building between institutions at
Archive-It.
● Phillips (2003) PANDORA dynamics
● Niu (DLib 2012) - personalized features at PANDORA
Public Web Archive Access (RQ 2):
● Brunelle et al. (JCDL 2017) - JS & deferred representations
● Ben-David and Huurdeman (Alexandria 2014) - archives
retrieval beyond URI
● AlNoamany et al. (JCDL 2013) Access Patterns
Private Web Archiving (RQ 1-4):
● Brunelle et al. (DLib 2016) - archiving corporate intranet
● Marshall and Shipman (JCDL 2012) - Institutions archiving
social media
RQ2: How do Web browser APIs compare in potential
functionality to the capabilities of archival crawlers?
62
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Related Work (RQ3)
Memento Aggregation (RQ 5):
● Bornand et al. (JCDL 2016) - routing using classifier
● Nelson and Alam (JCDL 2016) - routing using probability
● Rosenthal et al. (DLib 2015) - aggregation w/ restricted access
● AlSum et al. (IJDL 2014) - routing URI lookups
● Brunelle and Nelson (JCDL 2016) - TImeMap caching
● Alam et al. (IJDL 2016) - efficient querying through profiling
● Rosenthal (2013) - aggregator scalability
Prefer and Web Archives (RQ 5):
● Jones et al. (2016) - introduced “raw” mementos
● Van de Sompel et al. (2016) - Prefer usage for raw
● Rosenthal (2016) - other Prefer usage in web archives
Privacy in Web Archives (RQ 3-6):
● Marshall and Shipman (JCDL 2014) - Preserving FB, social
media content
● Lindley et al. (WWW 2013) - Users archiving social media
Access Control in Web Archives (RQ 4-6):
● Webber (2018) - HTTP 451 @ UKWA
● Rao et al. (SIGOPS 2001) iProxy - access control in web archives
Collaboration in Web Archives (RQ 5):
● Potts (UXPA 2015) - creating collections with tags
● Stine (2018) Cobweb - collection building between institutions at
Archive-It.
● Phillips (2003) PANDORA dynamics
● Niu (DLib 2012) - personalized features at PANDORA
Public Web Archive Access (RQ 2):
● Brunelle et al. (JCDL 2017) - JS & deferred representations
● Ben-David and Huurdeman (Alexandria 2014) - archives
retrieval beyond URI
● AlNoamany et al. (JCDL 2013) Access Patterns
Private Web Archiving (RQ 1-4):
● Brunelle et al. (DLib 2016) - archiving corporate intranet
● Marshall and Shipman (JCDL 2012) - Institutions archiving
social media
RQ3: What issues exist for capturing and replaying
content behind authentication?
63
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Related Work (RQ4)
Memento Aggregation (RQ 5):
● Bornand et al. (JCDL 2016) - routing using classifier
● Nelson and Alam (JCDL 2016) - routing using probability
● Rosenthal et al. (DLib 2015) - aggregation w/ restricted access
● AlSum et al. (IJDL 2014) - routing URI lookups
● Brunelle and Nelson (JCDL 2016) - TImeMap caching
● Alam et al. (IJDL 2016) - efficient querying through profiling
● Rosenthal (2013) - aggregator scalability
Prefer and Web Archives (RQ 5):
● Jones et al. (2016) - introduced “raw” mementos
● Van de Sompel et al. (2016) - Prefer usage for raw
● Rosenthal (2016) - other Prefer usage in web archives
Privacy in Web Archives (RQ 3-6):
● Marshall and Shipman (JCDL 2014) - Preserving FB, social
media content
● Lindley et al. (WWW 2013) - Users archiving social media
Access Control in Web Archives (RQ 4-6):
● Webber (2018) - HTTP 451 @ UKWA
● Rao et al. (SIGOPS 2001) iProxy - access control in web archives
Collaboration in Web Archives (RQ 5):
● Potts (UXPA 2015) - creating collections with tags
● Stine (2018) Cobweb - collection building between institutions at
Archive-It.
● Phillips (2003) PANDORA dynamics
● Niu (DLib 2012) - personalized features at PANDORA
Public Web Archive Access (RQ 2):
● Brunelle et al. (JCDL 2017) - JS & deferred representations
● Ben-David and Huurdeman (Alexandria 2014) - archives
retrieval beyond URI
● AlNoamany et al. (JCDL 2013) Access Patterns
Private Web Archiving (RQ 1-4):
● Brunelle et al. (DLib 2016) - archiving corporate intranet
● Marshall and Shipman (JCDL 2012) - Institutions archiving
social media
RQ4: How can content that was captured behind
authentication signal to Web archive replay systems
that it requires special handling?
64
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Related Work (RQ5)
Memento Aggregation (RQ 5):
● Bornand et al. (JCDL 2016) - routing using classifier
● Nelson and Alam (JCDL 2016) - routing using probability
● Rosenthal et al. (DLib 2015) - aggregation w/ restricted access
● AlSum et al. (IJDL 2014) - routing URI lookups
● Brunelle and Nelson (JCDL 2016) - TImeMap caching
● Alam et al. (IJDL 2016) - efficient querying through profiling
● Rosenthal (2013) - aggregator scalability
Prefer and Web Archives (RQ 5):
● Jones et al. (2016) - introduced “raw” mementos
● Van de Sompel et al. (2016) - Prefer usage for raw
● Rosenthal (2016) - other Prefer usage in web archives
Privacy in Web Archives (RQ 3-6):
● Marshall and Shipman (JCDL 2014) - Preserving FB, social
media content
● Lindley et al. (WWW 2013) - Users archiving social media
Access Control in Web Archives (RQ 4-6):
● Webber (2018) - HTTP 451 @ UKWA
● Rao et al. (SIGOPS 2001) iProxy - access control in web archives
Collaboration in Web Archives (RQ 5):
● Potts (UXPA 2015) - creating collections with tags
● Stine (2018) Cobweb - collection building between institutions at
Archive-It.
● Phillips (2003) PANDORA dynamics
● Niu (DLib 2012) - personalized features at PANDORA
Public Web Archive Access (RQ 2):
● Brunelle et al. (JCDL 2017) - JS & deferred representations
● Ben-David and Huurdeman (Alexandria 2014) - archives
retrieval beyond URI
● AlNoamany et al. (JCDL 2013) Access Patterns
Private Web Archiving (RQ 1-4):
● Brunelle et al. (DLib 2016) - archiving corporate intranet
● Marshall and Shipman (JCDL 2012) - Institutions archiving
social media
RQ5: How can Memento aggregators indicate that
private Web archive content requires special handling
to be replayed, despite being aggregated with publicly
available Web archive content?
65
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Related Work (RQ6)
Memento Aggregation (RQ 5):
● Bornand et al. (JCDL 2016) - routing using classifier
● Nelson and Alam (JCDL 2016) - routing using probability
● Rosenthal et al. (DLib 2015) - aggregation w/ restricted access
● AlSum et al. (IJDL 2014) - routing URI lookups
● Brunelle and Nelson (JCDL 2016) - TImeMap caching
● Alam et al. (IJDL 2016) - efficient querying through profiling
● Rosenthal (2013) - aggregator scalability
Prefer and Web Archives (RQ 5):
● Jones et al. (2016) - introduced “raw” mementos
● Van de Sompel et al. (2016) - Prefer usage for raw
● Rosenthal (2016) - other Prefer usage in web archives
Privacy in Web Archives (RQ 3-6):
● Marshall and Shipman (JCDL 2014) - Preserving FB, social
media content
● Lindley et al. (WWW 2013) - Users archiving social media
Access Control in Web Archives (RQ 4-6):
● Webber (2018) - HTTP 451 @ UKWA
● Rao et al. (SIGOPS 2001) iProxy - access control in web archives
Collaboration in Web Archives (RQ 5):
● Potts (UXPA 2015) - creating collections with tags
● Stine (2018) Cobweb - collection building between institutions at
Archive-It.
● Phillips (2003) PANDORA dynamics
● Niu (DLib 2012) - personalized features at PANDORA
Public Web Archive Access (RQ 2):
● Brunelle et al. (JCDL 2017) - JS & deferred representations
● Ben-David and Huurdeman (Alexandria 2014) - archives
retrieval beyond URI
● AlNoamany et al. (JCDL 2013) Access Patterns
Private Web Archiving (RQ 1-4):
● Brunelle et al. (DLib 2016) - archiving corporate intranet
● Marshall and Shipman (JCDL 2012) - Institutions archiving
social media
RQ6: What kinds of access control do users who create
private Web archives need to regulate access to their
archives?
66
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Outline
● Introduction/Motivation
● Background
● Related Work
● The Mementity Framework
○ Archival Negotiation Beyond Time
○ Query Precedence & Short-Circuiting
○ Mementities
● Applying Framework to Scenarios
● Contributions, Conclusions, and Future Work
67
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Outline
● Introduction/Motivation
● Background
● Related Work
● The Mementity Framework
○ Archival Negotiation Beyond Time
○ Query Precedence & Short-Circuiting
○ Mementities
● Applying Framework to Scenarios
● Contributions, Conclusions, and Future Work
68
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
More Expressive TimeMaps
● Memento Quality (e.g., Damage)1
● How Many Captures?2
● How Many Are Identical?2,3
● Other Attributes of Mementos...
69
1 Brunelle, Kelly et al., JCDL 2014, IJDL 2015
2 Kelly et al., JCDL 2017
3 AlSum and Nelson, ECIR 2014
Figure 7, Chapter 1
Section 6.2
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Additional TimeMap Attributes
Content-based Attributes
Derived Attributes
Access Attributes
70
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
TimeMap Enrichment: Content-Based Attributes
● Status Code1
● Content-Digest
○ In WARC & CDX
○ Not all archives expose CDX
● Would allow more info about mementos without
requiring comprehensive dereferencing
71
1 Kelly et al., “Impact of URI Canonicalization on Memento Count”, JCDL 2017, arXiv 1703.03302
Section 6.2.1
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
● Thumbnails (e.g., via SimHash)1
○ Calculation based on root memento’s HTML
● Memento Damage (JCDL 2014, IJDL)2
○ Requires dereferencing embedded resources
TimeMap Enrichment: Derived Attributes
72
1 AlSum and Nelson, Thumbnail Summarization Techniques for Web Archives, ECIR 2014, pp. 299-310.
2 Brunelle, Kelly et al., “The Impact of JavaScript on Archivability,” IJDL, 17(2), pp. 95-117. January 2016.
apple.com, many duplicate mementos!
Section 6.2.2
How to distinguish
Private captures
from
Public captures
in a TimeMap?
TimeMap Enrichment: Access Attributes
first
memento mementomemento
......
......
timemap timemap timemap
TimeMapURI-T
T
URI-T
T1
URI-T
T
URI-T
Tt
URI-M
1
URI-M
m
URI-M
n
URI-G
G
URI-R
R
URI-M
m+1
last
memento
...
RQ5: How can Memento aggregators
indicate that private Web archive content
requires special handling to be replayed,
despite being aggregated with publicly
available Web archive content?
Section 6.2.3
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
TimeMap Enrichment - in a CDXJ TimeMap
74
20101116060516 {
"uri": "http://localhost:8080/20101116060516/http://facebook.com/",
"rel": "memento",
"datetime": "Tue, 16 Nov 2010 06:05:16 GMT",
"status_code": 200,
"digest": "sha1:LK26DRRQJ4WATC6LBVF3B3Z4P2CP5ZZ7",
"damage": 0.24,
"simhash": "6551110622422153488",
"content-language": "en-US",
"access": {
"type": "Blake2b",
"token": "c6ed419e74907d220c69858614d86...ef0a3a88a41"
}
}
Line breaks added for clarity, CDXJ records occupy a single line
Content-based
attributes
Derived
Attributes
Access
Attributes
Figure 64, Chapter 6
TimeMap
Enrichment with Additional Attributes
75
“StarMap”
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Outline
● Introduction/Motivation
● Background
● Related Work
● The Mementity Framework
○ Archival Negotiation Beyond Time
○ Query Precedence & Short-Circuiting
○ Mementities
● Applying Framework to Scenarios
● Contributions, Conclusions, and Future Work
76
Query Precedence
77
“Check my archive first, then Carol’s, then all
public archives.”
1
2
3 3
● More control of querying in series and parallel
See Atkins’, “Paywalls in the Internet Archive”, March 2018
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Precedence in Archival Specifications
● MemGator’s internal archival specification
N1: [{A0},{A1},{A2}]
N2: {{A0},{A1},{A2}}
N3: {{A0},{A4,A2,A5},{A3}}
N4: {"id0":{A0},"id1":{A4},"id2":{A3}}
N5: [{"id0"∶{A0}},{"id1"∶{A1}},{"id2"∶ {A2}}]
N6: [{"id0"∶{A0}},
{"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}},
{"id2"∶ {A3}]
● Despite JSON array, archives queried in “pseudo-parallel” for efficiency
78
1
1
1
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Precedence in Archival Specifications
● MemGator’s internal archival specification
N1: [{A0},{A1},{A2}]
N2: {{A0},{A1},{A2}}
N3: {{A0},{A4,A2,A5},{A3}}
N4: {"id0":{A0},"id1":{A4},"id2":{A3}}
N5: [{"id0"∶{A0}},{"id1"∶{A1}},{"id2"∶{A2}}]
N6: [{"id0"∶{A0}},
{"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}},
{"id2"∶ {A3}]
● N2 more descriptive of behavior despite notation interpretation
● JSON object members require identifiers
79
1
1
1
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
● MemGator’s internal archival specification
N1: [{A0},{A1},{A2}]
N2: {{A0},{A1},{A2}}
N3: {{A0},{A4,A2,A5},{A3}}
N4: {"id0":{A0},"id1":{A4},"id2":{A3}}
N5: [{"id0"∶{A0}},{"id1"∶{A1}},{"id2"∶{A2}}]
N6: [{"id0"∶{A0}},
{"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}},
{"id2"∶ {A3}]
● Can introduce “sets” of URIs to archive, but still query in pseudo-parallel
● JSON object members require identifiers
Precedence in Archival Specifications
80
1
1
1
A4
A5
1
1
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
● MemGator’s internal archival specification
N1: [{A0},{A1},{A2}]
N2: {{A0},{A1},{A2}}
N3: {{A0},{A4,A2,A5},{A3}}
N4: {"id0":{A0},"id1":{A4},"id2":{A3}}
N5: [{"id0"∶{A0}},{"id1"∶{A1}},{"id2"∶{A2}}]
N6: [{"id0"∶{A0}},
{"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}},
{"id2"∶ {A3}]
● Add identifiers for JSON compliance
● Still pseudo-parallel
Precedence in Archival Specifications
81
1
1
1
id0
id1
id2
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
● MemGator’s internal archival specification
N1: [{A0},{A1},{A2}]
N2: {{A0},{A1},{A2}}
N3: {{A0},{A4,A2,A5},{A3}}
N4: {"id0":{A0},"id1":{A4},"id2":{A3}}
N5: [{"id0"∶{A0}},{"id1"∶{A1}},{"id2"∶{A2}}]
N6: [{"id0"∶{A0}},
{"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}},
{"id2"∶ {A3}]
● Query archives in-series
Precedence in Archival Specifications
82
3
2
1
id0
id1
id2
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
● MemGator’s internal archival specification
N1: [{A0},{A1},{A2}]
N2: {{A0},{A1},{A2}}
N3: {{A0},{A4,A2,A5},{A3}}
N4: {"id0":{A0},"id1":{A4},"id2":{A3}}
N5: [{"id0"∶{A0}},{"id1"∶{A1}},{"id2"∶{A2}}]
N6: [{"id0"∶{A0}},
{"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}},
{"id2"∶ {A3}]
● Query archival sets in-series
Precedence in Archival Specifications
83
3
2
1
A4
A5
2
2
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Precedence Using Prefer
[{"id0"∶{A0}},
{"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}},
{"id2"∶ {A3}]
Prefer:archives="data:application/json;c
harset=utf-8;base64,Ww0KICB7...NCn0="
ENCODES TO
(no aggregator prior to this dissertation supports this)
84
Query Precedence
85
“Check my archive first, then Carol’s, then all
public archives.”
1
● More control of querying in series and parallel
See Atkins’, “Paywalls in the Internet Archive”, March 2018
Query Precedence
86
“Check my archive first, then Carol’s, then all
public archives.”
1
2
● More control of querying in series and parallel
See Atkins’, “Paywalls in the Internet Archive”, March 2018
Query Precedence
87
“Check my archive first, then Carol’s, then all
public archives.”
1
2
3 3
● More control of querying in series and parallel
See Atkins’, “Paywalls in the Internet Archive”, March 2018
Query Short-Circuiting
88
“Check private archives first. Iff you find no
captures, only then check the public archives.
1
1
2 2
● May give priority to archive preference
● Series halt when threshold met.
Query Short-Circuiting
89
“Check private archives first. Iff you find no
captures, only then check the public archives.
1
1
2 2
● May give priority to archive preference.
● Series halt when threshold met.
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Outline
● Introduction/Motivation
● Background
● Related Work
● The Mementity Framework
○ Archival Negotiation Beyond Time
○ Query Precedence & Short-Circuiting
○ Mementities
● Applying Framework to Scenarios
● Contributions, Conclusions, and Future Work
90
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Mementities
● Memento + Entity (entity term already overused)
91
Time
Gate
Introduced in this Framework Conventional Memento
Mementities
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL 92
MEMENTITY FRAMEWORK
Mementities
Section 7.1.1
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Memento Meta-Aggregator (MMA)
93
functional
⊇
Figure 80, Chapter 7
Section 7.3.3
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
MMA: Archive Selection
94
Figure 55, Chapter 6
Section 6.1
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
MMA: User-Driven Archival Specification
95
✓ ✓
Section 6.1
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
MMAα:
from MA2, MA1 and WA6
MMAβ:
from WA7 and WA8
MMAγ:
from MMAβ, MA5, and WA1
96
MMA Aggregation sources
Figure 67, Chapter 7
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Costs of more expressive TimeMaps (StarMaps)
and Link header enrichment
● Computational
○ Mostly server-side, potential to further leverage client
● Temporal
○ Required on variant generation
● Spatial
○ Permutation variant storage
● Access
○ Variant negotiation
97
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Temporal Costs: Conventional Aggregation
T = treq + tMA + tAGG+ tresp
tMA=max(RTT(Ai ,URI-R))
Client→Aggregator
request
Aggregation⇄∀archives
communication
Aggregator combining,
sorting, etc.
Aggregator→Client
request
Equation 9, Chapter 8
Equation 10, Chapter 8
98
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Impact of MemGator→MMA
● Additional capability of MMA beyond MemGator adds additional
temporal complexity:
○ tprefer → time and computation required to modify base set of archives and associate
with client’s session
○ tenrich → time required for external services to parse/calculate additional attributes
per URI-M
○ tfilter→ time required to filter URI-Ms based on client’s preference (potentially 0)
Section 8.2
99
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
● Personal Archive Aggregation
● MMA Chaining
● Client-Side Aggregation Preference
MMA Dynamics By-Example
100
ALICE
BOB
CAROL
MALCOLM
@machawk1 • @WebSciDL
Personal Archive Aggregation: Alice Saves the Web
101
Personal Archive Aggregation
@machawk1 • @WebSciDL
Alice Wants to See Her Captures Temporally Inline
102
at tA
at tD→Z
Section 7.3.5
Personal Archive Aggregation
@machawk1 • @WebSciDL
Mementity Dynamics - Alice & Her Archives (WAA)
103
Section 7.3.5
Personal Archive Aggregation
@machawk1 • @WebSciDL
→{ }
Alice Deploys MMAA
104
Section 7.3.5
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Carol Asks MMAA for CNN
105
→{ }
MMA Chaining
Section 7.3.5
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
MMAA Returns CNN Memento {MA, MIA}
106
→{ }→{ }
MMA Chaining
Section 7.3.5
PhD Defense: Mat Kelly
May 7, 2019
Carol Wants to Aggregate Her Own Captures
107
→{ }
at
tC
at tD→Z,A
→{ }
(M(WAC))
MMA Chaining
Section 7.3.5
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Carol Creates MMAC to Access WAC and MMAA
108
→{ }→{ }
→{ }
MMA Chaining
Section 7.3.5
@machawk1 • @WebSciDL
Carol Asks MMAC For CNN
109
→{ }→{ }
→{ }
MMA Chaining
Section 7.3.5
MMAA Returns CNN Memento {MA, MIA,MC}
110
→{ }→{ }
→{ }
Section 7.3.5
Client-Side Aggregation Preference
Bob May Request M(CNN) From MMAA or
MMAC
111
→{ }→{ }
→{ }
Section 6.1
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Client-Side Aggregation Preference
Bob Prefers to Exclude IA Captures
112
✓ ✓
...and does not want to setup his own MMA
Section 6.1
GET /archives/
Bob Requests Supported Archives
113
→{ }
Section 6.1
Client-Side Aggregation Preference
Bob Customizes the Set in the JSON
114
→{ }
✓ ✓
Section 6.1
Client-Side Aggregation Preference
Bob Requests CNN for His Custom Set
115
→{ }
( )
Section 6.1
Client-Side Aggregation Preference
MMA Complies or Ignores Preference
116
→{ }
→{ }
✓
Section 6.1
Client-Side Aggregation Preference
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Hooray, Aggregation!
117
!
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL 118
MEMENTITY FRAMEWORK
Mementities
Section 7.1.2
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Hooray, Aggregation!
119
HTTP 401
Consult URI-P
✓
RQ4: How can content that was captured behind authentication signal to
Web archive replay systems that it requires special handling?
Section 7.3.4
PhD Defense: Mat Kelly
May 7, 2019
Private Web Archive Adapter (PWAA)
● Auth Layer for to encourage Private
Web archive aggregation
● Typical OAuth 2.0 flow
● Auth role cohesive to PWAA
● Persistent access through tokenization
120
RQ6: What kinds of access control do users who create
private Web archives need to regulate access to their
archives?
Figure 63, Chapter 6
PhD Defense: Mat Kelly
May 7, 2019
PWAA - Sharing Tokens
121
RQ6: What kinds of access control do users
who create private Web archives need to
regulate access to their archives?
Figure 81a, Chapter 7
Section 7.3.4
PhD Defense: Mat Kelly
May 7, 2019
PWAA - Previously Authorized
122Figure 81a, Chapter 7
Section 7.3.4
PhD Defense: Mat Kelly
May 7, 2019
PWAA - Unauthorized Request
123Figure 81a, Chapter 7
Section 7.3.4
PhD Defense: Mat Kelly
May 7, 2019
PWAA - Sharing Tokens
124
RQ6: What kinds of access control do users
who create private Web archives need to
regulate access to their archives?
Figure 81b, Chapter 7
Section 7.3.4
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Alice Passes Associative Token to MMA
125
RQ5: How can Memento aggregators indicate that private Web archive
content requires special handling to be replayed, despite being aggregated
with publicly available Web archive content?
Figure 84a, Chapter 7
Section 7.3.4
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
MMA requests URI-R...
126
...relays token where applicable
Figure 84b, Chapter 7
Section 7.3.4
RQ5: How can Memento aggregators indicate that private Web archive
content requires special handling to be replayed, despite being aggregated
with publicly available Web archive content?
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Private Archive Validates with PWAA
127Figure 84c, Chapter 7
Section 7.3.4
RQ5: How can Memento aggregators indicate that private Web archive
content requires special handling to be replayed, despite being aggregated
with publicly available Web archive content?
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
PWAA Confirms Token
128Figure 84d, Chapter 7
Section 7.3.4
RQ5: How can Memento aggregators indicate that private Web archive
content requires special handling to be replayed, despite being aggregated
with publicly available Web archive content?
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Private Archive Returns Captures
129Figure 84e, Chapter 7
Section 7.3.4
RQ5: How can Memento aggregators indicate that private Web archive
content requires special handling to be replayed, despite being aggregated
with publicly available Web archive content?
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
MMA Aggregates, Associates Token
130Figure 84f, Chapter 7
Section 7.3.4
RQ5: How can Memento aggregators indicate that private Web archive
content requires special handling to be replayed, despite being aggregated
with publicly available Web archive content?
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
MMA Aggregates, Associates Token
131
Section 7.3.4
We still need a way to
distinguish private archives
from public archives
when querying an aggregator
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL 132
MEMENTITY FRAMEWORK
Mementities
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
⊆StarGate
● Content negotiation in Web archives beyond time
● “Star” ~ wildcard (*) → any dimension of negotiation
● Allow for queries like: Only show me mementos…
○ That are not redirects (content-based attribute HTTP Status ≠ 3XX)
○ Of a sufficient quality (derived attribute Memento Damage < 0.4)
○ Are from personal Web archives (access attribute indicate Alice’s
facebook.com mementos are not login pages)
133
Time
Gate
functio
nal
Section 7.1.3
@machawk1 • @WebSciDL
Implicit Filtering via MMA or Directly
134
→{ }
→{ }
✓
filtering
filtering
Section 7.2.1
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Negotiation in the Privacy Dimension
(via short circuiting)
135
Get URI-Ms for URI-R only from
personal Web archives
privateOnly
1
4
2
3
ACCESS ATTRIBUTE
Figure 85, Chapter 7
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Negotiation on Content-Based or Derived
Attributes
(with response filtering) Retrieved StarMap
136
Get URI-Ms for URI-R of good
quality that are unique
MD < 0.25, unique(simhash)
Abbreviated StarMap
with filtering applied
1
1
2
3
DERIVED ATTRIBUTE
CONTENT-BASED
ATTRIBUTE
&
Figure 86, Chapter 7
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Outline
● Introduction/Motivation
● Background
● Related Work
● The Mementity Framework
● Applying Framework to Scenarios
○ Reconsidering Framework Scenarios
○ Framework Implementation
● Contributions, Conclusions, and Future Work
137
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Outline
● Introduction/Motivation
● Background
● Related Work
● The Mementity Framework
● Applying Framework to Scenarios
○ Reconsidering Framework Scenarios
○ Framework Implementation
● Contributions, Conclusions, and Future Work
138
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
“It was there yesterday, where did it go?”
● We created browser-based tools to mitigate live Web
ephemerality
○ e.g., WARCreate1 (Section 4.2)
● ...and personal Web archival replay systems to re-experience the
captures and facilitate collaboration and propagation
○ Web Archiving Integration Layer (WAIL)2,3, Section 4.3
○ InterPlanetary Wayback (ipwb)4,5, Section 4.5
139
1 Kelly and Weigle, “WARCreate - Create Wayback-Consumable WARC Files from Any Webpage”, JCDL 2012
2 Kelly et al., “Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving”, PDA 2013
3 Berlin, Kelly et al., “WAIL: Collection-Based Personal Web Archiving”, JCDL 2017
4 Kelly et al., “InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives”, TPDL 2016
5 Alam, Kelly et al., “InterPlanetary Wayback: The Permanent Web Archive”, JCDL 2016
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
“Save this, but only for me”
● Our personal Web archiving tools (e.g., previous slide) help with personal Web
archiving of potentially private live Web content.
● The Private Web Archive Adapter (Section 7.1.2) in the Mementity Framework
allows for systematic access regulation
HTTP 401
Consult URI-P
✓
140
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
“I want to share this but control who can see it.”
● Access regulation: PWAA (Sections 7.1.2, 7.2.2)
● Collaboration and propagation of captures: ipwb1,2 (Section 4.5)
● Symmetric encryption for privacy: ipwb
● Negotiation in dimension of privacy / access attributes3 (Section 6.2.3)
Figure 75, Chapter 7 141
1 Kelly et al., “InterPlanetary Wayback: Peer-To-Peer
Permanence of Web Archives”, TPDL 2016
2 Alam, Kelly et al., “InterPlanetary Wayback: The Permanent
Web Archive”, JCDL 2016
3 Kelly et al., “A Framework for Aggregating Private and
Public Web Archives”, JCDL 2018
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Outline
● Introduction/Motivation
● Background
● Related Work
● The Mementity Framework
● Applying Framework to Scenarios
○ Reconsidering Framework Scenarios
○ Framework Implementation
● Contributions, Conclusions, and Future Work
142
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Reference Implementation
New Software Packages
Adapt to Framework with Additional Capabilities
143
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Adaptations for MMA
MMA deliverables (MMADi)
1. Query precedence
2. Short-circuiting
3. Client-side Archival
Specification
4. ⇄ PWAA for access control
5. ⇄ StarGate for negotiation in
dimensions beyond time
Section 8.3.1
144
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
MMAD1: Query Precedence in MemGator
1. Adjusted base reference specification to be MemGator’s intended psuedo-
parallel querying model (JSON array → JSON objects).
○ Figure 90, Chapter 8
2. Implement MemGator interpretation of JSON arrays to query in-series, waiting
for a previous array entry to finish before proceedings.
○ Figure 87, Chapter 8
3. Implement interpretation of JSON “sets” of archives to recursively allow
querying a subset in-series.
○ Figure 91, Chapter 8
Section 8.3.1
145
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
MMAD2: Short-Circuiting in MemGator
MemGator has pre-existing functionality to short-circuit on archival count,
based on predefined order in server-set archival specification.
Leverage condition in MemGator to modify this conditional based on a function
checked at same point as conditional to prematurely break iteration.
Section 8.3.1
146
Figure 91, Chapter 8
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
MMAD3: Client-Side Archival Specification
1. Associate archives list with session, pre-populate with default set.
2. Read Prefer header on request received. Overwrite archives set and associate
with session.
3. When querying archives, refer to this set (passed along by-reference) instead of
a global set defined on startup and read from a local config.
Section 8.3.1
147
Figure 89, Chapter 8
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
PWAA Implementation
● We created a web-based GUI to
assist in initialization
● Handles interaction from client to
private Web archive and requests
from aggregators (MMAD4)
Section 8.3.2
148
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Preference from Client’s Perspective
149
HTTP Response
HTTP Request
GET /tm/cdxj/https://cnn.com HTTP/1.1
Host: mma.example.com
Prefer: damage="<0.4"
Prefer:archives="data:application/json;charset=utf-8;
base64,Ww0KICB7...NCn0="
Prefer: publicOnly
HTTP/1.1 200 OK
Content-Length: 1207
Content-Type: application/cdxj+ors
Preference-Applied: damage="<0.4"
CDXJ
TM
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
StarGate Implementation
Section 8.3.3
150
HTTP Request
> GET /https://cnn.com HTTP/1.1
> Host: mma.alice.org
> Prefer: damage="<0.4"
>
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Section 8.3.3
151
> GET /timemap/https://cnn.com
> Host: web.archive.org
>
> GET /tm/https://cnn.com
> Host: webarchive.org.uk/tm/
>
> GET /timemap/https://cnn.com
> Host: archive.md
>
StarGate Implementation
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Section 8.3.3
152
HTTP/1.1 200 OK
Content-Type:application/link-format
(TimeMap)
HTTP/1.1 200 OK
Content-Type:application/link-format
(TimeMap)
HTTP/1.1 200 OK
Content-Type:application/link-format
(TimeMap)
StarGate Implementation
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Section 8.3.3
153
HTTP Request
POST / HTTP/1.1
Host: stargate.alice.org
Content-Length: 1500
Content-Type: application/cdxj+ors
Prefer: damage="<0.4"
CDXJ
TM
StarGate Implementation
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Section 8.3.3
154
HTTP Request
GET /api/{URI-M}
Host: memento-damage.cs.odu.edu
Content-Type: application/cdxj+ors
StarGate Implementation
For each URI-M in TimeMap
Long running processes can be
executed asynchronously and use
the “respond-async” syntax per
RFC7240 §4.1
(not shown here for simplicity)
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Section 8.3.3
155
HTTP Response
HTTP/1.1 200 OK
Content-Type:application/json
(damage metadata)
StarGate Implementation
For each URI-M in TimeMap
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Section 8.3.3
156
StarGate Implementation
Parse damage from JSON
returned from API, associate
with URI-M
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Section 8.3.3
157
HTTP Response
HTTP/1.1 200 OK
Content-Length: 1207
Content-Type: application/cdxj+ors
Preference-Applied: damage="<0.4"
CDXJ
TM
StarGate Implementation
Smaller size
Indicative of filtering
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Section 8.3.3
158
HTTP Response
HTTP/1.1 200 OK
Content-Length: 1207
Content-Type: application/cdxj+ors
Preference-Applied: damage="<0.4"
CDXJ
TM
StarGate Implementation
Addresses MMAD5:
⇄ StarGate for negotiation
in dimensions beyond time
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Mink as a Client-Side Memento Meta-Aggregator
Figure 94, Chapter 8
159
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Archival Specification and Query Precedence in
Mink
Figure 95, Chapter 8
160
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Outline
● Introduction/Motivation
● Background
● Related Work
● The Mementity Framework
● Applying Framework to Scenarios
● Contributions, Conclusions, and Future Work
161
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ1: What sort of content is difficult to capture and replay for preservation
from the perspective of a Web browser?
● JavaScript largely the culprit
● Content behind authentication neglected for preservation
● We created novel approach to preserving content behind authentication using browser
APIs (WARCreate); prev. inaccessible content preserved to WARC.
● Studies on challenges in preservation:
○ Change in Archivability over Time (TPDL 2013)
○ Archival Acid Test (JCDL 2014)
○ Measuring the Impact of Missing Resources (JCDL 2014, IJDL 2015)
○ Impact of JS on Archivability (IJDL 2016)
● Tools to mitigate:
○ WARCreate (JCDL 2012)
○ ArchiveNow (JCDL 2018)
○ Replay Banners (JCDL 2018)
Chapter 9
162
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ2: How do Web browser APIs compare in potential functionality to the
capabilities of archival crawlers?
Non-browser-based preservation tools miss representations that use latest Web technologies.
● Browser preservation vs. Crawlers:
○ Identifying Personalized Representation in the Archive (D-Lib 2013)
○ Archival Acid Test (JCDL 2014)
○ Service Workers for Client-side Replay (JCDL 2017)
○ Extensible Replay Banners (JCDL 2018)
● Tools to mitigate:
○ WARCreate (JCDL 2012)
○ Mink (JCDL 2014)
○ Mobile Mink (JCDL 2015)
○ WAIL (JCDL 2017, PDA 2013)
Chapter 9
163
PhD Defense: Mat Kelly
May 7, 2019
RQ3: What issues exist for capturing and replaying content behind
authentication?
URI clash issue, need something to distinguish private captures on replay.
Key in preservation is to rely on native tools used to view the Web.
● Studies on Replay:
○ Impact of URI Canonicalization (JCDL 2017)
○ Framework for Aggregating Private & Public Web Archives (JCDL 2018)
Chapter 9
Figure 10, Chapter 1
Capturing
vs
164
PhD Defense: Mat Kelly
May 7, 2019
RQ4: How can content that was captured behind authentication signal to Web
archive replay systems that it requires special handling?
● Utilization of CDXJ, enrichment of TimeMaps to StarMaps with access attributes.
● We introduced a standard, extensible mechanism for regulating access to private Web
archives using the PWAA abstraction.
● Studies on auth+replay:
○ InterPlanetary Wayback (JCDL 2016, TPDL 2016)
○ Framework for Aggregating Private & Public Web Archives (JCDL 2018)
Chapter 9
19981212013921 {
"uri": "http://localhost:8080/20101116060516/http://facebook.com/",
...
"access": {
"type": "Blake2b",
"token": "c6ed419e74907d220c69858614d86...ef0a3a88a41"
}
}
Figure 64, Chapter 6
165
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ5: How can Memento aggregators indicate that private Web archive content
requires special handling to be replayed, despite being aggregated with publicly
available Web archive content?
Aforementioned access attributes, interoperability with PWAA for acquiring and representing
tokens in StarMaps
● Aggregation + Private Archives:
○ Mink (JCDL 2014)
○ Framework for Aggregating Private & Public Web Archives (JCDL 2018)
Chapter 9
166
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ6: What kinds of access control do users who create private Web archives
need to regulate access to their archives?
● Integrated the IPFS with WARC standard via InterPlanetary Wayback to allow for client-
side archival replay without needing to possess the archival holdings.
● We enabled symmetric encryption of personal archival holdings within InterPlanetary
Wayback
● Studies on Access Control + private archives:
○ InterPlanetary Wayback (JCDL 2016, TPDL 2016)
○ Framework for Aggregating Private & Public Web Archives (JCDL 2018)
Chapter 9
167
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Future Work
● Prefer with archives instead of more generic config keyword
○ Potential name clash at benefit of extensibility
● More elegant/semantic/syntactic expressions of negotiation attributes
○ Prefer : damage="<0.5" is limited by RFC/HTTP syntax
● Public-private inter-archive leakage
● Inferrerence of private-public based on holdings beyond boolean classification
● Further investigations of private archives’ contents, inherently inaccessible
○ Which the Mementity Framework mitigates
Section 9.2
168
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Conclusions
Content behind authentication
on the live Web can now be
preserved and systematically
replayed
Section 9.3
169
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Hierarchy of Aggregated Public-Private Archives
170
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Conclusions
171
→{ }→{ }
→{ }
Collective Experience of the past Web
Can be Aggregated
Section 9.3
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Conclusions Access to Private Archives Directly or
Through Sharing Can Be Systematic
172
Section 9.3
Figure 81, Chapter 7
PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Conclusions Enabled Client-Side Aggregation
and Archival Supplementation
173
Figures 94 and 95, Chapter 8
Section 9.3
PhD Defense: Mat Kelly
May 7, 2019
Research Dissemination
● 17 publications that address 6 RQs →
● 5 Best Paper/Poster awards/nominations
● 28 research blog posts for ODU WS-DL
● Collaborations with 8 different institutions (inc. ODU)
● 4 first-author, open-source, publicly released
software packages (collab. on 3 others from ODU)
Aggregating Private and Public Web Archives Using the Mementity Framework
@machawk1 • @WebSciDL

More Related Content

Similar to Aggregating Private and Public Web Archives Using the Mementity Framework

Beyond MARC: BIBFRAME and the Future of Bibliographic Data
Beyond MARC: BIBFRAME and the Future of Bibliographic DataBeyond MARC: BIBFRAME and the Future of Bibliographic Data
Beyond MARC: BIBFRAME and the Future of Bibliographic DataEmily Nimsakont
 
The promises of web scrapping: Mining the web for relational data about artists
The promises of web scrapping: Mining the web for relational data about artistsThe promises of web scrapping: Mining the web for relational data about artists
The promises of web scrapping: Mining the web for relational data about artistsGuillaume Cabanac
 
Skb web2.0
Skb web2.0Skb web2.0
Skb web2.0animove
 
Exploring the Use of Linked Data to Bridge State and Federal Archives
Exploring the Use of Linked Data to Bridge State and Federal ArchivesExploring the Use of Linked Data to Bridge State and Federal Archives
Exploring the Use of Linked Data to Bridge State and Federal ArchivesJon Voss
 
Web 1.0: The Web as Resource
Web 1.0:  The Web as ResourceWeb 1.0:  The Web as Resource
Web 1.0: The Web as ResourceJohan Koren
 
Web 1.0: The Web as Resource
Web 1.0:  The Web as ResourceWeb 1.0:  The Web as Resource
Web 1.0: The Web as ResourceJohan Koren
 
Web 2.0 In a Nutshell : A Librarian Guide to the World of Web 2.0
Web 2.0 In a Nutshell: A Librarian Guide to the World of Web 2.0Web 2.0 In a Nutshell: A Librarian Guide to the World of Web 2.0
Web 2.0 In a Nutshell : A Librarian Guide to the World of Web 2.0teaguese
 
Marc and beyond: 3 Linked Data Choices
 Marc and beyond: 3 Linked Data Choices  Marc and beyond: 3 Linked Data Choices
Marc and beyond: 3 Linked Data Choices Richard Wallis
 
Deepak semantic web_iitd
Deepak semantic web_iitdDeepak semantic web_iitd
Deepak semantic web_iitdDeepak Shevani
 
Archiving Web-Based #musetech for Institutional Memory
Archiving Web-Based #musetech for Institutional MemoryArchiving Web-Based #musetech for Institutional Memory
Archiving Web-Based #musetech for Institutional MemorySamantha Norling
 
Cinf flash v2 final
Cinf flash v2 finalCinf flash v2 final
Cinf flash v2 finalSean Ekins
 
What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...
What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...
What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...Emily Nimsakont
 
A tour of the library of the future
A tour of the library of the futureA tour of the library of the future
A tour of the library of the futureBethan Ruddock
 
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...Alexander Nwala
 
Supporting Account-based Queries for Archived Instagram Posts
Supporting Account-based Queries for Archived Instagram PostsSupporting Account-based Queries for Archived Instagram Posts
Supporting Account-based Queries for Archived Instagram PostsHimarsha Jayanetti
 
Information sharing about Columbia University Library’s recent web archiving ...
Information sharing about Columbia University Library’s recent web archiving ...Information sharing about Columbia University Library’s recent web archiving ...
Information sharing about Columbia University Library’s recent web archiving ...Anna Perricci
 
Social Discovery, Social Access
Social Discovery, Social AccessSocial Discovery, Social Access
Social Discovery, Social AccessStephen Francoeur
 
Building an Accessible Digital Institution
Building an Accessible Digital InstitutionBuilding an Accessible Digital Institution
Building an Accessible Digital Institutionlisbk
 

Similar to Aggregating Private and Public Web Archives Using the Mementity Framework (20)

Beyond MARC: BIBFRAME and the Future of Bibliographic Data
Beyond MARC: BIBFRAME and the Future of Bibliographic DataBeyond MARC: BIBFRAME and the Future of Bibliographic Data
Beyond MARC: BIBFRAME and the Future of Bibliographic Data
 
The promises of web scrapping: Mining the web for relational data about artists
The promises of web scrapping: Mining the web for relational data about artistsThe promises of web scrapping: Mining the web for relational data about artists
The promises of web scrapping: Mining the web for relational data about artists
 
Web1
Web1Web1
Web1
 
Skb web2.0
Skb web2.0Skb web2.0
Skb web2.0
 
Exploring the Use of Linked Data to Bridge State and Federal Archives
Exploring the Use of Linked Data to Bridge State and Federal ArchivesExploring the Use of Linked Data to Bridge State and Federal Archives
Exploring the Use of Linked Data to Bridge State and Federal Archives
 
Web 1.0: The Web as Resource
Web 1.0:  The Web as ResourceWeb 1.0:  The Web as Resource
Web 1.0: The Web as Resource
 
Web 1.0: The Web as Resource
Web 1.0:  The Web as ResourceWeb 1.0:  The Web as Resource
Web 1.0: The Web as Resource
 
Library Hacks
Library HacksLibrary Hacks
Library Hacks
 
Web 2.0 In a Nutshell : A Librarian Guide to the World of Web 2.0
Web 2.0 In a Nutshell: A Librarian Guide to the World of Web 2.0Web 2.0 In a Nutshell: A Librarian Guide to the World of Web 2.0
Web 2.0 In a Nutshell : A Librarian Guide to the World of Web 2.0
 
Marc and beyond: 3 Linked Data Choices
 Marc and beyond: 3 Linked Data Choices  Marc and beyond: 3 Linked Data Choices
Marc and beyond: 3 Linked Data Choices
 
Deepak semantic web_iitd
Deepak semantic web_iitdDeepak semantic web_iitd
Deepak semantic web_iitd
 
Archiving Web-Based #musetech for Institutional Memory
Archiving Web-Based #musetech for Institutional MemoryArchiving Web-Based #musetech for Institutional Memory
Archiving Web-Based #musetech for Institutional Memory
 
Cinf flash v2 final
Cinf flash v2 finalCinf flash v2 final
Cinf flash v2 final
 
What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...
What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...
What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...
 
A tour of the library of the future
A tour of the library of the futureA tour of the library of the future
A tour of the library of the future
 
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
 
Supporting Account-based Queries for Archived Instagram Posts
Supporting Account-based Queries for Archived Instagram PostsSupporting Account-based Queries for Archived Instagram Posts
Supporting Account-based Queries for Archived Instagram Posts
 
Information sharing about Columbia University Library’s recent web archiving ...
Information sharing about Columbia University Library’s recent web archiving ...Information sharing about Columbia University Library’s recent web archiving ...
Information sharing about Columbia University Library’s recent web archiving ...
 
Social Discovery, Social Access
Social Discovery, Social AccessSocial Discovery, Social Access
Social Discovery, Social Access
 
Building an Accessible Digital Institution
Building an Accessible Digital InstitutionBuilding an Accessible Digital Institution
Building an Accessible Digital Institution
 

More from Mat Kelly

Client-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderClient-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderMat Kelly
 
A Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web ArchivesA Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web ArchivesMat Kelly
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Mat Kelly
 
Exploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web ArchivesExploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web ArchivesMat Kelly
 
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...Mat Kelly
 
Facilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite ImageryFacilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite ImageryMat Kelly
 
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...Mat Kelly
 
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014Mat Kelly
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital PreservationMat Kelly
 
Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013Mat Kelly
 
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction SystemIEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction SystemMat Kelly
 
Digital Preservation 2013
Digital Preservation 2013Digital Preservation 2013
Digital Preservation 2013Mat Kelly
 
Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web ArchivingMaking Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web ArchivingMat Kelly
 
An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...Mat Kelly
 
The Revolution Will Not Be Archived
The Revolution Will Not Be ArchivedThe Revolution Will Not Be Archived
The Revolution Will Not Be ArchivedMat Kelly
 
WARCreate - Create Wayback-Consumable WARC Files from Any Webpage
WARCreate - Create Wayback-Consumable WARC Files from Any WebpageWARCreate - Create Wayback-Consumable WARC Files from Any Webpage
WARCreate - Create Wayback-Consumable WARC Files from Any WebpageMat Kelly
 
NDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link RestorationNDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link RestorationMat Kelly
 
NDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive FacebookNDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive FacebookMat Kelly
 

More from Mat Kelly (19)

Client-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderClient-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer Header
 
A Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web ArchivesA Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web Archives
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count
 
Exploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web ArchivesExploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web Archives
 
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
 
Facilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite ImageryFacilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite Imagery
 
Slides
SlidesSlides
Slides
 
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
 
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital Preservation
 
Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013
 
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction SystemIEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
 
Digital Preservation 2013
Digital Preservation 2013Digital Preservation 2013
Digital Preservation 2013
 
Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web ArchivingMaking Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
 
An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...
 
The Revolution Will Not Be Archived
The Revolution Will Not Be ArchivedThe Revolution Will Not Be Archived
The Revolution Will Not Be Archived
 
WARCreate - Create Wayback-Consumable WARC Files from Any Webpage
WARCreate - Create Wayback-Consumable WARC Files from Any WebpageWARCreate - Create Wayback-Consumable WARC Files from Any Webpage
WARCreate - Create Wayback-Consumable WARC Files from Any Webpage
 
NDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link RestorationNDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link Restoration
 
NDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive FacebookNDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive Facebook
 

Recently uploaded

Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersChitralekhaTherkar
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 

Recently uploaded (20)

Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of Powders
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 

Aggregating Private and Public Web Archives Using the Mementity Framework

  • 1. Aggregating Private and Public Web Archives Using the Mementity Framework PhD Dissertation Defense for: Mat Kelly Advisor: Michele C. Weigle Committee Members: Michele C. Weigle, Michael L. Nelson, Danella Zhao, Justin F. Brunelle, and Dimitrie C. Popescu Department of Computer Science Norfolk, Virginia 23529 USA May 7, 2019
  • 2. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL The Web 2
  • 3. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL The Web is Ephemeral 3
  • 4. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL The Web is Ephemeral 4
  • 5. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL The Web is Ephemeral 5
  • 6. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL The Web is Ephemeral 6
  • 7. @machawk1 • @WebSciDL 7 Web Archives to the Rescue: Typical Access 1. Go to archive.org in your browser 2. Enter the URL you want to see in the past in the form field 3. Submit your query
  • 8. @machawk1 • @WebSciDL 8 Web Archives to the Rescue: Typical Access 4. Locate the archived Web page on the calendar or histogram view 5. Select the year/selection for the day 6. Repeat until you find the closest date and time
  • 9. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL 7. Finally, View the Archived Web Page 9
  • 10. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Web Archiving - Live Web odu.edu 10 Now
  • 11. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Web Archiving: View The Web of the Past 11 May 4, 2019Now
  • 12. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Web Archiving: View The Web of the Past 12 August 2012 (when I started my PhD) Now
  • 13. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Web Archiving: View The Web of the Past 13 August 2010 (when I started at ODU) Now
  • 14. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Web Archiving: View The Web of the Past 14 June 2006 (when I finished my BS) Now
  • 15. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Web Archiving: View The Web of the Past 15 August 2001 (when I started college) Now
  • 16. PhD Defense: Mat Kelly May 7, 2019 Web Archiving 16 2019 2012 2010 2006 2001 Associate a live Web URI with their archived representations
  • 17. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Web Archives Provide Access to the Web that was 17 What did odu.edu look like in the past?
  • 18. PhD Defense: Mat Kelly May 7, 2019 Multiple Archival Efforts 18from Alam et al. IJDL 2016. https://git.io/archives
  • 19. PhD Defense: Mat Kelly May 7, 2019 Consulting More Archives Produces a More Comprehensive Picture 19
  • 20. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Even Then, Not Everything is Preserved 20 What did obscuresite.com look like in the past? 0 mementos for obscuresite.com
  • 21. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL User Sees on Live Web May Not Be What is Captured 21 What did facebook.com look like in the past? Figure 10b, Chapter 1 Figure 10a, Chapter 1
  • 22. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL ...And Oftentimes That is For the Best 22 Have you preserved my online banking? Figure 8, Chapter 1
  • 23. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Private Live Web Content Prescriptively Ephemeral Figures 9a & 9b, Chapter 1 23
  • 24. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Other Times, We May Want Our Content Archived 24 Have you archived my daughter’s pictures on Google Photos? 0 mementos for that URI
  • 25. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL ...Especially When It Has Disappeared 25 0 mementos for that URI ! Have you archived my daughter’s pictures on Google Photos!?!? “It was there yesterday, where did it go?”
  • 26. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL “Save this, but only for me.” ● Screenshots of Web pages are insufficient ○ Not interactive/representative, do not integrate, lose context otherwise provided in metadata ● Large-scale archives’ tools are open source ● Individuals can archive, but there are still technical barriers 26 at tA
  • 27. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Individuals, Too, Can Archive The Web .com 27 “I want to share this but control who can see it.”
  • 28. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Archives from Institutional and Personal Sources at tA at tC at tD→Z 28
  • 29. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Today’s Memento Aggregation Archives Queried (A0 ) 29
  • 30. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Today’s Memento Aggregation Archives Queried (A0 ) 30
  • 31. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Desire: Include Personal Archives Archives Queried (A0 ) 31
  • 32. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Desire: Include Other Non- Aggregated Archives Archives Queried (A0 ) 32
  • 33. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Rapidly Changing Pages May Not Be Comprehensively Captured 33
  • 34. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL More Archives Provides a Better Picture of the Web 34 Before aggregating, we must first consider the ramifications, enumerated in the RESEARCH QUESTIONS:
  • 35. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL RQ1: What sort of content is difficult to capture and replay for preservation from the perspective of a Web browser? RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival crawlers? RQ3: What issues exist for capturing and replaying content behind authentication? RQ4: How can content that was captured behind authentication signal to Web archive replay systems that it requires special handling? RQ5: How can Memento aggregators indicate that private Web archive content requires special handling to be replayed, despite being aggregated with publicly available Web archive content? RQ6: What kinds of access control do users who create private Web archives need to regulate access to their archives? Research Questions 35
  • 36. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL RQ1: What sort of content is difficult to capture and replay for preservation from the perspective of a Web browser? RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival crawlers? RQ3: What issues exist for capturing and replaying content behind authentication? RQ4: How can content that was captured behind authentication signal to Web archive replay systems that it requires special handling? RQ5: How can Memento aggregators indicate that private Web archive content requires special handling to be replayed, despite being aggregated with publicly available Web archive content? RQ6: What kinds of access control do users who create private Web archives need to regulate access to their archives? Research Questions: RQ1 36
  • 37. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL RQ1: What sort of content is difficult to capture and replay for preservation from the perspective of a Web browser? RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival crawlers? RQ3: What issues exist for capturing and replaying content behind authentication? RQ4: How can content that was captured behind authentication signal to Web archive replay systems that it requires special handling? RQ5: How can Memento aggregators indicate that private Web archive content requires special handling to be replayed, despite being aggregated with publicly available Web archive content? RQ6: What kinds of access control do users who create private Web archives need to regulate access to their archives? Research Questions: RQ2 37
  • 38. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL RQ1: What sort of content is difficult to capture and replay for preservation from the perspective of a Web browser? RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival crawlers? RQ3: What issues exist for capturing and replaying content behind authentication? RQ4: How can content that was captured behind authentication signal to Web archive replay systems that it requires special handling? RQ5: How can Memento aggregators indicate that private Web archive content requires special handling to be replayed, despite being aggregated with publicly available Web archive content? RQ6: What kinds of access control do users who create private Web archives need to regulate access to their archives? Research Questions: RQ3 38
  • 39. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL RQ1: What sort of content is difficult to capture and replay for preservation from the perspective of a Web browser? RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival crawlers? RQ3: What issues exist for capturing and replaying content behind authentication? RQ4: How can content that was captured behind authentication signal to Web archive replay systems that it requires special handling? RQ5: How can Memento aggregators indicate that private Web archive content requires special handling to be replayed, despite being aggregated with publicly available Web archive content? RQ6: What kinds of access control do users who create private Web archives need to regulate access to their archives? Research Questions: RQ4 39
  • 40. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL RQ1: What sort of content is difficult to capture and replay for preservation from the perspective of a Web browser? RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival crawlers? RQ3: What issues exist for capturing and replaying content behind authentication? RQ4: How can content that was captured behind authentication signal to Web archive replay systems that it requires special handling? RQ5: How can Memento aggregators indicate that private Web archive content requires special handling to be replayed, despite being aggregated with publicly available Web archive content? RQ6: What kinds of access control do users who create private Web archives need to regulate access to their archives? Research Questions: RQ5 40
  • 41. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL RQ1: What sort of content is difficult to capture and replay for preservation from the perspective of a Web browser? RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival crawlers? RQ3: What issues exist for capturing and replaying content behind authentication? RQ4: How can content that was captured behind authentication signal to Web archive replay systems that it requires special handling? RQ5: How can Memento aggregators indicate that private Web archive content requires special handling to be replayed, despite being aggregated with publicly available Web archive content? RQ6: What kinds of access control do users who create private Web archives need to regulate access to their archives? Research Questions: RQ6 41 To answer these questions, we first have to be familiar with BACKGROUND INFORMATION
  • 42. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Outline ● Introduction/Motivation ● Background ● Related Work ● The Mementity Framework ● Applying Framework to Scenarios ● Contributions, Conclusions, and Future Work 42
  • 43. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Hypertext Transfer Protocol (HTTP) HTTP Request > GET /~mkelly/dissertation.pdf HTTP/1.1 > Host: www.cs.odu.edu > Connection: close > request headers 43
  • 44. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Hypertext Transfer Protocol (HTTP) HTTP Response HTTP/1.1 200 OK Server: nginx Date: Sun, 05 May 2019 18:31:31 GMT Content-Type: application/pdf Content-Length: 40067283 %PDF-1.5 ... response headers (metadata) start of entity body 44
  • 45. @machawk1 • @WebSciDL Background: Memento Adapted from Memento Guide: Introduction. http://www.mementoweb.org/guide/quick-intro/, January 2015, in Public Domain Figure 21, Chapter 2 45 * H. Van de Sompel et al. HTTP Framework for Time-Based Access to Resource States – Memento. IETF RFC 7089, December 2013.
  • 46. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Background: Memento Request Example • Accept-Datetime: Wed, 02 Aug 2017 23:15:00 GMT • GET: http://web.archive.org/web/http://www.cnn.com Request cnn.com at Aug 2, 2017 at 6:15pm EST URI-G G HTTP Request 46
  • 47. PhD Defense: Mat Kelly May 7, 2019 Background: Memento Request Example • Memento-Datetime: Wed, 02 Aug 2017 23:18:04 GMT • Location: http://web.archive.org/web/20170802231804/http://www.cnn.com/ • Link: • Accept-Datetime: Wed, 02 Aug 2017 23:15:00 GMT • GET: http://web.archive.org/web/http://www.cnn.com URI-G G mementotimegateoriginaltimemap HTTP Request HTTP Response (302) URI-T T URI-R R URI-G G URI-M M 47 Request cnn.com at Aug 2, 2017 at 6:15pm EST
  • 48. PhD Defense: Mat Kelly May 7, 2019 Background: Dereferencing a TimeMap at URI-T Request URI-T ● Date-based pagination ● Other formats for TimeMap first memento last mementomemento ...... ...... timemap timemap timemap TimeMapURI-T T URI-T T1 URI-T T URI-T Tt URI-M 1 URI-M m URI-M n URI-G G URI-R R 48
  • 49. Background - Privacy and Security ● Web users question trusting institutions to preserve private Web contents1 ● OAuth 2.02 facilitates authentication cohesion of entities 1 Marshall and Shipman, “On the Institutional Archiving of Social Media”, JCDL 2012 2 D. Hardt. The OAuth 2.0 Authorization Framework. IETF RFC 6749, October 2012. 49 RQ3: What issues exist for capturing and replaying content behind authentication? RQ6: What kinds of access control do users who create private Web archives need to regulate access to their archives? Figure 27, Chapter 2
  • 50. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL A familiar paradigm used for authentication on the live Web Role-based Delegation and Authentication 50 1 2 3 RQ3: content behind auth RQ6: access control 4 5
  • 51. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Private vs. Public Web archives Both personal and institutional Web archiving can be either public are private. Table I, Chapter 1 [49] Brunelle et al., “Leveraging Heritrix and the Wayback Machine on a Corporate Intranet…”, D-Lib Magazine, 22(1/2), 2016 51
  • 52. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Private vs. Public Web archives Both personal and institutional Web archiving can be either public are private. Table I, Chapter 1 [49] Brunelle et al., “Leveraging Heritrix and the Wayback Machine on a Corporate Intranet…”, D-Lib Magazine, 22(1/2), 2016 OUR FOCUS 52
  • 53. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL HTTP Prefer ● HTTP negotiation already available via Accept-* headers ● Prefer syntax provide mechanism for client to specify preferences ○ ...with which servers may or may not comply 53 * J. Snell. Prefer Header for HTTP. IETF RFC 7240, June 2014. Prefer: foo; bar="baz" Preference Applied: bar="baz" RQ5: How can Memento aggregators indicate that private Web archive content requires special handling to be replayed, despite being aggregated with publicly available Web archive content? RQ4: How can content that was captured behind authentication signal to Web archive replay systems that it requires special handling?
  • 54. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL RQ5: How can Memento aggregators indicate that private Web archive content requires special handling to be replayed, despite being aggregated with publicly available Web archive content? Memento Aggregation* State of the Art 54 * Querying multiple Web archives for a URI from a single endpoint at a Memento aggregator Time Travel Service MemGator
  • 55. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Memento Aggregation - MementoWeb 55 Also available via CLI: $ curl http://timetravel.mementoweb.org/timemap/link/http://odu.edu Figures 25a and 25b, Chapter 2
  • 56. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL ● Open Source Memento Aggregator - github.com/oduwsdl/memgator ● Easy personal/local deployment ● Specify archive list on launch ○ Easily configurable JSON → ○ Use default collection if not specified ● TimeMap Formats: ○ Link ○ JSON ○ CDXJ Memento Aggregation - MemGator 56 * Alam and Nelson, “MemGator - A Portable Concurrent Memento Aggregator: Cross-Platform CLI and Server Binaries in Go”, JCDL 2016
  • 57. ... <http://localhost:8080/20101116060516/http://facebook.com/>; rel="memento"; datetime="Tue, 16 Nov 2010 06:05:16 GMT", ... ... 20101116060516 { "uri": "http://localhost:8080/20101116060516/http://facebook.com/", "rel": "memento", "datetime": "Tue, 16 Nov 2010 06:05:16 GMT", } ... 57 CDXJ: An Alternative TimeMap Format Memento entry in Link (RFC 7089) TimeMap Memento entry in CDXJ TimeMap Memento (URI-M) See Alam, “CDXJ: An Object Resource Stream Serialization Format”, 2015 Figure 26, Chapter 2 Relative Relations Memento-Datetime
  • 58. ● Browser extension we developed* ● Bridges gap between live and archived Webs ● Leverages Memento aggregator’s capability, returns TimeMaps ● Indicates # of captures for a URI while you browse ● Provides navigation of mementos while browsing live Web ● Single-click submission of URI-R to multiple Web archives : Visual User Interaction with Aggregators 58 * Kelly et al., “Mink: Integrating the Live and Archived Web Viewing Experience Using Web Browsers and Memento”, JCDL 2014 Figure 36, Chapter 4
  • 59. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Outline ● Introduction/Motivation ● Background ● Related Work ● The Mementity Framework ● Applying Framework to Scenarios ● Contributions, Conclusions, and Future Work 59
  • 60. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Related Work Memento Aggregation (RQ 5): ● Bornand et al. (JCDL 2016) - routing using classifier ● Nelson and Alam (JCDL 2016) - routing using probability ● Rosenthal et al. (DLib 2015) - aggregation w/ restricted access ● AlSum et al. (IJDL 2014) - routing URI lookups ● Brunelle and Nelson (JCDL 2016) - TImeMap caching ● Alam et al. (IJDL 2016) - efficient querying through profiling ● Rosenthal (2013) - aggregator scalability Prefer and Web Archives (RQ 5): ● Jones et al. (2016) - introduced “raw” mementos ● Van de Sompel et al. (2016) - Prefer usage for raw ● Rosenthal (2016) - other Prefer usage in web archives Privacy in Web Archives (RQ 3-6): ● Marshall and Shipman (JCDL 2014) - Preserving FB, social media content ● Lindley et al. (WWW 2013) - Users archiving social media Access Control in Web Archives (RQ 4-6): ● Webber (2018) - HTTP 451 @ UKWA ● Rao et al. (SIGOPS 2001) iProxy - access control in web archives Collaboration in Web Archives (RQ 5): ● Potts (UXPA 2015) - creating collections with tags ● Stine (2018) Cobweb - collection building between institutions at Archive-It. ● Phillips (2003) PANDORA dynamics ● Niu (DLib 2012) - personalized features at PANDORA Public Web Archiving (RQ 2): ● Brunelle et al. (JCDL 2017) - JS & deferred representations ● Ben-David and Huurdeman (Alexandria 2014) - archives retrieval beyond URI ● AlNoamany et al. (JCDL 2013) Access Patterns Private Web Archiving (RQ 1-4): ● Brunelle et al. (DLib 2016) - archiving corporate intranet ● Marshall and Shipman (JCDL 2012) - Institutions archiving social media 60
  • 61. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Related Work (RQ1) Memento Aggregation (RQ 5): ● Bornand et al. (JCDL 2016) - routing using classifier ● Nelson and Alam (JCDL 2016) - routing using probability ● Rosenthal et al. (DLib 2015) - aggregation w/ restricted access ● AlSum et al. (IJDL 2014) - routing URI lookups ● Brunelle and Nelson (JCDL 2016) - TImeMap caching ● Alam et al. (IJDL 2016) - efficient querying through profiling ● Rosenthal (2013) - aggregator scalability Prefer and Web Archives (RQ 5): ● Jones et al. (2016) - introduced “raw” mementos ● Van de Sompel et al. (2016) - Prefer usage for raw ● Rosenthal (2016) - other Prefer usage in web archives Privacy in Web Archives (RQ 3-6): ● Marshall and Shipman (JCDL 2014) - Preserving FB, social media content ● Lindley et al. (WWW 2013) - Users archiving social media Access Control in Web Archives (RQ 4-6): ● Webber (2018) - HTTP 451 @ UKWA ● Rao et al. (SIGOPS 2001) iProxy - access control in web archives Collaboration in Web Archives (RQ 5): ● Potts (UXPA 2015) - creating collections with tags ● Stine (2018) Cobweb - collection building between institutions at Archive-It. ● Phillips (2003) PANDORA dynamics ● Niu (DLib 2012) - personalized features at PANDORA Public Web Archive Access (RQ 2): ● Brunelle et al. (JCDL 2017) - JS & deferred representations ● Ben-David and Huurdeman (Alexandria 2014) - archives retrieval beyond URI ● AlNoamany et al. (JCDL 2013) Access Patterns Private Web Archiving (RQ 1-4): ● Brunelle et al. (DLib 2016) - archiving corporate intranet ● Marshall and Shipman (JCDL 2012) - Institutions archiving social media RQ1: What sort of content is difficult to capture and replay for preservation from the perspective of a Web browser? 61
  • 62. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Related Work (RQ2) Memento Aggregation (RQ 5): ● Bornand et al. (JCDL 2016) - routing using classifier ● Nelson and Alam (JCDL 2016) - routing using probability ● Rosenthal et al. (DLib 2015) - aggregation w/ restricted access ● AlSum et al. (IJDL 2014) - routing URI lookups ● Brunelle and Nelson (JCDL 2016) - TImeMap caching ● Alam et al. (IJDL 2016) - efficient querying through profiling ● Rosenthal (2013) - aggregator scalability Prefer and Web Archives (RQ 5): ● Jones et al. (2016) - introduced “raw” mementos ● Van de Sompel et al. (2016) - Prefer usage for raw ● Rosenthal (2016) - other Prefer usage in web archives Privacy in Web Archives (RQ 3-6): ● Marshall and Shipman (JCDL 2014) - Preserving FB, social media content ● Lindley et al. (WWW 2013) - Users archiving social media Access Control in Web Archives (RQ 4-6): ● Webber (2018) - HTTP 451 @ UKWA ● Rao et al. (SIGOPS 2001) iProxy - access control in web archives Collaboration in Web Archives (RQ 5): ● Potts (UXPA 2015) - creating collections with tags ● Stine (2018) Cobweb - collection building between institutions at Archive-It. ● Phillips (2003) PANDORA dynamics ● Niu (DLib 2012) - personalized features at PANDORA Public Web Archive Access (RQ 2): ● Brunelle et al. (JCDL 2017) - JS & deferred representations ● Ben-David and Huurdeman (Alexandria 2014) - archives retrieval beyond URI ● AlNoamany et al. (JCDL 2013) Access Patterns Private Web Archiving (RQ 1-4): ● Brunelle et al. (DLib 2016) - archiving corporate intranet ● Marshall and Shipman (JCDL 2012) - Institutions archiving social media RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival crawlers? 62
  • 63. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Related Work (RQ3) Memento Aggregation (RQ 5): ● Bornand et al. (JCDL 2016) - routing using classifier ● Nelson and Alam (JCDL 2016) - routing using probability ● Rosenthal et al. (DLib 2015) - aggregation w/ restricted access ● AlSum et al. (IJDL 2014) - routing URI lookups ● Brunelle and Nelson (JCDL 2016) - TImeMap caching ● Alam et al. (IJDL 2016) - efficient querying through profiling ● Rosenthal (2013) - aggregator scalability Prefer and Web Archives (RQ 5): ● Jones et al. (2016) - introduced “raw” mementos ● Van de Sompel et al. (2016) - Prefer usage for raw ● Rosenthal (2016) - other Prefer usage in web archives Privacy in Web Archives (RQ 3-6): ● Marshall and Shipman (JCDL 2014) - Preserving FB, social media content ● Lindley et al. (WWW 2013) - Users archiving social media Access Control in Web Archives (RQ 4-6): ● Webber (2018) - HTTP 451 @ UKWA ● Rao et al. (SIGOPS 2001) iProxy - access control in web archives Collaboration in Web Archives (RQ 5): ● Potts (UXPA 2015) - creating collections with tags ● Stine (2018) Cobweb - collection building between institutions at Archive-It. ● Phillips (2003) PANDORA dynamics ● Niu (DLib 2012) - personalized features at PANDORA Public Web Archive Access (RQ 2): ● Brunelle et al. (JCDL 2017) - JS & deferred representations ● Ben-David and Huurdeman (Alexandria 2014) - archives retrieval beyond URI ● AlNoamany et al. (JCDL 2013) Access Patterns Private Web Archiving (RQ 1-4): ● Brunelle et al. (DLib 2016) - archiving corporate intranet ● Marshall and Shipman (JCDL 2012) - Institutions archiving social media RQ3: What issues exist for capturing and replaying content behind authentication? 63
  • 64. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Related Work (RQ4) Memento Aggregation (RQ 5): ● Bornand et al. (JCDL 2016) - routing using classifier ● Nelson and Alam (JCDL 2016) - routing using probability ● Rosenthal et al. (DLib 2015) - aggregation w/ restricted access ● AlSum et al. (IJDL 2014) - routing URI lookups ● Brunelle and Nelson (JCDL 2016) - TImeMap caching ● Alam et al. (IJDL 2016) - efficient querying through profiling ● Rosenthal (2013) - aggregator scalability Prefer and Web Archives (RQ 5): ● Jones et al. (2016) - introduced “raw” mementos ● Van de Sompel et al. (2016) - Prefer usage for raw ● Rosenthal (2016) - other Prefer usage in web archives Privacy in Web Archives (RQ 3-6): ● Marshall and Shipman (JCDL 2014) - Preserving FB, social media content ● Lindley et al. (WWW 2013) - Users archiving social media Access Control in Web Archives (RQ 4-6): ● Webber (2018) - HTTP 451 @ UKWA ● Rao et al. (SIGOPS 2001) iProxy - access control in web archives Collaboration in Web Archives (RQ 5): ● Potts (UXPA 2015) - creating collections with tags ● Stine (2018) Cobweb - collection building between institutions at Archive-It. ● Phillips (2003) PANDORA dynamics ● Niu (DLib 2012) - personalized features at PANDORA Public Web Archive Access (RQ 2): ● Brunelle et al. (JCDL 2017) - JS & deferred representations ● Ben-David and Huurdeman (Alexandria 2014) - archives retrieval beyond URI ● AlNoamany et al. (JCDL 2013) Access Patterns Private Web Archiving (RQ 1-4): ● Brunelle et al. (DLib 2016) - archiving corporate intranet ● Marshall and Shipman (JCDL 2012) - Institutions archiving social media RQ4: How can content that was captured behind authentication signal to Web archive replay systems that it requires special handling? 64
  • 65. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Related Work (RQ5) Memento Aggregation (RQ 5): ● Bornand et al. (JCDL 2016) - routing using classifier ● Nelson and Alam (JCDL 2016) - routing using probability ● Rosenthal et al. (DLib 2015) - aggregation w/ restricted access ● AlSum et al. (IJDL 2014) - routing URI lookups ● Brunelle and Nelson (JCDL 2016) - TImeMap caching ● Alam et al. (IJDL 2016) - efficient querying through profiling ● Rosenthal (2013) - aggregator scalability Prefer and Web Archives (RQ 5): ● Jones et al. (2016) - introduced “raw” mementos ● Van de Sompel et al. (2016) - Prefer usage for raw ● Rosenthal (2016) - other Prefer usage in web archives Privacy in Web Archives (RQ 3-6): ● Marshall and Shipman (JCDL 2014) - Preserving FB, social media content ● Lindley et al. (WWW 2013) - Users archiving social media Access Control in Web Archives (RQ 4-6): ● Webber (2018) - HTTP 451 @ UKWA ● Rao et al. (SIGOPS 2001) iProxy - access control in web archives Collaboration in Web Archives (RQ 5): ● Potts (UXPA 2015) - creating collections with tags ● Stine (2018) Cobweb - collection building between institutions at Archive-It. ● Phillips (2003) PANDORA dynamics ● Niu (DLib 2012) - personalized features at PANDORA Public Web Archive Access (RQ 2): ● Brunelle et al. (JCDL 2017) - JS & deferred representations ● Ben-David and Huurdeman (Alexandria 2014) - archives retrieval beyond URI ● AlNoamany et al. (JCDL 2013) Access Patterns Private Web Archiving (RQ 1-4): ● Brunelle et al. (DLib 2016) - archiving corporate intranet ● Marshall and Shipman (JCDL 2012) - Institutions archiving social media RQ5: How can Memento aggregators indicate that private Web archive content requires special handling to be replayed, despite being aggregated with publicly available Web archive content? 65
  • 66. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Related Work (RQ6) Memento Aggregation (RQ 5): ● Bornand et al. (JCDL 2016) - routing using classifier ● Nelson and Alam (JCDL 2016) - routing using probability ● Rosenthal et al. (DLib 2015) - aggregation w/ restricted access ● AlSum et al. (IJDL 2014) - routing URI lookups ● Brunelle and Nelson (JCDL 2016) - TImeMap caching ● Alam et al. (IJDL 2016) - efficient querying through profiling ● Rosenthal (2013) - aggregator scalability Prefer and Web Archives (RQ 5): ● Jones et al. (2016) - introduced “raw” mementos ● Van de Sompel et al. (2016) - Prefer usage for raw ● Rosenthal (2016) - other Prefer usage in web archives Privacy in Web Archives (RQ 3-6): ● Marshall and Shipman (JCDL 2014) - Preserving FB, social media content ● Lindley et al. (WWW 2013) - Users archiving social media Access Control in Web Archives (RQ 4-6): ● Webber (2018) - HTTP 451 @ UKWA ● Rao et al. (SIGOPS 2001) iProxy - access control in web archives Collaboration in Web Archives (RQ 5): ● Potts (UXPA 2015) - creating collections with tags ● Stine (2018) Cobweb - collection building between institutions at Archive-It. ● Phillips (2003) PANDORA dynamics ● Niu (DLib 2012) - personalized features at PANDORA Public Web Archive Access (RQ 2): ● Brunelle et al. (JCDL 2017) - JS & deferred representations ● Ben-David and Huurdeman (Alexandria 2014) - archives retrieval beyond URI ● AlNoamany et al. (JCDL 2013) Access Patterns Private Web Archiving (RQ 1-4): ● Brunelle et al. (DLib 2016) - archiving corporate intranet ● Marshall and Shipman (JCDL 2012) - Institutions archiving social media RQ6: What kinds of access control do users who create private Web archives need to regulate access to their archives? 66
  • 67. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Outline ● Introduction/Motivation ● Background ● Related Work ● The Mementity Framework ○ Archival Negotiation Beyond Time ○ Query Precedence & Short-Circuiting ○ Mementities ● Applying Framework to Scenarios ● Contributions, Conclusions, and Future Work 67
  • 68. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Outline ● Introduction/Motivation ● Background ● Related Work ● The Mementity Framework ○ Archival Negotiation Beyond Time ○ Query Precedence & Short-Circuiting ○ Mementities ● Applying Framework to Scenarios ● Contributions, Conclusions, and Future Work 68
  • 69. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL More Expressive TimeMaps ● Memento Quality (e.g., Damage)1 ● How Many Captures?2 ● How Many Are Identical?2,3 ● Other Attributes of Mementos... 69 1 Brunelle, Kelly et al., JCDL 2014, IJDL 2015 2 Kelly et al., JCDL 2017 3 AlSum and Nelson, ECIR 2014 Figure 7, Chapter 1 Section 6.2
  • 70. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Additional TimeMap Attributes Content-based Attributes Derived Attributes Access Attributes 70
  • 71. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL TimeMap Enrichment: Content-Based Attributes ● Status Code1 ● Content-Digest ○ In WARC & CDX ○ Not all archives expose CDX ● Would allow more info about mementos without requiring comprehensive dereferencing 71 1 Kelly et al., “Impact of URI Canonicalization on Memento Count”, JCDL 2017, arXiv 1703.03302 Section 6.2.1
  • 72. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL ● Thumbnails (e.g., via SimHash)1 ○ Calculation based on root memento’s HTML ● Memento Damage (JCDL 2014, IJDL)2 ○ Requires dereferencing embedded resources TimeMap Enrichment: Derived Attributes 72 1 AlSum and Nelson, Thumbnail Summarization Techniques for Web Archives, ECIR 2014, pp. 299-310. 2 Brunelle, Kelly et al., “The Impact of JavaScript on Archivability,” IJDL, 17(2), pp. 95-117. January 2016. apple.com, many duplicate mementos! Section 6.2.2
  • 73. How to distinguish Private captures from Public captures in a TimeMap? TimeMap Enrichment: Access Attributes first memento mementomemento ...... ...... timemap timemap timemap TimeMapURI-T T URI-T T1 URI-T T URI-T Tt URI-M 1 URI-M m URI-M n URI-G G URI-R R URI-M m+1 last memento ... RQ5: How can Memento aggregators indicate that private Web archive content requires special handling to be replayed, despite being aggregated with publicly available Web archive content? Section 6.2.3
  • 74. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL TimeMap Enrichment - in a CDXJ TimeMap 74 20101116060516 { "uri": "http://localhost:8080/20101116060516/http://facebook.com/", "rel": "memento", "datetime": "Tue, 16 Nov 2010 06:05:16 GMT", "status_code": 200, "digest": "sha1:LK26DRRQJ4WATC6LBVF3B3Z4P2CP5ZZ7", "damage": 0.24, "simhash": "6551110622422153488", "content-language": "en-US", "access": { "type": "Blake2b", "token": "c6ed419e74907d220c69858614d86...ef0a3a88a41" } } Line breaks added for clarity, CDXJ records occupy a single line Content-based attributes Derived Attributes Access Attributes Figure 64, Chapter 6
  • 75. TimeMap Enrichment with Additional Attributes 75 “StarMap”
  • 76. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Outline ● Introduction/Motivation ● Background ● Related Work ● The Mementity Framework ○ Archival Negotiation Beyond Time ○ Query Precedence & Short-Circuiting ○ Mementities ● Applying Framework to Scenarios ● Contributions, Conclusions, and Future Work 76
  • 77. Query Precedence 77 “Check my archive first, then Carol’s, then all public archives.” 1 2 3 3 ● More control of querying in series and parallel See Atkins’, “Paywalls in the Internet Archive”, March 2018
  • 78. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Precedence in Archival Specifications ● MemGator’s internal archival specification N1: [{A0},{A1},{A2}] N2: {{A0},{A1},{A2}} N3: {{A0},{A4,A2,A5},{A3}} N4: {"id0":{A0},"id1":{A4},"id2":{A3}} N5: [{"id0"∶{A0}},{"id1"∶{A1}},{"id2"∶ {A2}}] N6: [{"id0"∶{A0}}, {"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}}, {"id2"∶ {A3}] ● Despite JSON array, archives queried in “pseudo-parallel” for efficiency 78 1 1 1
  • 79. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Precedence in Archival Specifications ● MemGator’s internal archival specification N1: [{A0},{A1},{A2}] N2: {{A0},{A1},{A2}} N3: {{A0},{A4,A2,A5},{A3}} N4: {"id0":{A0},"id1":{A4},"id2":{A3}} N5: [{"id0"∶{A0}},{"id1"∶{A1}},{"id2"∶{A2}}] N6: [{"id0"∶{A0}}, {"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}}, {"id2"∶ {A3}] ● N2 more descriptive of behavior despite notation interpretation ● JSON object members require identifiers 79 1 1 1
  • 80. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL ● MemGator’s internal archival specification N1: [{A0},{A1},{A2}] N2: {{A0},{A1},{A2}} N3: {{A0},{A4,A2,A5},{A3}} N4: {"id0":{A0},"id1":{A4},"id2":{A3}} N5: [{"id0"∶{A0}},{"id1"∶{A1}},{"id2"∶{A2}}] N6: [{"id0"∶{A0}}, {"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}}, {"id2"∶ {A3}] ● Can introduce “sets” of URIs to archive, but still query in pseudo-parallel ● JSON object members require identifiers Precedence in Archival Specifications 80 1 1 1 A4 A5 1 1
  • 81. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL ● MemGator’s internal archival specification N1: [{A0},{A1},{A2}] N2: {{A0},{A1},{A2}} N3: {{A0},{A4,A2,A5},{A3}} N4: {"id0":{A0},"id1":{A4},"id2":{A3}} N5: [{"id0"∶{A0}},{"id1"∶{A1}},{"id2"∶{A2}}] N6: [{"id0"∶{A0}}, {"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}}, {"id2"∶ {A3}] ● Add identifiers for JSON compliance ● Still pseudo-parallel Precedence in Archival Specifications 81 1 1 1 id0 id1 id2
  • 82. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL ● MemGator’s internal archival specification N1: [{A0},{A1},{A2}] N2: {{A0},{A1},{A2}} N3: {{A0},{A4,A2,A5},{A3}} N4: {"id0":{A0},"id1":{A4},"id2":{A3}} N5: [{"id0"∶{A0}},{"id1"∶{A1}},{"id2"∶{A2}}] N6: [{"id0"∶{A0}}, {"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}}, {"id2"∶ {A3}] ● Query archives in-series Precedence in Archival Specifications 82 3 2 1 id0 id1 id2
  • 83. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL ● MemGator’s internal archival specification N1: [{A0},{A1},{A2}] N2: {{A0},{A1},{A2}} N3: {{A0},{A4,A2,A5},{A3}} N4: {"id0":{A0},"id1":{A4},"id2":{A3}} N5: [{"id0"∶{A0}},{"id1"∶{A1}},{"id2"∶{A2}}] N6: [{"id0"∶{A0}}, {"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}}, {"id2"∶ {A3}] ● Query archival sets in-series Precedence in Archival Specifications 83 3 2 1 A4 A5 2 2
  • 84. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Precedence Using Prefer [{"id0"∶{A0}}, {"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}}, {"id2"∶ {A3}] Prefer:archives="data:application/json;c harset=utf-8;base64,Ww0KICB7...NCn0=" ENCODES TO (no aggregator prior to this dissertation supports this) 84
  • 85. Query Precedence 85 “Check my archive first, then Carol’s, then all public archives.” 1 ● More control of querying in series and parallel See Atkins’, “Paywalls in the Internet Archive”, March 2018
  • 86. Query Precedence 86 “Check my archive first, then Carol’s, then all public archives.” 1 2 ● More control of querying in series and parallel See Atkins’, “Paywalls in the Internet Archive”, March 2018
  • 87. Query Precedence 87 “Check my archive first, then Carol’s, then all public archives.” 1 2 3 3 ● More control of querying in series and parallel See Atkins’, “Paywalls in the Internet Archive”, March 2018
  • 88. Query Short-Circuiting 88 “Check private archives first. Iff you find no captures, only then check the public archives. 1 1 2 2 ● May give priority to archive preference ● Series halt when threshold met.
  • 89. Query Short-Circuiting 89 “Check private archives first. Iff you find no captures, only then check the public archives. 1 1 2 2 ● May give priority to archive preference. ● Series halt when threshold met.
  • 90. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Outline ● Introduction/Motivation ● Background ● Related Work ● The Mementity Framework ○ Archival Negotiation Beyond Time ○ Query Precedence & Short-Circuiting ○ Mementities ● Applying Framework to Scenarios ● Contributions, Conclusions, and Future Work 90
  • 91. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Mementities ● Memento + Entity (entity term already overused) 91 Time Gate Introduced in this Framework Conventional Memento Mementities
  • 92. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL 92 MEMENTITY FRAMEWORK Mementities Section 7.1.1
  • 93. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Memento Meta-Aggregator (MMA) 93 functional ⊇ Figure 80, Chapter 7 Section 7.3.3
  • 94. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL MMA: Archive Selection 94 Figure 55, Chapter 6 Section 6.1
  • 95. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL MMA: User-Driven Archival Specification 95 ✓ ✓ Section 6.1
  • 96. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL MMAα: from MA2, MA1 and WA6 MMAβ: from WA7 and WA8 MMAγ: from MMAβ, MA5, and WA1 96 MMA Aggregation sources Figure 67, Chapter 7
  • 97. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Costs of more expressive TimeMaps (StarMaps) and Link header enrichment ● Computational ○ Mostly server-side, potential to further leverage client ● Temporal ○ Required on variant generation ● Spatial ○ Permutation variant storage ● Access ○ Variant negotiation 97
  • 98. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Temporal Costs: Conventional Aggregation T = treq + tMA + tAGG+ tresp tMA=max(RTT(Ai ,URI-R)) Client→Aggregator request Aggregation⇄∀archives communication Aggregator combining, sorting, etc. Aggregator→Client request Equation 9, Chapter 8 Equation 10, Chapter 8 98
  • 99. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Impact of MemGator→MMA ● Additional capability of MMA beyond MemGator adds additional temporal complexity: ○ tprefer → time and computation required to modify base set of archives and associate with client’s session ○ tenrich → time required for external services to parse/calculate additional attributes per URI-M ○ tfilter→ time required to filter URI-Ms based on client’s preference (potentially 0) Section 8.2 99
  • 100. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL ● Personal Archive Aggregation ● MMA Chaining ● Client-Side Aggregation Preference MMA Dynamics By-Example 100 ALICE BOB CAROL MALCOLM
  • 101. @machawk1 • @WebSciDL Personal Archive Aggregation: Alice Saves the Web 101 Personal Archive Aggregation
  • 102. @machawk1 • @WebSciDL Alice Wants to See Her Captures Temporally Inline 102 at tA at tD→Z Section 7.3.5 Personal Archive Aggregation
  • 103. @machawk1 • @WebSciDL Mementity Dynamics - Alice & Her Archives (WAA) 103 Section 7.3.5 Personal Archive Aggregation
  • 104. @machawk1 • @WebSciDL →{ } Alice Deploys MMAA 104 Section 7.3.5
  • 105. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Carol Asks MMAA for CNN 105 →{ } MMA Chaining Section 7.3.5
  • 106. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL MMAA Returns CNN Memento {MA, MIA} 106 →{ }→{ } MMA Chaining Section 7.3.5
  • 107. PhD Defense: Mat Kelly May 7, 2019 Carol Wants to Aggregate Her Own Captures 107 →{ } at tC at tD→Z,A →{ } (M(WAC)) MMA Chaining Section 7.3.5
  • 108. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Carol Creates MMAC to Access WAC and MMAA 108 →{ }→{ } →{ } MMA Chaining Section 7.3.5
  • 109. @machawk1 • @WebSciDL Carol Asks MMAC For CNN 109 →{ }→{ } →{ } MMA Chaining Section 7.3.5
  • 110. MMAA Returns CNN Memento {MA, MIA,MC} 110 →{ }→{ } →{ } Section 7.3.5
  • 111. Client-Side Aggregation Preference Bob May Request M(CNN) From MMAA or MMAC 111 →{ }→{ } →{ } Section 6.1
  • 112. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Client-Side Aggregation Preference Bob Prefers to Exclude IA Captures 112 ✓ ✓ ...and does not want to setup his own MMA Section 6.1
  • 113. GET /archives/ Bob Requests Supported Archives 113 →{ } Section 6.1 Client-Side Aggregation Preference
  • 114. Bob Customizes the Set in the JSON 114 →{ } ✓ ✓ Section 6.1 Client-Side Aggregation Preference
  • 115. Bob Requests CNN for His Custom Set 115 →{ } ( ) Section 6.1 Client-Side Aggregation Preference
  • 116. MMA Complies or Ignores Preference 116 →{ } →{ } ✓ Section 6.1 Client-Side Aggregation Preference
  • 117. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Hooray, Aggregation! 117 !
  • 118. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL 118 MEMENTITY FRAMEWORK Mementities Section 7.1.2
  • 119. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Hooray, Aggregation! 119 HTTP 401 Consult URI-P ✓ RQ4: How can content that was captured behind authentication signal to Web archive replay systems that it requires special handling? Section 7.3.4
  • 120. PhD Defense: Mat Kelly May 7, 2019 Private Web Archive Adapter (PWAA) ● Auth Layer for to encourage Private Web archive aggregation ● Typical OAuth 2.0 flow ● Auth role cohesive to PWAA ● Persistent access through tokenization 120 RQ6: What kinds of access control do users who create private Web archives need to regulate access to their archives? Figure 63, Chapter 6
  • 121. PhD Defense: Mat Kelly May 7, 2019 PWAA - Sharing Tokens 121 RQ6: What kinds of access control do users who create private Web archives need to regulate access to their archives? Figure 81a, Chapter 7 Section 7.3.4
  • 122. PhD Defense: Mat Kelly May 7, 2019 PWAA - Previously Authorized 122Figure 81a, Chapter 7 Section 7.3.4
  • 123. PhD Defense: Mat Kelly May 7, 2019 PWAA - Unauthorized Request 123Figure 81a, Chapter 7 Section 7.3.4
  • 124. PhD Defense: Mat Kelly May 7, 2019 PWAA - Sharing Tokens 124 RQ6: What kinds of access control do users who create private Web archives need to regulate access to their archives? Figure 81b, Chapter 7 Section 7.3.4
  • 125. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Alice Passes Associative Token to MMA 125 RQ5: How can Memento aggregators indicate that private Web archive content requires special handling to be replayed, despite being aggregated with publicly available Web archive content? Figure 84a, Chapter 7 Section 7.3.4
  • 126. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL MMA requests URI-R... 126 ...relays token where applicable Figure 84b, Chapter 7 Section 7.3.4 RQ5: How can Memento aggregators indicate that private Web archive content requires special handling to be replayed, despite being aggregated with publicly available Web archive content?
  • 127. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Private Archive Validates with PWAA 127Figure 84c, Chapter 7 Section 7.3.4 RQ5: How can Memento aggregators indicate that private Web archive content requires special handling to be replayed, despite being aggregated with publicly available Web archive content?
  • 128. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL PWAA Confirms Token 128Figure 84d, Chapter 7 Section 7.3.4 RQ5: How can Memento aggregators indicate that private Web archive content requires special handling to be replayed, despite being aggregated with publicly available Web archive content?
  • 129. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Private Archive Returns Captures 129Figure 84e, Chapter 7 Section 7.3.4 RQ5: How can Memento aggregators indicate that private Web archive content requires special handling to be replayed, despite being aggregated with publicly available Web archive content?
  • 130. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL MMA Aggregates, Associates Token 130Figure 84f, Chapter 7 Section 7.3.4 RQ5: How can Memento aggregators indicate that private Web archive content requires special handling to be replayed, despite being aggregated with publicly available Web archive content?
  • 131. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL MMA Aggregates, Associates Token 131 Section 7.3.4 We still need a way to distinguish private archives from public archives when querying an aggregator
  • 132. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL 132 MEMENTITY FRAMEWORK Mementities
  • 133. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL ⊆StarGate ● Content negotiation in Web archives beyond time ● “Star” ~ wildcard (*) → any dimension of negotiation ● Allow for queries like: Only show me mementos… ○ That are not redirects (content-based attribute HTTP Status ≠ 3XX) ○ Of a sufficient quality (derived attribute Memento Damage < 0.4) ○ Are from personal Web archives (access attribute indicate Alice’s facebook.com mementos are not login pages) 133 Time Gate functio nal Section 7.1.3
  • 134. @machawk1 • @WebSciDL Implicit Filtering via MMA or Directly 134 →{ } →{ } ✓ filtering filtering Section 7.2.1
  • 135. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Negotiation in the Privacy Dimension (via short circuiting) 135 Get URI-Ms for URI-R only from personal Web archives privateOnly 1 4 2 3 ACCESS ATTRIBUTE Figure 85, Chapter 7
  • 136. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Negotiation on Content-Based or Derived Attributes (with response filtering) Retrieved StarMap 136 Get URI-Ms for URI-R of good quality that are unique MD < 0.25, unique(simhash) Abbreviated StarMap with filtering applied 1 1 2 3 DERIVED ATTRIBUTE CONTENT-BASED ATTRIBUTE & Figure 86, Chapter 7
  • 137. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Outline ● Introduction/Motivation ● Background ● Related Work ● The Mementity Framework ● Applying Framework to Scenarios ○ Reconsidering Framework Scenarios ○ Framework Implementation ● Contributions, Conclusions, and Future Work 137
  • 138. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Outline ● Introduction/Motivation ● Background ● Related Work ● The Mementity Framework ● Applying Framework to Scenarios ○ Reconsidering Framework Scenarios ○ Framework Implementation ● Contributions, Conclusions, and Future Work 138
  • 139. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL “It was there yesterday, where did it go?” ● We created browser-based tools to mitigate live Web ephemerality ○ e.g., WARCreate1 (Section 4.2) ● ...and personal Web archival replay systems to re-experience the captures and facilitate collaboration and propagation ○ Web Archiving Integration Layer (WAIL)2,3, Section 4.3 ○ InterPlanetary Wayback (ipwb)4,5, Section 4.5 139 1 Kelly and Weigle, “WARCreate - Create Wayback-Consumable WARC Files from Any Webpage”, JCDL 2012 2 Kelly et al., “Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving”, PDA 2013 3 Berlin, Kelly et al., “WAIL: Collection-Based Personal Web Archiving”, JCDL 2017 4 Kelly et al., “InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives”, TPDL 2016 5 Alam, Kelly et al., “InterPlanetary Wayback: The Permanent Web Archive”, JCDL 2016
  • 140. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL “Save this, but only for me” ● Our personal Web archiving tools (e.g., previous slide) help with personal Web archiving of potentially private live Web content. ● The Private Web Archive Adapter (Section 7.1.2) in the Mementity Framework allows for systematic access regulation HTTP 401 Consult URI-P ✓ 140
  • 141. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL “I want to share this but control who can see it.” ● Access regulation: PWAA (Sections 7.1.2, 7.2.2) ● Collaboration and propagation of captures: ipwb1,2 (Section 4.5) ● Symmetric encryption for privacy: ipwb ● Negotiation in dimension of privacy / access attributes3 (Section 6.2.3) Figure 75, Chapter 7 141 1 Kelly et al., “InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives”, TPDL 2016 2 Alam, Kelly et al., “InterPlanetary Wayback: The Permanent Web Archive”, JCDL 2016 3 Kelly et al., “A Framework for Aggregating Private and Public Web Archives”, JCDL 2018
  • 142. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Outline ● Introduction/Motivation ● Background ● Related Work ● The Mementity Framework ● Applying Framework to Scenarios ○ Reconsidering Framework Scenarios ○ Framework Implementation ● Contributions, Conclusions, and Future Work 142
  • 143. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Reference Implementation New Software Packages Adapt to Framework with Additional Capabilities 143
  • 144. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Adaptations for MMA MMA deliverables (MMADi) 1. Query precedence 2. Short-circuiting 3. Client-side Archival Specification 4. ⇄ PWAA for access control 5. ⇄ StarGate for negotiation in dimensions beyond time Section 8.3.1 144
  • 145. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL MMAD1: Query Precedence in MemGator 1. Adjusted base reference specification to be MemGator’s intended psuedo- parallel querying model (JSON array → JSON objects). ○ Figure 90, Chapter 8 2. Implement MemGator interpretation of JSON arrays to query in-series, waiting for a previous array entry to finish before proceedings. ○ Figure 87, Chapter 8 3. Implement interpretation of JSON “sets” of archives to recursively allow querying a subset in-series. ○ Figure 91, Chapter 8 Section 8.3.1 145
  • 146. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL MMAD2: Short-Circuiting in MemGator MemGator has pre-existing functionality to short-circuit on archival count, based on predefined order in server-set archival specification. Leverage condition in MemGator to modify this conditional based on a function checked at same point as conditional to prematurely break iteration. Section 8.3.1 146 Figure 91, Chapter 8
  • 147. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL MMAD3: Client-Side Archival Specification 1. Associate archives list with session, pre-populate with default set. 2. Read Prefer header on request received. Overwrite archives set and associate with session. 3. When querying archives, refer to this set (passed along by-reference) instead of a global set defined on startup and read from a local config. Section 8.3.1 147 Figure 89, Chapter 8
  • 148. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL PWAA Implementation ● We created a web-based GUI to assist in initialization ● Handles interaction from client to private Web archive and requests from aggregators (MMAD4) Section 8.3.2 148
  • 149. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Preference from Client’s Perspective 149 HTTP Response HTTP Request GET /tm/cdxj/https://cnn.com HTTP/1.1 Host: mma.example.com Prefer: damage="<0.4" Prefer:archives="data:application/json;charset=utf-8; base64,Ww0KICB7...NCn0=" Prefer: publicOnly HTTP/1.1 200 OK Content-Length: 1207 Content-Type: application/cdxj+ors Preference-Applied: damage="<0.4" CDXJ TM
  • 150. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL StarGate Implementation Section 8.3.3 150 HTTP Request > GET /https://cnn.com HTTP/1.1 > Host: mma.alice.org > Prefer: damage="<0.4" >
  • 151. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Section 8.3.3 151 > GET /timemap/https://cnn.com > Host: web.archive.org > > GET /tm/https://cnn.com > Host: webarchive.org.uk/tm/ > > GET /timemap/https://cnn.com > Host: archive.md > StarGate Implementation
  • 152. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Section 8.3.3 152 HTTP/1.1 200 OK Content-Type:application/link-format (TimeMap) HTTP/1.1 200 OK Content-Type:application/link-format (TimeMap) HTTP/1.1 200 OK Content-Type:application/link-format (TimeMap) StarGate Implementation
  • 153. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Section 8.3.3 153 HTTP Request POST / HTTP/1.1 Host: stargate.alice.org Content-Length: 1500 Content-Type: application/cdxj+ors Prefer: damage="<0.4" CDXJ TM StarGate Implementation
  • 154. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Section 8.3.3 154 HTTP Request GET /api/{URI-M} Host: memento-damage.cs.odu.edu Content-Type: application/cdxj+ors StarGate Implementation For each URI-M in TimeMap Long running processes can be executed asynchronously and use the “respond-async” syntax per RFC7240 §4.1 (not shown here for simplicity)
  • 155. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Section 8.3.3 155 HTTP Response HTTP/1.1 200 OK Content-Type:application/json (damage metadata) StarGate Implementation For each URI-M in TimeMap
  • 156. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Section 8.3.3 156 StarGate Implementation Parse damage from JSON returned from API, associate with URI-M
  • 157. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Section 8.3.3 157 HTTP Response HTTP/1.1 200 OK Content-Length: 1207 Content-Type: application/cdxj+ors Preference-Applied: damage="<0.4" CDXJ TM StarGate Implementation Smaller size Indicative of filtering
  • 158. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Section 8.3.3 158 HTTP Response HTTP/1.1 200 OK Content-Length: 1207 Content-Type: application/cdxj+ors Preference-Applied: damage="<0.4" CDXJ TM StarGate Implementation Addresses MMAD5: ⇄ StarGate for negotiation in dimensions beyond time
  • 159. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Mink as a Client-Side Memento Meta-Aggregator Figure 94, Chapter 8 159
  • 160. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Archival Specification and Query Precedence in Mink Figure 95, Chapter 8 160
  • 161. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Outline ● Introduction/Motivation ● Background ● Related Work ● The Mementity Framework ● Applying Framework to Scenarios ● Contributions, Conclusions, and Future Work 161
  • 162. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL RQ1: What sort of content is difficult to capture and replay for preservation from the perspective of a Web browser? ● JavaScript largely the culprit ● Content behind authentication neglected for preservation ● We created novel approach to preserving content behind authentication using browser APIs (WARCreate); prev. inaccessible content preserved to WARC. ● Studies on challenges in preservation: ○ Change in Archivability over Time (TPDL 2013) ○ Archival Acid Test (JCDL 2014) ○ Measuring the Impact of Missing Resources (JCDL 2014, IJDL 2015) ○ Impact of JS on Archivability (IJDL 2016) ● Tools to mitigate: ○ WARCreate (JCDL 2012) ○ ArchiveNow (JCDL 2018) ○ Replay Banners (JCDL 2018) Chapter 9 162
  • 163. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival crawlers? Non-browser-based preservation tools miss representations that use latest Web technologies. ● Browser preservation vs. Crawlers: ○ Identifying Personalized Representation in the Archive (D-Lib 2013) ○ Archival Acid Test (JCDL 2014) ○ Service Workers for Client-side Replay (JCDL 2017) ○ Extensible Replay Banners (JCDL 2018) ● Tools to mitigate: ○ WARCreate (JCDL 2012) ○ Mink (JCDL 2014) ○ Mobile Mink (JCDL 2015) ○ WAIL (JCDL 2017, PDA 2013) Chapter 9 163
  • 164. PhD Defense: Mat Kelly May 7, 2019 RQ3: What issues exist for capturing and replaying content behind authentication? URI clash issue, need something to distinguish private captures on replay. Key in preservation is to rely on native tools used to view the Web. ● Studies on Replay: ○ Impact of URI Canonicalization (JCDL 2017) ○ Framework for Aggregating Private & Public Web Archives (JCDL 2018) Chapter 9 Figure 10, Chapter 1 Capturing vs 164
  • 165. PhD Defense: Mat Kelly May 7, 2019 RQ4: How can content that was captured behind authentication signal to Web archive replay systems that it requires special handling? ● Utilization of CDXJ, enrichment of TimeMaps to StarMaps with access attributes. ● We introduced a standard, extensible mechanism for regulating access to private Web archives using the PWAA abstraction. ● Studies on auth+replay: ○ InterPlanetary Wayback (JCDL 2016, TPDL 2016) ○ Framework for Aggregating Private & Public Web Archives (JCDL 2018) Chapter 9 19981212013921 { "uri": "http://localhost:8080/20101116060516/http://facebook.com/", ... "access": { "type": "Blake2b", "token": "c6ed419e74907d220c69858614d86...ef0a3a88a41" } } Figure 64, Chapter 6 165
  • 166. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL RQ5: How can Memento aggregators indicate that private Web archive content requires special handling to be replayed, despite being aggregated with publicly available Web archive content? Aforementioned access attributes, interoperability with PWAA for acquiring and representing tokens in StarMaps ● Aggregation + Private Archives: ○ Mink (JCDL 2014) ○ Framework for Aggregating Private & Public Web Archives (JCDL 2018) Chapter 9 166
  • 167. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL RQ6: What kinds of access control do users who create private Web archives need to regulate access to their archives? ● Integrated the IPFS with WARC standard via InterPlanetary Wayback to allow for client- side archival replay without needing to possess the archival holdings. ● We enabled symmetric encryption of personal archival holdings within InterPlanetary Wayback ● Studies on Access Control + private archives: ○ InterPlanetary Wayback (JCDL 2016, TPDL 2016) ○ Framework for Aggregating Private & Public Web Archives (JCDL 2018) Chapter 9 167
  • 168. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Future Work ● Prefer with archives instead of more generic config keyword ○ Potential name clash at benefit of extensibility ● More elegant/semantic/syntactic expressions of negotiation attributes ○ Prefer : damage="<0.5" is limited by RFC/HTTP syntax ● Public-private inter-archive leakage ● Inferrerence of private-public based on holdings beyond boolean classification ● Further investigations of private archives’ contents, inherently inaccessible ○ Which the Mementity Framework mitigates Section 9.2 168
  • 169. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Conclusions Content behind authentication on the live Web can now be preserved and systematically replayed Section 9.3 169
  • 170. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Hierarchy of Aggregated Public-Private Archives 170
  • 171. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Conclusions 171 →{ }→{ } →{ } Collective Experience of the past Web Can be Aggregated Section 9.3
  • 172. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Conclusions Access to Private Archives Directly or Through Sharing Can Be Systematic 172 Section 9.3 Figure 81, Chapter 7
  • 173. PhD Defense: Mat Kelly May 7, 2019 @machawk1 • @WebSciDL Conclusions Enabled Client-Side Aggregation and Archival Supplementation 173 Figures 94 and 95, Chapter 8 Section 9.3
  • 174. PhD Defense: Mat Kelly May 7, 2019 Research Dissemination ● 17 publications that address 6 RQs → ● 5 Best Paper/Poster awards/nominations ● 28 research blog posts for ODU WS-DL ● Collaborations with 8 different institutions (inc. ODU) ● 4 first-author, open-source, publicly released software packages (collab. on 3 others from ODU) Aggregating Private and Public Web Archives Using the Mementity Framework @machawk1 • @WebSciDL

Editor's Notes

  1. MLNTODO: Put more emphasis on going from 1 to 3 archives
  2. A lot going on here. Do we want to split it and/or discuss content and derived attributes separately?