Aggregating Private and Public Web Archives Using the Mementity Framework

Aggregating Private and Public Web Archives
Using the Mementity Framework
PhD Dissertation Defense for:
Mat Kelly
Advisor:
Michele C. Weigle
Committee Members:
Michele C. Weigle, Michael L. Nelson, Danella Zhao,
Justin F. Brunelle, and Dimitrie C. Popescu
Department of Computer Science
Norfolk, Virginia 23529 USA
May 7, 2019

PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
The Web
2

May 7, 2019
The Web is Ephemeral
3

May 7, 2019
4

May 7, 2019
5

May 7, 2019
6

@machawk1 • @WebSciDL 7
Web Archives to the Rescue: Typical Access
1. Go to archive.org in your
browser
2. Enter the URL you want to see
in the past in the form field
3. Submit your query

Web Archives to the Rescue: Typical Access
4. Locate the archived
Web page on the calendar
or histogram view
5. Select the year/selection
for the day
6. Repeat until you find
the closest date and time

May 7, 2019
7. Finally, View the Archived Web Page
9

May 7, 2019
Web Archiving - Live Web odu.edu
10
Now

May 7, 2019
Web Archiving: View The Web of the Past
11
May 4, 2019Now

May 7, 2019
12
August 2012
(when I started my PhD)
Now

May 7, 2019
13
August 2010
(when I started at ODU)
Now

May 7, 2019
14
June 2006
(when I finished my BS)
Now

May 7, 2019
15
August 2001
(when I started college)
Now

May 7, 2019
Web Archiving
16
2019
2012
2010
2006
2001
Associate a live Web URI
with their archived representations

May 7, 2019
Web Archives Provide Access to the Web that was
17
What did odu.edu look like
in the past?

May 7, 2019
Multiple Archival Efforts
18from Alam et al. IJDL 2016.
https://git.io/archives

May 7, 2019
Consulting More Archives Produces a More
Comprehensive Picture
19

May 7, 2019
Even Then, Not Everything is Preserved
20
What did obscuresite.com
look like in the past?
0 mementos for
obscuresite.com

May 7, 2019
User Sees on Live Web May Not Be What is Captured
21
What did facebook.com look
like in the past?
Figure 10b, Chapter 1
Figure 10a, Chapter 1

May 7, 2019
...And Oftentimes That is For the Best
22
Have you preserved my
online banking?
Figure 8, Chapter 1

May 7, 2019
Private Live Web Content Prescriptively
Ephemeral
Figures 9a & 9b, Chapter 1
23

May 7, 2019
Other Times, We May Want Our Content Archived
24
Have you archived my
daughter’s pictures on
Google Photos?
0 mementos for that URI

May 7, 2019
...Especially When It Has Disappeared
25
0 mementos for that URI
!
Have you archived my
daughter’s pictures on
Google Photos!?!?
“It was there yesterday, where did it go?”

May 7, 2019
“Save this, but only for me.”
● Screenshots of Web pages are insufficient
○ Not interactive/representative, do not integrate, lose context otherwise provided in metadata
● Large-scale archives’ tools are open source
● Individuals can archive, but there are still technical barriers
26
at tA

May 7, 2019
Individuals, Too, Can Archive The Web
.com
27
“I want to share this but control who can see it.”

May 7, 2019
Archives from Institutional and Personal Sources
at tA at tC at tD→Z
28

May 7, 2019
Today’s Memento Aggregation
Archives Queried (A0 )
29

May 7, 2019
Today’s Memento Aggregation
30

May 7, 2019
Desire: Include Personal Archives
31

May 7, 2019
Desire: Include Other Non-
Aggregated Archives
32

May 7, 2019
Rapidly Changing Pages May Not Be Comprehensively
Captured
33

May 7, 2019
More Archives Provides a Better Picture of the Web
34
Before aggregating, we must first consider the ramifications,
enumerated in the RESEARCH QUESTIONS:

May 7, 2019
RQ1: What sort of content is difficult to capture and replay for preservation from the perspective
of a Web browser?
RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival
crawlers?
RQ3: What issues exist for capturing and replaying content behind authentication?
RQ4: How can content that was captured behind authentication signal to Web archive replay
systems that it requires special handling?
RQ5: How can Memento aggregators indicate that private Web archive content requires special
handling to be replayed, despite being aggregated with publicly available Web archive content?
RQ6: What kinds of access control do users who create private Web archives need to regulate
access to their archives?
Research Questions
35

May 7, 2019
of a Web browser?
crawlers?
Research Questions: RQ1
36

May 7, 2019
of a Web browser?
crawlers?
37

May 7, 2019
of a Web browser?
crawlers?
38

May 7, 2019
of a Web browser?
crawlers?
39

May 7, 2019
of a Web browser?
crawlers?
40

May 7, 2019
of a Web browser?
crawlers?
41
To answer these questions, we first have to be familiar with
BACKGROUND INFORMATION

May 7, 2019
Outline
● Introduction/Motivation
● Background
● Related Work
● The Mementity Framework
● Applying Framework to Scenarios
● Contributions, Conclusions, and Future Work
42

May 7, 2019
Hypertext Transfer Protocol (HTTP)
HTTP Request
> GET /~mkelly/dissertation.pdf HTTP/1.1
> Host: www.cs.odu.edu
> Connection: close
> request headers
43

May 7, 2019
Hypertext Transfer Protocol (HTTP)
HTTP Response
HTTP/1.1 200 OK
Server: nginx
Date: Sun, 05 May 2019 18:31:31 GMT
Content-Type: application/pdf
Content-Length: 40067283
%PDF-1.5
...
response headers
(metadata)
start of entity body
44

Background: Memento
Adapted from Memento Guide: Introduction. http://www.mementoweb.org/guide/quick-intro/, January 2015, in Public Domain
Figure 21, Chapter 2
45
* H. Van de Sompel et al. HTTP Framework for Time-Based Access to Resource States – Memento. IETF RFC 7089,
December 2013.

May 7, 2019
Background: Memento Request Example
• Accept-Datetime: Wed, 02 Aug 2017 23:15:00 GMT
• GET: http://web.archive.org/web/http://www.cnn.com
Request cnn.com at
Aug 2, 2017 at 6:15pm EST
URI-G
G
HTTP Request
46

May 7, 2019
Background: Memento Request Example
• Memento-Datetime: Wed, 02 Aug 2017 23:18:04 GMT
• Location: http://web.archive.org/web/20170802231804/http://www.cnn.com/
• Link:
• Accept-Datetime: Wed, 02 Aug 2017 23:15:00 GMT
• GET: http://web.archive.org/web/http://www.cnn.com
URI-G
G
mementotimegateoriginaltimemap
HTTP Request
HTTP Response (302)
URI-T
T
URI-R
R
URI-G
G
URI-M
M
47
Request cnn.com at
Aug 2, 2017 at 6:15pm EST

May 7, 2019
Background: Dereferencing a TimeMap at URI-T
Request URI-T
● Date-based pagination
● Other formats for TimeMap
first
memento
last
mementomemento
......
......
timemap timemap timemap
TimeMapURI-T
T
URI-T
T1
URI-T
T
URI-T
Tt
URI-M
1
URI-M
m
URI-M
n
URI-G
G
URI-R
R
48

Background - Privacy and Security
● Web users question trusting
institutions to preserve
private Web contents1
● OAuth 2.02 facilitates
authentication cohesion of entities
1 Marshall and Shipman, “On the Institutional Archiving of Social Media”, JCDL 2012
2 D. Hardt. The OAuth 2.0 Authorization Framework. IETF RFC 6749, October 2012. 49
RQ3: What issues exist for capturing and replaying
content behind authentication?
RQ6: What kinds of access control do users who
create private Web archives need to regulate

May 7, 2019
A familiar paradigm used for authentication on the live Web
Role-based Delegation and Authentication
50
1
2
3
RQ3: content behind auth
RQ6: access control
4
5

May 7, 2019
Private vs. Public Web archives
Both personal and institutional Web archiving can be either public are private.
Table I, Chapter 1
[49] Brunelle et al., “Leveraging Heritrix and the Wayback Machine on a Corporate Intranet…”, D-Lib Magazine, 22(1/2), 2016
51

May 7, 2019
Private vs. Public Web archives
Both personal and institutional Web archiving can be either public are private.
Table I, Chapter 1
[49] Brunelle et al., “Leveraging Heritrix and the Wayback Machine on a Corporate Intranet…”, D-Lib Magazine, 22(1/2), 2016
OUR FOCUS
52

May 7, 2019
HTTP Prefer
● HTTP negotiation already available via Accept-* headers
● Prefer syntax provide mechanism for client to specify preferences
○ ...with which servers may or may not comply
53
* J. Snell. Prefer Header for HTTP. IETF RFC 7240, June 2014.
Prefer: foo; bar="baz"
Preference Applied: bar="baz"
RQ4: How can content that was captured behind authentication
signal to Web archive replay systems that it requires special handling?

May 7, 2019
RQ5: How can Memento aggregators indicate that private Web archive
content requires special handling to be replayed, despite being aggregated
with publicly available Web archive content?
Memento Aggregation* State of the Art
54
* Querying multiple Web archives for a URI from a single
endpoint at a Memento aggregator
Time Travel Service MemGator

May 7, 2019
Memento Aggregation - MementoWeb
55
Also available via CLI:
$ curl http://timetravel.mementoweb.org/timemap/link/http://odu.edu
Figures 25a and 25b, Chapter 2

May 7, 2019
● Open Source Memento Aggregator - github.com/oduwsdl/memgator
● Easy personal/local deployment
● Specify archive list on launch
○ Easily configurable JSON →
○ Use default collection if not specified
● TimeMap Formats:
○ Link
○ JSON
○ CDXJ
Memento Aggregation - MemGator
56
* Alam and Nelson, “MemGator - A Portable Concurrent Memento Aggregator: Cross-Platform CLI and Server
Binaries in Go”, JCDL 2016

...
<http://localhost:8080/20101116060516/http://facebook.com/>; rel="memento";
datetime="Tue, 16 Nov 2010 06:05:16 GMT",
...
...
20101116060516 {
"uri": "http://localhost:8080/20101116060516/http://facebook.com/",
"rel": "memento",
"datetime": "Tue, 16 Nov 2010 06:05:16 GMT",
}
...
57
CDXJ: An Alternative TimeMap Format
Memento entry in Link (RFC 7089) TimeMap
Memento entry in CDXJ TimeMap
Memento (URI-M)
See Alam, “CDXJ: An Object Resource Stream Serialization Format”, 2015
Relative Relations Memento-Datetime

● Browser extension we developed*
● Bridges gap between live and
archived Webs
● Leverages Memento aggregator’s
capability, returns TimeMaps
● Indicates # of captures for a
URI while you browse
● Provides navigation of mementos
while browsing live Web
● Single-click submission of URI-R to
multiple Web archives
: Visual User Interaction with Aggregators
58
* Kelly et al., “Mink: Integrating the Live and Archived Web Viewing Experience Using Web Browsers and Memento”, JCDL 2014

May 7, 2019
Outline
● Background
● Related Work
59

May 7, 2019
Related Work
Memento Aggregation (RQ 5):
● Bornand et al. (JCDL 2016) - routing using classifier
● Nelson and Alam (JCDL 2016) - routing using probability
● Rosenthal et al. (DLib 2015) - aggregation w/ restricted access
● AlSum et al. (IJDL 2014) - routing URI lookups
● Brunelle and Nelson (JCDL 2016) - TImeMap caching
● Alam et al. (IJDL 2016) - efficient querying through profiling
● Rosenthal (2013) - aggregator scalability
Prefer and Web Archives (RQ 5):
● Jones et al. (2016) - introduced “raw” mementos
● Van de Sompel et al. (2016) - Prefer usage for raw
● Rosenthal (2016) - other Prefer usage in web archives
Privacy in Web Archives (RQ 3-6):
● Marshall and Shipman (JCDL 2014) - Preserving FB, social
media content
● Lindley et al. (WWW 2013) - Users archiving social media
Access Control in Web Archives (RQ 4-6):
● Webber (2018) - HTTP 451 @ UKWA
● Rao et al. (SIGOPS 2001) iProxy - access control in web archives
Collaboration in Web Archives (RQ 5):
● Potts (UXPA 2015) - creating collections with tags
● Stine (2018) Cobweb - collection building between institutions at
Archive-It.
● Phillips (2003) PANDORA dynamics
● Niu (DLib 2012) - personalized features at PANDORA
Public Web Archiving (RQ 2):
● Brunelle et al. (JCDL 2017) - JS & deferred representations
● Ben-David and Huurdeman (Alexandria 2014) - archives
retrieval beyond URI
● AlNoamany et al. (JCDL 2013) Access Patterns
Private Web Archiving (RQ 1-4):
● Brunelle et al. (DLib 2016) - archiving corporate intranet
● Marshall and Shipman (JCDL 2012) - Institutions archiving
social media
60

May 7, 2019
Related Work (RQ1)
media content
Archive-It.
Public Web Archive Access (RQ 2):
social media
RQ1: What sort of content is difficult to capture and
replay for preservation from the perspective of a Web
browser?
61

May 7, 2019
Related Work (RQ2)
media content
Archive-It.
social media
RQ2: How do Web browser APIs compare in potential
functionality to the capabilities of archival crawlers?
62

May 7, 2019
Related Work (RQ3)
media content
Archive-It.
social media
RQ3: What issues exist for capturing and replaying
content behind authentication?
63

May 7, 2019
Related Work (RQ4)
media content
Archive-It.
social media
RQ4: How can content that was captured behind
authentication signal to Web archive replay systems
that it requires special handling?
64

May 7, 2019
Related Work (RQ5)
media content
Archive-It.
social media
RQ5: How can Memento aggregators indicate that
private Web archive content requires special handling
to be replayed, despite being aggregated with publicly
available Web archive content?
65

May 7, 2019
Related Work (RQ6)
media content
Archive-It.
social media
RQ6: What kinds of access control do users who create
private Web archives need to regulate access to their
archives?
66

May 7, 2019
Outline
● Background
● Related Work
○ Archival Negotiation Beyond Time
○ Query Precedence & Short-Circuiting
○ Mementities
67

May 7, 2019
Outline
● Background
● Related Work
○ Mementities
68

May 7, 2019
More Expressive TimeMaps
● Memento Quality (e.g., Damage)1
● How Many Captures?2
● How Many Are Identical?2,3
● Other Attributes of Mementos...
69
1 Brunelle, Kelly et al., JCDL 2014, IJDL 2015
2 Kelly et al., JCDL 2017
3 AlSum and Nelson, ECIR 2014
Figure 7, Chapter 1
Section 6.2

May 7, 2019
Additional TimeMap Attributes
Content-based Attributes
Derived Attributes
Access Attributes
70

May 7, 2019
TimeMap Enrichment: Content-Based Attributes
● Status Code1
● Content-Digest
○ In WARC & CDX
○ Not all archives expose CDX
● Would allow more info about mementos without
requiring comprehensive dereferencing
71
1 Kelly et al., “Impact of URI Canonicalization on Memento Count”, JCDL 2017, arXiv 1703.03302
Section 6.2.1

May 7, 2019
● Thumbnails (e.g., via SimHash)1
○ Calculation based on root memento’s HTML
● Memento Damage (JCDL 2014, IJDL)2
○ Requires dereferencing embedded resources
TimeMap Enrichment: Derived Attributes
72
1 AlSum and Nelson, Thumbnail Summarization Techniques for Web Archives, ECIR 2014, pp. 299-310.
2 Brunelle, Kelly et al., “The Impact of JavaScript on Archivability,” IJDL, 17(2), pp. 95-117. January 2016.
apple.com, many duplicate mementos!
Section 6.2.2

How to distinguish
Private captures
from
Public captures
in a TimeMap?
TimeMap Enrichment: Access Attributes
first
memento mementomemento
......
......
timemap timemap timemap
TimeMapURI-T
T
URI-T
T1
URI-T
T
URI-T
Tt
URI-M
1
URI-M
m
URI-M
n
URI-G
G
URI-R
R
URI-M
m+1
last
memento
...
RQ5: How can Memento aggregators
indicate that private Web archive content
requires special handling to be replayed,
despite being aggregated with publicly
Section 6.2.3

May 7, 2019
TimeMap Enrichment - in a CDXJ TimeMap
74
20101116060516 {
"rel": "memento",
"datetime": "Tue, 16 Nov 2010 06:05:16 GMT",
"status_code": 200,
"digest": "sha1:LK26DRRQJ4WATC6LBVF3B3Z4P2CP5ZZ7",
"damage": 0.24,
"simhash": "6551110622422153488",
"content-language": "en-US",
"access": {
"type": "Blake2b",
"token": "c6ed419e74907d220c69858614d86...ef0a3a88a41"
}
}
Line breaks added for clarity, CDXJ records occupy a single line
Content-based
attributes
Derived
Attributes
Access
Attributes

TimeMap
Enrichment with Additional Attributes
75
“StarMap”

May 7, 2019
Outline
● Background
● Related Work
○ Mementities
76

Query Precedence
77
“Check my archive first, then Carol’s, then all
public archives.”
1
2
3 3
● More control of querying in series and parallel
See Atkins’, “Paywalls in the Internet Archive”, March 2018

May 7, 2019
Precedence in Archival Specifications
● MemGator’s internal archival specification
N1: [{A0},{A1},{A2}]
N2: {{A0},{A1},{A2}}
N3: {{A0},{A4,A2,A5},{A3}}
N4: {"id0":{A0},"id1":{A4},"id2":{A3}}
N5: [{"id0"∶{A0}},{"id1"∶{A1}},{"id2"∶ {A2}}]
N6: [{"id0"∶{A0}},
{"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}},
{"id2"∶ {A3}]
● Despite JSON array, archives queried in “pseudo-parallel” for efficiency
78
1
1
1

May 7, 2019
N1: [{A0},{A1},{A2}]
N2: {{A0},{A1},{A2}}
N3: {{A0},{A4,A2,A5},{A3}}
N4: {"id0":{A0},"id1":{A4},"id2":{A3}}
N5: [{"id0"∶{A0}},{"id1"∶{A1}},{"id2"∶{A2}}]
N6: [{"id0"∶{A0}},
{"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}},
{"id2"∶ {A3}]
● N2 more descriptive of behavior despite notation interpretation
● JSON object members require identifiers
79
1
1
1

May 7, 2019
N1: [{A0},{A1},{A2}]
N2: {{A0},{A1},{A2}}
N3: {{A0},{A4,A2,A5},{A3}}
N4: {"id0":{A0},"id1":{A4},"id2":{A3}}
N5: [{"id0"∶{A0}},{"id1"∶{A1}},{"id2"∶{A2}}]
N6: [{"id0"∶{A0}},
{"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}},
{"id2"∶ {A3}]
● Can introduce “sets” of URIs to archive, but still query in pseudo-parallel
● JSON object members require identifiers
80
1
1
1
A4
A5
1
1

May 7, 2019
N1: [{A0},{A1},{A2}]
N2: {{A0},{A1},{A2}}
N3: {{A0},{A4,A2,A5},{A3}}
N4: {"id0":{A0},"id1":{A4},"id2":{A3}}
N5: [{"id0"∶{A0}},{"id1"∶{A1}},{"id2"∶{A2}}]
N6: [{"id0"∶{A0}},
{"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}},
{"id2"∶ {A3}]
● Add identifiers for JSON compliance
● Still pseudo-parallel
81
1
1
1
id0
id1
id2

May 7, 2019
N1: [{A0},{A1},{A2}]
N2: {{A0},{A1},{A2}}
N3: {{A0},{A4,A2,A5},{A3}}
N4: {"id0":{A0},"id1":{A4},"id2":{A3}}
N5: [{"id0"∶{A0}},{"id1"∶{A1}},{"id2"∶{A2}}]
N6: [{"id0"∶{A0}},
{"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}},
{"id2"∶ {A3}]
● Query archives in-series
82
3
2
1
id0
id1
id2

May 7, 2019
N1: [{A0},{A1},{A2}]
N2: {{A0},{A1},{A2}}
N3: {{A0},{A4,A2,A5},{A3}}
N4: {"id0":{A0},"id1":{A4},"id2":{A3}}
N5: [{"id0"∶{A0}},{"id1"∶{A1}},{"id2"∶{A2}}]
N6: [{"id0"∶{A0}},
{"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}},
{"id2"∶ {A3}]
● Query archival sets in-series
83
3
2
1
A4
A5
2
2

May 7, 2019
Precedence Using Prefer
[{"id0"∶{A0}},
{"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}},
{"id2"∶ {A3}]
Prefer:archives="data:application/json;c
harset=utf-8;base64,Ww0KICB7...NCn0="
ENCODES TO
(no aggregator prior to this dissertation supports this)
84

Query Precedence
85
public archives.”
1

Query Precedence
86
public archives.”
1
2

Query Precedence
87
public archives.”
1
2
3 3

Query Short-Circuiting
88
“Check private archives first. Iff you find no
captures, only then check the public archives.
1
1
2 2
● May give priority to archive preference
● Series halt when threshold met.

Query Short-Circuiting
89
“Check private archives first. Iff you find no
captures, only then check the public archives.
1
1
2 2
● May give priority to archive preference.
● Series halt when threshold met.

May 7, 2019
Outline
● Background
● Related Work
○ Mementities
90

May 7, 2019
Mementities
● Memento + Entity (entity term already overused)
91
Time
Gate
Introduced in this Framework Conventional Memento
Mementities

May 7, 2019
MEMENTITY FRAMEWORK
Mementities
Section 7.1.1

May 7, 2019
Memento Meta-Aggregator (MMA)
93
functional
⊇
Section 7.3.3

May 7, 2019
MMA: Archive Selection
94
Section 6.1

May 7, 2019
MMA: User-Driven Archival Specification
95
✓ ✓
Section 6.1

May 7, 2019
MMAα:
from MA2, MA1 and WA6
MMAβ:
from WA7 and WA8
MMAγ:
from MMAβ, MA5, and WA1
96
MMA Aggregation sources

May 7, 2019
Costs of more expressive TimeMaps (StarMaps)
and Link header enrichment
● Computational
○ Mostly server-side, potential to further leverage client
● Temporal
○ Required on variant generation
● Spatial
○ Permutation variant storage
● Access
○ Variant negotiation
97

May 7, 2019
Temporal Costs: Conventional Aggregation
T = treq + tMA + tAGG+ tresp
tMA=max(RTT(Ai ,URI-R))
Client→Aggregator
request
Aggregation⇄∀archives
communication
Aggregator combining,
sorting, etc.
Aggregator→Client
request
Equation 9, Chapter 8
Equation 10, Chapter 8
98

May 7, 2019
Impact of MemGator→MMA
● Additional capability of MMA beyond MemGator adds additional
temporal complexity:
○ tprefer → time and computation required to modify base set of archives and associate
with client’s session
○ tenrich → time required for external services to parse/calculate additional attributes
per URI-M
○ tfilter→ time required to filter URI-Ms based on client’s preference (potentially 0)
Section 8.2
99

May 7, 2019
● Personal Archive Aggregation
● MMA Chaining
● Client-Side Aggregation Preference
MMA Dynamics By-Example
100
ALICE
BOB
CAROL
MALCOLM

Personal Archive Aggregation: Alice Saves the Web
101
Personal Archive Aggregation

Alice Wants to See Her Captures Temporally Inline
102
at tA
at tD→Z
Section 7.3.5

Mementity Dynamics - Alice & Her Archives (WAA)
103
Section 7.3.5

→{ }
Alice Deploys MMAA
104
Section 7.3.5

May 7, 2019
Carol Asks MMAA for CNN
105
→{ }
MMA Chaining
Section 7.3.5

May 7, 2019
MMAA Returns CNN Memento {MA, MIA}
106
→{ }→{ }
MMA Chaining
Section 7.3.5

May 7, 2019
Carol Wants to Aggregate Her Own Captures
107
→{ }
at
tC
at tD→Z,A
→{ }
(M(WAC))
MMA Chaining
Section 7.3.5

May 7, 2019
Carol Creates MMAC to Access WAC and MMAA
108
→{ }→{ }
→{ }
MMA Chaining
Section 7.3.5

Carol Asks MMAC For CNN
109
→{ }→{ }
→{ }
MMA Chaining
Section 7.3.5

MMAA Returns CNN Memento {MA, MIA,MC}
110
→{ }→{ }
→{ }
Section 7.3.5

Client-Side Aggregation Preference
Bob May Request M(CNN) From MMAA or
MMAC
111
→{ }→{ }
→{ }
Section 6.1

May 7, 2019
Bob Prefers to Exclude IA Captures
112
✓ ✓
...and does not want to setup his own MMA
Section 6.1

GET /archives/
Bob Requests Supported Archives
113
→{ }
Section 6.1

Bob Customizes the Set in the JSON
114
→{ }
✓ ✓
Section 6.1

Bob Requests CNN for His Custom Set
115
→{ }
( )
Section 6.1

MMA Complies or Ignores Preference
116
→{ }
→{ }
✓
Section 6.1

May 7, 2019
Hooray, Aggregation!
117
!

May 7, 2019
MEMENTITY FRAMEWORK
Mementities
Section 7.1.2

May 7, 2019
Hooray, Aggregation!
119
HTTP 401
Consult URI-P
✓
RQ4: How can content that was captured behind authentication signal to
Web archive replay systems that it requires special handling?
Section 7.3.4

May 7, 2019
Private Web Archive Adapter (PWAA)
● Auth Layer for to encourage Private
Web archive aggregation
● Typical OAuth 2.0 flow
● Auth role cohesive to PWAA
● Persistent access through tokenization
120
RQ6: What kinds of access control do users who create
private Web archives need to regulate access to their
archives?

May 7, 2019
PWAA - Sharing Tokens
121
RQ6: What kinds of access control do users
who create private Web archives need to
regulate access to their archives?
Section 7.3.4

May 7, 2019
PWAA - Previously Authorized
122Figure 81a, Chapter 7
Section 7.3.4

May 7, 2019
PWAA - Unauthorized Request
123Figure 81a, Chapter 7
Section 7.3.4

May 7, 2019
PWAA - Sharing Tokens
124
RQ6: What kinds of access control do users
who create private Web archives need to
regulate access to their archives?
Section 7.3.4

May 7, 2019
Alice Passes Associative Token to MMA
125
Section 7.3.4

May 7, 2019
MMA requests URI-R...
126
...relays token where applicable
Section 7.3.4

May 7, 2019
Private Archive Validates with PWAA
127Figure 84c, Chapter 7
Section 7.3.4

May 7, 2019
PWAA Confirms Token
128Figure 84d, Chapter 7
Section 7.3.4

May 7, 2019
Private Archive Returns Captures
129Figure 84e, Chapter 7
Section 7.3.4

May 7, 2019
MMA Aggregates, Associates Token
130Figure 84f, Chapter 7
Section 7.3.4

May 7, 2019
MMA Aggregates, Associates Token
131
Section 7.3.4
We still need a way to
distinguish private archives
from public archives
when querying an aggregator

May 7, 2019
MEMENTITY FRAMEWORK
Mementities

May 7, 2019
⊆StarGate
● Content negotiation in Web archives beyond time
● “Star” ~ wildcard (*) → any dimension of negotiation
● Allow for queries like: Only show me mementos…
○ That are not redirects (content-based attribute HTTP Status ≠ 3XX)
○ Of a sufficient quality (derived attribute Memento Damage < 0.4)
○ Are from personal Web archives (access attribute indicate Alice’s
facebook.com mementos are not login pages)
133
Time
Gate
functio
nal
Section 7.1.3

Implicit Filtering via MMA or Directly
134
→{ }
→{ }
✓
filtering
filtering
Section 7.2.1

May 7, 2019
Negotiation in the Privacy Dimension
(via short circuiting)
135
Get URI-Ms for URI-R only from
personal Web archives
privateOnly
1
4
2
3
ACCESS ATTRIBUTE

May 7, 2019
Negotiation on Content-Based or Derived
Attributes
(with response filtering) Retrieved StarMap
136
Get URI-Ms for URI-R of good
quality that are unique
MD < 0.25, unique(simhash)
Abbreviated StarMap
with filtering applied
1
1
2
3
DERIVED ATTRIBUTE
CONTENT-BASED
ATTRIBUTE
&

May 7, 2019
Outline
● Background
● Related Work
○ Reconsidering Framework Scenarios
○ Framework Implementation
137

May 7, 2019
Outline
● Background
● Related Work
138

May 7, 2019
“It was there yesterday, where did it go?”
● We created browser-based tools to mitigate live Web
ephemerality
○ e.g., WARCreate1 (Section 4.2)
● ...and personal Web archival replay systems to re-experience the
captures and facilitate collaboration and propagation
○ Web Archiving Integration Layer (WAIL)2,3, Section 4.3
○ InterPlanetary Wayback (ipwb)4,5, Section 4.5
139
1 Kelly and Weigle, “WARCreate - Create Wayback-Consumable WARC Files from Any Webpage”, JCDL 2012
2 Kelly et al., “Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving”, PDA 2013
3 Berlin, Kelly et al., “WAIL: Collection-Based Personal Web Archiving”, JCDL 2017
4 Kelly et al., “InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives”, TPDL 2016
5 Alam, Kelly et al., “InterPlanetary Wayback: The Permanent Web Archive”, JCDL 2016

May 7, 2019
“Save this, but only for me”
● Our personal Web archiving tools (e.g., previous slide) help with personal Web
archiving of potentially private live Web content.
● The Private Web Archive Adapter (Section 7.1.2) in the Mementity Framework
allows for systematic access regulation
HTTP 401
Consult URI-P
✓
140

May 7, 2019
“I want to share this but control who can see it.”
● Access regulation: PWAA (Sections 7.1.2, 7.2.2)
● Collaboration and propagation of captures: ipwb1,2 (Section 4.5)
● Symmetric encryption for privacy: ipwb
● Negotiation in dimension of privacy / access attributes3 (Section 6.2.3)
Figure 75, Chapter 7 141
1 Kelly et al., “InterPlanetary Wayback: Peer-To-Peer
Permanence of Web Archives”, TPDL 2016
2 Alam, Kelly et al., “InterPlanetary Wayback: The Permanent
Web Archive”, JCDL 2016
3 Kelly et al., “A Framework for Aggregating Private and
Public Web Archives”, JCDL 2018

May 7, 2019
Outline
● Background
● Related Work
142

May 7, 2019
Reference Implementation
New Software Packages
Adapt to Framework with Additional Capabilities
143

May 7, 2019
Adaptations for MMA
MMA deliverables (MMADi)
1. Query precedence
2. Short-circuiting
3. Client-side Archival
Specification
4. ⇄ PWAA for access control
5. ⇄ StarGate for negotiation in
dimensions beyond time
Section 8.3.1
144

May 7, 2019
MMAD1: Query Precedence in MemGator
1. Adjusted base reference specification to be MemGator’s intended psuedo-
parallel querying model (JSON array → JSON objects).
○ Figure 90, Chapter 8
2. Implement MemGator interpretation of JSON arrays to query in-series, waiting
for a previous array entry to finish before proceedings.
3. Implement interpretation of JSON “sets” of archives to recursively allow
querying a subset in-series.
Section 8.3.1
145

May 7, 2019
MMAD2: Short-Circuiting in MemGator
MemGator has pre-existing functionality to short-circuit on archival count,
based on predefined order in server-set archival specification.
Leverage condition in MemGator to modify this conditional based on a function
checked at same point as conditional to prematurely break iteration.
Section 8.3.1
146

May 7, 2019
MMAD3: Client-Side Archival Specification
1. Associate archives list with session, pre-populate with default set.
2. Read Prefer header on request received. Overwrite archives set and associate
with session.
3. When querying archives, refer to this set (passed along by-reference) instead of
a global set defined on startup and read from a local config.
Section 8.3.1
147

May 7, 2019
PWAA Implementation
● We created a web-based GUI to
assist in initialization
● Handles interaction from client to
private Web archive and requests
from aggregators (MMAD4)
Section 8.3.2
148

May 7, 2019
Preference from Client’s Perspective
149
HTTP Response
HTTP Request
GET /tm/cdxj/https://cnn.com HTTP/1.1
Host: mma.example.com
Prefer: damage="<0.4"
Prefer:archives="data:application/json;charset=utf-8;
base64,Ww0KICB7...NCn0="
Prefer: publicOnly
HTTP/1.1 200 OK
Content-Type: application/cdxj+ors
Preference-Applied: damage="<0.4"
CDXJ
TM

May 7, 2019
StarGate Implementation
Section 8.3.3
150
HTTP Request
> GET /https://cnn.com HTTP/1.1
> Host: mma.alice.org
> Prefer: damage="<0.4"
>

May 7, 2019
Section 8.3.3
151
> GET /timemap/https://cnn.com
> Host: web.archive.org
>
> GET /tm/https://cnn.com
> Host: webarchive.org.uk/tm/
>
> GET /timemap/https://cnn.com
> Host: archive.md
>

May 7, 2019
Section 8.3.3
152
HTTP/1.1 200 OK
Content-Type:application/link-format
(TimeMap)
HTTP/1.1 200 OK
(TimeMap)
HTTP/1.1 200 OK
(TimeMap)

May 7, 2019
Section 8.3.3
153
HTTP Request
POST / HTTP/1.1
Host: stargate.alice.org
Prefer: damage="<0.4"
CDXJ
TM

May 7, 2019
Section 8.3.3
154
HTTP Request
GET /api/{URI-M}
Host: memento-damage.cs.odu.edu
For each URI-M in TimeMap
Long running processes can be
executed asynchronously and use
the “respond-async” syntax per
RFC7240 §4.1
(not shown here for simplicity)

May 7, 2019
Section 8.3.3
155
HTTP Response
HTTP/1.1 200 OK
Content-Type:application/json
(damage metadata)
For each URI-M in TimeMap

May 7, 2019
Section 8.3.3
156
Parse damage from JSON
returned from API, associate
with URI-M

May 7, 2019
Section 8.3.3
157
HTTP Response
HTTP/1.1 200 OK
CDXJ
TM
Smaller size
Indicative of filtering

May 7, 2019
Section 8.3.3
158
HTTP Response
HTTP/1.1 200 OK
CDXJ
TM
Addresses MMAD5:
⇄ StarGate for negotiation
in dimensions beyond time

May 7, 2019
Mink as a Client-Side Memento Meta-Aggregator
159

May 7, 2019
Archival Specification and Query Precedence in
Mink
160

May 7, 2019
Outline
● Background
● Related Work
161

May 7, 2019
RQ1: What sort of content is difficult to capture and replay for preservation
from the perspective of a Web browser?
● JavaScript largely the culprit
● Content behind authentication neglected for preservation
● We created novel approach to preserving content behind authentication using browser
APIs (WARCreate); prev. inaccessible content preserved to WARC.
● Studies on challenges in preservation:
○ Change in Archivability over Time (TPDL 2013)
○ Archival Acid Test (JCDL 2014)
○ Measuring the Impact of Missing Resources (JCDL 2014, IJDL 2015)
○ Impact of JS on Archivability (IJDL 2016)
● Tools to mitigate:
○ WARCreate (JCDL 2012)
○ ArchiveNow (JCDL 2018)
○ Replay Banners (JCDL 2018)
Chapter 9
162

May 7, 2019
RQ2: How do Web browser APIs compare in potential functionality to the
capabilities of archival crawlers?
Non-browser-based preservation tools miss representations that use latest Web technologies.
● Browser preservation vs. Crawlers:
○ Identifying Personalized Representation in the Archive (D-Lib 2013)
○ Archival Acid Test (JCDL 2014)
○ Service Workers for Client-side Replay (JCDL 2017)
○ Extensible Replay Banners (JCDL 2018)
● Tools to mitigate:
○ WARCreate (JCDL 2012)
○ Mink (JCDL 2014)
○ Mobile Mink (JCDL 2015)
○ WAIL (JCDL 2017, PDA 2013)
Chapter 9
163

May 7, 2019
RQ3: What issues exist for capturing and replaying content behind
authentication?
URI clash issue, need something to distinguish private captures on replay.
Key in preservation is to rely on native tools used to view the Web.
● Studies on Replay:
○ Impact of URI Canonicalization (JCDL 2017)
○ Framework for Aggregating Private & Public Web Archives (JCDL 2018)
Chapter 9
Capturing
vs
164

May 7, 2019
RQ4: How can content that was captured behind authentication signal to Web
archive replay systems that it requires special handling?
● Utilization of CDXJ, enrichment of TimeMaps to StarMaps with access attributes.
● We introduced a standard, extensible mechanism for regulating access to private Web
archives using the PWAA abstraction.
● Studies on auth+replay:
○ InterPlanetary Wayback (JCDL 2016, TPDL 2016)
Chapter 9
19981212013921 {
...
"access": {
"type": "Blake2b",
"token": "c6ed419e74907d220c69858614d86...ef0a3a88a41"
}
}
165

May 7, 2019
RQ5: How can Memento aggregators indicate that private Web archive content
requires special handling to be replayed, despite being aggregated with publicly
Aforementioned access attributes, interoperability with PWAA for acquiring and representing
tokens in StarMaps
● Aggregation + Private Archives:
○ Mink (JCDL 2014)
Chapter 9
166

May 7, 2019
RQ6: What kinds of access control do users who create private Web archives
need to regulate access to their archives?
● Integrated the IPFS with WARC standard via InterPlanetary Wayback to allow for client-
side archival replay without needing to possess the archival holdings.
● We enabled symmetric encryption of personal archival holdings within InterPlanetary
Wayback
● Studies on Access Control + private archives:
○ InterPlanetary Wayback (JCDL 2016, TPDL 2016)
Chapter 9
167

May 7, 2019
Future Work
● Prefer with archives instead of more generic config keyword
○ Potential name clash at benefit of extensibility
● More elegant/semantic/syntactic expressions of negotiation attributes
○ Prefer : damage="<0.5" is limited by RFC/HTTP syntax
● Public-private inter-archive leakage
● Inferrerence of private-public based on holdings beyond boolean classification
● Further investigations of private archives’ contents, inherently inaccessible
○ Which the Mementity Framework mitigates
Section 9.2
168

May 7, 2019
Conclusions
Content behind authentication
on the live Web can now be
preserved and systematically
replayed
Section 9.3
169

May 7, 2019
Hierarchy of Aggregated Public-Private Archives
170

May 7, 2019
Conclusions
171
→{ }→{ }
→{ }
Collective Experience of the past Web
Can be Aggregated
Section 9.3

May 7, 2019
Conclusions Access to Private Archives Directly or
Through Sharing Can Be Systematic
172
Section 9.3

May 7, 2019
Conclusions Enabled Client-Side Aggregation
and Archival Supplementation
173
Figures 94 and 95, Chapter 8
Section 9.3

May 7, 2019
Research Dissemination
● 17 publications that address 6 RQs →
● 5 Best Paper/Poster awards/nominations
● 28 research blog posts for ODU WS-DL
● Collaborations with 8 different institutions (inc. ODU)
● 4 first-author, open-source, publicly released
software packages (collab. on 3 others from ODU)
Aggregating Private and Public Web Archives Using the Mementity Framework

Aggregating Private and Public Web Archives Using the Mementity Framework

Recommended

Recommended

More Related Content

Similar to Aggregating Private and Public Web Archives Using the Mementity Framework

Similar to Aggregating Private and Public Web Archives Using the Mementity Framework (20)

More from Mat Kelly

More from Mat Kelly (19)

Recently uploaded

Recently uploaded (20)

Aggregating Private and Public Web Archives Using the Mementity Framework

Editor's Notes