Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Aggregating Private and Public Web Archives Using the Mementity Framework
1. Aggregating Private and Public Web Archives
Using the Mementity Framework
PhD Dissertation Defense for:
Mat Kelly
Advisor:
Michele C. Weigle
Committee Members:
Michele C. Weigle, Michael L. Nelson, Danella Zhao,
Justin F. Brunelle, and Dimitrie C. Popescu
Department of Computer Science
Norfolk, Virginia 23529 USA
May 7, 2019
2. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
The Web
2
3. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
The Web is Ephemeral
3
4. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
The Web is Ephemeral
4
5. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
The Web is Ephemeral
5
6. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
The Web is Ephemeral
6
7. @machawk1 • @WebSciDL 7
Web Archives to the Rescue: Typical Access
1. Go to archive.org in your
browser
2. Enter the URL you want to see
in the past in the form field
3. Submit your query
8. @machawk1 • @WebSciDL 8
Web Archives to the Rescue: Typical Access
4. Locate the archived
Web page on the calendar
or histogram view
5. Select the year/selection
for the day
6. Repeat until you find
the closest date and time
9. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
7. Finally, View the Archived Web Page
9
10. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Web Archiving - Live Web odu.edu
10
Now
11. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Web Archiving: View The Web of the Past
11
May 4, 2019Now
12. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Web Archiving: View The Web of the Past
12
August 2012
(when I started my PhD)
Now
13. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Web Archiving: View The Web of the Past
13
August 2010
(when I started at ODU)
Now
14. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Web Archiving: View The Web of the Past
14
June 2006
(when I finished my BS)
Now
15. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Web Archiving: View The Web of the Past
15
August 2001
(when I started college)
Now
16. PhD Defense: Mat Kelly
May 7, 2019
Web Archiving
16
2019
2012
2010
2006
2001
Associate a live Web URI
with their archived representations
17. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Web Archives Provide Access to the Web that was
17
What did odu.edu look like
in the past?
18. PhD Defense: Mat Kelly
May 7, 2019
Multiple Archival Efforts
18from Alam et al. IJDL 2016.
https://git.io/archives
19. PhD Defense: Mat Kelly
May 7, 2019
Consulting More Archives Produces a More
Comprehensive Picture
19
20. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Even Then, Not Everything is Preserved
20
What did obscuresite.com
look like in the past?
0 mementos for
obscuresite.com
21. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
User Sees on Live Web May Not Be What is Captured
21
What did facebook.com look
like in the past?
Figure 10b, Chapter 1
Figure 10a, Chapter 1
22. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
...And Oftentimes That is For the Best
22
Have you preserved my
online banking?
Figure 8, Chapter 1
23. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Private Live Web Content Prescriptively
Ephemeral
Figures 9a & 9b, Chapter 1
23
24. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Other Times, We May Want Our Content Archived
24
Have you archived my
daughter’s pictures on
Google Photos?
0 mementos for that URI
25. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
...Especially When It Has Disappeared
25
0 mementos for that URI
!
Have you archived my
daughter’s pictures on
Google Photos!?!?
“It was there yesterday, where did it go?”
26. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
“Save this, but only for me.”
● Screenshots of Web pages are insufficient
○ Not interactive/representative, do not integrate, lose context otherwise provided in metadata
● Large-scale archives’ tools are open source
● Individuals can archive, but there are still technical barriers
26
at tA
27. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Individuals, Too, Can Archive The Web
.com
27
“I want to share this but control who can see it.”
28. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Archives from Institutional and Personal Sources
at tA at tC at tD→Z
28
29. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Today’s Memento Aggregation
Archives Queried (A0 )
29
30. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Today’s Memento Aggregation
Archives Queried (A0 )
30
31. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Desire: Include Personal Archives
Archives Queried (A0 )
31
32. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Desire: Include Other Non-
Aggregated Archives
Archives Queried (A0 )
32
33. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Rapidly Changing Pages May Not Be Comprehensively
Captured
33
34. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
More Archives Provides a Better Picture of the Web
34
Before aggregating, we must first consider the ramifications,
enumerated in the RESEARCH QUESTIONS:
35. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ1: What sort of content is difficult to capture and replay for preservation from the perspective
of a Web browser?
RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival
crawlers?
RQ3: What issues exist for capturing and replaying content behind authentication?
RQ4: How can content that was captured behind authentication signal to Web archive replay
systems that it requires special handling?
RQ5: How can Memento aggregators indicate that private Web archive content requires special
handling to be replayed, despite being aggregated with publicly available Web archive content?
RQ6: What kinds of access control do users who create private Web archives need to regulate
access to their archives?
Research Questions
35
36. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ1: What sort of content is difficult to capture and replay for preservation from the perspective
of a Web browser?
RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival
crawlers?
RQ3: What issues exist for capturing and replaying content behind authentication?
RQ4: How can content that was captured behind authentication signal to Web archive replay
systems that it requires special handling?
RQ5: How can Memento aggregators indicate that private Web archive content requires special
handling to be replayed, despite being aggregated with publicly available Web archive content?
RQ6: What kinds of access control do users who create private Web archives need to regulate
access to their archives?
Research Questions: RQ1
36
37. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ1: What sort of content is difficult to capture and replay for preservation from the perspective
of a Web browser?
RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival
crawlers?
RQ3: What issues exist for capturing and replaying content behind authentication?
RQ4: How can content that was captured behind authentication signal to Web archive replay
systems that it requires special handling?
RQ5: How can Memento aggregators indicate that private Web archive content requires special
handling to be replayed, despite being aggregated with publicly available Web archive content?
RQ6: What kinds of access control do users who create private Web archives need to regulate
access to their archives?
Research Questions: RQ2
37
38. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ1: What sort of content is difficult to capture and replay for preservation from the perspective
of a Web browser?
RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival
crawlers?
RQ3: What issues exist for capturing and replaying content behind authentication?
RQ4: How can content that was captured behind authentication signal to Web archive replay
systems that it requires special handling?
RQ5: How can Memento aggregators indicate that private Web archive content requires special
handling to be replayed, despite being aggregated with publicly available Web archive content?
RQ6: What kinds of access control do users who create private Web archives need to regulate
access to their archives?
Research Questions: RQ3
38
39. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ1: What sort of content is difficult to capture and replay for preservation from the perspective
of a Web browser?
RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival
crawlers?
RQ3: What issues exist for capturing and replaying content behind authentication?
RQ4: How can content that was captured behind authentication signal to Web archive replay
systems that it requires special handling?
RQ5: How can Memento aggregators indicate that private Web archive content requires special
handling to be replayed, despite being aggregated with publicly available Web archive content?
RQ6: What kinds of access control do users who create private Web archives need to regulate
access to their archives?
Research Questions: RQ4
39
40. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ1: What sort of content is difficult to capture and replay for preservation from the perspective
of a Web browser?
RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival
crawlers?
RQ3: What issues exist for capturing and replaying content behind authentication?
RQ4: How can content that was captured behind authentication signal to Web archive replay
systems that it requires special handling?
RQ5: How can Memento aggregators indicate that private Web archive content requires special
handling to be replayed, despite being aggregated with publicly available Web archive content?
RQ6: What kinds of access control do users who create private Web archives need to regulate
access to their archives?
Research Questions: RQ5
40
41. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ1: What sort of content is difficult to capture and replay for preservation from the perspective
of a Web browser?
RQ2: How do Web browser APIs compare in potential functionality to the capabilities of archival
crawlers?
RQ3: What issues exist for capturing and replaying content behind authentication?
RQ4: How can content that was captured behind authentication signal to Web archive replay
systems that it requires special handling?
RQ5: How can Memento aggregators indicate that private Web archive content requires special
handling to be replayed, despite being aggregated with publicly available Web archive content?
RQ6: What kinds of access control do users who create private Web archives need to regulate
access to their archives?
Research Questions: RQ6
41
To answer these questions, we first have to be familiar with
BACKGROUND INFORMATION
42. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Outline
● Introduction/Motivation
● Background
● Related Work
● The Mementity Framework
● Applying Framework to Scenarios
● Contributions, Conclusions, and Future Work
42
43. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Hypertext Transfer Protocol (HTTP)
HTTP Request
> GET /~mkelly/dissertation.pdf HTTP/1.1
> Host: www.cs.odu.edu
> Connection: close
> request headers
43
44. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Hypertext Transfer Protocol (HTTP)
HTTP Response
HTTP/1.1 200 OK
Server: nginx
Date: Sun, 05 May 2019 18:31:31 GMT
Content-Type: application/pdf
Content-Length: 40067283
%PDF-1.5
...
response headers
(metadata)
start of entity body
44
45. @machawk1 • @WebSciDL
Background: Memento
Adapted from Memento Guide: Introduction. http://www.mementoweb.org/guide/quick-intro/, January 2015, in Public Domain
Figure 21, Chapter 2
45
* H. Van de Sompel et al. HTTP Framework for Time-Based Access to Resource States – Memento. IETF RFC 7089,
December 2013.
46. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Background: Memento Request Example
• Accept-Datetime: Wed, 02 Aug 2017 23:15:00 GMT
• GET: http://web.archive.org/web/http://www.cnn.com
Request cnn.com at
Aug 2, 2017 at 6:15pm EST
URI-G
G
HTTP Request
46
47. PhD Defense: Mat Kelly
May 7, 2019
Background: Memento Request Example
• Memento-Datetime: Wed, 02 Aug 2017 23:18:04 GMT
• Location: http://web.archive.org/web/20170802231804/http://www.cnn.com/
• Link:
• Accept-Datetime: Wed, 02 Aug 2017 23:15:00 GMT
• GET: http://web.archive.org/web/http://www.cnn.com
URI-G
G
mementotimegateoriginaltimemap
HTTP Request
HTTP Response (302)
URI-T
T
URI-R
R
URI-G
G
URI-M
M
47
Request cnn.com at
Aug 2, 2017 at 6:15pm EST
48. PhD Defense: Mat Kelly
May 7, 2019
Background: Dereferencing a TimeMap at URI-T
Request URI-T
● Date-based pagination
● Other formats for TimeMap
first
memento
last
mementomemento
......
......
timemap timemap timemap
TimeMapURI-T
T
URI-T
T1
URI-T
T
URI-T
Tt
URI-M
1
URI-M
m
URI-M
n
URI-G
G
URI-R
R
48
49. Background - Privacy and Security
● Web users question trusting
institutions to preserve
private Web contents1
● OAuth 2.02 facilitates
authentication cohesion of entities
1 Marshall and Shipman, “On the Institutional Archiving of Social Media”, JCDL 2012
2 D. Hardt. The OAuth 2.0 Authorization Framework. IETF RFC 6749, October 2012. 49
RQ3: What issues exist for capturing and replaying
content behind authentication?
RQ6: What kinds of access control do users who
create private Web archives need to regulate
access to their archives?
Figure 27, Chapter 2
50. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
A familiar paradigm used for authentication on the live Web
Role-based Delegation and Authentication
50
1
2
3
RQ3: content behind auth
RQ6: access control
4
5
51. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Private vs. Public Web archives
Both personal and institutional Web archiving can be either public are private.
Table I, Chapter 1
[49] Brunelle et al., “Leveraging Heritrix and the Wayback Machine on a Corporate Intranet…”, D-Lib Magazine, 22(1/2), 2016
51
52. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Private vs. Public Web archives
Both personal and institutional Web archiving can be either public are private.
Table I, Chapter 1
[49] Brunelle et al., “Leveraging Heritrix and the Wayback Machine on a Corporate Intranet…”, D-Lib Magazine, 22(1/2), 2016
OUR FOCUS
52
53. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
HTTP Prefer
● HTTP negotiation already available via Accept-* headers
● Prefer syntax provide mechanism for client to specify preferences
○ ...with which servers may or may not comply
53
* J. Snell. Prefer Header for HTTP. IETF RFC 7240, June 2014.
Prefer: foo; bar="baz"
Preference Applied: bar="baz"
RQ5: How can Memento aggregators indicate that private Web archive content requires special
handling to be replayed, despite being aggregated with publicly available Web archive content?
RQ4: How can content that was captured behind authentication
signal to Web archive replay systems that it requires special handling?
54. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ5: How can Memento aggregators indicate that private Web archive
content requires special handling to be replayed, despite being aggregated
with publicly available Web archive content?
Memento Aggregation* State of the Art
54
* Querying multiple Web archives for a URI from a single
endpoint at a Memento aggregator
Time Travel Service MemGator
55. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Memento Aggregation - MementoWeb
55
Also available via CLI:
$ curl http://timetravel.mementoweb.org/timemap/link/http://odu.edu
Figures 25a and 25b, Chapter 2
56. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
● Open Source Memento Aggregator - github.com/oduwsdl/memgator
● Easy personal/local deployment
● Specify archive list on launch
○ Easily configurable JSON →
○ Use default collection if not specified
● TimeMap Formats:
○ Link
○ JSON
○ CDXJ
Memento Aggregation - MemGator
56
* Alam and Nelson, “MemGator - A Portable Concurrent Memento Aggregator: Cross-Platform CLI and Server
Binaries in Go”, JCDL 2016
57. ...
<http://localhost:8080/20101116060516/http://facebook.com/>; rel="memento";
datetime="Tue, 16 Nov 2010 06:05:16 GMT",
...
...
20101116060516 {
"uri": "http://localhost:8080/20101116060516/http://facebook.com/",
"rel": "memento",
"datetime": "Tue, 16 Nov 2010 06:05:16 GMT",
}
...
57
CDXJ: An Alternative TimeMap Format
Memento entry in Link (RFC 7089) TimeMap
Memento entry in CDXJ TimeMap
Memento (URI-M)
See Alam, “CDXJ: An Object Resource Stream Serialization Format”, 2015
Figure 26, Chapter 2
Relative Relations Memento-Datetime
58. ● Browser extension we developed*
● Bridges gap between live and
archived Webs
● Leverages Memento aggregator’s
capability, returns TimeMaps
● Indicates # of captures for a
URI while you browse
● Provides navigation of mementos
while browsing live Web
● Single-click submission of URI-R to
multiple Web archives
: Visual User Interaction with Aggregators
58
* Kelly et al., “Mink: Integrating the Live and Archived Web Viewing Experience Using Web Browsers and Memento”, JCDL 2014
Figure 36, Chapter 4
59. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Outline
● Introduction/Motivation
● Background
● Related Work
● The Mementity Framework
● Applying Framework to Scenarios
● Contributions, Conclusions, and Future Work
59
60. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Related Work
Memento Aggregation (RQ 5):
● Bornand et al. (JCDL 2016) - routing using classifier
● Nelson and Alam (JCDL 2016) - routing using probability
● Rosenthal et al. (DLib 2015) - aggregation w/ restricted access
● AlSum et al. (IJDL 2014) - routing URI lookups
● Brunelle and Nelson (JCDL 2016) - TImeMap caching
● Alam et al. (IJDL 2016) - efficient querying through profiling
● Rosenthal (2013) - aggregator scalability
Prefer and Web Archives (RQ 5):
● Jones et al. (2016) - introduced “raw” mementos
● Van de Sompel et al. (2016) - Prefer usage for raw
● Rosenthal (2016) - other Prefer usage in web archives
Privacy in Web Archives (RQ 3-6):
● Marshall and Shipman (JCDL 2014) - Preserving FB, social
media content
● Lindley et al. (WWW 2013) - Users archiving social media
Access Control in Web Archives (RQ 4-6):
● Webber (2018) - HTTP 451 @ UKWA
● Rao et al. (SIGOPS 2001) iProxy - access control in web archives
Collaboration in Web Archives (RQ 5):
● Potts (UXPA 2015) - creating collections with tags
● Stine (2018) Cobweb - collection building between institutions at
Archive-It.
● Phillips (2003) PANDORA dynamics
● Niu (DLib 2012) - personalized features at PANDORA
Public Web Archiving (RQ 2):
● Brunelle et al. (JCDL 2017) - JS & deferred representations
● Ben-David and Huurdeman (Alexandria 2014) - archives
retrieval beyond URI
● AlNoamany et al. (JCDL 2013) Access Patterns
Private Web Archiving (RQ 1-4):
● Brunelle et al. (DLib 2016) - archiving corporate intranet
● Marshall and Shipman (JCDL 2012) - Institutions archiving
social media
60
61. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Related Work (RQ1)
Memento Aggregation (RQ 5):
● Bornand et al. (JCDL 2016) - routing using classifier
● Nelson and Alam (JCDL 2016) - routing using probability
● Rosenthal et al. (DLib 2015) - aggregation w/ restricted access
● AlSum et al. (IJDL 2014) - routing URI lookups
● Brunelle and Nelson (JCDL 2016) - TImeMap caching
● Alam et al. (IJDL 2016) - efficient querying through profiling
● Rosenthal (2013) - aggregator scalability
Prefer and Web Archives (RQ 5):
● Jones et al. (2016) - introduced “raw” mementos
● Van de Sompel et al. (2016) - Prefer usage for raw
● Rosenthal (2016) - other Prefer usage in web archives
Privacy in Web Archives (RQ 3-6):
● Marshall and Shipman (JCDL 2014) - Preserving FB, social
media content
● Lindley et al. (WWW 2013) - Users archiving social media
Access Control in Web Archives (RQ 4-6):
● Webber (2018) - HTTP 451 @ UKWA
● Rao et al. (SIGOPS 2001) iProxy - access control in web archives
Collaboration in Web Archives (RQ 5):
● Potts (UXPA 2015) - creating collections with tags
● Stine (2018) Cobweb - collection building between institutions at
Archive-It.
● Phillips (2003) PANDORA dynamics
● Niu (DLib 2012) - personalized features at PANDORA
Public Web Archive Access (RQ 2):
● Brunelle et al. (JCDL 2017) - JS & deferred representations
● Ben-David and Huurdeman (Alexandria 2014) - archives
retrieval beyond URI
● AlNoamany et al. (JCDL 2013) Access Patterns
Private Web Archiving (RQ 1-4):
● Brunelle et al. (DLib 2016) - archiving corporate intranet
● Marshall and Shipman (JCDL 2012) - Institutions archiving
social media
RQ1: What sort of content is difficult to capture and
replay for preservation from the perspective of a Web
browser?
61
62. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Related Work (RQ2)
Memento Aggregation (RQ 5):
● Bornand et al. (JCDL 2016) - routing using classifier
● Nelson and Alam (JCDL 2016) - routing using probability
● Rosenthal et al. (DLib 2015) - aggregation w/ restricted access
● AlSum et al. (IJDL 2014) - routing URI lookups
● Brunelle and Nelson (JCDL 2016) - TImeMap caching
● Alam et al. (IJDL 2016) - efficient querying through profiling
● Rosenthal (2013) - aggregator scalability
Prefer and Web Archives (RQ 5):
● Jones et al. (2016) - introduced “raw” mementos
● Van de Sompel et al. (2016) - Prefer usage for raw
● Rosenthal (2016) - other Prefer usage in web archives
Privacy in Web Archives (RQ 3-6):
● Marshall and Shipman (JCDL 2014) - Preserving FB, social
media content
● Lindley et al. (WWW 2013) - Users archiving social media
Access Control in Web Archives (RQ 4-6):
● Webber (2018) - HTTP 451 @ UKWA
● Rao et al. (SIGOPS 2001) iProxy - access control in web archives
Collaboration in Web Archives (RQ 5):
● Potts (UXPA 2015) - creating collections with tags
● Stine (2018) Cobweb - collection building between institutions at
Archive-It.
● Phillips (2003) PANDORA dynamics
● Niu (DLib 2012) - personalized features at PANDORA
Public Web Archive Access (RQ 2):
● Brunelle et al. (JCDL 2017) - JS & deferred representations
● Ben-David and Huurdeman (Alexandria 2014) - archives
retrieval beyond URI
● AlNoamany et al. (JCDL 2013) Access Patterns
Private Web Archiving (RQ 1-4):
● Brunelle et al. (DLib 2016) - archiving corporate intranet
● Marshall and Shipman (JCDL 2012) - Institutions archiving
social media
RQ2: How do Web browser APIs compare in potential
functionality to the capabilities of archival crawlers?
62
63. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Related Work (RQ3)
Memento Aggregation (RQ 5):
● Bornand et al. (JCDL 2016) - routing using classifier
● Nelson and Alam (JCDL 2016) - routing using probability
● Rosenthal et al. (DLib 2015) - aggregation w/ restricted access
● AlSum et al. (IJDL 2014) - routing URI lookups
● Brunelle and Nelson (JCDL 2016) - TImeMap caching
● Alam et al. (IJDL 2016) - efficient querying through profiling
● Rosenthal (2013) - aggregator scalability
Prefer and Web Archives (RQ 5):
● Jones et al. (2016) - introduced “raw” mementos
● Van de Sompel et al. (2016) - Prefer usage for raw
● Rosenthal (2016) - other Prefer usage in web archives
Privacy in Web Archives (RQ 3-6):
● Marshall and Shipman (JCDL 2014) - Preserving FB, social
media content
● Lindley et al. (WWW 2013) - Users archiving social media
Access Control in Web Archives (RQ 4-6):
● Webber (2018) - HTTP 451 @ UKWA
● Rao et al. (SIGOPS 2001) iProxy - access control in web archives
Collaboration in Web Archives (RQ 5):
● Potts (UXPA 2015) - creating collections with tags
● Stine (2018) Cobweb - collection building between institutions at
Archive-It.
● Phillips (2003) PANDORA dynamics
● Niu (DLib 2012) - personalized features at PANDORA
Public Web Archive Access (RQ 2):
● Brunelle et al. (JCDL 2017) - JS & deferred representations
● Ben-David and Huurdeman (Alexandria 2014) - archives
retrieval beyond URI
● AlNoamany et al. (JCDL 2013) Access Patterns
Private Web Archiving (RQ 1-4):
● Brunelle et al. (DLib 2016) - archiving corporate intranet
● Marshall and Shipman (JCDL 2012) - Institutions archiving
social media
RQ3: What issues exist for capturing and replaying
content behind authentication?
63
64. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Related Work (RQ4)
Memento Aggregation (RQ 5):
● Bornand et al. (JCDL 2016) - routing using classifier
● Nelson and Alam (JCDL 2016) - routing using probability
● Rosenthal et al. (DLib 2015) - aggregation w/ restricted access
● AlSum et al. (IJDL 2014) - routing URI lookups
● Brunelle and Nelson (JCDL 2016) - TImeMap caching
● Alam et al. (IJDL 2016) - efficient querying through profiling
● Rosenthal (2013) - aggregator scalability
Prefer and Web Archives (RQ 5):
● Jones et al. (2016) - introduced “raw” mementos
● Van de Sompel et al. (2016) - Prefer usage for raw
● Rosenthal (2016) - other Prefer usage in web archives
Privacy in Web Archives (RQ 3-6):
● Marshall and Shipman (JCDL 2014) - Preserving FB, social
media content
● Lindley et al. (WWW 2013) - Users archiving social media
Access Control in Web Archives (RQ 4-6):
● Webber (2018) - HTTP 451 @ UKWA
● Rao et al. (SIGOPS 2001) iProxy - access control in web archives
Collaboration in Web Archives (RQ 5):
● Potts (UXPA 2015) - creating collections with tags
● Stine (2018) Cobweb - collection building between institutions at
Archive-It.
● Phillips (2003) PANDORA dynamics
● Niu (DLib 2012) - personalized features at PANDORA
Public Web Archive Access (RQ 2):
● Brunelle et al. (JCDL 2017) - JS & deferred representations
● Ben-David and Huurdeman (Alexandria 2014) - archives
retrieval beyond URI
● AlNoamany et al. (JCDL 2013) Access Patterns
Private Web Archiving (RQ 1-4):
● Brunelle et al. (DLib 2016) - archiving corporate intranet
● Marshall and Shipman (JCDL 2012) - Institutions archiving
social media
RQ4: How can content that was captured behind
authentication signal to Web archive replay systems
that it requires special handling?
64
65. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Related Work (RQ5)
Memento Aggregation (RQ 5):
● Bornand et al. (JCDL 2016) - routing using classifier
● Nelson and Alam (JCDL 2016) - routing using probability
● Rosenthal et al. (DLib 2015) - aggregation w/ restricted access
● AlSum et al. (IJDL 2014) - routing URI lookups
● Brunelle and Nelson (JCDL 2016) - TImeMap caching
● Alam et al. (IJDL 2016) - efficient querying through profiling
● Rosenthal (2013) - aggregator scalability
Prefer and Web Archives (RQ 5):
● Jones et al. (2016) - introduced “raw” mementos
● Van de Sompel et al. (2016) - Prefer usage for raw
● Rosenthal (2016) - other Prefer usage in web archives
Privacy in Web Archives (RQ 3-6):
● Marshall and Shipman (JCDL 2014) - Preserving FB, social
media content
● Lindley et al. (WWW 2013) - Users archiving social media
Access Control in Web Archives (RQ 4-6):
● Webber (2018) - HTTP 451 @ UKWA
● Rao et al. (SIGOPS 2001) iProxy - access control in web archives
Collaboration in Web Archives (RQ 5):
● Potts (UXPA 2015) - creating collections with tags
● Stine (2018) Cobweb - collection building between institutions at
Archive-It.
● Phillips (2003) PANDORA dynamics
● Niu (DLib 2012) - personalized features at PANDORA
Public Web Archive Access (RQ 2):
● Brunelle et al. (JCDL 2017) - JS & deferred representations
● Ben-David and Huurdeman (Alexandria 2014) - archives
retrieval beyond URI
● AlNoamany et al. (JCDL 2013) Access Patterns
Private Web Archiving (RQ 1-4):
● Brunelle et al. (DLib 2016) - archiving corporate intranet
● Marshall and Shipman (JCDL 2012) - Institutions archiving
social media
RQ5: How can Memento aggregators indicate that
private Web archive content requires special handling
to be replayed, despite being aggregated with publicly
available Web archive content?
65
66. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Related Work (RQ6)
Memento Aggregation (RQ 5):
● Bornand et al. (JCDL 2016) - routing using classifier
● Nelson and Alam (JCDL 2016) - routing using probability
● Rosenthal et al. (DLib 2015) - aggregation w/ restricted access
● AlSum et al. (IJDL 2014) - routing URI lookups
● Brunelle and Nelson (JCDL 2016) - TImeMap caching
● Alam et al. (IJDL 2016) - efficient querying through profiling
● Rosenthal (2013) - aggregator scalability
Prefer and Web Archives (RQ 5):
● Jones et al. (2016) - introduced “raw” mementos
● Van de Sompel et al. (2016) - Prefer usage for raw
● Rosenthal (2016) - other Prefer usage in web archives
Privacy in Web Archives (RQ 3-6):
● Marshall and Shipman (JCDL 2014) - Preserving FB, social
media content
● Lindley et al. (WWW 2013) - Users archiving social media
Access Control in Web Archives (RQ 4-6):
● Webber (2018) - HTTP 451 @ UKWA
● Rao et al. (SIGOPS 2001) iProxy - access control in web archives
Collaboration in Web Archives (RQ 5):
● Potts (UXPA 2015) - creating collections with tags
● Stine (2018) Cobweb - collection building between institutions at
Archive-It.
● Phillips (2003) PANDORA dynamics
● Niu (DLib 2012) - personalized features at PANDORA
Public Web Archive Access (RQ 2):
● Brunelle et al. (JCDL 2017) - JS & deferred representations
● Ben-David and Huurdeman (Alexandria 2014) - archives
retrieval beyond URI
● AlNoamany et al. (JCDL 2013) Access Patterns
Private Web Archiving (RQ 1-4):
● Brunelle et al. (DLib 2016) - archiving corporate intranet
● Marshall and Shipman (JCDL 2012) - Institutions archiving
social media
RQ6: What kinds of access control do users who create
private Web archives need to regulate access to their
archives?
66
67. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Outline
● Introduction/Motivation
● Background
● Related Work
● The Mementity Framework
○ Archival Negotiation Beyond Time
○ Query Precedence & Short-Circuiting
○ Mementities
● Applying Framework to Scenarios
● Contributions, Conclusions, and Future Work
67
68. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Outline
● Introduction/Motivation
● Background
● Related Work
● The Mementity Framework
○ Archival Negotiation Beyond Time
○ Query Precedence & Short-Circuiting
○ Mementities
● Applying Framework to Scenarios
● Contributions, Conclusions, and Future Work
68
69. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
More Expressive TimeMaps
● Memento Quality (e.g., Damage)1
● How Many Captures?2
● How Many Are Identical?2,3
● Other Attributes of Mementos...
69
1 Brunelle, Kelly et al., JCDL 2014, IJDL 2015
2 Kelly et al., JCDL 2017
3 AlSum and Nelson, ECIR 2014
Figure 7, Chapter 1
Section 6.2
70. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Additional TimeMap Attributes
Content-based Attributes
Derived Attributes
Access Attributes
70
71. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
TimeMap Enrichment: Content-Based Attributes
● Status Code1
● Content-Digest
○ In WARC & CDX
○ Not all archives expose CDX
● Would allow more info about mementos without
requiring comprehensive dereferencing
71
1 Kelly et al., “Impact of URI Canonicalization on Memento Count”, JCDL 2017, arXiv 1703.03302
Section 6.2.1
72. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
● Thumbnails (e.g., via SimHash)1
○ Calculation based on root memento’s HTML
● Memento Damage (JCDL 2014, IJDL)2
○ Requires dereferencing embedded resources
TimeMap Enrichment: Derived Attributes
72
1 AlSum and Nelson, Thumbnail Summarization Techniques for Web Archives, ECIR 2014, pp. 299-310.
2 Brunelle, Kelly et al., “The Impact of JavaScript on Archivability,” IJDL, 17(2), pp. 95-117. January 2016.
apple.com, many duplicate mementos!
Section 6.2.2
73. How to distinguish
Private captures
from
Public captures
in a TimeMap?
TimeMap Enrichment: Access Attributes
first
memento mementomemento
......
......
timemap timemap timemap
TimeMapURI-T
T
URI-T
T1
URI-T
T
URI-T
Tt
URI-M
1
URI-M
m
URI-M
n
URI-G
G
URI-R
R
URI-M
m+1
last
memento
...
RQ5: How can Memento aggregators
indicate that private Web archive content
requires special handling to be replayed,
despite being aggregated with publicly
available Web archive content?
Section 6.2.3
74. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
TimeMap Enrichment - in a CDXJ TimeMap
74
20101116060516 {
"uri": "http://localhost:8080/20101116060516/http://facebook.com/",
"rel": "memento",
"datetime": "Tue, 16 Nov 2010 06:05:16 GMT",
"status_code": 200,
"digest": "sha1:LK26DRRQJ4WATC6LBVF3B3Z4P2CP5ZZ7",
"damage": 0.24,
"simhash": "6551110622422153488",
"content-language": "en-US",
"access": {
"type": "Blake2b",
"token": "c6ed419e74907d220c69858614d86...ef0a3a88a41"
}
}
Line breaks added for clarity, CDXJ records occupy a single line
Content-based
attributes
Derived
Attributes
Access
Attributes
Figure 64, Chapter 6
76. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Outline
● Introduction/Motivation
● Background
● Related Work
● The Mementity Framework
○ Archival Negotiation Beyond Time
○ Query Precedence & Short-Circuiting
○ Mementities
● Applying Framework to Scenarios
● Contributions, Conclusions, and Future Work
76
77. Query Precedence
77
“Check my archive first, then Carol’s, then all
public archives.”
1
2
3 3
● More control of querying in series and parallel
See Atkins’, “Paywalls in the Internet Archive”, March 2018
78. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Precedence in Archival Specifications
● MemGator’s internal archival specification
N1: [{A0},{A1},{A2}]
N2: {{A0},{A1},{A2}}
N3: {{A0},{A4,A2,A5},{A3}}
N4: {"id0":{A0},"id1":{A4},"id2":{A3}}
N5: [{"id0"∶{A0}},{"id1"∶{A1}},{"id2"∶ {A2}}]
N6: [{"id0"∶{A0}},
{"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}},
{"id2"∶ {A3}]
● Despite JSON array, archives queried in “pseudo-parallel” for efficiency
78
1
1
1
79. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Precedence in Archival Specifications
● MemGator’s internal archival specification
N1: [{A0},{A1},{A2}]
N2: {{A0},{A1},{A2}}
N3: {{A0},{A4,A2,A5},{A3}}
N4: {"id0":{A0},"id1":{A4},"id2":{A3}}
N5: [{"id0"∶{A0}},{"id1"∶{A1}},{"id2"∶{A2}}]
N6: [{"id0"∶{A0}},
{"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}},
{"id2"∶ {A3}]
● N2 more descriptive of behavior despite notation interpretation
● JSON object members require identifiers
79
1
1
1
80. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
● MemGator’s internal archival specification
N1: [{A0},{A1},{A2}]
N2: {{A0},{A1},{A2}}
N3: {{A0},{A4,A2,A5},{A3}}
N4: {"id0":{A0},"id1":{A4},"id2":{A3}}
N5: [{"id0"∶{A0}},{"id1"∶{A1}},{"id2"∶{A2}}]
N6: [{"id0"∶{A0}},
{"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}},
{"id2"∶ {A3}]
● Can introduce “sets” of URIs to archive, but still query in pseudo-parallel
● JSON object members require identifiers
Precedence in Archival Specifications
80
1
1
1
A4
A5
1
1
81. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
● MemGator’s internal archival specification
N1: [{A0},{A1},{A2}]
N2: {{A0},{A1},{A2}}
N3: {{A0},{A4,A2,A5},{A3}}
N4: {"id0":{A0},"id1":{A4},"id2":{A3}}
N5: [{"id0"∶{A0}},{"id1"∶{A1}},{"id2"∶{A2}}]
N6: [{"id0"∶{A0}},
{"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}},
{"id2"∶ {A3}]
● Add identifiers for JSON compliance
● Still pseudo-parallel
Precedence in Archival Specifications
81
1
1
1
id0
id1
id2
84. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Precedence Using Prefer
[{"id0"∶{A0}},
{"id1a"∶{A4},"id1b"∶{A2},"id1c"∶{A5}},
{"id2"∶ {A3}]
Prefer:archives="data:application/json;c
harset=utf-8;base64,Ww0KICB7...NCn0="
ENCODES TO
(no aggregator prior to this dissertation supports this)
84
85. Query Precedence
85
“Check my archive first, then Carol’s, then all
public archives.”
1
● More control of querying in series and parallel
See Atkins’, “Paywalls in the Internet Archive”, March 2018
86. Query Precedence
86
“Check my archive first, then Carol’s, then all
public archives.”
1
2
● More control of querying in series and parallel
See Atkins’, “Paywalls in the Internet Archive”, March 2018
87. Query Precedence
87
“Check my archive first, then Carol’s, then all
public archives.”
1
2
3 3
● More control of querying in series and parallel
See Atkins’, “Paywalls in the Internet Archive”, March 2018
88. Query Short-Circuiting
88
“Check private archives first. Iff you find no
captures, only then check the public archives.
1
1
2 2
● May give priority to archive preference
● Series halt when threshold met.
89. Query Short-Circuiting
89
“Check private archives first. Iff you find no
captures, only then check the public archives.
1
1
2 2
● May give priority to archive preference.
● Series halt when threshold met.
90. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Outline
● Introduction/Motivation
● Background
● Related Work
● The Mementity Framework
○ Archival Negotiation Beyond Time
○ Query Precedence & Short-Circuiting
○ Mementities
● Applying Framework to Scenarios
● Contributions, Conclusions, and Future Work
90
91. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Mementities
● Memento + Entity (entity term already overused)
91
Time
Gate
Introduced in this Framework Conventional Memento
Mementities
92. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL 92
MEMENTITY FRAMEWORK
Mementities
Section 7.1.1
93. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Memento Meta-Aggregator (MMA)
93
functional
⊇
Figure 80, Chapter 7
Section 7.3.3
94. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
MMA: Archive Selection
94
Figure 55, Chapter 6
Section 6.1
95. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
MMA: User-Driven Archival Specification
95
✓ ✓
Section 6.1
96. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
MMAα:
from MA2, MA1 and WA6
MMAβ:
from WA7 and WA8
MMAγ:
from MMAβ, MA5, and WA1
96
MMA Aggregation sources
Figure 67, Chapter 7
97. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Costs of more expressive TimeMaps (StarMaps)
and Link header enrichment
● Computational
○ Mostly server-side, potential to further leverage client
● Temporal
○ Required on variant generation
● Spatial
○ Permutation variant storage
● Access
○ Variant negotiation
97
98. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Temporal Costs: Conventional Aggregation
T = treq + tMA + tAGG+ tresp
tMA=max(RTT(Ai ,URI-R))
Client→Aggregator
request
Aggregation⇄∀archives
communication
Aggregator combining,
sorting, etc.
Aggregator→Client
request
Equation 9, Chapter 8
Equation 10, Chapter 8
98
99. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Impact of MemGator→MMA
● Additional capability of MMA beyond MemGator adds additional
temporal complexity:
○ tprefer → time and computation required to modify base set of archives and associate
with client’s session
○ tenrich → time required for external services to parse/calculate additional attributes
per URI-M
○ tfilter→ time required to filter URI-Ms based on client’s preference (potentially 0)
Section 8.2
99
100. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
● Personal Archive Aggregation
● MMA Chaining
● Client-Side Aggregation Preference
MMA Dynamics By-Example
100
ALICE
BOB
CAROL
MALCOLM
112. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Client-Side Aggregation Preference
Bob Prefers to Exclude IA Captures
112
✓ ✓
...and does not want to setup his own MMA
Section 6.1
117. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Hooray, Aggregation!
117
!
118. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL 118
MEMENTITY FRAMEWORK
Mementities
Section 7.1.2
119. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Hooray, Aggregation!
119
HTTP 401
Consult URI-P
✓
RQ4: How can content that was captured behind authentication signal to
Web archive replay systems that it requires special handling?
Section 7.3.4
120. PhD Defense: Mat Kelly
May 7, 2019
Private Web Archive Adapter (PWAA)
● Auth Layer for to encourage Private
Web archive aggregation
● Typical OAuth 2.0 flow
● Auth role cohesive to PWAA
● Persistent access through tokenization
120
RQ6: What kinds of access control do users who create
private Web archives need to regulate access to their
archives?
Figure 63, Chapter 6
121. PhD Defense: Mat Kelly
May 7, 2019
PWAA - Sharing Tokens
121
RQ6: What kinds of access control do users
who create private Web archives need to
regulate access to their archives?
Figure 81a, Chapter 7
Section 7.3.4
122. PhD Defense: Mat Kelly
May 7, 2019
PWAA - Previously Authorized
122Figure 81a, Chapter 7
Section 7.3.4
123. PhD Defense: Mat Kelly
May 7, 2019
PWAA - Unauthorized Request
123Figure 81a, Chapter 7
Section 7.3.4
124. PhD Defense: Mat Kelly
May 7, 2019
PWAA - Sharing Tokens
124
RQ6: What kinds of access control do users
who create private Web archives need to
regulate access to their archives?
Figure 81b, Chapter 7
Section 7.3.4
125. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Alice Passes Associative Token to MMA
125
RQ5: How can Memento aggregators indicate that private Web archive
content requires special handling to be replayed, despite being aggregated
with publicly available Web archive content?
Figure 84a, Chapter 7
Section 7.3.4
126. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
MMA requests URI-R...
126
...relays token where applicable
Figure 84b, Chapter 7
Section 7.3.4
RQ5: How can Memento aggregators indicate that private Web archive
content requires special handling to be replayed, despite being aggregated
with publicly available Web archive content?
127. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Private Archive Validates with PWAA
127Figure 84c, Chapter 7
Section 7.3.4
RQ5: How can Memento aggregators indicate that private Web archive
content requires special handling to be replayed, despite being aggregated
with publicly available Web archive content?
128. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
PWAA Confirms Token
128Figure 84d, Chapter 7
Section 7.3.4
RQ5: How can Memento aggregators indicate that private Web archive
content requires special handling to be replayed, despite being aggregated
with publicly available Web archive content?
129. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Private Archive Returns Captures
129Figure 84e, Chapter 7
Section 7.3.4
RQ5: How can Memento aggregators indicate that private Web archive
content requires special handling to be replayed, despite being aggregated
with publicly available Web archive content?
130. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
MMA Aggregates, Associates Token
130Figure 84f, Chapter 7
Section 7.3.4
RQ5: How can Memento aggregators indicate that private Web archive
content requires special handling to be replayed, despite being aggregated
with publicly available Web archive content?
131. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
MMA Aggregates, Associates Token
131
Section 7.3.4
We still need a way to
distinguish private archives
from public archives
when querying an aggregator
132. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL 132
MEMENTITY FRAMEWORK
Mementities
133. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
⊆StarGate
● Content negotiation in Web archives beyond time
● “Star” ~ wildcard (*) → any dimension of negotiation
● Allow for queries like: Only show me mementos…
○ That are not redirects (content-based attribute HTTP Status ≠ 3XX)
○ Of a sufficient quality (derived attribute Memento Damage < 0.4)
○ Are from personal Web archives (access attribute indicate Alice’s
facebook.com mementos are not login pages)
133
Time
Gate
functio
nal
Section 7.1.3
135. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Negotiation in the Privacy Dimension
(via short circuiting)
135
Get URI-Ms for URI-R only from
personal Web archives
privateOnly
1
4
2
3
ACCESS ATTRIBUTE
Figure 85, Chapter 7
136. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Negotiation on Content-Based or Derived
Attributes
(with response filtering) Retrieved StarMap
136
Get URI-Ms for URI-R of good
quality that are unique
MD < 0.25, unique(simhash)
Abbreviated StarMap
with filtering applied
1
1
2
3
DERIVED ATTRIBUTE
CONTENT-BASED
ATTRIBUTE
&
Figure 86, Chapter 7
137. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Outline
● Introduction/Motivation
● Background
● Related Work
● The Mementity Framework
● Applying Framework to Scenarios
○ Reconsidering Framework Scenarios
○ Framework Implementation
● Contributions, Conclusions, and Future Work
137
138. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Outline
● Introduction/Motivation
● Background
● Related Work
● The Mementity Framework
● Applying Framework to Scenarios
○ Reconsidering Framework Scenarios
○ Framework Implementation
● Contributions, Conclusions, and Future Work
138
139. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
“It was there yesterday, where did it go?”
● We created browser-based tools to mitigate live Web
ephemerality
○ e.g., WARCreate1 (Section 4.2)
● ...and personal Web archival replay systems to re-experience the
captures and facilitate collaboration and propagation
○ Web Archiving Integration Layer (WAIL)2,3, Section 4.3
○ InterPlanetary Wayback (ipwb)4,5, Section 4.5
139
1 Kelly and Weigle, “WARCreate - Create Wayback-Consumable WARC Files from Any Webpage”, JCDL 2012
2 Kelly et al., “Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving”, PDA 2013
3 Berlin, Kelly et al., “WAIL: Collection-Based Personal Web Archiving”, JCDL 2017
4 Kelly et al., “InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives”, TPDL 2016
5 Alam, Kelly et al., “InterPlanetary Wayback: The Permanent Web Archive”, JCDL 2016
140. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
“Save this, but only for me”
● Our personal Web archiving tools (e.g., previous slide) help with personal Web
archiving of potentially private live Web content.
● The Private Web Archive Adapter (Section 7.1.2) in the Mementity Framework
allows for systematic access regulation
HTTP 401
Consult URI-P
✓
140
141. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
“I want to share this but control who can see it.”
● Access regulation: PWAA (Sections 7.1.2, 7.2.2)
● Collaboration and propagation of captures: ipwb1,2 (Section 4.5)
● Symmetric encryption for privacy: ipwb
● Negotiation in dimension of privacy / access attributes3 (Section 6.2.3)
Figure 75, Chapter 7 141
1 Kelly et al., “InterPlanetary Wayback: Peer-To-Peer
Permanence of Web Archives”, TPDL 2016
2 Alam, Kelly et al., “InterPlanetary Wayback: The Permanent
Web Archive”, JCDL 2016
3 Kelly et al., “A Framework for Aggregating Private and
Public Web Archives”, JCDL 2018
142. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Outline
● Introduction/Motivation
● Background
● Related Work
● The Mementity Framework
● Applying Framework to Scenarios
○ Reconsidering Framework Scenarios
○ Framework Implementation
● Contributions, Conclusions, and Future Work
142
143. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Reference Implementation
New Software Packages
Adapt to Framework with Additional Capabilities
143
144. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Adaptations for MMA
MMA deliverables (MMADi)
1. Query precedence
2. Short-circuiting
3. Client-side Archival
Specification
4. ⇄ PWAA for access control
5. ⇄ StarGate for negotiation in
dimensions beyond time
Section 8.3.1
144
145. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
MMAD1: Query Precedence in MemGator
1. Adjusted base reference specification to be MemGator’s intended psuedo-
parallel querying model (JSON array → JSON objects).
○ Figure 90, Chapter 8
2. Implement MemGator interpretation of JSON arrays to query in-series, waiting
for a previous array entry to finish before proceedings.
○ Figure 87, Chapter 8
3. Implement interpretation of JSON “sets” of archives to recursively allow
querying a subset in-series.
○ Figure 91, Chapter 8
Section 8.3.1
145
146. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
MMAD2: Short-Circuiting in MemGator
MemGator has pre-existing functionality to short-circuit on archival count,
based on predefined order in server-set archival specification.
Leverage condition in MemGator to modify this conditional based on a function
checked at same point as conditional to prematurely break iteration.
Section 8.3.1
146
Figure 91, Chapter 8
147. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
MMAD3: Client-Side Archival Specification
1. Associate archives list with session, pre-populate with default set.
2. Read Prefer header on request received. Overwrite archives set and associate
with session.
3. When querying archives, refer to this set (passed along by-reference) instead of
a global set defined on startup and read from a local config.
Section 8.3.1
147
Figure 89, Chapter 8
148. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
PWAA Implementation
● We created a web-based GUI to
assist in initialization
● Handles interaction from client to
private Web archive and requests
from aggregators (MMAD4)
Section 8.3.2
148
149. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Preference from Client’s Perspective
149
HTTP Response
HTTP Request
GET /tm/cdxj/https://cnn.com HTTP/1.1
Host: mma.example.com
Prefer: damage="<0.4"
Prefer:archives="data:application/json;charset=utf-8;
base64,Ww0KICB7...NCn0="
Prefer: publicOnly
HTTP/1.1 200 OK
Content-Length: 1207
Content-Type: application/cdxj+ors
Preference-Applied: damage="<0.4"
CDXJ
TM
150. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
StarGate Implementation
Section 8.3.3
150
HTTP Request
> GET /https://cnn.com HTTP/1.1
> Host: mma.alice.org
> Prefer: damage="<0.4"
>
151. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Section 8.3.3
151
> GET /timemap/https://cnn.com
> Host: web.archive.org
>
> GET /tm/https://cnn.com
> Host: webarchive.org.uk/tm/
>
> GET /timemap/https://cnn.com
> Host: archive.md
>
StarGate Implementation
152. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Section 8.3.3
152
HTTP/1.1 200 OK
Content-Type:application/link-format
(TimeMap)
HTTP/1.1 200 OK
Content-Type:application/link-format
(TimeMap)
HTTP/1.1 200 OK
Content-Type:application/link-format
(TimeMap)
StarGate Implementation
153. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Section 8.3.3
153
HTTP Request
POST / HTTP/1.1
Host: stargate.alice.org
Content-Length: 1500
Content-Type: application/cdxj+ors
Prefer: damage="<0.4"
CDXJ
TM
StarGate Implementation
154. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Section 8.3.3
154
HTTP Request
GET /api/{URI-M}
Host: memento-damage.cs.odu.edu
Content-Type: application/cdxj+ors
StarGate Implementation
For each URI-M in TimeMap
Long running processes can be
executed asynchronously and use
the “respond-async” syntax per
RFC7240 §4.1
(not shown here for simplicity)
155. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Section 8.3.3
155
HTTP Response
HTTP/1.1 200 OK
Content-Type:application/json
(damage metadata)
StarGate Implementation
For each URI-M in TimeMap
156. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Section 8.3.3
156
StarGate Implementation
Parse damage from JSON
returned from API, associate
with URI-M
157. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Section 8.3.3
157
HTTP Response
HTTP/1.1 200 OK
Content-Length: 1207
Content-Type: application/cdxj+ors
Preference-Applied: damage="<0.4"
CDXJ
TM
StarGate Implementation
Smaller size
Indicative of filtering
158. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Section 8.3.3
158
HTTP Response
HTTP/1.1 200 OK
Content-Length: 1207
Content-Type: application/cdxj+ors
Preference-Applied: damage="<0.4"
CDXJ
TM
StarGate Implementation
Addresses MMAD5:
⇄ StarGate for negotiation
in dimensions beyond time
159. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Mink as a Client-Side Memento Meta-Aggregator
Figure 94, Chapter 8
159
160. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Archival Specification and Query Precedence in
Mink
Figure 95, Chapter 8
160
161. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Outline
● Introduction/Motivation
● Background
● Related Work
● The Mementity Framework
● Applying Framework to Scenarios
● Contributions, Conclusions, and Future Work
161
162. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ1: What sort of content is difficult to capture and replay for preservation
from the perspective of a Web browser?
● JavaScript largely the culprit
● Content behind authentication neglected for preservation
● We created novel approach to preserving content behind authentication using browser
APIs (WARCreate); prev. inaccessible content preserved to WARC.
● Studies on challenges in preservation:
○ Change in Archivability over Time (TPDL 2013)
○ Archival Acid Test (JCDL 2014)
○ Measuring the Impact of Missing Resources (JCDL 2014, IJDL 2015)
○ Impact of JS on Archivability (IJDL 2016)
● Tools to mitigate:
○ WARCreate (JCDL 2012)
○ ArchiveNow (JCDL 2018)
○ Replay Banners (JCDL 2018)
Chapter 9
162
163. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ2: How do Web browser APIs compare in potential functionality to the
capabilities of archival crawlers?
Non-browser-based preservation tools miss representations that use latest Web technologies.
● Browser preservation vs. Crawlers:
○ Identifying Personalized Representation in the Archive (D-Lib 2013)
○ Archival Acid Test (JCDL 2014)
○ Service Workers for Client-side Replay (JCDL 2017)
○ Extensible Replay Banners (JCDL 2018)
● Tools to mitigate:
○ WARCreate (JCDL 2012)
○ Mink (JCDL 2014)
○ Mobile Mink (JCDL 2015)
○ WAIL (JCDL 2017, PDA 2013)
Chapter 9
163
164. PhD Defense: Mat Kelly
May 7, 2019
RQ3: What issues exist for capturing and replaying content behind
authentication?
URI clash issue, need something to distinguish private captures on replay.
Key in preservation is to rely on native tools used to view the Web.
● Studies on Replay:
○ Impact of URI Canonicalization (JCDL 2017)
○ Framework for Aggregating Private & Public Web Archives (JCDL 2018)
Chapter 9
Figure 10, Chapter 1
Capturing
vs
164
165. PhD Defense: Mat Kelly
May 7, 2019
RQ4: How can content that was captured behind authentication signal to Web
archive replay systems that it requires special handling?
● Utilization of CDXJ, enrichment of TimeMaps to StarMaps with access attributes.
● We introduced a standard, extensible mechanism for regulating access to private Web
archives using the PWAA abstraction.
● Studies on auth+replay:
○ InterPlanetary Wayback (JCDL 2016, TPDL 2016)
○ Framework for Aggregating Private & Public Web Archives (JCDL 2018)
Chapter 9
19981212013921 {
"uri": "http://localhost:8080/20101116060516/http://facebook.com/",
...
"access": {
"type": "Blake2b",
"token": "c6ed419e74907d220c69858614d86...ef0a3a88a41"
}
}
Figure 64, Chapter 6
165
166. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ5: How can Memento aggregators indicate that private Web archive content
requires special handling to be replayed, despite being aggregated with publicly
available Web archive content?
Aforementioned access attributes, interoperability with PWAA for acquiring and representing
tokens in StarMaps
● Aggregation + Private Archives:
○ Mink (JCDL 2014)
○ Framework for Aggregating Private & Public Web Archives (JCDL 2018)
Chapter 9
166
167. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
RQ6: What kinds of access control do users who create private Web archives
need to regulate access to their archives?
● Integrated the IPFS with WARC standard via InterPlanetary Wayback to allow for client-
side archival replay without needing to possess the archival holdings.
● We enabled symmetric encryption of personal archival holdings within InterPlanetary
Wayback
● Studies on Access Control + private archives:
○ InterPlanetary Wayback (JCDL 2016, TPDL 2016)
○ Framework for Aggregating Private & Public Web Archives (JCDL 2018)
Chapter 9
167
168. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Future Work
● Prefer with archives instead of more generic config keyword
○ Potential name clash at benefit of extensibility
● More elegant/semantic/syntactic expressions of negotiation attributes
○ Prefer : damage="<0.5" is limited by RFC/HTTP syntax
● Public-private inter-archive leakage
● Inferrerence of private-public based on holdings beyond boolean classification
● Further investigations of private archives’ contents, inherently inaccessible
○ Which the Mementity Framework mitigates
Section 9.2
168
169. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Conclusions
Content behind authentication
on the live Web can now be
preserved and systematically
replayed
Section 9.3
169
170. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Hierarchy of Aggregated Public-Private Archives
170
171. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Conclusions
171
→{ }→{ }
→{ }
Collective Experience of the past Web
Can be Aggregated
Section 9.3
172. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Conclusions Access to Private Archives Directly or
Through Sharing Can Be Systematic
172
Section 9.3
Figure 81, Chapter 7
173. PhD Defense: Mat Kelly
May 7, 2019
@machawk1 • @WebSciDL
Conclusions Enabled Client-Side Aggregation
and Archival Supplementation
173
Figures 94 and 95, Chapter 8
Section 9.3
174. PhD Defense: Mat Kelly
May 7, 2019
Research Dissemination
● 17 publications that address 6 RQs →
● 5 Best Paper/Poster awards/nominations
● 28 research blog posts for ODU WS-DL
● Collaborations with 8 different institutions (inc. ODU)
● 4 first-author, open-source, publicly released
software packages (collab. on 3 others from ODU)
Aggregating Private and Public Web Archives Using the Mementity Framework
@machawk1 • @WebSciDL
Editor's Notes
MLNTODO: Put more emphasis on going from 1 to 3 archives
A lot going on here. Do we want to split it and/or discuss content and derived attributes separately?