Presentation on how to chat with PDF using ChatGPT code interpreter
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
1. Scripts in a Frame:
A Two-Tiered Approach for Archiving
Deferred Representations
Justin F. Brunelle
Dissertation Defense
February 5, 2016
Committee Members:
Michael L. Nelson
Michele C. Weigle
Elizabeth J. Vincelette
Irwin B. Levinstein
8. New ones are annoying…for now.
8
“Why are your parents wrestling?”
9. Today’s ads are
missing from the
archives
9
http://adserver.adtechus.com/addyn/3.0/5399.1/2394397/0/-
1/QUANTCAST;;size=300x250;target=_blank;alias=p36-
17b4f9us2qmzc8bn;kvp36=p36-17b4f9us2qmzc8bn;sub1=p-
4UZr_j7rCm_Aj;kvl=172802;kvc=794676;kvs=300x250;kvi=c052a80
3d0b5476f0bd2f2043ef237e27cd48019;kva=p-
4UZr_j7rCm_Aj;rdclick=http://exch.quantserve.com/r?a=p-
4UZr_j7rCm_Aj;labels=_qc.clk,_click.adserver.rtb,_click.rand.85854;
rtbip=192.184.64.144;rtbdata2=EAQaFUhSQmxvY2tfMjAxNlRheFNlY
XNvbiCZiRcogsYKMLTAMDoSaHR0cDovL3d3dy5jbm4uY29tWihUUEh
wYlUzM3ZqeFU5LTA1SGZEMk1SXzE0anBVcGU0d0dxTG10STFUdUs2I
ECAAb_JicoFoAEBqAGhy7YCugEoVFBIcGJVMzN2anhVOS0wNUhmR
DJNUl8xNGpwVXBlNHdHcUxtdEkxVMAB3ed3yAGUp7GUqSraAShjM
DUyYTgwM2QwYjU0NzZmMGJkMmYyMDQzZWYyMzdlMjdjZDQ4M
DE55QHvEWs-
6AFkmAK2wQqoAgWoAgawAgi6AgTAuECQwAICyAIA0ALe9baMj4Co
s-oB
10. JavaScript is hard to replay
What happens when things are completely lost?
http://ws-dl.blogspot.com/2013/11/2013-11-28-replaying-sopa-protest.html
10
11. Remember SOPA? And the protest?
11
https://en.wikipedia.org/wiki/Stop_Online_Piracy_Act
https://en.wikipedia.org/wiki/Protests_against_SOPA_and_PIPA
15. Why archive?
The Internet Archive has everything!
Why didn’t you back it up?
Participating institutions can hand over their databases.
15
16. Crimean Conflict
Russian troops captured the Crimean Center for Investigative
Journalism
Gunman: "We will try to agree on the correct truthful coverage of
events.”
16
http://gijn.org/2014/03/02/masked-gunmen-seize-crimean-investigative-journalism-center/
18. How?
Masked
gunman have
your servers
Where are
your backups?
Transactional
archive? Too
late!
18
Preservation over HTTP
19. How?
Masked
gunman have
your servers
Where are
your backups?
Transactional
archive? Too
late!
19
Preservation over HTTP
20. Any future discussion of the 21st
century will involve the web and
the web archives
20
21. Any future discussion of the 21st
century will involve the web and
the web archives
But JavaScript is hard to archive, resulting in archives of
content as seen by crawlers rather than as seen by users
21
22. Any future discussion of the 21st
century will involve the web and
the web archives
But JavaScript is hard to archive, resulting in archives of
content as seen by crawlers rather than as seen by users
22
Goal: Mitigate the impact of JavaScript on the archives
by making crawlers behave like users
30. 30
URI-R: Original
Resource Identifier
URI-M: memento
Identifier
URI-T:
TimeMap
Identifier
Page on the live web
Archived version of a
page
List of archived
pages
35. Web Browsing Process
35
User-controlled
Interaction
Environment
variables → content
negotiation
Client-controlled
representation
changes
HTTP GET Request for Resource R
HTTP 200 OK Response: R Content
Browser renders
and displays R
JavaScript requests
embedded resources
Server returns
embedded resources
R updates its representation
36. Web Browsing Process
36
There is no longer “the”
representation.
At any given time, users
get “a” representation.
GeoIP: Washington, D.C.
URI-R: http://www.wunderground.com/
GeoIP: Suffolk, VA
URI-R: http://www.wunderground.com/
39. HTTP GET Request for Resource R
HTTP 200 OK Response: R Content
Browser renders
and displays R
JavaScript requests
embedded resources
Server returns
embedded resources
R updates its representation
Web Browsing Process
39
Archival Tools stop here
40. HTTP GET Request for Resource R
HTTP 200 OK Response: R Content
Browser renders
and displays R
JavaScript requests
embedded resources
Server returns
embedded resources
R updates its representation
Web Browsing Process
40
Archival Tools stop here
41. HTTP GET Request for Resource R
HTTP 200 OK Response: R Content
Browser renders
and displays R
JavaScript requests
embedded resources
Server returns
embedded resources
R updates its representation
Web Browsing Process
41
Archival Tools stop here
Still not solved!
43. Research Questions
RQ1. To what extent does JavaScript impact archival tools?
RQ2. How do we measure memento quality?
RQ3. How can we crawl, archive, and play back deferred
representations?
43
46. Measuring JavaScript
1,000 URIs from Twitter
1,000 URIs from Archive-it
Dataset available at http://www.cs.odu.edu/~jbrunelle/jsDataSet.txt
Capture with tools
Study the archivability
46
“The impact of JavaScript on archivability”, 2015, International Journal of Digital Libraries
( )
60. Leakage increasing over time
60Increased JavaScript -> increases in missing embedded resources
61. 61
• 73.1% of all missing
embedded mementos are
loaded via JavaScript
• 33% increase in missing
embedded mementos from
JavaScript between
2005-2012
Leakage increasing over time
63. 63
“Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources”, JCDL 2014,
International Journal of Digital Libraries, 2015
VS.
63
65. “Live” XKCD
• Take three resources:
• Logo
• Main Comic
• Navigation Strip
• Relative importance?
• All present in “Live” XKCD
65
66. Damaging XKCD
• Created a local memento
• Removed the logo and navigation
strip
• Now missing 29% of
embedded resources
• Human assessment:
looks OK
66
67. Damaging XKCD
• From our local memento
• Removed the Main Comic
• Now missing 24% of
embedded resources
• Human assessment:
Not a usable memento
67
68. Damaging XKCD
• From our local memento
• Removed the Main Comic
• Now missing 24% of
embedded resources
• Human assessment:
Not a usable memento
• Percent of missing
embedded resources is
not a suitable metric for
memento quality
68
72. Missing CSS
• More important
than thought
• Calculated the
amount of content
in each vertical
third
• If >=80% in left
column and
missing CSS, CSS is
important
• Only performed if
stylesheets are
missing
72
73. Methodology
• Defined Dm and Mm metrics
Mm =
𝑀𝑖𝑠𝑠𝑖𝑛𝑔 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠
𝐴𝑙𝑙 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠
Dm = 𝑖=1
𝑛 𝑚𝑖𝑠𝑠𝑖𝑛𝑔 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠
𝑤 𝑖
𝑗=1
𝑛 𝑎𝑙𝑙 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 𝑤 𝑗
• Used Amazon Mechanical Turkers to assess web user perception of
quality
• Assessed Dm versus Mm in manually damaged pages
• Assessed Dm versus Mm in the archives
73
74. Turk Results
74
Live vs Manually
Damaged Dm
Mementos from
Internet Archive
Agreement with Dm
Mementos from
Internet Archive
Agreement with Mm
50/50 Chance
75. Damage in the Archives
75
Internet Archive WebCite
Mementos with deferred representations have 13.5% higher
damage rating
78. 78
Two-Tiered Crawling
“Archiving Deferred Representations
Using a Two-Tiered Crawling Approach”,
iPRES2015
“Adapting the Hypercube Model to Archive
Deferred Representations at Web-Scale”,
Technical Report, arXiv:1601.05142, 2016
79. 79
<script> tags alone are not indicative of a deferred
representation. JavaScript can be played back in the
archives!
Current workflow not suitable for deferred
representations
Use PhantomJS to run JavaScript, interact with the
representation
Two-tiered crawling approach to optimize
performance
80. 80
<script> tags alone are not indicative of a deferred
representation. JavaScript can be played back in the
archives!
Current workflow not suitable for deferred
representations
Use PhantomJS to run JavaScript, interact with the
representation
Two-tiered crawling approach to optimize
performance
More URI-Rs in the
crawl frontier
Runs more slowly but
more deeply
81. Comparing Performance
• Crawled 10,000 URI-Rs
Dataset available at http://www.cs.odu.edu/~jbrunelle/wsdl/10kuris.txt
• Compare crawl speed & discovered frontier size
• With and without classifier
• Code available at https://github.com/jbrunelle/classifyDeferred/
81
84. Classifier
We are omitting a discussion about the classifier
for deferred vs. nondeferred representations
Please see Section 7.4 in the
dissertation for a detailed discussion
84
85. Descendants = States of deferred representations
reached through client-side events
85
Click Pan Zoom
Click Pan Zoom
86. Crawling descendants
• Interactions represented as N-ary tree G
• FSM: M = (S, s0, Σ, δ)
‒ S is the finite set of client states
‒ s0 ϵ S is the initial state reached by dereferencing the URI-R and executing the initial on-
load events
‒ e ϵ Σ defines the client-side event e as a member of the set of all events Σ
‒ δ : Sx Σ → S is the transition function in which a client-side event is executed and
leads to a new state
si, sj ϵ S
δ(si, e) = sj
e = client-side event
j = i + 1
86
“Adapting the Hypercube Model to Archive Deferred Representations at Web-Scale”, Technical Report, arXiv:1601.05142, 2016
89. 89
Interaction Trees are 2 Levels Deep
http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices
s0
s1
s2
90. 90
Interaction Trees are 2 Levels Deep
http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices
s0
s1
s2
91. 91
Interaction Trees are 2 Levels Deep
http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices
s0
s1
s2
92. Expanding the Crawl Frontier
92
Level s1 provides the greatest benefit to the crawl frontier
Nondeferred
Deferred
95. Storage Impact of Two-Tiered Crawling
IIPC-proposed JSON metadata of interactions, resulting descendants
–Potentially used to resolve URI-M collisions
–16.5KB WARC metadata
–143MB for total dataset
11.4 times larger for deferred vs nondeferred
Totals 5.12 times more storage per URI-R for total dataset
95
2013
97. Future Work
• Modeling user interactions, tendencies, and simulation
– Form filling
– Click and navigation likelihood
• Evaluating success of crawling deferred representations
– Random walks through the archives
– Dm vs Mm of mementos of deferred representations
• Archival Halting Problem: How much is enough?
– Mapping Applications – How many pans and zooms gets all the Norfolk,
VA Google map tiles?
– How many CNN.com pages get all the Google Ads?
• Playing back WARCs with IIPC metadata of deferred
representations and descendants
97
99. RQ1. To what extent does JavaScript impact
archival tools?
Contributions:
• Defined and identified zombie resources
• Adoption of JavaScript correlates with
missing embedded resources in mementos
• Defined deferred representations
• Showed that deferred representations have
reduced archivability
99
2012: ws-dl.blogspot.com
2013: TPDL2013
2015: iPRES2015
2015: IJDL
2015: IJDL
Section 4.3
Ch. 5
Ch. 2
Ch. 5
For more information, reference:
100. RQ2. How do we measure memento quality?
Contributions:
• Mm is not accurate (worse than coin-flip)
• Created Dm metric
• Dm is closer to user perception than Mm
• Mementos of deferred representations
have higher Dm than
nondeferred representations
100
2015: JCDL2015
2015: IJDL Special Issue
Ch. 6
Section 6.6
For more information, reference:
101. RQ3. How can we crawl, archive, and play
back deferred representations?
Contributions:
• Defined a framework for archiving deferred
representations
• Showed that the framework will crawl more
slowly but more thoroughly
• Defined descendants, showed that they are
2-levels deep
• Showed the storage impact of crawling
descendants and deferred representations
101
2015: iPRES2015
2016:
arXiv:1601.05142
Ch. 7
Ch. 7
For more information, reference:
102. Summary
• Measured the impact of JavaScript on the archives
• Quantified damage caused by JavaScript
• Measured the cost in time and space to archive JavaScript
Provides policy makers information to make decisions regarding
JavaScript handling in crawling and archiving
Quantified an intuitive understanding of crawling deferred
representations at web scale
102
104. 104
Year RQ Venue Abbreviated Title Notes
2012 JCDL2012 Doctoral Consortium Capturing Dynamic Web
2013 JCDL2013 TimeMap Caching
2013 RQ1 TPDL2013 Archivability Over Time
2013 TPDL2013 Transactional Archiving
2013 RQ1 DLib Magazine 19(11/12) Identifying Mementos
2014 RQ2 JCDL2014 Measuring Memento Damage Best Student Paper
2015 RQ1 International Journal of Digital Libraries Measuring Impact of JavaScript
2015 RQ2 International Journal of Digital Libraries Measuring Memento Damage JCDL2015 Special Issue
2015 JCDL2015 Merging Mobile and Desktop Best Poster
2015 RQ3 iPRES2015 Two-Tiered Crawling
2016 RQ3 Technical Report, arXiv:1601.05142 Hypercube Model for Archiving
2016 DLib Magazine 22(1/2) Archiving Corporate Intranets
Publications
105. Publications
• Justin F. Brunelle “Filling in the Blanks: Capturing the
Dynamic Web”, JCDL 2012 Doctoral Consortium
• Justin F. Brunelle, Michael L. Nelson “An Evaluation of
Caching Policies for Memento TimeMaps”, JCDL 2013
• Mat Kelly, Justin F. Brunelle, Michele C. Weigle, Michael L.
Nelson, “On the Change in Archivability of Websites Over
Time”, TPDL 2013
• Justin F. Brunelle, Michael L. Nelson, Lyudmila Balakireva,
Robert Sanderson, Herbert Van de Sompel, “Evaluating the
SiteStory Transactional Web Archive With the
ApacheBench Tool”, TPDL 2013
• Mat Kelly, Justin F. Brunelle, Michele C. Weigle, and Michael
L. Nelson, “A Method for Identifying Personalized
Representations in Web Archives”, D-Lib Magazine,
19(11/12), 2013.
• Justin F. Brunelle, Mat Kelly, Hany SalahEldeen, Michele C.
Weigle, and Michael L. Nelson “Not All Mementos Are
Created Equal: Measuring The Impact Of Missing
Resources”, JCDL 2014
• Justin F. Brunelle, Mat Kelly, Michele C. Weigle, and Michael
L. Nelson “The impact of JavaScript on archivability”, 2015,
IJDL
• Wesley Jordan, Mat Kelly, Justin F. Brunelle, Laura Vobrak,
Michele C. Weigle, and Michael L. Nelson “Mobile Mink:
Merging Mobile and Desktop Archived Webs”, JCDL 2015
• Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson,
“Archiving Deferred Representations Using a Two-Tiered
Crawling Approach”, iPRES2015
• Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson,
“Adapting the Hypercube Model to Archive Deferred
Representations at Web-Scale”, Technical Report,
arXiv:1601.05142, 2016
• Justin F. Brunelle, Krista Ferrante, Eliot Wilczek, Michele C.
Weigle, and Michael L. Nelson, “Leveraging Heritrix and the
Wayback Machine on a corporate intranet: A case study on
improving corporate archives”, DLib Magazine, 22(1/2)
2016
105
106. Mobile Mink: Merging Mobile and
Desktop Archived Webs
Wesley Jordan, Mat Kelly, Justin F. Brunelle,
Laura Vobrak, Michele C. Weigle, Michael L. Nelson
This work supported in part by the NEH HK-50181. This work
was performed as part of Wesley Jordan’s mentorship at The
MITRE Corporation. The author’s affiliation with The MITRE
Corporation is provided for identification purposes only, and is
not intended to convey or imply MITRE’s concurrence with, or
support for, the positions, opinions or viewpoints expressed by
the author.
Acknowledgements
http://bitly.com/MobileMink/
More about Mobile Mink
Desktop URIs are much
more prevalent than their
mobile counterparts in the
archives because crawlers
use desktop user-agent
strings.
Corresponding Mobile URIs
are archived less frequently
even though the
representations are different
than their desktop
counterparts.
http://espn.go.com/ http://m.espn.go.com/
Same
ESPN,
different
URIs,
different
HTML,
different
TimeMaps.
.
Browse to a URI-R
Potential content-
negotiation from
user-agent
Access tool from the
“Share” menu
MobileMink merges TimeMaps of
http://espn.go.com & http://m.espn.go.com/
Desktop and mobile webs differ and
the linkage between them is lost in the
archives
Discovers mobile and
desktop URI-Rs
Uses Memento to get
all available
TimeMaps
Provides integrated
TimeMap
Offers users ability
to submit mobile
and desktop URI-Rs
to archives
Increases
coverage of mobile
URI-Rs in the
archives
114. • Measured Internet
Archive mementos
• Damage generally
improves over time
• Despite missing more
resources over time
Damage in the
Internet Archive
114
120. Publications
Master’s:
• Kyle Dempsey, Justin Brunelle, G. Tanner Jackson, Chutima Boonthum, Irwin
Levinstein, Danielle McNamara. “MiBoard: Multiplayer Interactive Board
Game”, AIED2009
• Justin F. Brunelle, Irwin B. Levinstein, Chutima Boonthum. “MiBoard:
Metacognitive Training Through Gaming in iSTART”, 2009 VMASC Capstone
Conference
• Best paper in track
• Justin F. Brunelle, Kyle B Dempsey, G. Tanner Jackson, Chutima Boonthum,
Irwin B. Levinstein, Danielle S. McNamara. “MiBoard: Metacognitive Training
Through Gaming”, SCiP2009
• Justin F. Brunelle, G. Tanner Jackson, Kyle Dempsey, Chutima Boonthum,
Irwin B. Levinstein, Danielle S. McNamara. “Analysis of MiBoard as an iSTART
Practice Tool”, FLAIRS-24, 2010
• Kyle Dempsey, G. Tanner Jackson, Justin Brunelle, Michael Rowe, Danielle
McNamara. “MiBoard: Assessing Collaborative Learning Through Game-
Based Practice”, FLAIRS-24, 2010
PhD:
• Justin F. Brunelle “Filling in the Blanks: Capturing the Dynamic Web”, JCDL
2012 Doctoral Consortium
• Justin F. Brunelle, Michael L. Nelson “An Evaluation of Caching Policies for
Memento TimeMaps”, JCDL 2013
• Mat Kelly, Justin F. Brunelle, Michele C. Weigle, Michael L. Nelson, “On the
Change in Archivability of Websites Over Time”, TPDL 2013
• Justin F. Brunelle, Michael L. Nelson, Lyudmila Balakireva, Robert Sanderson,
Herbert Van de Sompel, “Evaluating the SiteStory Transactional Web Archive
With the ApacheBench Tool”, TPDL 2013
• Mat Kelly, Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “A
Method for Identifying Personalized Representations in Web Archives”, D-
Lib Magazine, 19(11/12), 2013.
• Justin F. Brunelle, Mat Kelly, Hany SalahEldeen, Michele C. Weigle, and
Michael L. Nelson “Not All Mementos Are Created Equal: Measuring The
Impact Of Missing Resources”, JCDL 2014
• Best Student Paper, International Journal of Digital Libraries: JCDL2015 Special
Issue
• Justin F. Brunelle, Mat Kelly, Michele C. Weigle, and Michael L. Nelson “The
impact of JavaScript on archivability”, 2015, IJDL
• Wesley Jordan, Mat Kelly, Justin F. Brunelle, Laura Vobrak, Michele C. Weigle,
and Michael L. Nelson “Mobile Mink: Merging Mobile and Desktop Archived
Webs”, JCDL 2015
• Best Poster
• Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “Archiving
Deferred Representations Using a Two-Tiered Crawling Approach”,
iPRES2015
• Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “Adapting the
Hypercube Model to Archive Deferred Representations at Web-Scale”,
Technical Report, arXiv:1601.05142, 2016
• Justin F. Brunelle, Krista Ferrante, Eliot Wilczek, Michele C. Weigle, and
Michael L. Nelson, “Leveraging Heritrix and the Wayback Machine on a
corporate intranet: A case study on improving corporate archives”, DLib
Magazine, 2016
120
122. Mobile Sites in the Archives
122
http://m.espn.go.com/wireless/http://espn.go.com/
“A Method for Identifying Personalized Representations in Web Archives”, D-Lib Magazine, 2013
123. Mobile Sites in the Archives
123
http://m.espn.go.com/wireless/http://espn.go.com/
URI-M:
http://web.archive.org/web/2014033
0125315/http://espn.go.com/
URI-M:
http://web.archive.org/web/2014033012
5414/http://m.espn.go.com/wireless/
124. Collisions in the Archives
124
http://www.cnn.com/
URI-M? URI-T?
http://web.archive.org/web/[DATETIME]/http://www.cnn.com/
125. Need a better way to index mementos
• URI-R is no longer enough
• Environmental factors:
‒ Content negotiation
‒ Interaction
‒ Personalization
‒ GeoIP
125
126. Content Negotiation
Server-side
interpretation of
client-provided
parameters
Multiple
representations,
single resource
126
Resource
URI Representation 2
Represents
Representation 1
Represents
Identifies
Content Negotiation
Mobile
Desktop
user-agent