SlideShare una empresa de Scribd logo
1 de 126
Descargar para leer sin conexión
Scripts in a Frame:
A Two-Tiered Approach for Archiving
Deferred Representations
Justin F. Brunelle
Dissertation Defense
February 5, 2016
Committee Members:
 Michael L. Nelson
 Michele C. Weigle
 Elizabeth J. Vincelette
 Irwin B. Levinstein
A simpler time…
2
Mass hysteria. Human sacrifices. Dogs and
cats living together.
3
<iframe><script>…</script></iframe>
4
t
5http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html
Missing resources (bad)
2008
6http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html
2008
2012
Missing resources (bad) and
Temporal violations (worse)
Old ads are interesting
7
New ones are annoying…for now.
8
“Why are your parents wrestling?”
Today’s ads are
missing from the
archives
9
http://adserver.adtechus.com/addyn/3.0/5399.1/2394397/0/-
1/QUANTCAST;;size=300x250;target=_blank;alias=p36-
17b4f9us2qmzc8bn;kvp36=p36-17b4f9us2qmzc8bn;sub1=p-
4UZr_j7rCm_Aj;kvl=172802;kvc=794676;kvs=300x250;kvi=c052a80
3d0b5476f0bd2f2043ef237e27cd48019;kva=p-
4UZr_j7rCm_Aj;rdclick=http://exch.quantserve.com/r?a=p-
4UZr_j7rCm_Aj;labels=_qc.clk,_click.adserver.rtb,_click.rand.85854;
rtbip=192.184.64.144;rtbdata2=EAQaFUhSQmxvY2tfMjAxNlRheFNlY
XNvbiCZiRcogsYKMLTAMDoSaHR0cDovL3d3dy5jbm4uY29tWihUUEh
wYlUzM3ZqeFU5LTA1SGZEMk1SXzE0anBVcGU0d0dxTG10STFUdUs2I
ECAAb_JicoFoAEBqAGhy7YCugEoVFBIcGJVMzN2anhVOS0wNUhmR
DJNUl8xNGpwVXBlNHdHcUxtdEkxVMAB3ed3yAGUp7GUqSraAShjM
DUyYTgwM2QwYjU0NzZmMGJkMmYyMDQzZWYyMzdlMjdjZDQ4M
DE55QHvEWs-
6AFkmAK2wQqoAgWoAgawAgi6AgTAuECQwAICyAIA0ALe9baMj4Co
s-oB
JavaScript is hard to replay
What happens when things are completely lost?
http://ws-dl.blogspot.com/2013/11/2013-11-28-replaying-sopa-protest.html
10
Remember SOPA? And the protest?
11
https://en.wikipedia.org/wiki/Stop_Online_Piracy_Act
https://en.wikipedia.org/wiki/Protests_against_SOPA_and_PIPA
http://en.wikipedia.org/wiki/Main_Page January 18th, 2012 12
http://web.archive.org/web/20120118110520/http://en.wikipedia.org/wiki/Main_Page January 18th, 2012 13
14
Problem!
The archives contain the Web as
seen by crawlers
Why archive?
The Internet Archive has everything!
Why didn’t you back it up?
Participating institutions can hand over their databases.
15
Crimean Conflict
Russian troops captured the Crimean Center for Investigative
Journalism
Gunman: "We will try to agree on the correct truthful coverage of
events.”
16
http://gijn.org/2014/03/02/masked-gunmen-seize-crimean-investigative-journalism-center/
Archive-It to the rescue!
17
How?
 Masked
gunman have
your servers
 Where are
your backups?
 Transactional
archive? Too
late!
18
Preservation over HTTP
How?
 Masked
gunman have
your servers
 Where are
your backups?
 Transactional
archive? Too
late!
19
Preservation over HTTP
Any future discussion of the 21st
century will involve the web and
the web archives
20
Any future discussion of the 21st
century will involve the web and
the web archives
But JavaScript is hard to archive, resulting in archives of
content as seen by crawlers rather than as seen by users
21
Any future discussion of the 21st
century will involve the web and
the web archives
But JavaScript is hard to archive, resulting in archives of
content as seen by crawlers rather than as seen by users
22
Goal: Mitigate the impact of JavaScript on the archives
by making crawlers behave like users
23
Motivating Examples
Background Information
Research Questions
Measuring the Impact of JavaScript
Measuring Memento Quality
Crawling Deferred Representations
Future Work
Conclusions
Some Institutional Archives
24
Some Page-at-a-time Archivers
25
Some Archival Tools
26
1: http://warcreate.com/
2: http://matkelly.com/wail/
1
2
Memento Framework
27
http://mementoweb.org/guide/rfc/
Machine readable bidirectional link between the past and present web
28
29
30
 URI-R: Original
Resource Identifier
 URI-M: memento
Identifier
 URI-T:
TimeMap
Identifier
Page on the live web
Archived version of a
page
List of archived
pages
Web Architecture
31
Dereference a URI, get a
representation
JavaScript makes requests for new resources
after the initial page load
32
http://maps.google.com
Identifies
Represents
Deferred Representation
33
http://maps.google.com
Identifies
Represents
JavaScript != Deferred
34
Deferred
HTTP GETHTTP GET HTTP GETHTTP GET
onload
Nondeferred
HTTP GET
Web Browsing Process
35
 User-controlled
 Interaction
 Environment
variables → content
negotiation
 Client-controlled
representation
changes
HTTP GET Request for Resource R
HTTP 200 OK Response: R Content
Browser renders
and displays R
JavaScript requests
embedded resources
Server returns
embedded resources
R updates its representation
Web Browsing Process
36
There is no longer “the”
representation.
At any given time, users
get “a” representation.
GeoIP: Washington, D.C.
URI-R: http://www.wunderground.com/
GeoIP: Suffolk, VA
URI-R: http://www.wunderground.com/
The Internet Archive got everything, right?
37
Missing tiles, not interactive
38
HTTP GET Request for Resource R
HTTP 200 OK Response: R Content
Browser renders
and displays R
JavaScript requests
embedded resources
Server returns
embedded resources
R updates its representation
Web Browsing Process
39
Archival Tools stop here
HTTP GET Request for Resource R
HTTP 200 OK Response: R Content
Browser renders
and displays R
JavaScript requests
embedded resources
Server returns
embedded resources
R updates its representation
Web Browsing Process
40
Archival Tools stop here
HTTP GET Request for Resource R
HTTP 200 OK Response: R Content
Browser renders
and displays R
JavaScript requests
embedded resources
Server returns
embedded resources
R updates its representation
Web Browsing Process
41
Archival Tools stop here
Still not solved!
42
Motivating Examples
Background Information
Research Questions
Measuring the Impact of JavaScript
Measuring Memento Quality
Crawling Deferred Representations
Future Work
Conclusions
Research Questions
RQ1. To what extent does JavaScript impact archival tools?
RQ2. How do we measure memento quality?
RQ3. How can we crawl, archive, and play back deferred
representations?
43
44
Motivating Examples
Background Information
Research Questions
Measuring the Impact of JavaScript
Measuring Memento Quality
Crawling Deferred Representations
Future Work
Conclusions
20152013
Zombies!
45
2008
2012
Measuring JavaScript
 1,000 URIs from Twitter
 1,000 URIs from Archive-it
Dataset available at http://www.cs.odu.edu/~jbrunelle/jsDataSet.txt
 Capture with tools
 Study the archivability
46
“The impact of JavaScript on archivability”, 2015, International Journal of Digital Libraries
( )
Good
47
Good
48
Good
49
Meh
50
Meh
51
Bad
52
Bad
53
Bad
54
Bad
55
Bad
56
Leakage by archival tool
57Twitter has more leakage than Archive-It
Leakage by archival tool
58Wayback reduces leakage the most
Leakage -> Zombies
5912% increase in embedded mementos loaded via JavaScript
Leakage increasing over time
60Increased JavaScript -> increases in missing embedded resources
61
• 73.1% of all missing
embedded mementos are
loaded via JavaScript
• 33% increase in missing
embedded mementos from
JavaScript between
2005-2012
Leakage increasing over time
62
Motivating Examples
Background Information
Research Questions
Measuring the Impact of JavaScript
Measuring Memento Quality
Crawling Deferred Representations
Future Work
Conclusions
2015
2014
63
“Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources”, JCDL 2014,
International Journal of Digital Libraries, 2015
VS.
63
“Live” XKCD
• Missing 17% of embedded
resources
• Looks complete
64
“Live” XKCD
• Take three resources:
• Logo
• Main Comic
• Navigation Strip
• Relative importance?
• All present in “Live” XKCD
65
Damaging XKCD
• Created a local memento
• Removed the logo and navigation
strip
• Now missing 29% of
embedded resources
• Human assessment:
looks OK
66
Damaging XKCD
• From our local memento
• Removed the Main Comic
• Now missing 24% of
embedded resources
• Human assessment:
Not a usable memento
67
Damaging XKCD
• From our local memento
• Removed the Main Comic
• Now missing 24% of
embedded resources
• Human assessment:
Not a usable memento
• Percent of missing
embedded resources is
not a suitable metric for
memento quality
68
Image Importance
• Size (as percentage of all pixels)
69
Image Importance
• Size
• Position (in viewport?)
70
Image Importance
• Size
• Position
• Centrality (in the vertical or
horizontal center?)
71
Missing CSS
• More important
than thought
• Calculated the
amount of content
in each vertical
third
• If >=80% in left
column and
missing CSS, CSS is
important
• Only performed if
stylesheets are
missing
72
Methodology
• Defined Dm and Mm metrics
Mm =
𝑀𝑖𝑠𝑠𝑖𝑛𝑔 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠
𝐴𝑙𝑙 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠
Dm = 𝑖=1
𝑛 𝑚𝑖𝑠𝑠𝑖𝑛𝑔 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠
𝑤 𝑖
𝑗=1
𝑛 𝑎𝑙𝑙 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 𝑤 𝑗
• Used Amazon Mechanical Turkers to assess web user perception of
quality
• Assessed Dm versus Mm in manually damaged pages
• Assessed Dm versus Mm in the archives
73
Turk Results
74
Live vs Manually
Damaged Dm
Mementos from
Internet Archive
Agreement with Dm
Mementos from
Internet Archive
Agreement with Mm
50/50 Chance
Damage in the Archives
75
Internet Archive WebCite
Mementos with deferred representations have 13.5% higher
damage rating
76
Motivating Examples
Background Information
Research Questions
Measuring the Impact of JavaScript
Measuring Memento Quality
Crawling Deferred Representations
Future Work
Conclusions
2015 2016
77
Current
Workflow
• Dereference URI-Rs
• Archive representation
• Extract embedded URI-Rs
• Repeat
78
Two-Tiered Crawling
“Archiving Deferred Representations
Using a Two-Tiered Crawling Approach”,
iPRES2015
“Adapting the Hypercube Model to Archive
Deferred Representations at Web-Scale”,
Technical Report, arXiv:1601.05142, 2016
79
<script> tags alone are not indicative of a deferred
representation. JavaScript can be played back in the
archives!
Current workflow not suitable for deferred
representations
Use PhantomJS to run JavaScript, interact with the
representation
Two-tiered crawling approach to optimize
performance
80
<script> tags alone are not indicative of a deferred
representation. JavaScript can be played back in the
archives!
Current workflow not suitable for deferred
representations
Use PhantomJS to run JavaScript, interact with the
representation
Two-tiered crawling approach to optimize
performance
More URI-Rs in the
crawl frontier
Runs more slowly but
more deeply
Comparing Performance
• Crawled 10,000 URI-Rs
Dataset available at http://www.cs.odu.edu/~jbrunelle/wsdl/10kuris.txt
• Compare crawl speed & discovered frontier size
• With and without classifier
• Code available at https://github.com/jbrunelle/classifyDeferred/
81
Performance: Frontier Size
82PhantomJS creates a 1.5x larger crawl frontier than Heritrix
Performance: Crawl Speed
83
Heritrix: ~2 URIs/second
PhantomJS: ~4 seconds/URI
Classifier
We are omitting a discussion about the classifier
for deferred vs. nondeferred representations
Please see Section 7.4 in the
dissertation for a detailed discussion
84
Descendants = States of deferred representations
reached through client-side events
85
Click Pan Zoom
Click Pan Zoom
Crawling descendants
• Interactions represented as N-ary tree G
• FSM: M = (S, s0, Σ, δ)
‒ S is the finite set of client states
‒ s0 ϵ S is the initial state reached by dereferencing the URI-R and executing the initial on-
load events
‒ e ϵ Σ defines the client-side event e as a member of the set of all events Σ
‒ δ : Sx Σ → S is the transition function in which a client-side event is executed and
leads to a new state
si, sj ϵ S
δ(si, e) = sj
e = client-side event
j = i + 1
86
“Adapting the Hypercube Model to Archive Deferred Representations at Web-Scale”, Technical Report, arXiv:1601.05142, 2016
87http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices
s0
s1
s2
Interaction Trees are 2 Levels Deep
88http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices
s0
s1
s2
Interaction Trees are 2 Levels Deep
89
Interaction Trees are 2 Levels Deep
http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices
s0
s1
s2
90
Interaction Trees are 2 Levels Deep
http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices
s0
s1
s2
91
Interaction Trees are 2 Levels Deep
http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices
s0
s1
s2
Expanding the Crawl Frontier
92
Level s1 provides the greatest benefit to the crawl frontier
Nondeferred
Deferred
Crawling Descendants
93
New embedded resources at levels s1 are largely
unarchived
Crawling Descendants
94
Level s1 has the highest cost-benefit Return on Investment
Storage Impact of Two-Tiered Crawling
 IIPC-proposed JSON metadata of interactions, resulting descendants
–Potentially used to resolve URI-M collisions
–16.5KB WARC metadata
–143MB for total dataset
 11.4 times larger for deferred vs nondeferred
 Totals 5.12 times more storage per URI-R for total dataset
95
2013
96
Motivating Examples
Background Information
Research Questions
Measuring the Impact of JavaScript
Measuring Memento Quality
Crawling Deferred Representations
Future Work
Conclusions
Future Work
• Modeling user interactions, tendencies, and simulation
– Form filling
– Click and navigation likelihood
• Evaluating success of crawling deferred representations
– Random walks through the archives
– Dm vs Mm of mementos of deferred representations
• Archival Halting Problem: How much is enough?
– Mapping Applications – How many pans and zooms gets all the Norfolk,
VA Google map tiles?
– How many CNN.com pages get all the Google Ads?
• Playing back WARCs with IIPC metadata of deferred
representations and descendants
97
98
Motivating Examples
Background Information
Research Questions
Measuring the Impact of JavaScript
Measuring Memento Quality
Crawling Deferred Representations
Future Work
Conclusions
RQ1. To what extent does JavaScript impact
archival tools?
Contributions:
• Defined and identified zombie resources
• Adoption of JavaScript correlates with
missing embedded resources in mementos
• Defined deferred representations
• Showed that deferred representations have
reduced archivability
99
2012: ws-dl.blogspot.com
2013: TPDL2013
2015: iPRES2015
2015: IJDL
2015: IJDL
Section 4.3
Ch. 5
Ch. 2
Ch. 5
For more information, reference:
RQ2. How do we measure memento quality?
Contributions:
• Mm is not accurate (worse than coin-flip)
• Created Dm metric
• Dm is closer to user perception than Mm
• Mementos of deferred representations
have higher Dm than
nondeferred representations
100
2015: JCDL2015
2015: IJDL Special Issue
Ch. 6
Section 6.6
For more information, reference:
RQ3. How can we crawl, archive, and play
back deferred representations?
Contributions:
• Defined a framework for archiving deferred
representations
• Showed that the framework will crawl more
slowly but more thoroughly
• Defined descendants, showed that they are
2-levels deep
• Showed the storage impact of crawling
descendants and deferred representations
101
2015: iPRES2015
2016:
arXiv:1601.05142
Ch. 7
Ch. 7
For more information, reference:
Summary
• Measured the impact of JavaScript on the archives
• Quantified damage caused by JavaScript
• Measured the cost in time and space to archive JavaScript
Provides policy makers information to make decisions regarding
JavaScript handling in crawling and archiving
Quantified an intuitive understanding of crawling deferred
representations at web scale
102
Backups
103
104
Year RQ Venue Abbreviated Title Notes
2012 JCDL2012 Doctoral Consortium Capturing Dynamic Web
2013 JCDL2013 TimeMap Caching
2013 RQ1 TPDL2013 Archivability Over Time
2013 TPDL2013 Transactional Archiving
2013 RQ1 DLib Magazine 19(11/12) Identifying Mementos
2014 RQ2 JCDL2014 Measuring Memento Damage Best Student Paper
2015 RQ1 International Journal of Digital Libraries Measuring Impact of JavaScript
2015 RQ2 International Journal of Digital Libraries Measuring Memento Damage JCDL2015 Special Issue
2015 JCDL2015 Merging Mobile and Desktop Best Poster
2015 RQ3 iPRES2015 Two-Tiered Crawling
2016 RQ3 Technical Report, arXiv:1601.05142 Hypercube Model for Archiving
2016 DLib Magazine 22(1/2) Archiving Corporate Intranets
Publications
Publications
• Justin F. Brunelle “Filling in the Blanks: Capturing the
Dynamic Web”, JCDL 2012 Doctoral Consortium
• Justin F. Brunelle, Michael L. Nelson “An Evaluation of
Caching Policies for Memento TimeMaps”, JCDL 2013
• Mat Kelly, Justin F. Brunelle, Michele C. Weigle, Michael L.
Nelson, “On the Change in Archivability of Websites Over
Time”, TPDL 2013
• Justin F. Brunelle, Michael L. Nelson, Lyudmila Balakireva,
Robert Sanderson, Herbert Van de Sompel, “Evaluating the
SiteStory Transactional Web Archive With the
ApacheBench Tool”, TPDL 2013
• Mat Kelly, Justin F. Brunelle, Michele C. Weigle, and Michael
L. Nelson, “A Method for Identifying Personalized
Representations in Web Archives”, D-Lib Magazine,
19(11/12), 2013.
• Justin F. Brunelle, Mat Kelly, Hany SalahEldeen, Michele C.
Weigle, and Michael L. Nelson “Not All Mementos Are
Created Equal: Measuring The Impact Of Missing
Resources”, JCDL 2014
• Justin F. Brunelle, Mat Kelly, Michele C. Weigle, and Michael
L. Nelson “The impact of JavaScript on archivability”, 2015,
IJDL
• Wesley Jordan, Mat Kelly, Justin F. Brunelle, Laura Vobrak,
Michele C. Weigle, and Michael L. Nelson “Mobile Mink:
Merging Mobile and Desktop Archived Webs”, JCDL 2015
• Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson,
“Archiving Deferred Representations Using a Two-Tiered
Crawling Approach”, iPRES2015
• Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson,
“Adapting the Hypercube Model to Archive Deferred
Representations at Web-Scale”, Technical Report,
arXiv:1601.05142, 2016
• Justin F. Brunelle, Krista Ferrante, Eliot Wilczek, Michele C.
Weigle, and Michael L. Nelson, “Leveraging Heritrix and the
Wayback Machine on a corporate intranet: A case study on
improving corporate archives”, DLib Magazine, 22(1/2)
2016
105
Mobile Mink: Merging Mobile and
Desktop Archived Webs
Wesley Jordan, Mat Kelly, Justin F. Brunelle,
Laura Vobrak, Michele C. Weigle, Michael L. Nelson
This work supported in part by the NEH HK-50181. This work
was performed as part of Wesley Jordan’s mentorship at The
MITRE Corporation. The author’s affiliation with The MITRE
Corporation is provided for identification purposes only, and is
not intended to convey or imply MITRE’s concurrence with, or
support for, the positions, opinions or viewpoints expressed by
the author.
Acknowledgements
http://bitly.com/MobileMink/
More about Mobile Mink
Desktop URIs are much
more prevalent than their
mobile counterparts in the
archives because crawlers
use desktop user-agent
strings.
Corresponding Mobile URIs
are archived less frequently
even though the
representations are different
than their desktop
counterparts.
http://espn.go.com/ http://m.espn.go.com/
Same
ESPN,
different
URIs,
different
HTML,
different
TimeMaps.
.
 Browse to a URI-R
 Potential content-
negotiation from
user-agent
 Access tool from the
“Share” menu
MobileMink merges TimeMaps of
http://espn.go.com & http://m.espn.go.com/
Desktop and mobile webs differ and
the linkage between them is lost in the
archives
 Discovers mobile and
desktop URI-Rs
 Uses Memento to get
all available
TimeMaps
 Provides integrated
TimeMap
 Offers users ability
to submit mobile
and desktop URI-Rs
to archives
 Increases
coverage of mobile
URI-Rs in the
archives
HTTP Request
$ curl -i -v http://www.cs.odu.edu/
> GET / HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7
NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2
> Host: www.cs.odu.edu
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: nginx
< Date: Tue, 25 Mar 2014 23:42:38 GMT
< Content-Type: text/html
< Transfer-Encoding: chunked
< Connection: keep-alive
<
107
HTTP Response
HTTP/1.1 200 OK
Server: nginx
Date: Tue, 25 Mar 2014 23:40:09 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: keep-alive
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0036)http://www.cs.odu.edu/newcssite/new/ -->
<!-- saved from url=(0019)http://sci.odu.edu/ -->
<HTML xmlns:st1 = "urn:schemas-microsoft-com:office:smarttags">
<HEAD>
<meta name="verify-v1" content="CXMn8RoyhZpl9fsKpbgxtiFw3kIdHD51r/ntbf1Rrcw=" >
<TITLE>Department Of Computer Science</TITLE>
108
Client-side code modifies
the DOM
109
Internet Archive URI-M
110
http://web.archive.org/web/20140314130018/http://espn.go.com/
Archive Prefix Memento-DateTime URI-R
Deferred Representations
Representation is incomplete
Client-side code execution completes the build of the representation
111
Web Browsing Process
112
Deferred
representations
Percent Missing vs. Weighted Damage
• 𝑀 𝑀 = Percent of embedded
resources missing
𝑀 𝑀 =
𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 𝑀𝑖𝑠𝑠𝑖𝑛𝑔
𝑇𝑜𝑡𝑎𝑙 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠
• 𝐷 𝑀 = Damage rating of missing
embedded resources
𝐷 𝑀 =
𝐷 𝑀 𝐴𝑐𝑡𝑢𝑎𝑙
𝐷 𝑀 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙
𝐷 𝑀 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙
= 𝑖=1
𝑛[𝐼|𝑀𝑀]
𝐷[𝐼|𝑀𝑀] (𝑖)
𝑛[𝐼|𝑀𝑀]
+ 𝑖=1
𝑛[𝐶]
𝐷[𝐶] (𝑖)
𝑛 𝐶 113
𝐼 = 𝐼𝑚𝑎𝑔𝑒
𝑀𝑀 = 𝑀𝑢𝑙𝑡𝑖𝑀𝑒𝑑𝑖𝑎
𝐶 = 𝐶𝑆𝑆
• Measured Internet
Archive mementos
• Damage generally
improves over time
• Despite missing more
resources over time
Damage in the
Internet Archive
114
Expanding the crawl frontier
115
Click events lead to the most descendants
Related Work
116
Deep Web
• Deferred=Deep (Bergman, 2001)
• Mobile requires context (Schneider, 2013)
• Static → Dynamic Web (Rosenthal, 2011)(IIPC, 2012)
• Crawlers & deep Web (Ast, 2008) (B. He, 2007) (Y. He, 2013)
• Google’s deep Web crawler (Madhavan, 2008)
• Forms (Ntoulas, 2005)
117
Archive Quality
• SHARC, Quality Conscious Archiving (Spaniol, 2009)
• Quality of archives (Spaniol, 2009, 2009)
• Archiveready (Banos, 2013, 2015)
• Acid test (Kelly, 2014)
• Block Importance (Ye, 2003) (Fersini, 2008) (Kohlschutter, 2010)
118
Monitoring for Security
• Ripley (Vikram, 2009)
• Mugshot (Mickens, 2010)
• ActionShot (Li, 2010)
• Ajax testing and states (Mesbah, 2007, 2008, 2009, 2009, 2012)
• Crawling Ajax (Dincturk, 2013, 2014)
119
Publications
Master’s:
• Kyle Dempsey, Justin Brunelle, G. Tanner Jackson, Chutima Boonthum, Irwin
Levinstein, Danielle McNamara. “MiBoard: Multiplayer Interactive Board
Game”, AIED2009
• Justin F. Brunelle, Irwin B. Levinstein, Chutima Boonthum. “MiBoard:
Metacognitive Training Through Gaming in iSTART”, 2009 VMASC Capstone
Conference
• Best paper in track
• Justin F. Brunelle, Kyle B Dempsey, G. Tanner Jackson, Chutima Boonthum,
Irwin B. Levinstein, Danielle S. McNamara. “MiBoard: Metacognitive Training
Through Gaming”, SCiP2009
• Justin F. Brunelle, G. Tanner Jackson, Kyle Dempsey, Chutima Boonthum,
Irwin B. Levinstein, Danielle S. McNamara. “Analysis of MiBoard as an iSTART
Practice Tool”, FLAIRS-24, 2010
• Kyle Dempsey, G. Tanner Jackson, Justin Brunelle, Michael Rowe, Danielle
McNamara. “MiBoard: Assessing Collaborative Learning Through Game-
Based Practice”, FLAIRS-24, 2010
PhD:
• Justin F. Brunelle “Filling in the Blanks: Capturing the Dynamic Web”, JCDL
2012 Doctoral Consortium
• Justin F. Brunelle, Michael L. Nelson “An Evaluation of Caching Policies for
Memento TimeMaps”, JCDL 2013
• Mat Kelly, Justin F. Brunelle, Michele C. Weigle, Michael L. Nelson, “On the
Change in Archivability of Websites Over Time”, TPDL 2013
• Justin F. Brunelle, Michael L. Nelson, Lyudmila Balakireva, Robert Sanderson,
Herbert Van de Sompel, “Evaluating the SiteStory Transactional Web Archive
With the ApacheBench Tool”, TPDL 2013
• Mat Kelly, Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “A
Method for Identifying Personalized Representations in Web Archives”, D-
Lib Magazine, 19(11/12), 2013.
• Justin F. Brunelle, Mat Kelly, Hany SalahEldeen, Michele C. Weigle, and
Michael L. Nelson “Not All Mementos Are Created Equal: Measuring The
Impact Of Missing Resources”, JCDL 2014
• Best Student Paper, International Journal of Digital Libraries: JCDL2015 Special
Issue
• Justin F. Brunelle, Mat Kelly, Michele C. Weigle, and Michael L. Nelson “The
impact of JavaScript on archivability”, 2015, IJDL
• Wesley Jordan, Mat Kelly, Justin F. Brunelle, Laura Vobrak, Michele C. Weigle,
and Michael L. Nelson “Mobile Mink: Merging Mobile and Desktop Archived
Webs”, JCDL 2015
• Best Poster
• Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “Archiving
Deferred Representations Using a Two-Tiered Crawling Approach”,
iPRES2015
• Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “Adapting the
Hypercube Model to Archive Deferred Representations at Web-Scale”,
Technical Report, arXiv:1601.05142, 2016
• Justin F. Brunelle, Krista Ferrante, Eliot Wilczek, Michele C. Weigle, and
Michael L. Nelson, “Leveraging Heritrix and the Wayback Machine on a
corporate intranet: A case study on improving corporate archives”, DLib
Magazine, 2016
120
Performance with classifier
121
Mobile Sites in the Archives
122
http://m.espn.go.com/wireless/http://espn.go.com/
“A Method for Identifying Personalized Representations in Web Archives”, D-Lib Magazine, 2013
Mobile Sites in the Archives
123
http://m.espn.go.com/wireless/http://espn.go.com/
URI-M:
http://web.archive.org/web/2014033
0125315/http://espn.go.com/
URI-M:
http://web.archive.org/web/2014033012
5414/http://m.espn.go.com/wireless/
Collisions in the Archives
124
http://www.cnn.com/
URI-M? URI-T?
http://web.archive.org/web/[DATETIME]/http://www.cnn.com/
Need a better way to index mementos
• URI-R is no longer enough
• Environmental factors:
‒ Content negotiation
‒ Interaction
‒ Personalization
‒ GeoIP
125
Content Negotiation
 Server-side
interpretation of
client-provided
parameters
 Multiple
representations,
single resource
126
Resource
URI Representation 2
Represents
Representation 1
Represents
Identifies
Content Negotiation
Mobile
Desktop
user-agent

Más contenido relacionado

La actualidad más candente

Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesSawood Alam
 
To the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly CommunicationTo the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly CommunicationMartin Klein
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingSawood Alam
 
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingWho Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingMichael Nelson
 
Interoperability for web based scholarship
Interoperability for web based scholarshipInteroperability for web based scholarship
Interoperability for web based scholarshipHerbert Van de Sompel
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesSawood Alam
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Mat Kelly
 
"Web Archive services framework for tighter integration between the past and ...
"Web Archive services framework for tighter integration between the past and ..."Web Archive services framework for tighter integration between the past and ...
"Web Archive services framework for tighter integration between the past and ...Ahmed AlSum
 
Hiberlink: Investigating Reference Rot, December 2013
Hiberlink: Investigating Reference Rot, December 2013Hiberlink: Investigating Reference Rot, December 2013
Hiberlink: Investigating Reference Rot, December 2013Herbert Van de Sompel
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple ArchivesMichael Nelson
 
Persistent Identification: Easier Said than Done
Persistent Identification: Easier Said than DonePersistent Identification: Easier Said than Done
Persistent Identification: Easier Said than DoneHerbert Van de Sompel
 
FAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning IssueFAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning IssueHerbert Van de Sompel
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web ArchivesMichael Nelson
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for RepositoriesMartin Klein
 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live WebMartin Klein
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesSawood Alam
 

La actualidad más candente (20)

Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
To the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly CommunicationTo the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly Communication
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
 
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingWho Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
 
The Web We Want
The Web We WantThe Web We Want
The Web We Want
 
Interoperability for web based scholarship
Interoperability for web based scholarshipInteroperability for web based scholarship
Interoperability for web based scholarship
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web Archives
 
PID Signposting Pattern
PID Signposting PatternPID Signposting Pattern
PID Signposting Pattern
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count
 
"Web Archive services framework for tighter integration between the past and ...
"Web Archive services framework for tighter integration between the past and ..."Web Archive services framework for tighter integration between the past and ...
"Web Archive services framework for tighter integration between the past and ...
 
Creating Pockets of Persistence
Creating Pockets of PersistenceCreating Pockets of Persistence
Creating Pockets of Persistence
 
Hiberlink: Investigating Reference Rot, December 2013
Hiberlink: Investigating Reference Rot, December 2013Hiberlink: Investigating Reference Rot, December 2013
Hiberlink: Investigating Reference Rot, December 2013
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple Archives
 
Persistent Identification: Easier Said than Done
Persistent Identification: Easier Said than DonePersistent Identification: Easier Said than Done
Persistent Identification: Easier Said than Done
 
FAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning IssueFAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning Issue
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
 
Signposting Overview
Signposting OverviewSignposting Overview
Signposting Overview
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for Repositories
 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live Web
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
 

Similar a Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations

Filling in the Blanks: Capturing Dynamically Generated Content
Filling in the Blanks: Capturing Dynamically Generated ContentFilling in the Blanks: Capturing Dynamically Generated Content
Filling in the Blanks: Capturing Dynamically Generated ContentJustin Brunelle
 
Javascript library toolbox
Javascript library toolboxJavascript library toolbox
Javascript library toolboxSkysoul Pty.Ltd.
 
Quantitative Digital Backchannel: Developing a Web-Based Audience Response Sy...
Quantitative Digital Backchannel: Developing a Web-Based Audience Response Sy...Quantitative Digital Backchannel: Developing a Web-Based Audience Response Sy...
Quantitative Digital Backchannel: Developing a Web-Based Audience Response Sy...Educational Technology
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)Tao Xie
 
Introduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonJo-fai Chow
 
Silk Data - Recommendations
Silk Data - RecommendationsSilk Data - Recommendations
Silk Data - RecommendationsNikolay Karelin
 
Semantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer AppsSemantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer AppsJie Bao
 
How HTML5 missed its graduation - #TrondheimDC
How HTML5 missed its graduation - #TrondheimDCHow HTML5 missed its graduation - #TrondheimDC
How HTML5 missed its graduation - #TrondheimDCChristian Heilmann
 
Software Architecture and Predictive Models in R
Software Architecture and Predictive Models in RSoftware Architecture and Predictive Models in R
Software Architecture and Predictive Models in RHarlan Harris
 
Making Great User Experiences at Cleveland C# .Net Meetup July 27 2017
Making Great User Experiences at Cleveland C# .Net Meetup July 27 2017Making Great User Experiences at Cleveland C# .Net Meetup July 27 2017
Making Great User Experiences at Cleveland C# .Net Meetup July 27 2017Carol Smith
 
Enhancing performance in an open-source CMS ecosystem
Enhancing performance in an open-source CMS ecosystemEnhancing performance in an open-source CMS ecosystem
Enhancing performance in an open-source CMS ecosystemFelix Arntz
 
Applications for Social Networking Strategies in an Agency Context: Exploitin...
Applications for Social Networking Strategies in an Agency Context: Exploitin...Applications for Social Networking Strategies in an Agency Context: Exploitin...
Applications for Social Networking Strategies in an Agency Context: Exploitin...BoaB Team
 
Analysing image collections with the computer vision network approach
Analysing image collections with  the computer vision network approachAnalysing image collections with  the computer vision network approach
Analysing image collections with the computer vision network approachJanna Joceli Omena
 
1,2,3 … Testing : Is this thing on(line)? with Mike Martin
1,2,3 … Testing : Is this thing on(line)? with Mike Martin1,2,3 … Testing : Is this thing on(line)? with Mike Martin
1,2,3 … Testing : Is this thing on(line)? with Mike MartinNETUserGroupBern
 
Chasing web-based malware
Chasing web-based malwareChasing web-based malware
Chasing web-based malwareFACE
 
Class 39: ...and the World Wide Web
Class 39: ...and the World Wide WebClass 39: ...and the World Wide Web
Class 39: ...and the World Wide WebDavid Evans
 
2014 01-ticosa
2014 01-ticosa2014 01-ticosa
2014 01-ticosaPharo
 
Architecting an ASP.NET MVC Solution
Architecting an ASP.NET MVC SolutionArchitecting an ASP.NET MVC Solution
Architecting an ASP.NET MVC SolutionAndrea Saltarello
 

Similar a Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations (20)

Filling in the Blanks: Capturing Dynamically Generated Content
Filling in the Blanks: Capturing Dynamically Generated ContentFilling in the Blanks: Capturing Dynamically Generated Content
Filling in the Blanks: Capturing Dynamically Generated Content
 
Javascript library toolbox
Javascript library toolboxJavascript library toolbox
Javascript library toolbox
 
W3 C Intro And Beyond - Eyal Sela
W3 C Intro And Beyond - Eyal SelaW3 C Intro And Beyond - Eyal Sela
W3 C Intro And Beyond - Eyal Sela
 
Quantitative Digital Backchannel: Developing a Web-Based Audience Response Sy...
Quantitative Digital Backchannel: Developing a Web-Based Audience Response Sy...Quantitative Digital Backchannel: Developing a Web-Based Audience Response Sy...
Quantitative Digital Backchannel: Developing a Web-Based Audience Response Sy...
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)
 
Introduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and Python
 
Silk Data - Recommendations
Silk Data - RecommendationsSilk Data - Recommendations
Silk Data - Recommendations
 
Semantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer AppsSemantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer Apps
 
How HTML5 missed its graduation - #TrondheimDC
How HTML5 missed its graduation - #TrondheimDCHow HTML5 missed its graduation - #TrondheimDC
How HTML5 missed its graduation - #TrondheimDC
 
Software Architecture and Predictive Models in R
Software Architecture and Predictive Models in RSoftware Architecture and Predictive Models in R
Software Architecture and Predictive Models in R
 
Making Great User Experiences at Cleveland C# .Net Meetup July 27 2017
Making Great User Experiences at Cleveland C# .Net Meetup July 27 2017Making Great User Experiences at Cleveland C# .Net Meetup July 27 2017
Making Great User Experiences at Cleveland C# .Net Meetup July 27 2017
 
Enhancing performance in an open-source CMS ecosystem
Enhancing performance in an open-source CMS ecosystemEnhancing performance in an open-source CMS ecosystem
Enhancing performance in an open-source CMS ecosystem
 
Applications for Social Networking Strategies in an Agency Context: Exploitin...
Applications for Social Networking Strategies in an Agency Context: Exploitin...Applications for Social Networking Strategies in an Agency Context: Exploitin...
Applications for Social Networking Strategies in an Agency Context: Exploitin...
 
Analysing image collections with the computer vision network approach
Analysing image collections with  the computer vision network approachAnalysing image collections with  the computer vision network approach
Analysing image collections with the computer vision network approach
 
1,2,3 … Testing : Is this thing on(line)? with Mike Martin
1,2,3 … Testing : Is this thing on(line)? with Mike Martin1,2,3 … Testing : Is this thing on(line)? with Mike Martin
1,2,3 … Testing : Is this thing on(line)? with Mike Martin
 
Chasing web-based malware
Chasing web-based malwareChasing web-based malware
Chasing web-based malware
 
Class 39: ...and the World Wide Web
Class 39: ...and the World Wide WebClass 39: ...and the World Wide Web
Class 39: ...and the World Wide Web
 
2014 01-ticosa
2014 01-ticosa2014 01-ticosa
2014 01-ticosa
 
October 28, 2015 NISO Virtual Conference Interacting with Content: Improving ...
October 28, 2015 NISO Virtual Conference Interacting with Content: Improving ...October 28, 2015 NISO Virtual Conference Interacting with Content: Improving ...
October 28, 2015 NISO Virtual Conference Interacting with Content: Improving ...
 
Architecting an ASP.NET MVC Solution
Architecting an ASP.NET MVC SolutionArchitecting an ASP.NET MVC Solution
Architecting an ASP.NET MVC Solution
 

Último

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Último (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations

  • 1. Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations Justin F. Brunelle Dissertation Defense February 5, 2016 Committee Members:  Michael L. Nelson  Michele C. Weigle  Elizabeth J. Vincelette  Irwin B. Levinstein
  • 3. Mass hysteria. Human sacrifices. Dogs and cats living together. 3 <iframe><script>…</script></iframe>
  • 4. 4 t
  • 7. Old ads are interesting 7
  • 8. New ones are annoying…for now. 8 “Why are your parents wrestling?”
  • 9. Today’s ads are missing from the archives 9 http://adserver.adtechus.com/addyn/3.0/5399.1/2394397/0/- 1/QUANTCAST;;size=300x250;target=_blank;alias=p36- 17b4f9us2qmzc8bn;kvp36=p36-17b4f9us2qmzc8bn;sub1=p- 4UZr_j7rCm_Aj;kvl=172802;kvc=794676;kvs=300x250;kvi=c052a80 3d0b5476f0bd2f2043ef237e27cd48019;kva=p- 4UZr_j7rCm_Aj;rdclick=http://exch.quantserve.com/r?a=p- 4UZr_j7rCm_Aj;labels=_qc.clk,_click.adserver.rtb,_click.rand.85854; rtbip=192.184.64.144;rtbdata2=EAQaFUhSQmxvY2tfMjAxNlRheFNlY XNvbiCZiRcogsYKMLTAMDoSaHR0cDovL3d3dy5jbm4uY29tWihUUEh wYlUzM3ZqeFU5LTA1SGZEMk1SXzE0anBVcGU0d0dxTG10STFUdUs2I ECAAb_JicoFoAEBqAGhy7YCugEoVFBIcGJVMzN2anhVOS0wNUhmR DJNUl8xNGpwVXBlNHdHcUxtdEkxVMAB3ed3yAGUp7GUqSraAShjM DUyYTgwM2QwYjU0NzZmMGJkMmYyMDQzZWYyMzdlMjdjZDQ4M DE55QHvEWs- 6AFkmAK2wQqoAgWoAgawAgi6AgTAuECQwAICyAIA0ALe9baMj4Co s-oB
  • 10. JavaScript is hard to replay What happens when things are completely lost? http://ws-dl.blogspot.com/2013/11/2013-11-28-replaying-sopa-protest.html 10
  • 11. Remember SOPA? And the protest? 11 https://en.wikipedia.org/wiki/Stop_Online_Piracy_Act https://en.wikipedia.org/wiki/Protests_against_SOPA_and_PIPA
  • 14. 14 Problem! The archives contain the Web as seen by crawlers
  • 15. Why archive? The Internet Archive has everything! Why didn’t you back it up? Participating institutions can hand over their databases. 15
  • 16. Crimean Conflict Russian troops captured the Crimean Center for Investigative Journalism Gunman: "We will try to agree on the correct truthful coverage of events.” 16 http://gijn.org/2014/03/02/masked-gunmen-seize-crimean-investigative-journalism-center/
  • 17. Archive-It to the rescue! 17
  • 18. How?  Masked gunman have your servers  Where are your backups?  Transactional archive? Too late! 18 Preservation over HTTP
  • 19. How?  Masked gunman have your servers  Where are your backups?  Transactional archive? Too late! 19 Preservation over HTTP
  • 20. Any future discussion of the 21st century will involve the web and the web archives 20
  • 21. Any future discussion of the 21st century will involve the web and the web archives But JavaScript is hard to archive, resulting in archives of content as seen by crawlers rather than as seen by users 21
  • 22. Any future discussion of the 21st century will involve the web and the web archives But JavaScript is hard to archive, resulting in archives of content as seen by crawlers rather than as seen by users 22 Goal: Mitigate the impact of JavaScript on the archives by making crawlers behave like users
  • 23. 23 Motivating Examples Background Information Research Questions Measuring the Impact of JavaScript Measuring Memento Quality Crawling Deferred Representations Future Work Conclusions
  • 26. Some Archival Tools 26 1: http://warcreate.com/ 2: http://matkelly.com/wail/ 1 2
  • 27. Memento Framework 27 http://mementoweb.org/guide/rfc/ Machine readable bidirectional link between the past and present web
  • 28. 28
  • 29. 29
  • 30. 30  URI-R: Original Resource Identifier  URI-M: memento Identifier  URI-T: TimeMap Identifier Page on the live web Archived version of a page List of archived pages
  • 31. Web Architecture 31 Dereference a URI, get a representation
  • 32. JavaScript makes requests for new resources after the initial page load 32 http://maps.google.com Identifies Represents
  • 34. JavaScript != Deferred 34 Deferred HTTP GETHTTP GET HTTP GETHTTP GET onload Nondeferred HTTP GET
  • 35. Web Browsing Process 35  User-controlled  Interaction  Environment variables → content negotiation  Client-controlled representation changes HTTP GET Request for Resource R HTTP 200 OK Response: R Content Browser renders and displays R JavaScript requests embedded resources Server returns embedded resources R updates its representation
  • 36. Web Browsing Process 36 There is no longer “the” representation. At any given time, users get “a” representation. GeoIP: Washington, D.C. URI-R: http://www.wunderground.com/ GeoIP: Suffolk, VA URI-R: http://www.wunderground.com/
  • 37. The Internet Archive got everything, right? 37
  • 38. Missing tiles, not interactive 38
  • 39. HTTP GET Request for Resource R HTTP 200 OK Response: R Content Browser renders and displays R JavaScript requests embedded resources Server returns embedded resources R updates its representation Web Browsing Process 39 Archival Tools stop here
  • 40. HTTP GET Request for Resource R HTTP 200 OK Response: R Content Browser renders and displays R JavaScript requests embedded resources Server returns embedded resources R updates its representation Web Browsing Process 40 Archival Tools stop here
  • 41. HTTP GET Request for Resource R HTTP 200 OK Response: R Content Browser renders and displays R JavaScript requests embedded resources Server returns embedded resources R updates its representation Web Browsing Process 41 Archival Tools stop here Still not solved!
  • 42. 42 Motivating Examples Background Information Research Questions Measuring the Impact of JavaScript Measuring Memento Quality Crawling Deferred Representations Future Work Conclusions
  • 43. Research Questions RQ1. To what extent does JavaScript impact archival tools? RQ2. How do we measure memento quality? RQ3. How can we crawl, archive, and play back deferred representations? 43
  • 44. 44 Motivating Examples Background Information Research Questions Measuring the Impact of JavaScript Measuring Memento Quality Crawling Deferred Representations Future Work Conclusions 20152013
  • 46. Measuring JavaScript  1,000 URIs from Twitter  1,000 URIs from Archive-it Dataset available at http://www.cs.odu.edu/~jbrunelle/jsDataSet.txt  Capture with tools  Study the archivability 46 “The impact of JavaScript on archivability”, 2015, International Journal of Digital Libraries ( )
  • 57. Leakage by archival tool 57Twitter has more leakage than Archive-It
  • 58. Leakage by archival tool 58Wayback reduces leakage the most
  • 59. Leakage -> Zombies 5912% increase in embedded mementos loaded via JavaScript
  • 60. Leakage increasing over time 60Increased JavaScript -> increases in missing embedded resources
  • 61. 61 • 73.1% of all missing embedded mementos are loaded via JavaScript • 33% increase in missing embedded mementos from JavaScript between 2005-2012 Leakage increasing over time
  • 62. 62 Motivating Examples Background Information Research Questions Measuring the Impact of JavaScript Measuring Memento Quality Crawling Deferred Representations Future Work Conclusions 2015 2014
  • 63. 63 “Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources”, JCDL 2014, International Journal of Digital Libraries, 2015 VS. 63
  • 64. “Live” XKCD • Missing 17% of embedded resources • Looks complete 64
  • 65. “Live” XKCD • Take three resources: • Logo • Main Comic • Navigation Strip • Relative importance? • All present in “Live” XKCD 65
  • 66. Damaging XKCD • Created a local memento • Removed the logo and navigation strip • Now missing 29% of embedded resources • Human assessment: looks OK 66
  • 67. Damaging XKCD • From our local memento • Removed the Main Comic • Now missing 24% of embedded resources • Human assessment: Not a usable memento 67
  • 68. Damaging XKCD • From our local memento • Removed the Main Comic • Now missing 24% of embedded resources • Human assessment: Not a usable memento • Percent of missing embedded resources is not a suitable metric for memento quality 68
  • 69. Image Importance • Size (as percentage of all pixels) 69
  • 70. Image Importance • Size • Position (in viewport?) 70
  • 71. Image Importance • Size • Position • Centrality (in the vertical or horizontal center?) 71
  • 72. Missing CSS • More important than thought • Calculated the amount of content in each vertical third • If >=80% in left column and missing CSS, CSS is important • Only performed if stylesheets are missing 72
  • 73. Methodology • Defined Dm and Mm metrics Mm = 𝑀𝑖𝑠𝑠𝑖𝑛𝑔 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 𝐴𝑙𝑙 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 Dm = 𝑖=1 𝑛 𝑚𝑖𝑠𝑠𝑖𝑛𝑔 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 𝑤 𝑖 𝑗=1 𝑛 𝑎𝑙𝑙 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 𝑤 𝑗 • Used Amazon Mechanical Turkers to assess web user perception of quality • Assessed Dm versus Mm in manually damaged pages • Assessed Dm versus Mm in the archives 73
  • 74. Turk Results 74 Live vs Manually Damaged Dm Mementos from Internet Archive Agreement with Dm Mementos from Internet Archive Agreement with Mm 50/50 Chance
  • 75. Damage in the Archives 75 Internet Archive WebCite Mementos with deferred representations have 13.5% higher damage rating
  • 76. 76 Motivating Examples Background Information Research Questions Measuring the Impact of JavaScript Measuring Memento Quality Crawling Deferred Representations Future Work Conclusions 2015 2016
  • 77. 77 Current Workflow • Dereference URI-Rs • Archive representation • Extract embedded URI-Rs • Repeat
  • 78. 78 Two-Tiered Crawling “Archiving Deferred Representations Using a Two-Tiered Crawling Approach”, iPRES2015 “Adapting the Hypercube Model to Archive Deferred Representations at Web-Scale”, Technical Report, arXiv:1601.05142, 2016
  • 79. 79 <script> tags alone are not indicative of a deferred representation. JavaScript can be played back in the archives! Current workflow not suitable for deferred representations Use PhantomJS to run JavaScript, interact with the representation Two-tiered crawling approach to optimize performance
  • 80. 80 <script> tags alone are not indicative of a deferred representation. JavaScript can be played back in the archives! Current workflow not suitable for deferred representations Use PhantomJS to run JavaScript, interact with the representation Two-tiered crawling approach to optimize performance More URI-Rs in the crawl frontier Runs more slowly but more deeply
  • 81. Comparing Performance • Crawled 10,000 URI-Rs Dataset available at http://www.cs.odu.edu/~jbrunelle/wsdl/10kuris.txt • Compare crawl speed & discovered frontier size • With and without classifier • Code available at https://github.com/jbrunelle/classifyDeferred/ 81
  • 82. Performance: Frontier Size 82PhantomJS creates a 1.5x larger crawl frontier than Heritrix
  • 83. Performance: Crawl Speed 83 Heritrix: ~2 URIs/second PhantomJS: ~4 seconds/URI
  • 84. Classifier We are omitting a discussion about the classifier for deferred vs. nondeferred representations Please see Section 7.4 in the dissertation for a detailed discussion 84
  • 85. Descendants = States of deferred representations reached through client-side events 85 Click Pan Zoom Click Pan Zoom
  • 86. Crawling descendants • Interactions represented as N-ary tree G • FSM: M = (S, s0, Σ, δ) ‒ S is the finite set of client states ‒ s0 ϵ S is the initial state reached by dereferencing the URI-R and executing the initial on- load events ‒ e ϵ Σ defines the client-side event e as a member of the set of all events Σ ‒ δ : Sx Σ → S is the transition function in which a client-side event is executed and leads to a new state si, sj ϵ S δ(si, e) = sj e = client-side event j = i + 1 86 “Adapting the Hypercube Model to Archive Deferred Representations at Web-Scale”, Technical Report, arXiv:1601.05142, 2016
  • 89. 89 Interaction Trees are 2 Levels Deep http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices s0 s1 s2
  • 90. 90 Interaction Trees are 2 Levels Deep http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices s0 s1 s2
  • 91. 91 Interaction Trees are 2 Levels Deep http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices s0 s1 s2
  • 92. Expanding the Crawl Frontier 92 Level s1 provides the greatest benefit to the crawl frontier Nondeferred Deferred
  • 93. Crawling Descendants 93 New embedded resources at levels s1 are largely unarchived
  • 94. Crawling Descendants 94 Level s1 has the highest cost-benefit Return on Investment
  • 95. Storage Impact of Two-Tiered Crawling  IIPC-proposed JSON metadata of interactions, resulting descendants –Potentially used to resolve URI-M collisions –16.5KB WARC metadata –143MB for total dataset  11.4 times larger for deferred vs nondeferred  Totals 5.12 times more storage per URI-R for total dataset 95 2013
  • 96. 96 Motivating Examples Background Information Research Questions Measuring the Impact of JavaScript Measuring Memento Quality Crawling Deferred Representations Future Work Conclusions
  • 97. Future Work • Modeling user interactions, tendencies, and simulation – Form filling – Click and navigation likelihood • Evaluating success of crawling deferred representations – Random walks through the archives – Dm vs Mm of mementos of deferred representations • Archival Halting Problem: How much is enough? – Mapping Applications – How many pans and zooms gets all the Norfolk, VA Google map tiles? – How many CNN.com pages get all the Google Ads? • Playing back WARCs with IIPC metadata of deferred representations and descendants 97
  • 98. 98 Motivating Examples Background Information Research Questions Measuring the Impact of JavaScript Measuring Memento Quality Crawling Deferred Representations Future Work Conclusions
  • 99. RQ1. To what extent does JavaScript impact archival tools? Contributions: • Defined and identified zombie resources • Adoption of JavaScript correlates with missing embedded resources in mementos • Defined deferred representations • Showed that deferred representations have reduced archivability 99 2012: ws-dl.blogspot.com 2013: TPDL2013 2015: iPRES2015 2015: IJDL 2015: IJDL Section 4.3 Ch. 5 Ch. 2 Ch. 5 For more information, reference:
  • 100. RQ2. How do we measure memento quality? Contributions: • Mm is not accurate (worse than coin-flip) • Created Dm metric • Dm is closer to user perception than Mm • Mementos of deferred representations have higher Dm than nondeferred representations 100 2015: JCDL2015 2015: IJDL Special Issue Ch. 6 Section 6.6 For more information, reference:
  • 101. RQ3. How can we crawl, archive, and play back deferred representations? Contributions: • Defined a framework for archiving deferred representations • Showed that the framework will crawl more slowly but more thoroughly • Defined descendants, showed that they are 2-levels deep • Showed the storage impact of crawling descendants and deferred representations 101 2015: iPRES2015 2016: arXiv:1601.05142 Ch. 7 Ch. 7 For more information, reference:
  • 102. Summary • Measured the impact of JavaScript on the archives • Quantified damage caused by JavaScript • Measured the cost in time and space to archive JavaScript Provides policy makers information to make decisions regarding JavaScript handling in crawling and archiving Quantified an intuitive understanding of crawling deferred representations at web scale 102
  • 104. 104 Year RQ Venue Abbreviated Title Notes 2012 JCDL2012 Doctoral Consortium Capturing Dynamic Web 2013 JCDL2013 TimeMap Caching 2013 RQ1 TPDL2013 Archivability Over Time 2013 TPDL2013 Transactional Archiving 2013 RQ1 DLib Magazine 19(11/12) Identifying Mementos 2014 RQ2 JCDL2014 Measuring Memento Damage Best Student Paper 2015 RQ1 International Journal of Digital Libraries Measuring Impact of JavaScript 2015 RQ2 International Journal of Digital Libraries Measuring Memento Damage JCDL2015 Special Issue 2015 JCDL2015 Merging Mobile and Desktop Best Poster 2015 RQ3 iPRES2015 Two-Tiered Crawling 2016 RQ3 Technical Report, arXiv:1601.05142 Hypercube Model for Archiving 2016 DLib Magazine 22(1/2) Archiving Corporate Intranets Publications
  • 105. Publications • Justin F. Brunelle “Filling in the Blanks: Capturing the Dynamic Web”, JCDL 2012 Doctoral Consortium • Justin F. Brunelle, Michael L. Nelson “An Evaluation of Caching Policies for Memento TimeMaps”, JCDL 2013 • Mat Kelly, Justin F. Brunelle, Michele C. Weigle, Michael L. Nelson, “On the Change in Archivability of Websites Over Time”, TPDL 2013 • Justin F. Brunelle, Michael L. Nelson, Lyudmila Balakireva, Robert Sanderson, Herbert Van de Sompel, “Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool”, TPDL 2013 • Mat Kelly, Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “A Method for Identifying Personalized Representations in Web Archives”, D-Lib Magazine, 19(11/12), 2013. • Justin F. Brunelle, Mat Kelly, Hany SalahEldeen, Michele C. Weigle, and Michael L. Nelson “Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources”, JCDL 2014 • Justin F. Brunelle, Mat Kelly, Michele C. Weigle, and Michael L. Nelson “The impact of JavaScript on archivability”, 2015, IJDL • Wesley Jordan, Mat Kelly, Justin F. Brunelle, Laura Vobrak, Michele C. Weigle, and Michael L. Nelson “Mobile Mink: Merging Mobile and Desktop Archived Webs”, JCDL 2015 • Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “Archiving Deferred Representations Using a Two-Tiered Crawling Approach”, iPRES2015 • Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “Adapting the Hypercube Model to Archive Deferred Representations at Web-Scale”, Technical Report, arXiv:1601.05142, 2016 • Justin F. Brunelle, Krista Ferrante, Eliot Wilczek, Michele C. Weigle, and Michael L. Nelson, “Leveraging Heritrix and the Wayback Machine on a corporate intranet: A case study on improving corporate archives”, DLib Magazine, 22(1/2) 2016 105
  • 106. Mobile Mink: Merging Mobile and Desktop Archived Webs Wesley Jordan, Mat Kelly, Justin F. Brunelle, Laura Vobrak, Michele C. Weigle, Michael L. Nelson This work supported in part by the NEH HK-50181. This work was performed as part of Wesley Jordan’s mentorship at The MITRE Corporation. The author’s affiliation with The MITRE Corporation is provided for identification purposes only, and is not intended to convey or imply MITRE’s concurrence with, or support for, the positions, opinions or viewpoints expressed by the author. Acknowledgements http://bitly.com/MobileMink/ More about Mobile Mink Desktop URIs are much more prevalent than their mobile counterparts in the archives because crawlers use desktop user-agent strings. Corresponding Mobile URIs are archived less frequently even though the representations are different than their desktop counterparts. http://espn.go.com/ http://m.espn.go.com/ Same ESPN, different URIs, different HTML, different TimeMaps. .  Browse to a URI-R  Potential content- negotiation from user-agent  Access tool from the “Share” menu MobileMink merges TimeMaps of http://espn.go.com & http://m.espn.go.com/ Desktop and mobile webs differ and the linkage between them is lost in the archives  Discovers mobile and desktop URI-Rs  Uses Memento to get all available TimeMaps  Provides integrated TimeMap  Offers users ability to submit mobile and desktop URI-Rs to archives  Increases coverage of mobile URI-Rs in the archives
  • 107. HTTP Request $ curl -i -v http://www.cs.odu.edu/ > GET / HTTP/1.1 > User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2 > Host: www.cs.odu.edu > Accept: */* > < HTTP/1.1 200 OK < Server: nginx < Date: Tue, 25 Mar 2014 23:42:38 GMT < Content-Type: text/html < Transfer-Encoding: chunked < Connection: keep-alive < 107
  • 108. HTTP Response HTTP/1.1 200 OK Server: nginx Date: Tue, 25 Mar 2014 23:40:09 GMT Content-Type: text/html Transfer-Encoding: chunked Connection: keep-alive <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <!-- saved from url=(0036)http://www.cs.odu.edu/newcssite/new/ --> <!-- saved from url=(0019)http://sci.odu.edu/ --> <HTML xmlns:st1 = "urn:schemas-microsoft-com:office:smarttags"> <HEAD> <meta name="verify-v1" content="CXMn8RoyhZpl9fsKpbgxtiFw3kIdHD51r/ntbf1Rrcw=" > <TITLE>Department Of Computer Science</TITLE> 108
  • 111. Deferred Representations Representation is incomplete Client-side code execution completes the build of the representation 111
  • 113. Percent Missing vs. Weighted Damage • 𝑀 𝑀 = Percent of embedded resources missing 𝑀 𝑀 = 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 𝑀𝑖𝑠𝑠𝑖𝑛𝑔 𝑇𝑜𝑡𝑎𝑙 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 • 𝐷 𝑀 = Damage rating of missing embedded resources 𝐷 𝑀 = 𝐷 𝑀 𝐴𝑐𝑡𝑢𝑎𝑙 𝐷 𝑀 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙 𝐷 𝑀 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙 = 𝑖=1 𝑛[𝐼|𝑀𝑀] 𝐷[𝐼|𝑀𝑀] (𝑖) 𝑛[𝐼|𝑀𝑀] + 𝑖=1 𝑛[𝐶] 𝐷[𝐶] (𝑖) 𝑛 𝐶 113 𝐼 = 𝐼𝑚𝑎𝑔𝑒 𝑀𝑀 = 𝑀𝑢𝑙𝑡𝑖𝑀𝑒𝑑𝑖𝑎 𝐶 = 𝐶𝑆𝑆
  • 114. • Measured Internet Archive mementos • Damage generally improves over time • Despite missing more resources over time Damage in the Internet Archive 114
  • 115. Expanding the crawl frontier 115 Click events lead to the most descendants
  • 117. Deep Web • Deferred=Deep (Bergman, 2001) • Mobile requires context (Schneider, 2013) • Static → Dynamic Web (Rosenthal, 2011)(IIPC, 2012) • Crawlers & deep Web (Ast, 2008) (B. He, 2007) (Y. He, 2013) • Google’s deep Web crawler (Madhavan, 2008) • Forms (Ntoulas, 2005) 117
  • 118. Archive Quality • SHARC, Quality Conscious Archiving (Spaniol, 2009) • Quality of archives (Spaniol, 2009, 2009) • Archiveready (Banos, 2013, 2015) • Acid test (Kelly, 2014) • Block Importance (Ye, 2003) (Fersini, 2008) (Kohlschutter, 2010) 118
  • 119. Monitoring for Security • Ripley (Vikram, 2009) • Mugshot (Mickens, 2010) • ActionShot (Li, 2010) • Ajax testing and states (Mesbah, 2007, 2008, 2009, 2009, 2012) • Crawling Ajax (Dincturk, 2013, 2014) 119
  • 120. Publications Master’s: • Kyle Dempsey, Justin Brunelle, G. Tanner Jackson, Chutima Boonthum, Irwin Levinstein, Danielle McNamara. “MiBoard: Multiplayer Interactive Board Game”, AIED2009 • Justin F. Brunelle, Irwin B. Levinstein, Chutima Boonthum. “MiBoard: Metacognitive Training Through Gaming in iSTART”, 2009 VMASC Capstone Conference • Best paper in track • Justin F. Brunelle, Kyle B Dempsey, G. Tanner Jackson, Chutima Boonthum, Irwin B. Levinstein, Danielle S. McNamara. “MiBoard: Metacognitive Training Through Gaming”, SCiP2009 • Justin F. Brunelle, G. Tanner Jackson, Kyle Dempsey, Chutima Boonthum, Irwin B. Levinstein, Danielle S. McNamara. “Analysis of MiBoard as an iSTART Practice Tool”, FLAIRS-24, 2010 • Kyle Dempsey, G. Tanner Jackson, Justin Brunelle, Michael Rowe, Danielle McNamara. “MiBoard: Assessing Collaborative Learning Through Game- Based Practice”, FLAIRS-24, 2010 PhD: • Justin F. Brunelle “Filling in the Blanks: Capturing the Dynamic Web”, JCDL 2012 Doctoral Consortium • Justin F. Brunelle, Michael L. Nelson “An Evaluation of Caching Policies for Memento TimeMaps”, JCDL 2013 • Mat Kelly, Justin F. Brunelle, Michele C. Weigle, Michael L. Nelson, “On the Change in Archivability of Websites Over Time”, TPDL 2013 • Justin F. Brunelle, Michael L. Nelson, Lyudmila Balakireva, Robert Sanderson, Herbert Van de Sompel, “Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool”, TPDL 2013 • Mat Kelly, Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “A Method for Identifying Personalized Representations in Web Archives”, D- Lib Magazine, 19(11/12), 2013. • Justin F. Brunelle, Mat Kelly, Hany SalahEldeen, Michele C. Weigle, and Michael L. Nelson “Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources”, JCDL 2014 • Best Student Paper, International Journal of Digital Libraries: JCDL2015 Special Issue • Justin F. Brunelle, Mat Kelly, Michele C. Weigle, and Michael L. Nelson “The impact of JavaScript on archivability”, 2015, IJDL • Wesley Jordan, Mat Kelly, Justin F. Brunelle, Laura Vobrak, Michele C. Weigle, and Michael L. Nelson “Mobile Mink: Merging Mobile and Desktop Archived Webs”, JCDL 2015 • Best Poster • Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “Archiving Deferred Representations Using a Two-Tiered Crawling Approach”, iPRES2015 • Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “Adapting the Hypercube Model to Archive Deferred Representations at Web-Scale”, Technical Report, arXiv:1601.05142, 2016 • Justin F. Brunelle, Krista Ferrante, Eliot Wilczek, Michele C. Weigle, and Michael L. Nelson, “Leveraging Heritrix and the Wayback Machine on a corporate intranet: A case study on improving corporate archives”, DLib Magazine, 2016 120
  • 122. Mobile Sites in the Archives 122 http://m.espn.go.com/wireless/http://espn.go.com/ “A Method for Identifying Personalized Representations in Web Archives”, D-Lib Magazine, 2013
  • 123. Mobile Sites in the Archives 123 http://m.espn.go.com/wireless/http://espn.go.com/ URI-M: http://web.archive.org/web/2014033 0125315/http://espn.go.com/ URI-M: http://web.archive.org/web/2014033012 5414/http://m.espn.go.com/wireless/
  • 124. Collisions in the Archives 124 http://www.cnn.com/ URI-M? URI-T? http://web.archive.org/web/[DATETIME]/http://www.cnn.com/
  • 125. Need a better way to index mementos • URI-R is no longer enough • Environmental factors: ‒ Content negotiation ‒ Interaction ‒ Personalization ‒ GeoIP 125
  • 126. Content Negotiation  Server-side interpretation of client-provided parameters  Multiple representations, single resource 126 Resource URI Representation 2 Represents Representation 1 Represents Identifies Content Negotiation Mobile Desktop user-agent