Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Site story wadl2013
1. WADL 2013
July 25-26th Indianapolis, IN
Martin Klein
@mart1nkle1n
martinklein0815@gmail.com
SiteStory
Archiving Done Differently
http://mementoweb.github.io/SiteStory/
Justin F. Brunelle
jbrunelle@cs.odu.edu
3. WADL 2013
July 25-26th Indianapolis, IN
Archiving - the traditional way
• Actively crawl the web
• For example, using Heritrix
4. WADL 2013
July 25-26th Indianapolis, IN
• Issues with crawler based archiving:
• Request can be rejected (robots.txt, user-agent, IP)
• Can be deceived (geo-location, user-agent)
• Can be trapped (crawl my calendar!)
• Requires constant and massive bandwidth
• Implied timing problem, when to crawl?
Archiving - the traditional way
5. WADL 2013
July 25-26th Indianapolis, IN
Timing problem:
• Update 1 viewed but not archived
t1
R
created
t2
browser
visit1
t3
crawler
visit1
t4
R
update1
t5
browser
visit2
t6
R
update2
Archiving - the traditional way
6. WADL 2013
July 25-26th Indianapolis, IN
Archiving - the SiteStory way
• Transactional Web archiving
• Archive accepts HTTP transaction between browser
and server
7. WADL 2013
July 25-26th Indianapolis, IN
Timing problem:
• Update 1 viewed and archived
t1
R
created
t2
browser
visit1
t3
crawler
visit1
t4
R
update1
t5
browser
visit2
t6
R
update2
Archiving - the traditional way
9. WADL 2013
July 25-26th Indianapolis, IN
• Challenges with transactional archiving:
• To be archived server has to cooperate
• Transfer data to archive, batch mode or real-time
• Archive must trust transmission to be authentic
• Resources from external servers have to be archived
out-of-band
• Deduplication challenges
• Alias: different URI, same response
• Conneg: same URI, different response
• Determine “significant” content change
Archiving - the SiteStory way
10. WADL 2013
July 25-26th Indianapolis, IN
SiteStory Status Quo
• mod_sitestory sends HTTP PUT to SiteStory Web
Archive upon client’s GET request
• not for POST, DELETE, etc
• for HTTP response codes 200, 302, 303
• Client IP can be included in stored headers, configurable
• Header info stored in BerkeleyDB, response body in FS
• Dedup via hash(body)
• Offloading content as WARC files possible
(read: recommended)
11. WADL 2013
July 25-26th Indianapolis, IN
To Appear: TPDL 2013
• SiteStory benchmark with ab&wget
o ApacheBench (ab): server stress test tool
o wget: Web page download
- All content: -p
• Local network
• Negligible difference between
SiteStory and No SiteStory
12. WADL 2013
July 25-26th Indianapolis, IN
Re-executed on testbed
ws-dl-03.cs.odu.edu
x99
,…
,
,
megalodon.lanl.gov
@AWS
16. WADL 2013
July 25-26th Indianapolis, IN
Results
• Distributed: Higher variance
• Increased delay due to network
• On vs. Off Comparison still comparable
• Viable solution without crippling service
17. WADL 2013
July 25-26th Indianapolis, IN
SiteStory Installation
• Apache module mod_sitestory
• Option to exclude a list of directories
• SiteStory Web Archive
• Trivial for existing Tomcat environments
• Tanuki Java wrapper (stand-alone) available
• Configure, open ports, go!
Or…
18. WADL 2013
July 25-26th Indianapolis, IN
SiteStoryTestbed
We have a SiteStory Web Archive installed for you!
1. Install and configure mod_sitestory
2. Send an email containing:
1. Your contact info
2. Web server IP address
3. Server domain name used
3. Happy Sitestory’ing!
mailto: SiteStory-Testbed@googlegroups.com
http://mementoweb.github.io/SiteStory/
19. WADL 2013
July 25-26th Indianapolis, IN
Martin Klein
@mart1nkle1n
martinklein0815@gmail.com
SiteStory
Archiving Done Differently
http://mementoweb.github.io/SiteStory/
Justin F. Brunelle
jbrunelle@cs.odu.edu