Site story wadl2013

•Descargar como PPTX, PDF•

0 recomendaciones•2,862 vistas

Martin Klein

Tecnología

WADL 2013
July 25-26th Indianapolis, IN
LANL SiteStory Teamlead developer

WADL 2013
July 25-26th Indianapolis, IN
Archiving - the traditional way
• Actively crawl the web
• For example, using Heritrix

WADL 2013
July 25-26th Indianapolis, IN
• Issues with crawler based archiving:
• Request can be rejected (robots.txt, user-agent, IP)
• Can be deceived (geo-location, user-agent)
• Can be trapped (crawl my calendar!)
• Requires constant and massive bandwidth
• Implied timing problem, when to crawl?
Archiving - the traditional way

WADL 2013
July 25-26th Indianapolis, IN
Timing problem:
• Update 1 viewed but not archived
t1
R
created
t2
browser
visit1
t3
crawler
visit1
t4
R
update1
t5
browser
visit2
t6
R
update2
Archiving - the traditional way

WADL 2013
July 25-26th Indianapolis, IN
Archiving - the SiteStory way
• Transactional Web archiving
• Archive accepts HTTP transaction between browser
and server

WADL 2013
July 25-26th Indianapolis, IN
Timing problem:
• Update 1 viewed and archived
t1
R
created
t2
browser
visit1
t3
crawler
visit1
t4
R
update1
t5
browser
visit2
t6
R
update2
Archiving - the traditional way

WADL 2013
July 25-26th Indianapolis, IN
• Challenges with transactional archiving:
• To be archived server has to cooperate
• Transfer data to archive, batch mode or real-time
• Archive must trust transmission to be authentic
• Resources from external servers have to be archived
out-of-band
• Deduplication challenges
• Alias: different URI, same response
• Conneg: same URI, different response
• Determine “significant” content change
Archiving - the SiteStory way

WADL 2013
July 25-26th Indianapolis, IN
SiteStory Status Quo
• mod_sitestory sends HTTP PUT to SiteStory Web
Archive upon client’s GET request
• not for POST, DELETE, etc
• for HTTP response codes 200, 302, 303
• Client IP can be included in stored headers, configurable
• Header info stored in BerkeleyDB, response body in FS
• Dedup via hash(body)
• Offloading content as WARC files possible
(read: recommended)

WADL 2013
July 25-26th Indianapolis, IN
To Appear: TPDL 2013
• SiteStory benchmark with ab&wget
o ApacheBench (ab): server stress test tool
o wget: Web page download
- All content: -p
• Local network
• Negligible difference between
SiteStory and No SiteStory

WADL 2013
July 25-26th Indianapolis, IN
Re-executed on testbed
ws-dl-03.cs.odu.edu
x99
,…
,
,
megalodon.lanl.gov
@AWS

WADL 2013
July 25-26th Indianapolis, IN
Testing with ab

WADL 2013
July 25-26th Indianapolis, IN
Testing with wget

WADL 2013
July 25-26th Indianapolis, IN
Round Trip Time -- Distributed

WADL 2013
July 25-26th Indianapolis, IN
Results
• Distributed: Higher variance
• Increased delay due to network
• On vs. Off Comparison still comparable
• Viable solution without crippling service

WADL 2013
July 25-26th Indianapolis, IN
SiteStory Installation
• Apache module mod_sitestory
• Option to exclude a list of directories
• SiteStory Web Archive
• Trivial for existing Tomcat environments
• Tanuki Java wrapper (stand-alone) available
• Configure, open ports, go!
Or…

WADL 2013
July 25-26th Indianapolis, IN
SiteStoryTestbed
We have a SiteStory Web Archive installed for you!
1. Install and configure mod_sitestory
2. Send an email containing:
1. Your contact info
2. Web server IP address
3. Server domain name used
3. Happy Sitestory’ing!
mailto: SiteStory-Testbed@googlegroups.com
http://mementoweb.github.io/SiteStory/

Más contenido relacionado

La actualidad más candente

Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin

How I learned to time travel, or, data pipelining and scheduling with AirflowPyData

What is SparkBruno Faria

Processing genetic data at scaleMark Schroering

AWS_Data_PipelineAhasan Habib

Acid ORC, Iceberg and Delta LakeMichal Gancarski

Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc

Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementBurasakorn Sabyeying

Scaling Graphite At YelpPaul O'Connor

Airflow for BeginnersVarya Karpenko

Lighthouse - an open-source library to build data lakes - Kris PeetersData Science Leuven

presto-at-netflix-hadoop-summit-15Zhenxiao Luo

Analysing GitHub commits with RBarbara Fusinska

Real Time Big DataInfoFarm

Spark: The Good, the Bad, and the UglySarah Guido

Dataset Descriptions in Open PHACTS and HCLSAlasdair Gray

Apache AirflowSumit Maheshwari

Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Kaxil Naik

Automatic Query-Centric API for Routine Access to Linked DataAlbert Meroño-Peñuela

Semantic web and Drupal: an introductionKristof Van Tomme

La actualidad más candente (20)

Apache Airflow (incubating) NL HUG Meetup 2016-07-19

How I learned to time travel, or, data pipelining and scheduling with Airflow

What is Spark

Processing genetic data at scale

AWS_Data_Pipeline

Acid ORC, Iceberg and Delta Lake

Running Airflow Workflows as ETL Processes on Hadoop

Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management

Scaling Graphite At Yelp

Airflow for Beginners

Lighthouse - an open-source library to build data lakes - Kris Peeters

presto-at-netflix-hadoop-summit-15

Analysing GitHub commits with R

Real Time Big Data

Spark: The Good, the Bad, and the Ugly

Dataset Descriptions in Open PHACTS and HCLS

Apache Airflow

Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...

Automatic Query-Centric API for Routine Access to Linked Data

Semantic web and Drupal: an introduction

Destacado

Who and What Links to the Internet ArchiveYasmin AlNoamany, PhD

Archiving the Mobile WebFrank McCown

Old Dominion University Computer Science IIPC New Member Michael Nelson

Generating stories from Archive-It collectionsYasmin AlNoamany, PhD

Access Patterns for Robots and Humans in Web ArchivesYasmin AlNoamany, PhD

Using Web Archives to Enrich the Live Web Experience Through StorytellingYasmin AlNoamany, PhD

Destacado (6)

Who and What Links to the Internet Archive

Archiving the Mobile Web

Old Dominion University Computer Science IIPC New Member

Generating stories from Archive-It collections

Access Patterns for Robots and Humans in Web Archives

Using Web Archives to Enrich the Live Web Experience Through Storytelling

Similar a Site story wadl2013

H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...Lucidworks

Jcdl2013 mkleinMartin Klein

Hadoop: The Unintended BenefitsDataWorks Summit

Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...Lucidworks

Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz

This Ain't Your Parents' Search EngineLucidworks

Data Migration Using AWS Snowball, Snowball Edge & SnowmobileAmazon Web Services

Globus Portal Framework (APS Workshop)Globus

Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)

This Ain't Your Parent's Search EngineGrant Ingersoll

Introduction to Riak - Joel Jacobsonakqaanoraks

Webinar: The Future of SQLCrate.io

Dataiku Flow and dctc - Berlin BuzzwordsDataiku

Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Cloudian

[DSC DACH 23] The Modern Data Stack - Bogdan PirvuDataScienceConferenc1

Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolMichael Nelson

A BASILar Approach for Building Web APIs on top of SPARQL EndpointsEnrico Daga

Big data for bay area big data developer19scottmiller

Oracle Java & Developer Cloud Service: What It Does & Doesn't DoRevelation Technologies

Talavant Data Lake Analytics Sean Forgatch

Similar a Site story wadl2013 (20)

H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...

Jcdl2013 mklein

Hadoop: The Unintended Benefits

Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...

Introduction to the Hadoop Ecosystem (FrOSCon Edition)

This Ain't Your Parents' Search Engine

Data Migration Using AWS Snowball, Snowball Edge & Snowmobile

Globus Portal Framework (APS Workshop)

Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine

This Ain't Your Parent's Search Engine

Introduction to Riak - Joel Jacobson

Webinar: The Future of SQL

Dataiku Flow and dctc - Berlin Buzzwords

Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...

[DSC DACH 23] The Modern Data Stack - Bogdan Pirvu

Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool

A BASILar Approach for Building Web APIs on top of SPARQL Endpoints

Big data for bay area big data developer

Oracle Java & Developer Cloud Service: What It Does & Doesn't Do

Talavant Data Lake Analytics

Más de Martin Klein

On the Persistence of Persistent Identifiers of the Scholarly WebMartin Klein

An Institutional Perspective to Rescue Scholarly OrphansMartin Klein

Who is Asking - Humans and Machines Experience a Different Scholarly WebMartin Klein

The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...Martin Klein

Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...Martin Klein

Comparing the Performance of OAI-PMH with ResourceSyncMartin Klein

Evaluating Memento Service OptimizationsMartin Klein

An Institutional Perspective to Rescue Scholarly OrphansMartin Klein

A Vision of the Library’s Role in Archiving Scholarly ArtifactsMartin Klein

First Steps in Research Data Management Under Constraints of a National Secur...Martin Klein

Smart Routing of Memento RequestsMartin Klein

Building Event Collections from Crawling Web ArchivesMartin Klein

A Web-Centric Pipeline for Archiving Scholarly ArtifactsMartin Klein

Focused Crawl of Web Archives to Build Event CollectionsMartin Klein

Creating Topical Collections:Web Archives vs. Live WebMartin Klein

Robust Linking to Web ResourcesMartin Klein

Signposting for RepositoriesMartin Klein

Discovering Scholarly Orphans Using ORCIDMartin Klein

Using the Memento Framework to Assess Content Drift in Scholarly CommunicationMartin Klein

Más de Martin Klein (20)

On the Persistence of Persistent Identifiers of the Scholarly Web

An Institutional Perspective to Rescue Scholarly Orphans

Who is Asking - Humans and Machines Experience a Different Scholarly Web

The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...

Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...

Comparing the Performance of OAI-PMH with ResourceSync

Evaluating Memento Service Optimizations

An Institutional Perspective to Rescue Scholarly Orphans

A Vision of the Library’s Role in Archiving Scholarly Artifacts

First Steps in Research Data Management Under Constraints of a National Secur...

Smart Routing of Memento Requests

Building Event Collections from Crawling Web Archives

A Web-Centric Pipeline for Archiving Scholarly Artifacts

Focused Crawl of Web Archives to Build Event Collections

Creating Topical Collections:Web Archives vs. Live Web

Robust Linking to Web Resources

Signposting for Repositories

Discovering Scholarly Orphans Using ORCID

Using the Memento Framework to Assess Content Drift in Scholarly Communication

Último

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Take control of your SAP testing with UiPath Test SuiteDianaGray10

"ML in Production",Oleksandr BaganFwdays

From Family Reminiscence to Scholarly Archive .Alan Dix

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

Site story wadl2013

1. WADL 2013 July 25-26th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently http://mementoweb.github.io/SiteStory/ Justin F. Brunelle jbrunelle@cs.odu.edu

2. WADL 2013 July 25-26th Indianapolis, IN LANL SiteStory Teamlead developer

3. WADL 2013 July 25-26th Indianapolis, IN Archiving - the traditional way • Actively crawl the web • For example, using Heritrix

4. WADL 2013 July 25-26th Indianapolis, IN • Issues with crawler based archiving: • Request can be rejected (robots.txt, user-agent, IP) • Can be deceived (geo-location, user-agent) • Can be trapped (crawl my calendar!) • Requires constant and massive bandwidth • Implied timing problem, when to crawl? Archiving - the traditional way

5. WADL 2013 July 25-26th Indianapolis, IN Timing problem: • Update 1 viewed but not archived t1 R created t2 browser visit1 t3 crawler visit1 t4 R update1 t5 browser visit2 t6 R update2 Archiving - the traditional way

6. WADL 2013 July 25-26th Indianapolis, IN Archiving - the SiteStory way • Transactional Web archiving • Archive accepts HTTP transaction between browser and server

7. WADL 2013 July 25-26th Indianapolis, IN Timing problem: • Update 1 viewed and archived t1 R created t2 browser visit1 t3 crawler visit1 t4 R update1 t5 browser visit2 t6 R update2 Archiving - the traditional way

8. WADL 2013 July 25-26th Indianapolis, IN

9. WADL 2013 July 25-26th Indianapolis, IN • Challenges with transactional archiving: • To be archived server has to cooperate • Transfer data to archive, batch mode or real-time • Archive must trust transmission to be authentic • Resources from external servers have to be archived out-of-band • Deduplication challenges • Alias: different URI, same response • Conneg: same URI, different response • Determine “significant” content change Archiving - the SiteStory way

10. WADL 2013 July 25-26th Indianapolis, IN SiteStory Status Quo • mod_sitestory sends HTTP PUT to SiteStory Web Archive upon client’s GET request • not for POST, DELETE, etc • for HTTP response codes 200, 302, 303 • Client IP can be included in stored headers, configurable • Header info stored in BerkeleyDB, response body in FS • Dedup via hash(body) • Offloading content as WARC files possible (read: recommended)

11. WADL 2013 July 25-26th Indianapolis, IN To Appear: TPDL 2013 • SiteStory benchmark with ab&wget o ApacheBench (ab): server stress test tool o wget: Web page download - All content: -p • Local network • Negligible difference between SiteStory and No SiteStory

12. WADL 2013 July 25-26th Indianapolis, IN Re-executed on testbed ws-dl-03.cs.odu.edu x99 ,… , , megalodon.lanl.gov @AWS

13. WADL 2013 July 25-26th Indianapolis, IN Testing with ab

14. WADL 2013 July 25-26th Indianapolis, IN Testing with wget

15. WADL 2013 July 25-26th Indianapolis, IN Round Trip Time -- Distributed

16. WADL 2013 July 25-26th Indianapolis, IN Results • Distributed: Higher variance • Increased delay due to network • On vs. Off Comparison still comparable • Viable solution without crippling service

17. WADL 2013 July 25-26th Indianapolis, IN SiteStory Installation • Apache module mod_sitestory • Option to exclude a list of directories • SiteStory Web Archive • Trivial for existing Tomcat environments • Tanuki Java wrapper (stand-alone) available • Configure, open ports, go! Or…

18. WADL 2013 July 25-26th Indianapolis, IN SiteStoryTestbed We have a SiteStory Web Archive installed for you! 1. Install and configure mod_sitestory 2. Send an email containing: 1. Your contact info 2. Web server IP address 3. Server domain name used 3. Happy Sitestory’ing! mailto: SiteStory-Testbed@googlegroups.com http://mementoweb.github.io/SiteStory/

19. WADL 2013 July 25-26th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently http://mementoweb.github.io/SiteStory/ Justin F. Brunelle jbrunelle@cs.odu.edu

Site story wadl2013

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (6)

Similar a Site story wadl2013

Similar a Site story wadl2013 (20)

Más de Martin Klein

Más de Martin Klein (20)

Último

Último (20)

Site story wadl2013