SlideShare una empresa de Scribd logo
1 de 43
Visualizing Digital Collections
of Web Archives
Mat Kelly, Michael L. Nelson, Michele C. Weigle
Old Dominion University
Web Archiving Collaboration: New Tools and Models
Columbia University, New York, NY
June 4, 2015
@machawk1http://ws-dl.cs.odu.edu
Motivation for Thumbnail
Summarization
• Change over time - aboutness
Apple.com has > 17k mementos
Many Nearly Identical
(apple.com)
Methods of Summarization
• Including all mementos
– many redundant thumbnails
– temporally/spatially/cognitively expensive
• Naively excluding images
– missing important captures in summary
• Compare image thumbnails
– temporally expensive for identifying unique
thumbnails
Comparing mementos’ markup can identify
sufficiently unique mementos
Analyzing Markup
<title>Apple</title>
<meta property="analytics-
track" content="Apple - Index/Tab"
/>
<meta property="analytics-s-
channel" content="homepage" />
<meta property="analytics-s-
bucket-0"
content="appleglobal,applehome"
/>
<meta property="analytics-s-
bucket-1"
content="apple{COUNTRY_CODE}gl
obal,apple{COUNTRY_CODE}home"
/>
8664ee964799c38c156d8f0
39dae8330
apple.com at Mar 17, 2008 HTML for memento SimHash for HTML
SimHash?
HTML snippet for
memento
First k characters of
markup
Second k characters of
markup
64th k characters of
markup
63rd k characters of
markup
markup length
64
k =
…
Hash to a
character
Hash to a
character
Hash to a
character
Hash to a
character
c
3
9
f
…
SimHash vs. Other Hashes
• md5(“aaaaaaaaaaaaaaa”)
12f9cf6998d52dbe773b06f848bb3608
• md5(“aaaaaaabaaaaaaa”)
e984cee68697eb77577717b532171493
• simhash(“aaaaaaaaaaaaaaa”)
8664ee964799c38c156d8f039dae8330
• simhash(“aaaaaaabaaaaaaa”)
8664ee964799a48c156d8f039dae8330
Why SimHash?
• SimHash identifies similarities between
documents
• Conventional hashing algorithms are for
identifying differences
– Drastically different output from similar content
• To remove redundancies, we want to detect
when temporally adjacent mementos are
sufficiently dissimilar
SimHashes for Mementos
HTML of apple.com
March 3, 2008
HTML of apple.com
March 5, 2008
HTML of apple.com
April 12, 2008
HTML of apple.com
October 4, 2008
c39f0abc...b9
c39d0abc...c9
c39d0abc...b9
c770ad1b...b9
Identifying Similarity by
Calculating Hamming Distance
HTML of apple.com
March 3, 2008
HTML of apple.com
March 5, 2008
HTML of apple.com
April 12, 2008
HTML of apple.com
October 4, 2008
c39f0abc...b9
c39d0abc...c9
c39d0abc...b9
c770ad1b...b9
HAMMING DISTANCE
2
1
7
N/A
pivot
Sliding Hamming Distance
• Selection based on previously selected memento
• Sliding pivot
ΔM3 ΔM3
ΔM3
ΔM6
ΔM6
ΔM6
ΔM6
ΔM0
ΔM0
ΔM0
Project Goals
Develop tools that implement thumbnail
summarization for TimeMaps
• Web Service
– Allows anyone to view TimeMap using thumbnail
summarization
• Wayback add-on
– Allows any archive using wayback to provide this
service to users
• Embeddable version
– Allow web page authors to embed overview of
past versions of page on live web page
AlSummarization
• SimHash-based summarization scheme
created by Ahmed AlSum
• AlSum + Summarization = AlSummarization
A. AlSum, and M. L. Nelson. “Thumbnail Summarization Techniques for Web Archives.” In
Proceedings of the 36TH European Conference on Information Retrieval, ECIR 2014, 2014.
Dr. Nelson’s Homepage
• URI-R: http://www.cs.odu.edu/~mln
• Append onto service URI for summary
– http://service/http://www.cs.odu.edu/~mln
Anatomy of the Visualization
3 presentations of the Summary
Temporally sorted
mementos
Memento metadata
Additional (optional) Endpoint
Parameters
• Access – tailors user interface
– Interactive, Embed, Wayback
• Strategy – to use alternative summarization
– alSummarization, yearly, skipListed, random
• http://service/?
o access=wayback&URI-
R=http://www.cs.odu.edu/~mln
o access=wayback&strategy=random&URI-
R=http://www.cs.odu.edu/~mln
Programmatic Flow
User’s Browser Thumbnails
Service
Memento-Compliant
Archive
User Requests URI-R Summary
User’s Browser Thumbnails
Service
Memento-Compliant
Archive
Service Relays URI-R to Archive
User’s Browser Thumbnails
Service
Memento-Compliant
Archive
Service queries archive for all mementos
for URI-R
URI-Ms returned to Service
User’s Browser Thumbnails
Service
Memento-Compliant
Archive
Archive returns TimeMap with URI-Ms to
thumbnail service
TM
Service fetches HTML for each
Memento
Thumbnails
Service
Service generates SimHash for
Each Mementos’ HTML
Thumbnails
Service
c39f0abc...b9
c39d0abc...c9
c39d0abc...b9
c770ad1b...b9
c770ad1b...b9
Service Calculates
Hamming Distance
Thumbnails
Service
Mementos in summary selected based on
hamming distance
c39f0abc...b9
c39d0abc...c9
c39d0abc...b9
c770ad1b...b9
c770ad1b...b9
Hd()
2
1
7
0
Preliminary UI returned to user
User’s Browser Thumbnails
Service
Templated HTML interface is returned to
user with placeholders for thumbnails
HTML
interface
c39d0abc...c9
c39d0abc...b9
c770ad1b...b9
2
1
0
Service Generates Thumbnails for
Mementos in Summary
Thumbnails
Service
c39f0abc...b9
c770ad1b...b9
Hd()
7
Thumbnails Served to User
User’s Browser Thumbnails
Service
Asynchronous polling from HTML pages
populates placeholder images once available
HTML
interface
Core Implementation
•
• for thumbnail generation
• abstractions preserved for code reuse
and extensibility
• Code documented to facilitate extensibility,
usage, and fixes
http://github.com/machawk1/ArchiveThumbnails
Initializing the service
$ npm install
$ node alSummarization.js
* Local resource (css, js,etc.) server
listening on Port 1338...
* Thumbnails service started on Port 15421
> Try localhost:15421/?URI-
R=http://matkelly.com in your web browser
for sample execution.
User/Service Administrator simply enters:
Service responds and is ready for query:
Online vs. Offline Generation
• Online Thumbnail Summarization
– Fetch each mementos’ HTML
– Calculate SimHashes
– Calculate Hamming Distance (HD)
– Select Mementos That Pass HD threshold
– Generate Thumbnails of Mementos
• Offline Thumbnail Summarization
– All of the above performed a priori
– Data potentially updated on access
Adaptive Strategies
• Very large TimeMaps are temporally
expensive to generate
• Default behavior:
if(timeRequirement == tooLong):
use(naiveStrategy)
• User can explicitly override behavior
Other Summarization Strategies
• Random Selection
– k mementos, uniform selection
• Interval
– every mth memento, m = n/k
• Temporal Interval
– One memento/year, reverse chronological
monthly back-fill
• Temporally Uniform Trimming when k > 15
Grid View
AlSummarization vs Random
Dr. Nelson’s Homepage
Random Strategy
Dr. Nelson’s Homepage
AlSummarization Strategy
Grid View
AlSummarization vs Interval
Dr. Nelson’s Homepage
Interval Strategy
Dr. Nelson’s Homepage
AlSummarization Strategy
Grid View
AlSummarization vs Temporal Interval
Dr. Nelson’s Homepage
Temporal Interval Strategy
Dr. Nelson’s Homepage
AlSummarization Strategy
Asynchronous Polling
Server-side SimHash Caching
Four Summarization Strategies
OpenWayback Integration
Service Embedding
• <object data=http://service/http://yoururl.com”
type=“text/html”>
</object>
-or-
• <iframe src=“http://service/http://yoururl.com”>
</iframe>
Visualizing Digital Collections
of Web Archives
• Codebase:
– github.com/machawk1/ArchiveThumbnails
• Service URI:
– http://wsdl-docker.cs.odu.edu:15421
Live
Demo

Más contenido relacionado

La actualidad más candente

Carpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSyncCarpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSyncnisohq
 
ResourceSync: Web-Based Resource Synchronization
ResourceSync: Web-Based Resource SynchronizationResourceSync: Web-Based Resource Synchronization
ResourceSync: Web-Based Resource SynchronizationHerbert Van de Sompel
 
ResourceSync: Web-based Resource Synchronization
ResourceSync: Web-based Resource SynchronizationResourceSync: Web-based Resource Synchronization
ResourceSync: Web-based Resource SynchronizationSimeon Warner
 
Publishing "5 star" data: the case for RDF
Publishing "5 star" data: the case for RDFPublishing "5 star" data: the case for RDF
Publishing "5 star" data: the case for RDFPeterWinstanley1
 
Tune-up electronic resources in Alma for better discovery_May_06_2016
Tune-up electronic resources in Alma for better discovery_May_06_2016Tune-up electronic resources in Alma for better discovery_May_06_2016
Tune-up electronic resources in Alma for better discovery_May_06_2016mahongzn
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureChris Bizer
 

La actualidad más candente (11)

Carpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSyncCarpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSync
 
ResourceSync: Web-Based Resource Synchronization
ResourceSync: Web-Based Resource SynchronizationResourceSync: Web-Based Resource Synchronization
ResourceSync: Web-Based Resource Synchronization
 
NISO ResourceSync Training Session
NISO ResourceSync Training SessionNISO ResourceSync Training Session
NISO ResourceSync Training Session
 
Today's forecast for your campus: BLUEcloud
 Today's forecast for your campus: BLUEcloud Today's forecast for your campus: BLUEcloud
Today's forecast for your campus: BLUEcloud
 
ResourceSync: Web-based Resource Synchronization
ResourceSync: Web-based Resource SynchronizationResourceSync: Web-based Resource Synchronization
ResourceSync: Web-based Resource Synchronization
 
Publishing "5 star" data: the case for RDF
Publishing "5 star" data: the case for RDFPublishing "5 star" data: the case for RDF
Publishing "5 star" data: the case for RDF
 
ResourceSync Overview
ResourceSync OverviewResourceSync Overview
ResourceSync Overview
 
Tune-up electronic resources in Alma for better discovery_May_06_2016
Tune-up electronic resources in Alma for better discovery_May_06_2016Tune-up electronic resources in Alma for better discovery_May_06_2016
Tune-up electronic resources in Alma for better discovery_May_06_2016
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited Lecture
 
OpenGLAM: LOD and American Art
OpenGLAM: LOD and American ArtOpenGLAM: LOD and American Art
OpenGLAM: LOD and American Art
 
Intro cOMPUTERS
Intro cOMPUTERSIntro cOMPUTERS
Intro cOMPUTERS
 

Similar a Visualizing Digital Collections of Web Archives from Columbia Web Archiving Collaboration Conference

Architecture Patterns - Open Discussion
Architecture Patterns - Open DiscussionArchitecture Patterns - Open Discussion
Architecture Patterns - Open DiscussionNguyen Tung
 
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...Tony Erwin
 
JavaOne: Efficiently building and deploying microservices
JavaOne: Efficiently building and deploying microservicesJavaOne: Efficiently building and deploying microservices
JavaOne: Efficiently building and deploying microservicesBart Blommaerts
 
Tech talk-live-alfresco-drupal
Tech talk-live-alfresco-drupalTech talk-live-alfresco-drupal
Tech talk-live-alfresco-drupalAlfresco Software
 
Cloud Services Powered by IBM SoftLayer and NetflixOSS
Cloud Services Powered by IBM SoftLayer and NetflixOSSCloud Services Powered by IBM SoftLayer and NetflixOSS
Cloud Services Powered by IBM SoftLayer and NetflixOSSaspyker
 
How Responsive Do You Want Your Website?
How Responsive Do You Want Your Website?How Responsive Do You Want Your Website?
How Responsive Do You Want Your Website?IWMW
 
Guide to Application Performance: Planning to Continued Optimization
Guide to Application Performance: Planning to Continued OptimizationGuide to Application Performance: Planning to Continued Optimization
Guide to Application Performance: Planning to Continued OptimizationMuleSoft
 
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...Ram G Athreya
 
Monitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with PrometheusMonitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with PrometheusFabian Reinartz
 
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...Tony Erwin
 
Web technologies course, an introduction
Web technologies course, an introductionWeb technologies course, an introduction
Web technologies course, an introductionPiero Fraternali
 
HTML5 on Mobile(For Designer)
HTML5 on Mobile(For Designer)HTML5 on Mobile(For Designer)
HTML5 on Mobile(For Designer)Adam Lu
 
Membase Introduction
Membase IntroductionMembase Introduction
Membase IntroductionMembase
 
e-Learning Delivery System : The Challenges
e-Learning Delivery System : The Challengese-Learning Delivery System : The Challenges
e-Learning Delivery System : The ChallengesDenpong Soodphakdee
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...Big Data Value Association
 
4. Web programming MVC.pptx
4. Web programming  MVC.pptx4. Web programming  MVC.pptx
4. Web programming MVC.pptxKrisnaBayu41
 
2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indixYu Ishikawa
 
TrueReusableCode-BigDataCodeCamp2016
TrueReusableCode-BigDataCodeCamp2016TrueReusableCode-BigDataCodeCamp2016
TrueReusableCode-BigDataCodeCamp2016Eduard Lazar
 
Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Aad Versteden
 

Similar a Visualizing Digital Collections of Web Archives from Columbia Web Archiving Collaboration Conference (20)

Architecture Patterns - Open Discussion
Architecture Patterns - Open DiscussionArchitecture Patterns - Open Discussion
Architecture Patterns - Open Discussion
 
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
 
JavaOne: Efficiently building and deploying microservices
JavaOne: Efficiently building and deploying microservicesJavaOne: Efficiently building and deploying microservices
JavaOne: Efficiently building and deploying microservices
 
Tech talk-live-alfresco-drupal
Tech talk-live-alfresco-drupalTech talk-live-alfresco-drupal
Tech talk-live-alfresco-drupal
 
Cloud Services Powered by IBM SoftLayer and NetflixOSS
Cloud Services Powered by IBM SoftLayer and NetflixOSSCloud Services Powered by IBM SoftLayer and NetflixOSS
Cloud Services Powered by IBM SoftLayer and NetflixOSS
 
How Responsive Do You Want Your Website?
How Responsive Do You Want Your Website?How Responsive Do You Want Your Website?
How Responsive Do You Want Your Website?
 
Guide to Application Performance: Planning to Continued Optimization
Guide to Application Performance: Planning to Continued OptimizationGuide to Application Performance: Planning to Continued Optimization
Guide to Application Performance: Planning to Continued Optimization
 
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
 
Monitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with PrometheusMonitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with Prometheus
 
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
 
Web technologies course, an introduction
Web technologies course, an introductionWeb technologies course, an introduction
Web technologies course, an introduction
 
HTML5 on Mobile(For Designer)
HTML5 on Mobile(For Designer)HTML5 on Mobile(For Designer)
HTML5 on Mobile(For Designer)
 
Membase Introduction
Membase IntroductionMembase Introduction
Membase Introduction
 
e-Learning Delivery System : The Challenges
e-Learning Delivery System : The Challengese-Learning Delivery System : The Challenges
e-Learning Delivery System : The Challenges
 
Salesforce Performance hacks - Client Side
Salesforce Performance hacks - Client SideSalesforce Performance hacks - Client Side
Salesforce Performance hacks - Client Side
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
 
4. Web programming MVC.pptx
4. Web programming  MVC.pptx4. Web programming  MVC.pptx
4. Web programming MVC.pptx
 
2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix
 
TrueReusableCode-BigDataCodeCamp2016
TrueReusableCode-BigDataCodeCamp2016TrueReusableCode-BigDataCodeCamp2016
TrueReusableCode-BigDataCodeCamp2016
 
Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016
 

Más de Mat Kelly

Aggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity FrameworkAggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity FrameworkMat Kelly
 
Client-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderClient-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderMat Kelly
 
A Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web ArchivesA Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web ArchivesMat Kelly
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Mat Kelly
 
Exploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web ArchivesExploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web ArchivesMat Kelly
 
JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Publi...
JCDL 2015 Doctoral Consortium - A Framework for AggregatingPrivate and Publi...JCDL 2015 Doctoral Consortium - A Framework for AggregatingPrivate and Publi...
JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Publi...Mat Kelly
 
Facilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite ImageryFacilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite ImageryMat Kelly
 
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...Mat Kelly
 
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014Mat Kelly
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital PreservationMat Kelly
 
Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013Mat Kelly
 
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction SystemIEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction SystemMat Kelly
 
Digital Preservation 2013
Digital Preservation 2013Digital Preservation 2013
Digital Preservation 2013Mat Kelly
 
Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web ArchivingMaking Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web ArchivingMat Kelly
 
An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...Mat Kelly
 
The Revolution Will Not Be Archived
The Revolution Will Not Be ArchivedThe Revolution Will Not Be Archived
The Revolution Will Not Be ArchivedMat Kelly
 
WARCreate - Create Wayback-Consumable WARC Files from Any Webpage
WARCreate - Create Wayback-Consumable WARC Files from Any WebpageWARCreate - Create Wayback-Consumable WARC Files from Any Webpage
WARCreate - Create Wayback-Consumable WARC Files from Any WebpageMat Kelly
 
NDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link RestorationNDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link RestorationMat Kelly
 
NDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive FacebookNDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive FacebookMat Kelly
 

Más de Mat Kelly (20)

Aggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity FrameworkAggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity Framework
 
Client-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderClient-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer Header
 
A Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web ArchivesA Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web Archives
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count
 
Exploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web ArchivesExploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web Archives
 
JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Publi...
JCDL 2015 Doctoral Consortium - A Framework for AggregatingPrivate and Publi...JCDL 2015 Doctoral Consortium - A Framework for AggregatingPrivate and Publi...
JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Publi...
 
Facilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite ImageryFacilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite Imagery
 
Slides
SlidesSlides
Slides
 
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
 
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital Preservation
 
Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013
 
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction SystemIEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
 
Digital Preservation 2013
Digital Preservation 2013Digital Preservation 2013
Digital Preservation 2013
 
Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web ArchivingMaking Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
 
An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...
 
The Revolution Will Not Be Archived
The Revolution Will Not Be ArchivedThe Revolution Will Not Be Archived
The Revolution Will Not Be Archived
 
WARCreate - Create Wayback-Consumable WARC Files from Any Webpage
WARCreate - Create Wayback-Consumable WARC Files from Any WebpageWARCreate - Create Wayback-Consumable WARC Files from Any Webpage
WARCreate - Create Wayback-Consumable WARC Files from Any Webpage
 
NDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link RestorationNDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link Restoration
 
NDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive FacebookNDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive Facebook
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 

Último (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Visualizing Digital Collections of Web Archives from Columbia Web Archiving Collaboration Conference

  • 1. Visualizing Digital Collections of Web Archives Mat Kelly, Michael L. Nelson, Michele C. Weigle Old Dominion University Web Archiving Collaboration: New Tools and Models Columbia University, New York, NY June 4, 2015 @machawk1http://ws-dl.cs.odu.edu
  • 2. Motivation for Thumbnail Summarization • Change over time - aboutness
  • 3. Apple.com has > 17k mementos
  • 5. Methods of Summarization • Including all mementos – many redundant thumbnails – temporally/spatially/cognitively expensive • Naively excluding images – missing important captures in summary • Compare image thumbnails – temporally expensive for identifying unique thumbnails Comparing mementos’ markup can identify sufficiently unique mementos
  • 6. Analyzing Markup <title>Apple</title> <meta property="analytics- track" content="Apple - Index/Tab" /> <meta property="analytics-s- channel" content="homepage" /> <meta property="analytics-s- bucket-0" content="appleglobal,applehome" /> <meta property="analytics-s- bucket-1" content="apple{COUNTRY_CODE}gl obal,apple{COUNTRY_CODE}home" /> 8664ee964799c38c156d8f0 39dae8330 apple.com at Mar 17, 2008 HTML for memento SimHash for HTML
  • 7. SimHash? HTML snippet for memento First k characters of markup Second k characters of markup 64th k characters of markup 63rd k characters of markup markup length 64 k = … Hash to a character Hash to a character Hash to a character Hash to a character c 3 9 f …
  • 8. SimHash vs. Other Hashes • md5(“aaaaaaaaaaaaaaa”) 12f9cf6998d52dbe773b06f848bb3608 • md5(“aaaaaaabaaaaaaa”) e984cee68697eb77577717b532171493 • simhash(“aaaaaaaaaaaaaaa”) 8664ee964799c38c156d8f039dae8330 • simhash(“aaaaaaabaaaaaaa”) 8664ee964799a48c156d8f039dae8330
  • 9. Why SimHash? • SimHash identifies similarities between documents • Conventional hashing algorithms are for identifying differences – Drastically different output from similar content • To remove redundancies, we want to detect when temporally adjacent mementos are sufficiently dissimilar
  • 10. SimHashes for Mementos HTML of apple.com March 3, 2008 HTML of apple.com March 5, 2008 HTML of apple.com April 12, 2008 HTML of apple.com October 4, 2008 c39f0abc...b9 c39d0abc...c9 c39d0abc...b9 c770ad1b...b9
  • 11. Identifying Similarity by Calculating Hamming Distance HTML of apple.com March 3, 2008 HTML of apple.com March 5, 2008 HTML of apple.com April 12, 2008 HTML of apple.com October 4, 2008 c39f0abc...b9 c39d0abc...c9 c39d0abc...b9 c770ad1b...b9 HAMMING DISTANCE 2 1 7 N/A pivot
  • 12.
  • 13. Sliding Hamming Distance • Selection based on previously selected memento • Sliding pivot ΔM3 ΔM3 ΔM3 ΔM6 ΔM6 ΔM6 ΔM6 ΔM0 ΔM0 ΔM0
  • 14. Project Goals Develop tools that implement thumbnail summarization for TimeMaps • Web Service – Allows anyone to view TimeMap using thumbnail summarization • Wayback add-on – Allows any archive using wayback to provide this service to users • Embeddable version – Allow web page authors to embed overview of past versions of page on live web page
  • 15. AlSummarization • SimHash-based summarization scheme created by Ahmed AlSum • AlSum + Summarization = AlSummarization A. AlSum, and M. L. Nelson. “Thumbnail Summarization Techniques for Web Archives.” In Proceedings of the 36TH European Conference on Information Retrieval, ECIR 2014, 2014.
  • 16. Dr. Nelson’s Homepage • URI-R: http://www.cs.odu.edu/~mln • Append onto service URI for summary – http://service/http://www.cs.odu.edu/~mln
  • 17. Anatomy of the Visualization 3 presentations of the Summary Temporally sorted mementos Memento metadata
  • 18. Additional (optional) Endpoint Parameters • Access – tailors user interface – Interactive, Embed, Wayback • Strategy – to use alternative summarization – alSummarization, yearly, skipListed, random • http://service/? o access=wayback&URI- R=http://www.cs.odu.edu/~mln o access=wayback&strategy=random&URI- R=http://www.cs.odu.edu/~mln
  • 19. Programmatic Flow User’s Browser Thumbnails Service Memento-Compliant Archive
  • 20. User Requests URI-R Summary User’s Browser Thumbnails Service Memento-Compliant Archive
  • 21. Service Relays URI-R to Archive User’s Browser Thumbnails Service Memento-Compliant Archive Service queries archive for all mementos for URI-R
  • 22. URI-Ms returned to Service User’s Browser Thumbnails Service Memento-Compliant Archive Archive returns TimeMap with URI-Ms to thumbnail service TM
  • 23. Service fetches HTML for each Memento Thumbnails Service
  • 24. Service generates SimHash for Each Mementos’ HTML Thumbnails Service c39f0abc...b9 c39d0abc...c9 c39d0abc...b9 c770ad1b...b9 c770ad1b...b9
  • 25. Service Calculates Hamming Distance Thumbnails Service Mementos in summary selected based on hamming distance c39f0abc...b9 c39d0abc...c9 c39d0abc...b9 c770ad1b...b9 c770ad1b...b9 Hd() 2 1 7 0
  • 26. Preliminary UI returned to user User’s Browser Thumbnails Service Templated HTML interface is returned to user with placeholders for thumbnails HTML interface
  • 27. c39d0abc...c9 c39d0abc...b9 c770ad1b...b9 2 1 0 Service Generates Thumbnails for Mementos in Summary Thumbnails Service c39f0abc...b9 c770ad1b...b9 Hd() 7
  • 28. Thumbnails Served to User User’s Browser Thumbnails Service Asynchronous polling from HTML pages populates placeholder images once available HTML interface
  • 29. Core Implementation • • for thumbnail generation • abstractions preserved for code reuse and extensibility • Code documented to facilitate extensibility, usage, and fixes http://github.com/machawk1/ArchiveThumbnails
  • 30. Initializing the service $ npm install $ node alSummarization.js * Local resource (css, js,etc.) server listening on Port 1338... * Thumbnails service started on Port 15421 > Try localhost:15421/?URI- R=http://matkelly.com in your web browser for sample execution. User/Service Administrator simply enters: Service responds and is ready for query:
  • 31. Online vs. Offline Generation • Online Thumbnail Summarization – Fetch each mementos’ HTML – Calculate SimHashes – Calculate Hamming Distance (HD) – Select Mementos That Pass HD threshold – Generate Thumbnails of Mementos • Offline Thumbnail Summarization – All of the above performed a priori – Data potentially updated on access
  • 32. Adaptive Strategies • Very large TimeMaps are temporally expensive to generate • Default behavior: if(timeRequirement == tooLong): use(naiveStrategy) • User can explicitly override behavior
  • 33. Other Summarization Strategies • Random Selection – k mementos, uniform selection • Interval – every mth memento, m = n/k • Temporal Interval – One memento/year, reverse chronological monthly back-fill • Temporally Uniform Trimming when k > 15
  • 34. Grid View AlSummarization vs Random Dr. Nelson’s Homepage Random Strategy Dr. Nelson’s Homepage AlSummarization Strategy
  • 35. Grid View AlSummarization vs Interval Dr. Nelson’s Homepage Interval Strategy Dr. Nelson’s Homepage AlSummarization Strategy
  • 36. Grid View AlSummarization vs Temporal Interval Dr. Nelson’s Homepage Temporal Interval Strategy Dr. Nelson’s Homepage AlSummarization Strategy
  • 41. Service Embedding • <object data=http://service/http://yoururl.com” type=“text/html”> </object> -or- • <iframe src=“http://service/http://yoururl.com”> </iframe>
  • 42. Visualizing Digital Collections of Web Archives • Codebase: – github.com/machawk1/ArchiveThumbnails • Service URI: – http://wsdl-docker.cs.odu.edu:15421