SlideShare a Scribd company logo
1 of 30
Thumbnail Summarization
Techniques For Web Archives
Ahmed AlSum*
Stanford University Libraries
Stanford CA, USA
aalsum@stanford.edu
Michael L. Nelson
Old Dominion University
Norfolk VA, USA
mln@cs.odu.edu
The 36th European Conference on Information Retrieval.
ECIR 2014, Amsterdam, Netherlands, 2014
* The research has been conducted while Ahmed AlSum was at Old Dominion University
ECIR 2014 Amsterdam, Netherlands
What is a Web Archive?
http://www.cs.odu.edu
2ECIR 2014 Amsterdam, Netherlands
Memento Terminology
URI-R, R
URI-M, M
URI-T, TM
http://www.amazon.com
http://web.archive.org/web/20110411070244/http://amazon.com
Original Resource
Memento
TimeMap
3ECIR 2014 Amsterdam, Netherlands
Thumbnails in Web Archive
Internet Archive UK Web Archive
4ECIR 2014 Amsterdam, Netherlands
Thumbnail Creation Challenges
• Scalability in Time
• IA may need 361 years to create thumbnail for each memento
using one hundred machines.
• Scalability in Space
• IA will need 355 TB to store 1 thumbnail per each memento.
• Page quality
5ECIR 2014 Amsterdam, Netherlands
Thumbnail Usage Challenges
6
• This is partial view of the first 700 thumbnails out of
10,500 available mementos for www.apple.com
ECIR 2014 Amsterdam, Netherlands
From 10,500 Mementos to 69 Thumbnails.
7ECIR 2014 Amsterdam, Netherlands
How many thumbnails do we need?
www.unfi.com on the live Web
8ECIR 2014 Amsterdam, Netherlands
How many thumbnails do we need?
www.unfi.com on the live Web
9ECIR 2014 Amsterdam, Netherlands
40 Thumbnails are good.
10ECIR 2014 Amsterdam, Netherlands
METHODOLOGY
11ECIR 2014 Amsterdam, Netherlands
Visual Similarity and Text Similarity
SimilarDifferent
HTML Text
12ECIR 2014 Amsterdam, Netherlands
Correlation between
Visual Similarity and Text Similarity
• Text Similarity
• SimHash
• DOM Tree
• Embedded resources
• Memento Datetime (Capture time)
• Visual Similarity
• Number of different pixels
13ECIR 2014 Amsterdam, Netherlands
Text Similarity
SimHash
• Compute 64-bit SimHash fingerprints with k = 4 for two
pages, then Calculate the distance using Hamming
Distance
14ECIR 2014 Amsterdam, Netherlands
12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05
Distance
12 bits
Simhash: 147EDAA9977E9400 Simhash: 157EFAAC97189100
Text Similarity
DOM Tree
• Transfer each webpage to DOM tree
• Calculate the difference using Levenshtein Distance
• Levenshtein distance: is the number of operations to insert, update, and delete.
15ECIR 2014 Amsterdam, Netherlands
Pawlik, M., & Augsten, N. (2011). RTED: a robust algorithm for the tree edit distance. Proceedings of the VLDB Endowment, 5(4), 334–345.
Text Similarity
Embedded resources
• Extract the embedded resources from each page
• Calculate the total number of new resources that have
been added and the resources that have been removed.
16ECIR 2014 Amsterdam, Netherlands
Addition
Removal
Total 4 11
Images 1 9
JS 1 0
CSS 2 2
12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05
Text Similarity
Memento datetime
• Calculate the difference between the record capture time
for both pages in seconds.
17ECIR 2014 Amsterdam, Netherlands
12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05
Difference
70942 sec
Visual Similarity
• The number of different pixels between two thumbnails,
we resize them into different dimensions (e.g., 64x64 and
128x128). We calculate the Manhattan distance between
each pair
ECIR 2014 Amsterdam, Netherlands 18
12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05
Distance
0.65
EXPERIMENT
Calculate the correlation between Visual Similarity and
Text Similarity
ECIR 2014 Amsterdam, Netherlands 19
Fortune 500
• 499,540 mementos from 488
TimeMaps.
• For each Memento, we download the
HTML and capture the thumbnail using
PhantomJS.
20
Dataset
Correlation between
Visual Similarity and Text Similarity
SimHash DOM tree
Embedded resources Memento Datetime
21
SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013]
ECIR 2014 Amsterdam, Netherlands
SELECTION ALGORITHMS
Using text similarity features to predict the visual
similarity.
22ECIR 2014 Amsterdam, Netherlands
#1: Threshold Grouping
23ECIR 2014 Amsterdam, Netherlands
#1: Threshold Grouping
24ECIR 2014 Amsterdam, Netherlands
#2: Clustering technique
• Input:
• TimeMap with n mementos
• A set of features.
• For example, F = {SimHash, Memento-Datetime}
• Task:
• Cluster n mementos in K clusters.
25ECIR 2014 Amsterdam, Netherlands
#2: Clustering technique
SimHash Feature SimHash and Datetime Features
26
Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341.
ECIR 2014 Amsterdam, Netherlands
#3: Time Normalization
27ECIR 2014 Amsterdam, Netherlands
Selection Algorithms Comparison
Threshold Grouping K clustering Time Normalization
TimeMap Reduction 27% 9% to 12% 23%
Image Loss 28 78 - 101 109
# Features 1 feature 1 or more 1 feature
Preprocessing required Yes Yes No
Efficient processing Medium Extensive Light
Incremental Yes No Yes
Online/offline Both Both Both
28ECIR 2014 Amsterdam, Netherlands
Generalization outside the Web Archive
• Summarize a website of n pages with only k thumbnails
29ECIR 2014 Amsterdam, Netherlands
Conclusions
• We explored the similarity between the text and visual
appearance of the web page.
• We found that SimHash difference between HTML text and
Levenshtein distance between HTML DOM tree have the highest
correlation
• We presented three algorithms to select k thumbnails
from n mementos per TimeMap.
30
aalsum@stanford.edu
@aalsum
ECIR 2014 Amsterdam, Netherlands

More Related Content

Similar to Thumbnail Summarization Techniques For Web Archives

DepositMOre: Applying tools to increase full-text content in institutional re...
DepositMOre: Applying tools to increase full-text content in institutional re...DepositMOre: Applying tools to increase full-text content in institutional re...
DepositMOre: Applying tools to increase full-text content in institutional re...depositMO
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingPlanetData Network of Excellence
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...Oscar Corcho
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMOVING Project
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataAnsgar Scherp
 
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...Databricks
 
Cloud-native persistence in a serverless world
Cloud-native persistence in a serverless worldCloud-native persistence in a serverless world
Cloud-native persistence in a serverless worldNick Do
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseFlorian Lautenschlager
 
CLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationCLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationEnno Meijers
 
Fontys Lecture - The Evolution of the Oracle Database 2016
Fontys Lecture -  The Evolution of the Oracle Database 2016Fontys Lecture -  The Evolution of the Oracle Database 2016
Fontys Lecture - The Evolution of the Oracle Database 2016Lucas Jellema
 
RDM programme @ Edinburgh an institutional approach
RDM programme @ Edinburgh an institutional approachRDM programme @ Edinburgh an institutional approach
RDM programme @ Edinburgh an institutional approachJisc
 
Geospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataGeospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataAlexMiowski
 
ODIN 1st year Conference Oct 2013 Interoperability: connecting identifiers
ODIN 1st year Conference Oct 2013 Interoperability: connecting identifiersODIN 1st year Conference Oct 2013 Interoperability: connecting identifiers
ODIN 1st year Conference Oct 2013 Interoperability: connecting identifiersGudmundur Thorisson
 
RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)Daniele Dell'Aglio
 
Duraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository ServicesDuraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository ServicesMatthew Critchlow
 

Similar to Thumbnail Summarization Techniques For Web Archives (20)

DepositMOre: Applying tools to increase full-text content in institutional re...
DepositMOre: Applying tools to increase full-text content in institutional re...DepositMOre: Applying tools to increase full-text content in institutional re...
DepositMOre: Applying tools to increase full-text content in institutional re...
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open Data
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open Data
 
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
 
Cloud-native persistence in a serverless world
Cloud-native persistence in a serverless worldCloud-native persistence in a serverless world
Cloud-native persistence in a serverless world
 
sample-resume
sample-resumesample-resume
sample-resume
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series database
 
Service Integration to Enhance RDM
Service Integration to Enhance RDMService Integration to Enhance RDM
Service Integration to Enhance RDM
 
CLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationCLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage information
 
RDM Programme @ Edinburgh
RDM Programme @ Edinburgh RDM Programme @ Edinburgh
RDM Programme @ Edinburgh
 
Fontys Lecture - The Evolution of the Oracle Database 2016
Fontys Lecture -  The Evolution of the Oracle Database 2016Fontys Lecture -  The Evolution of the Oracle Database 2016
Fontys Lecture - The Evolution of the Oracle Database 2016
 
RDM programme @ Edinburgh an institutional approach
RDM programme @ Edinburgh an institutional approachRDM programme @ Edinburgh an institutional approach
RDM programme @ Edinburgh an institutional approach
 
RDM@Edinburgh_interoperation_IDCC2015
RDM@Edinburgh_interoperation_IDCC2015RDM@Edinburgh_interoperation_IDCC2015
RDM@Edinburgh_interoperation_IDCC2015
 
Geospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataGeospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning Data
 
ODIN 1st year Conference Oct 2013 Interoperability: connecting identifiers
ODIN 1st year Conference Oct 2013 Interoperability: connecting identifiersODIN 1st year Conference Oct 2013 Interoperability: connecting identifiers
ODIN 1st year Conference Oct 2013 Interoperability: connecting identifiers
 
RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)
 
Duraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository ServicesDuraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository Services
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 

More from Ahmed AlSum

Restoring US First Website
Restoring US First WebsiteRestoring US First Website
Restoring US First WebsiteAhmed AlSum
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 
Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013Ahmed AlSum
 
Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013
Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013
Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013Ahmed AlSum
 
ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013Ahmed AlSum
 
How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011Ahmed AlSum
 

More from Ahmed AlSum (6)

Restoring US First Website
Restoring US First WebsiteRestoring US First Website
Restoring US First Website
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013
 
Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013
Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013
Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013
 
ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013
 
How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011
 

Recently uploaded

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 

Recently uploaded (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Thumbnail Summarization Techniques For Web Archives

  • 1. Thumbnail Summarization Techniques For Web Archives Ahmed AlSum* Stanford University Libraries Stanford CA, USA aalsum@stanford.edu Michael L. Nelson Old Dominion University Norfolk VA, USA mln@cs.odu.edu The 36th European Conference on Information Retrieval. ECIR 2014, Amsterdam, Netherlands, 2014 * The research has been conducted while Ahmed AlSum was at Old Dominion University ECIR 2014 Amsterdam, Netherlands
  • 2. What is a Web Archive? http://www.cs.odu.edu 2ECIR 2014 Amsterdam, Netherlands
  • 3. Memento Terminology URI-R, R URI-M, M URI-T, TM http://www.amazon.com http://web.archive.org/web/20110411070244/http://amazon.com Original Resource Memento TimeMap 3ECIR 2014 Amsterdam, Netherlands
  • 4. Thumbnails in Web Archive Internet Archive UK Web Archive 4ECIR 2014 Amsterdam, Netherlands
  • 5. Thumbnail Creation Challenges • Scalability in Time • IA may need 361 years to create thumbnail for each memento using one hundred machines. • Scalability in Space • IA will need 355 TB to store 1 thumbnail per each memento. • Page quality 5ECIR 2014 Amsterdam, Netherlands
  • 6. Thumbnail Usage Challenges 6 • This is partial view of the first 700 thumbnails out of 10,500 available mementos for www.apple.com ECIR 2014 Amsterdam, Netherlands
  • 7. From 10,500 Mementos to 69 Thumbnails. 7ECIR 2014 Amsterdam, Netherlands
  • 8. How many thumbnails do we need? www.unfi.com on the live Web 8ECIR 2014 Amsterdam, Netherlands
  • 9. How many thumbnails do we need? www.unfi.com on the live Web 9ECIR 2014 Amsterdam, Netherlands
  • 10. 40 Thumbnails are good. 10ECIR 2014 Amsterdam, Netherlands
  • 12. Visual Similarity and Text Similarity SimilarDifferent HTML Text 12ECIR 2014 Amsterdam, Netherlands
  • 13. Correlation between Visual Similarity and Text Similarity • Text Similarity • SimHash • DOM Tree • Embedded resources • Memento Datetime (Capture time) • Visual Similarity • Number of different pixels 13ECIR 2014 Amsterdam, Netherlands
  • 14. Text Similarity SimHash • Compute 64-bit SimHash fingerprints with k = 4 for two pages, then Calculate the distance using Hamming Distance 14ECIR 2014 Amsterdam, Netherlands 12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05 Distance 12 bits Simhash: 147EDAA9977E9400 Simhash: 157EFAAC97189100
  • 15. Text Similarity DOM Tree • Transfer each webpage to DOM tree • Calculate the difference using Levenshtein Distance • Levenshtein distance: is the number of operations to insert, update, and delete. 15ECIR 2014 Amsterdam, Netherlands Pawlik, M., & Augsten, N. (2011). RTED: a robust algorithm for the tree edit distance. Proceedings of the VLDB Endowment, 5(4), 334–345.
  • 16. Text Similarity Embedded resources • Extract the embedded resources from each page • Calculate the total number of new resources that have been added and the resources that have been removed. 16ECIR 2014 Amsterdam, Netherlands Addition Removal Total 4 11 Images 1 9 JS 1 0 CSS 2 2 12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05
  • 17. Text Similarity Memento datetime • Calculate the difference between the record capture time for both pages in seconds. 17ECIR 2014 Amsterdam, Netherlands 12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05 Difference 70942 sec
  • 18. Visual Similarity • The number of different pixels between two thumbnails, we resize them into different dimensions (e.g., 64x64 and 128x128). We calculate the Manhattan distance between each pair ECIR 2014 Amsterdam, Netherlands 18 12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05 Distance 0.65
  • 19. EXPERIMENT Calculate the correlation between Visual Similarity and Text Similarity ECIR 2014 Amsterdam, Netherlands 19
  • 20. Fortune 500 • 499,540 mementos from 488 TimeMaps. • For each Memento, we download the HTML and capture the thumbnail using PhantomJS. 20 Dataset
  • 21. Correlation between Visual Similarity and Text Similarity SimHash DOM tree Embedded resources Memento Datetime 21 SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013] ECIR 2014 Amsterdam, Netherlands
  • 22. SELECTION ALGORITHMS Using text similarity features to predict the visual similarity. 22ECIR 2014 Amsterdam, Netherlands
  • 23. #1: Threshold Grouping 23ECIR 2014 Amsterdam, Netherlands
  • 24. #1: Threshold Grouping 24ECIR 2014 Amsterdam, Netherlands
  • 25. #2: Clustering technique • Input: • TimeMap with n mementos • A set of features. • For example, F = {SimHash, Memento-Datetime} • Task: • Cluster n mementos in K clusters. 25ECIR 2014 Amsterdam, Netherlands
  • 26. #2: Clustering technique SimHash Feature SimHash and Datetime Features 26 Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341. ECIR 2014 Amsterdam, Netherlands
  • 27. #3: Time Normalization 27ECIR 2014 Amsterdam, Netherlands
  • 28. Selection Algorithms Comparison Threshold Grouping K clustering Time Normalization TimeMap Reduction 27% 9% to 12% 23% Image Loss 28 78 - 101 109 # Features 1 feature 1 or more 1 feature Preprocessing required Yes Yes No Efficient processing Medium Extensive Light Incremental Yes No Yes Online/offline Both Both Both 28ECIR 2014 Amsterdam, Netherlands
  • 29. Generalization outside the Web Archive • Summarize a website of n pages with only k thumbnails 29ECIR 2014 Amsterdam, Netherlands
  • 30. Conclusions • We explored the similarity between the text and visual appearance of the web page. • We found that SimHash difference between HTML text and Levenshtein distance between HTML DOM tree have the highest correlation • We presented three algorithms to select k thumbnails from n mementos per TimeMap. 30 aalsum@stanford.edu @aalsum ECIR 2014 Amsterdam, Netherlands

Editor's Notes

  1. Verbally show this is the endExplain this is an initial step in this area