SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
Web Archive Profiling
Through Fulltext Search
Sawood Alam and Michael L. Nelson
Computer Science Department, Old Dominion University
Norfolk, Virginia - 23529
Herbert Van de Sompel
Los Alamos National Laboratory, Los Alamos, NM
David S. H. Rosenthal
Stanford University Libraries, Stanford, CA
Supported in part by the IIPC and NSF 1526700
Unorganized Collections
2
Organized Collections
3
Collection Understanding
4
Memento Aggregator
5
Memento Aggregator
6
Memento Aggregator
7
Memento Aggregator
8
Memento Aggregator
9
Memento Aggregator
10
From: Michael Nelson [mailto:mln@cs.odu.edu]
Sent: Wednesday, December 02, 2015 12:33 PM
To: Jones, Gina
Cc: Rourke, Patrick; Grotke, Abigail
Subject: Re: WebSciDL
Hi Gina, I'll investigate. memgator is software that one my students wrote,
but I suspect the traffic you're seeing is b/c it is deployed in
http://oldweb.today/ can you share the IP addr from where you're seeing
the traffic? I presume the requests are for Memento TimeMaps? It should
not being actually scraping HTML pages.
regards,
Michael
On Wed, 2 Dec 2015, Jones, Gina wrote:
> Hi Michael, we have a slight configuration issue with the current OW
> set up for our webarchives. I think, from looking at the logs, that
> "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues
on our wayback.
> Do you know who is running this scraper? Itʼs not part of memento is it?
>
> Gina Jones
> Web Archiving Team
> Library of Congress
From: Ilya Kreymer <ikreymer@gmail.com>
Date: Wed, 2 Dec 2015 10:33:56 -0800
Subject: high traffic on oldweb!
To: Herbert Van de Sompel <hvdsomp@gmail.com>, Sawood Alam
<ibnesayeed@gmail.com>
Hi Herbert, Sawood,
Herbert: Perhaps you are lucky that I am not using the LANL aggregator,
as the traffic has gotten really high, and also I was asked to remove an
archive due to the traffic it was causing temporarily..
I am thinking that ability to remove source archives quickly is an
important aspect of an aggregator.
Sawood: Hopefully yours will support something like this so I don't need
to restart the container to change the archivelist ;)
Ilya
Broadcasting is Bad
11
Availability and Overlap
● Archives are sparse
● Broadcasting is wasteful, both clients and archives suffer
12
Memento Routing
13
Routing Pros & Cons
● Pros
○ Minimizes traffic and resources consumption
○ Improves throughput
● Cons
○ Upfront profile maintenance cost
○ May miss Mementos (false negatives)
14
Why Small Archives Matter?
15
Why Small Archives Matter?
● 400B+ web pages at IA do not cover
everything
● Top three archives after IA produce full
TimeMap 52% of the time (AlSum, et al., TPDL 2013)
● Targeted crawls
● Special focus archives
● Restricted resources
● Private archives
● Censorship
16
While the Internet Archive was Down...
$ memgator -f cdxj example.org | cut -c-4 | grep -v "^@" | uniq -c
2 2002
1 2005
1 2008
6 2009
67 2010
17 2011
64 2012
108 2013
108 2014
186 2015
51 2016 17
Archive Profile
● High-level summary of an archive
● Predicts presence of mementos of a URI-R in
an archive
● Provides various statistics about the holdings
● Small in size
● Publicly available
● Easy to update and partially patch
● Useful for Memento query routing and other
things
18
Profiling Strategies
● Sample URI Profiling (AlSum, et al., TPDL 2013)
● CDX Profiling (Alam, et al., TPDL 2015)
● Response Cache Profiling (Bornand, et al., JCDL 2016)
● Fulltext Search Profiling
19
Methodology
Top Nouns
time
year
people
way
man
day
thing
child
mr
government 20
Random Dict
analogies
unbolt
consonant
coils
stolidly
cigar
decrepit
rhododendron
cannibal
honeydew
Dynamic Words Discovery
the ‫وﻛﺎﻟﺔ‬ war
angry ‫أﻧﺑﺎء‬ the
arab ‫اﻟﻌرﺑﻲ‬ middle
news ‫اﻟﻐﺎﺿب‬ east
service on arabic
a politics poetry
source war art
Random Searcher Model (RSM)
21
START
STOP
Seed Vocabulary
NextWord()
ExtractWords()
Search()
Select a random link
from the search results
Vocabulary
seeding
needed?
Termination
condition
reached?
GenerateProfile()
Store search results
No
Yes
YesNo
Fetch the contents of the
selected document
RSM Illustration
Teaching Resources Adjunct Toolkit NC NET Academy PD Planning Tools Regional
Centers Campus Liaisons Nontraditional Careers College Tech Prep NC ACCESS Co op
Education Green Technology You are here NC NET Teaching Resources Discipline Specific
English English Self Paced Modules Writing Across the Curriculum NC NET Western Center
Incorporating Visuals in Workplace Documents Sections 1 2 Wake Tech Community College
Incorporating Visuals in Workplace Documents Section 3 Wake Tech Community College
All self paced modules can be accessed through the NC NET Blackboard server Log in with
the user name faculty and the password nc net Once connected you can view the courses
by topic or alphabetically by title English Webliography North Carolina Community College
System 2012
RSM Modes
● Static: Externally supplied static word list
● PopularityBiased: Refresh Vocabulary after
every search attempt and consider term
frequency for selecting next search keyword
● EqualOpportunity: Refresh Vocabulary
after every search attempt and ignore term
frequency for selecting next search keyword
● Conservative: Discover new words only
when the Vocabulary is exhausted
23
Profiling Policies & Archive-It Dataset
Policy # Keys Example
URIR 30,800,406 uk,co,bbc,news,)/Images/Logo.png?height=80&width=200
HxP1 1,724,284 uk,co,bbc,news,)/Images
DDom 91,629 uk,co,bbc,)/
H1P0 212 uk,)/
Sample URI: https://www.news.BBC.co.uk/Images/Logo.png?width=80&height=40
24
For a detailed list of profiling policies please refer to:
Alam, et al.: Web Archive Profiling Through CDX Summarization. IJDL (2016) 17: 223-238
Searches vs Coverage
25
100% in 11K searches
100% in 27K searches
100% in 337K searches 100% in 1.9M searches
RSM Operation Mode Costs
Mode
Query
Cost
HTTP
Cost
Remarks
Static C C
Suitable for specialized collection with known top
keywords
PopularityBiased C 2 * C Human like model, but costly
EqualOpportunity C 2 * C Human like model, but costly
Conservative C C +
(where << C)
Suitable for any collection and works without any
supplementary materials with very little overhead
26
Routing Confusion Matrix
Predicted  Actual Present in the Archive Not in the Archive
Routed to the Archive True Positive (TP) False Positive (FP)
Not Routed to the Archive False Negative (FN) True Negative (TN)
Routing Confusion Matrix Recall Accuracy
27
Accuracy, Recall, & Coverage (10-100%)
28
DMOZ IA Wayback
UK WaybackMemento Proxy
Low Accuracy (high FP) =>
Archives & Aggregator suffer
Low Recall (high FN) =>
Users suffer
Profile Policy Recommendations
● IF complete CDX is available THEN
○ Generate HxP1 profile
● ELSE IF fulltext search is available THEN
○ Generate DDom profile
● ELSE
○ Generate H1P0 or other smaller profiles using
Sample URIs
Note: It is possible to perform less detailed queries on more
specific (higher order) profiles, but not the other way
29
RSM Mode Recommendations
● IF the collection is about a specific topic in a
specific language AND a suitable top
keywords list is available THEN
○ Use Static mode
● ELSE
○ Use Conservative mode
30
Who Knows Term Frequency for
Estonian Nouns?
31
https://en.wiktionary.org/wiki/Category:Estonian_nouns
Future Work
● Evaluation of combination profiles such as
URI-Key along with Datetime
● Utilize archive profile to generate rank
ordered list of archive
● Profiles for usage other than Memento
routing, such as, site classification based
profiles (e.g., news, wiki, social media, blog
etc.)
32
Conclusions
● Evaluated the search cost as a function of archive holdings’
coverage and profiling policy
● Developed the Random Searcher Model
● Correctly route 80% requests while maintaining 0.9 Recall
by only discovering 10% of the archive holdings and
generating a profile that costs less than 1% of the complete
knowledge profile
33

Más contenido relacionado

La actualidad más candente

PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...Dimitris Kontokostas
 
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
VALA Tech Camp 2017: Intro to Wikidata & SPARQLVALA Tech Camp 2017: Intro to Wikidata & SPARQL
VALA Tech Camp 2017: Intro to Wikidata & SPARQLJane Frazier
 
Graph databases & data integration v2
Graph databases & data integration v2Graph databases & data integration v2
Graph databases & data integration v2Dimitris Kontokostas
 
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011François Scharffe
 
Flagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertierFlagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertierFlagis VZW
 
Semantic Web introduction
Semantic Web introductionSemantic Web introduction
Semantic Web introductionGraphity
 
Dirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz ProjectDirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz Projectmbruemmer
 
Vocabulary for Linked Data Visualization Model - Dateso 2015
Vocabulary for Linked Data Visualization Model - Dateso 2015Vocabulary for Linked Data Visualization Model - Dateso 2015
Vocabulary for Linked Data Visualization Model - Dateso 2015Jiří Helmich
 
SWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebSWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebPascal-Nicolas Becker
 
Insight Data Engineering project
Insight Data Engineering projectInsight Data Engineering project
Insight Data Engineering projectHoa Nguyen
 
Web Data Management with RDF
Web Data Management with RDFWeb Data Management with RDF
Web Data Management with RDFM. Tamer Özsu
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commonsJesse Wang
 
Ontology, Semantic Web and DBpedia
Ontology, Semantic Web and DBpediaOntology, Semantic Web and DBpedia
Ontology, Semantic Web and DBpediaRichard Kuo
 

La actualidad más candente (20)

PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
 
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
VALA Tech Camp 2017: Intro to Wikidata & SPARQLVALA Tech Camp 2017: Intro to Wikidata & SPARQL
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
 
Memento 101
Memento 101Memento 101
Memento 101
 
Graph databases & data integration v2
Graph databases & data integration v2Graph databases & data integration v2
Graph databases & data integration v2
 
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
 
20110728 datalift-rpi-troy
20110728 datalift-rpi-troy20110728 datalift-rpi-troy
20110728 datalift-rpi-troy
 
GitHubGraph
GitHubGraphGitHubGraph
GitHubGraph
 
Flagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertierFlagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertier
 
Semantic Web introduction
Semantic Web introductionSemantic Web introduction
Semantic Web introduction
 
Converting GHO to RDF
Converting GHO to RDFConverting GHO to RDF
Converting GHO to RDF
 
Data quality in Real Estate
Data quality in Real EstateData quality in Real Estate
Data quality in Real Estate
 
Linked Open Data stuff
Linked Open Data stuffLinked Open Data stuff
Linked Open Data stuff
 
Dirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz ProjectDirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz Project
 
Vocabulary for Linked Data Visualization Model - Dateso 2015
Vocabulary for Linked Data Visualization Model - Dateso 2015Vocabulary for Linked Data Visualization Model - Dateso 2015
Vocabulary for Linked Data Visualization Model - Dateso 2015
 
Introduction to W3C Linked Data Platform
Introduction to W3C Linked Data PlatformIntroduction to W3C Linked Data Platform
Introduction to W3C Linked Data Platform
 
SWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebSWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic Web
 
Insight Data Engineering project
Insight Data Engineering projectInsight Data Engineering project
Insight Data Engineering project
 
Web Data Management with RDF
Web Data Management with RDFWeb Data Management with RDF
Web Data Management with RDF
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
 
Ontology, Semantic Web and DBpedia
Ontology, Semantic Web and DBpediaOntology, Semantic Web and DBpedia
Ontology, Semantic Web and DBpedia
 

Destacado

InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesSawood Alam
 
Libyan digital newspapers_after_revolution
Libyan digital newspapers_after_revolutionLibyan digital newspapers_after_revolution
Libyan digital newspapers_after_revolutionmaturban
 
10 Ways to Win at SlideShare SEO & Presentation Optimization
10 Ways to Win at SlideShare SEO & Presentation Optimization10 Ways to Win at SlideShare SEO & Presentation Optimization
10 Ways to Win at SlideShare SEO & Presentation OptimizationOneupweb
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingUsing Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingYasmin AlNoamany, PhD
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Michael Nelson
 
Social Feed Manager presentation at Archives Unleashed 3.0
Social Feed Manager presentation at Archives Unleashed 3.0Social Feed Manager presentation at Archives Unleashed 3.0
Social Feed Manager presentation at Archives Unleashed 3.0Justin Littman
 
Twitter Analysis: Fake News
Twitter Analysis: Fake  NewsTwitter Analysis: Fake  News
Twitter Analysis: Fake NewsErika Siregar
 
Good News/ Bad News
Good News/ Bad NewsGood News/ Bad News
Good News/ Bad NewsLulwahMA
 
My Presentation to SFIMA Summit 2010 - Social Media Strategy, YouTube, and Vi...
My Presentation to SFIMA Summit 2010 - Social Media Strategy, YouTube, and Vi...My Presentation to SFIMA Summit 2010 - Social Media Strategy, YouTube, and Vi...
My Presentation to SFIMA Summit 2010 - Social Media Strategy, YouTube, and Vi...Murray Izenwasser
 
02 תואר בוגר וגליון ציונים
02 תואר בוגר וגליון ציונים02 תואר בוגר וגליון ציונים
02 תואר בוגר וגליון ציוניםEvyatar Glatzer
 
I sociedades de inversion
I sociedades de inversionI sociedades de inversion
I sociedades de inversionAGHATA1236
 
5 Things You Should Do Before Job Interview-by Jubaer
5 Things You Should Do Before Job Interview-by Jubaer 5 Things You Should Do Before Job Interview-by Jubaer
5 Things You Should Do Before Job Interview-by Jubaer Slide Gen
 
”C”は何の”C”
”C”は何の”C””C”は何の”C”
”C”は何の”C”gipwest
 
Props music video pp
Props music video ppProps music video pp
Props music video ppeloisesmith98
 

Destacado (18)

InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
 
Libyan digital newspapers_after_revolution
Libyan digital newspapers_after_revolutionLibyan digital newspapers_after_revolution
Libyan digital newspapers_after_revolution
 
10 Ways to Win at SlideShare SEO & Presentation Optimization
10 Ways to Win at SlideShare SEO & Presentation Optimization10 Ways to Win at SlideShare SEO & Presentation Optimization
10 Ways to Win at SlideShare SEO & Presentation Optimization
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingUsing Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through Storytelling
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
 
Social Feed Manager presentation at Archives Unleashed 3.0
Social Feed Manager presentation at Archives Unleashed 3.0Social Feed Manager presentation at Archives Unleashed 3.0
Social Feed Manager presentation at Archives Unleashed 3.0
 
Twitter Analysis: Fake News
Twitter Analysis: Fake  NewsTwitter Analysis: Fake  News
Twitter Analysis: Fake News
 
Good News/ Bad News
Good News/ Bad NewsGood News/ Bad News
Good News/ Bad News
 
My Presentation to SFIMA Summit 2010 - Social Media Strategy, YouTube, and Vi...
My Presentation to SFIMA Summit 2010 - Social Media Strategy, YouTube, and Vi...My Presentation to SFIMA Summit 2010 - Social Media Strategy, YouTube, and Vi...
My Presentation to SFIMA Summit 2010 - Social Media Strategy, YouTube, and Vi...
 
FINAL.LosOjos
FINAL.LosOjosFINAL.LosOjos
FINAL.LosOjos
 
02 תואר בוגר וגליון ציונים
02 תואר בוגר וגליון ציונים02 תואר בוגר וגליון ציונים
02 תואר בוגר וגליון ציונים
 
Operatingsystems 4grade
Operatingsystems 4gradeOperatingsystems 4grade
Operatingsystems 4grade
 
I sociedades de inversion
I sociedades de inversionI sociedades de inversion
I sociedades de inversion
 
5 Things You Should Do Before Job Interview-by Jubaer
5 Things You Should Do Before Job Interview-by Jubaer 5 Things You Should Do Before Job Interview-by Jubaer
5 Things You Should Do Before Job Interview-by Jubaer
 
”C”は何の”C”
”C”は何の”C””C”は何の”C”
”C”は何の”C”
 
Webles10
Webles10Webles10
Webles10
 
Props music video pp
Props music video ppProps music video pp
Props music video pp
 
Evidencias 2013
Evidencias 2013Evidencias 2013
Evidencias 2013
 

Similar a Web Archive Profiling Through Fulltext Search

dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of DatadipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of DataeXascale Infolab
 
A Practical Approach to Design, Implementation, and Management A Practical Ap...
A Practical Approach to Design, Implementation, and Management A Practical Ap...A Practical Approach to Design, Implementation, and Management A Practical Ap...
A Practical Approach to Design, Implementation, and Management A Practical Ap...Cynthia Velynne
 
Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Discover Pinterest
 
How to get started in Big Data for master's students
How to get started in Big Data for master's studentsHow to get started in Big Data for master's students
How to get started in Big Data for master's studentsMohamed Nadjib MAMI
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko GlobalLogic Ukraine
 
Release webinar: Sansa and Ontario
Release webinar: Sansa and OntarioRelease webinar: Sansa and Ontario
Release webinar: Sansa and OntarioBigData_Europe
 
Graph basedrdf storeforapachecassandra
Graph basedrdf storeforapachecassandraGraph basedrdf storeforapachecassandra
Graph basedrdf storeforapachecassandraRavindra Ranwala
 
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case StudyPLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case StudyPROIDEA
 
Efficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesEfficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesRui Vieira
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingPlanetData Network of Excellence
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...Oscar Corcho
 
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of dataPiyush Katariya
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRoverChristoph Matthies
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightGluster.org
 
Data management for Quantitative Biology -Basics and challenges in biomedical...
Data management for Quantitative Biology -Basics and challenges in biomedical...Data management for Quantitative Biology -Basics and challenges in biomedical...
Data management for Quantitative Biology -Basics and challenges in biomedical...QBiC_Tue
 
Complex Ephemeral Caching With Redis: Jeff Pollard
Complex Ephemeral Caching With Redis: Jeff PollardComplex Ephemeral Caching With Redis: Jeff Pollard
Complex Ephemeral Caching With Redis: Jeff PollardRedis Labs
 

Similar a Web Archive Profiling Through Fulltext Search (20)

dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of DatadipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
 
A Practical Approach to Design, Implementation, and Management A Practical Ap...
A Practical Approach to Design, Implementation, and Management A Practical Ap...A Practical Approach to Design, Implementation, and Management A Practical Ap...
A Practical Approach to Design, Implementation, and Management A Practical Ap...
 
Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"
 
How to get started in Big Data for master's students
How to get started in Big Data for master's studentsHow to get started in Big Data for master's students
How to get started in Big Data for master's students
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
Release webinar: Sansa and Ontario
Release webinar: Sansa and OntarioRelease webinar: Sansa and Ontario
Release webinar: Sansa and Ontario
 
Graph basedrdf storeforapachecassandra
Graph basedrdf storeforapachecassandraGraph basedrdf storeforapachecassandra
Graph basedrdf storeforapachecassandra
 
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case StudyPLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
 
cyclades eswc2016
cyclades eswc2016cyclades eswc2016
cyclades eswc2016
 
Efficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesEfficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databases
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
 
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of data
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRover
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan Lambright
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
 
Data management for Quantitative Biology -Basics and challenges in biomedical...
Data management for Quantitative Biology -Basics and challenges in biomedical...Data management for Quantitative Biology -Basics and challenges in biomedical...
Data management for Quantitative Biology -Basics and challenges in biomedical...
 
Complex Ephemeral Caching With Redis: Jeff Pollard
Complex Ephemeral Caching With Redis: Jeff PollardComplex Ephemeral Caching With Redis: Jeff Pollard
Complex Ephemeral Caching With Redis: Jeff Pollard
 

Más de Sawood Alam

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesSawood Alam
 
CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsSawood Alam
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineSawood Alam
 
Profiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingProfiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingSawood Alam
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesSawood Alam
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSawood Alam
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingSawood Alam
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSawood Alam
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkSawood Alam
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesSawood Alam
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkSawood Alam
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingSawood Alam
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File FormatSawood Alam
 
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingInterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingSawood Alam
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoSawood Alam
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationSawood Alam
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerSawood Alam
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerSawood Alam
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Sawood Alam
 
Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Sawood Alam
 

Más de Sawood Alam (20)

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web Pages
 
CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection Insights
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
 
Profiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingProfiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento Routing
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web Packaging
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination Framework
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web Archives
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File Format
 
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingInterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in Go
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to Containerization
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorker
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
 
Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015
 

Último

pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flyPRADYUMMAURYA1
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxBhagirath Gogikar
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Joonhun Lee
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 

Último (20)

Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 

Web Archive Profiling Through Fulltext Search

  • 1. Web Archive Profiling Through Fulltext Search Sawood Alam and Michael L. Nelson Computer Science Department, Old Dominion University Norfolk, Virginia - 23529 Herbert Van de Sompel Los Alamos National Laboratory, Los Alamos, NM David S. H. Rosenthal Stanford University Libraries, Stanford, CA Supported in part by the IIPC and NSF 1526700
  • 11. From: Michael Nelson [mailto:mln@cs.odu.edu] Sent: Wednesday, December 02, 2015 12:33 PM To: Jones, Gina Cc: Rourke, Patrick; Grotke, Abigail Subject: Re: WebSciDL Hi Gina, I'll investigate. memgator is software that one my students wrote, but I suspect the traffic you're seeing is b/c it is deployed in http://oldweb.today/ can you share the IP addr from where you're seeing the traffic? I presume the requests are for Memento TimeMaps? It should not being actually scraping HTML pages. regards, Michael On Wed, 2 Dec 2015, Jones, Gina wrote: > Hi Michael, we have a slight configuration issue with the current OW > set up for our webarchives. I think, from looking at the logs, that > "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on our wayback. > Do you know who is running this scraper? Itʼs not part of memento is it? > > Gina Jones > Web Archiving Team > Library of Congress From: Ilya Kreymer <ikreymer@gmail.com> Date: Wed, 2 Dec 2015 10:33:56 -0800 Subject: high traffic on oldweb! To: Herbert Van de Sompel <hvdsomp@gmail.com>, Sawood Alam <ibnesayeed@gmail.com> Hi Herbert, Sawood, Herbert: Perhaps you are lucky that I am not using the LANL aggregator, as the traffic has gotten really high, and also I was asked to remove an archive due to the traffic it was causing temporarily.. I am thinking that ability to remove source archives quickly is an important aspect of an aggregator. Sawood: Hopefully yours will support something like this so I don't need to restart the container to change the archivelist ;) Ilya Broadcasting is Bad 11
  • 12. Availability and Overlap ● Archives are sparse ● Broadcasting is wasteful, both clients and archives suffer 12
  • 14. Routing Pros & Cons ● Pros ○ Minimizes traffic and resources consumption ○ Improves throughput ● Cons ○ Upfront profile maintenance cost ○ May miss Mementos (false negatives) 14
  • 15. Why Small Archives Matter? 15
  • 16. Why Small Archives Matter? ● 400B+ web pages at IA do not cover everything ● Top three archives after IA produce full TimeMap 52% of the time (AlSum, et al., TPDL 2013) ● Targeted crawls ● Special focus archives ● Restricted resources ● Private archives ● Censorship 16
  • 17. While the Internet Archive was Down... $ memgator -f cdxj example.org | cut -c-4 | grep -v "^@" | uniq -c 2 2002 1 2005 1 2008 6 2009 67 2010 17 2011 64 2012 108 2013 108 2014 186 2015 51 2016 17
  • 18. Archive Profile ● High-level summary of an archive ● Predicts presence of mementos of a URI-R in an archive ● Provides various statistics about the holdings ● Small in size ● Publicly available ● Easy to update and partially patch ● Useful for Memento query routing and other things 18
  • 19. Profiling Strategies ● Sample URI Profiling (AlSum, et al., TPDL 2013) ● CDX Profiling (Alam, et al., TPDL 2015) ● Response Cache Profiling (Bornand, et al., JCDL 2016) ● Fulltext Search Profiling 19
  • 20. Methodology Top Nouns time year people way man day thing child mr government 20 Random Dict analogies unbolt consonant coils stolidly cigar decrepit rhododendron cannibal honeydew Dynamic Words Discovery the ‫وﻛﺎﻟﺔ‬ war angry ‫أﻧﺑﺎء‬ the arab ‫اﻟﻌرﺑﻲ‬ middle news ‫اﻟﻐﺎﺿب‬ east service on arabic a politics poetry source war art
  • 21. Random Searcher Model (RSM) 21 START STOP Seed Vocabulary NextWord() ExtractWords() Search() Select a random link from the search results Vocabulary seeding needed? Termination condition reached? GenerateProfile() Store search results No Yes YesNo Fetch the contents of the selected document
  • 22. RSM Illustration Teaching Resources Adjunct Toolkit NC NET Academy PD Planning Tools Regional Centers Campus Liaisons Nontraditional Careers College Tech Prep NC ACCESS Co op Education Green Technology You are here NC NET Teaching Resources Discipline Specific English English Self Paced Modules Writing Across the Curriculum NC NET Western Center Incorporating Visuals in Workplace Documents Sections 1 2 Wake Tech Community College Incorporating Visuals in Workplace Documents Section 3 Wake Tech Community College All self paced modules can be accessed through the NC NET Blackboard server Log in with the user name faculty and the password nc net Once connected you can view the courses by topic or alphabetically by title English Webliography North Carolina Community College System 2012
  • 23. RSM Modes ● Static: Externally supplied static word list ● PopularityBiased: Refresh Vocabulary after every search attempt and consider term frequency for selecting next search keyword ● EqualOpportunity: Refresh Vocabulary after every search attempt and ignore term frequency for selecting next search keyword ● Conservative: Discover new words only when the Vocabulary is exhausted 23
  • 24. Profiling Policies & Archive-It Dataset Policy # Keys Example URIR 30,800,406 uk,co,bbc,news,)/Images/Logo.png?height=80&width=200 HxP1 1,724,284 uk,co,bbc,news,)/Images DDom 91,629 uk,co,bbc,)/ H1P0 212 uk,)/ Sample URI: https://www.news.BBC.co.uk/Images/Logo.png?width=80&height=40 24 For a detailed list of profiling policies please refer to: Alam, et al.: Web Archive Profiling Through CDX Summarization. IJDL (2016) 17: 223-238
  • 25. Searches vs Coverage 25 100% in 11K searches 100% in 27K searches 100% in 337K searches 100% in 1.9M searches
  • 26. RSM Operation Mode Costs Mode Query Cost HTTP Cost Remarks Static C C Suitable for specialized collection with known top keywords PopularityBiased C 2 * C Human like model, but costly EqualOpportunity C 2 * C Human like model, but costly Conservative C C + (where << C) Suitable for any collection and works without any supplementary materials with very little overhead 26
  • 27. Routing Confusion Matrix Predicted Actual Present in the Archive Not in the Archive Routed to the Archive True Positive (TP) False Positive (FP) Not Routed to the Archive False Negative (FN) True Negative (TN) Routing Confusion Matrix Recall Accuracy 27
  • 28. Accuracy, Recall, & Coverage (10-100%) 28 DMOZ IA Wayback UK WaybackMemento Proxy Low Accuracy (high FP) => Archives & Aggregator suffer Low Recall (high FN) => Users suffer
  • 29. Profile Policy Recommendations ● IF complete CDX is available THEN ○ Generate HxP1 profile ● ELSE IF fulltext search is available THEN ○ Generate DDom profile ● ELSE ○ Generate H1P0 or other smaller profiles using Sample URIs Note: It is possible to perform less detailed queries on more specific (higher order) profiles, but not the other way 29
  • 30. RSM Mode Recommendations ● IF the collection is about a specific topic in a specific language AND a suitable top keywords list is available THEN ○ Use Static mode ● ELSE ○ Use Conservative mode 30
  • 31. Who Knows Term Frequency for Estonian Nouns? 31 https://en.wiktionary.org/wiki/Category:Estonian_nouns
  • 32. Future Work ● Evaluation of combination profiles such as URI-Key along with Datetime ● Utilize archive profile to generate rank ordered list of archive ● Profiles for usage other than Memento routing, such as, site classification based profiles (e.g., news, wiki, social media, blog etc.) 32
  • 33. Conclusions ● Evaluated the search cost as a function of archive holdings’ coverage and profiling policy ● Developed the Random Searcher Model ● Correctly route 80% requests while maintaining 0.9 Recall by only discovering 10% of the archive holdings and generating a profile that costs less than 1% of the complete knowledge profile 33