This presentation on using the Heritrix crawler is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
CNIC Information System with Pakdata Cf In Pakistan
Arcomem training heritrix_advanced
1. Adaptive Heritrix
ATHENA – Research and Innovation Center in Information,
Communication and Knowledge Technologies
2. ARCOMEM Requirements for crawling
• ARCOMEM aims to guide crawling based on
– Advanced semantic link extraction
– Use of social media
– Analysis of crawled content in large-scale distributed
environment
• These aims require a crawler to
– Update adaptively priorities
– Operate as a service
2Adaptive Heritrix
3. Adaptive Prioritization
• New Heritrix frontier class
– Plug & Play with open source Heritrix
– Minimal configuration
• Adding forward index for URLs
– locates a link already scheduled for crawling
• Moves scheduled link to the place corresponding to
the updated priority
3Adaptive Heritrix
4. Heritrix as a crawling service
• Decoupled fetching and link prioritization
• Writing crawled data to modified WARC files
– WARCS are loaded on Hbase by different process
• Efficient URL injection end-point
– Receives scored links from online analysis and API crawler
– ARCOMEM-specific JSON format of outlinks
– External-memory queue to handle large volumes of links
4Adaptive Heritrix
5. Assessing the impact of adaptive prioritization
• Simulations to evaluate how adaptive prioritization affects
performance of a focused crawler
– Simulation on 3 DMOZ topics: Genetics, Recycling, Oceanography
• Running simulated crawl
– Start from set of 20 randomly selected seeds (repeated 3 times)
– Topic vector is the sum of the seed vectors
– Crawl 10,000 web pages
• Compare the effectiveness of a best-first crawler to
– Adaptive prioritization: priorities are updated using MAX, MIN, AVG,
SUM, FIRST, LAST functions
5Adaptive Heritrix
6. Adaptive Prioritization results
6
Update
function
Harvest Ratio Average Similarity DMOZ topics
FIRST 0.3317 0.2945 0.4979
AVG 0.3609 0.3024 0.5779
MAX 0.3388 0.2967 0.5270
SUM 0.2679 0.2759 0.4650
LAST 0.3404 0.2961 0.5985
FIRST 0.3317 0.2945 0.4979
• AVG and LAST have highest harvest ratios and find most
pages from DMOZ topics
• Adaptive prioritization more effective that FIRST, i.e. Best-
First crawler
Adaptive Heritrix
7. Adaptive Prioritization results
6
Update
function
Harvest Ratio Average Similarity DMOZ topics
FIRST 0.3317 0.2945 0.4979
AVG 0.3609 0.3024 0.5779
MAX 0.3388 0.2967 0.5270
SUM 0.2679 0.2759 0.4650
LAST 0.3404 0.2961 0.5985
FIRST 0.3317 0.2945 0.4979
• AVG and LAST have highest harvest ratios and find most
pages from DMOZ topics
• Adaptive prioritization more effective that FIRST, i.e. Best-
First crawler
Adaptive Heritrix