This presentation on using the Heritrix crawler is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
1. Adaptive Heritrix
ATHENA – Research and Innovation Center in Information,
Communication and Knowledge Technologies
2. ARCOMEM Requirements for crawling
• ARCOMEM aims to guide crawling based on
– Advanced semantic link extraction
– Use of social media
– Analysis of crawled content in large-scale distributed
environment
• These aims require a crawler to
– Update adaptively priorities
– Operate as a service
2Adaptive Heritrix
3. Adaptive Prioritization
• New Heritrix frontier class
– Plug & Play with open source Heritrix
– Minimal configuration
• Adding forward index for URLs
– locates a link already scheduled for crawling
• Moves scheduled link to the place corresponding to
the updated priority
3Adaptive Heritrix