This presentation on the ARCOMEM system is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
2. Overview
Beginner Level
• Approach of current crawlers
• What’s new in ARCOMEM?
• The ARCOMEM Approach
– Overview about the phases
– Overview about the processing levels
• Handling of Preservation in ARCOMEM
Advanced Level
• Overview of the system architecture
• Possible ARCOMEM System Configurations
Slide 2
7. What‘s new in ARCOMEM?
• Intelligent Crawler
– Semantically Enhanced Crawl Specification
– „Understands“ the crawl intention
– Crawler guidance by using social and semantic information
– Stops crawling at irrelevant pages
– Two stage crawling strategy: Web ARCOMEM Storage Archive
• Advanced Web Archive Enrichment
– Semantic Information: Entities, Topics, Opinions, Events (ETOE)
– Social Context: Interlinking Web Social Web, Trustworthiness of
information and users
• Archivist and End User Support
– Archivist Tool
– Searching and browsing Web archives with different facets
Slide 7
8. ARCOMEM Phases: Crawl Specification
1. Intelligent Crawl Specification (ICS)
The ICS describes the intended crawl
by specifying keywords, entities,
topics, etc. together with reference
page and starting points. Reference
pages matches to 100% with the
crawl content and are used by the
crawler to learn more about the crawl.
Slide 8
Entities
Obama, Romney, Biden, Ryan, Republicans, Democrats, …
Keywords
US Election, CommitToMitt, Teaparty, Budget deficit, …
Reference Seedlist
https://twitter.com/whitehouse , https://twitter.com/blog44 ,
https://twitter.com/BarackObama, ...
Seedlist
http://news.bbc.co.uk/, http://telegraph.co.uk/, ...
9. ARCOMEM Phases: Crawling & Online Processing
Slide 9
2. Crawling & Online Processing
In this phase the web pages and
social web content will be collected
and a first semantic analysis will be
applied. The analysis result is used to
guide the crawler by ranking
extracted links by their importance.
All information are stored in the
ARCOMEM Storage.
Crawling
Online
Processing
ARCOMEM
Storage
Crawling
Entities
Obama, Romney, Biden, Ryan, Republicans, Democrats, …
Keywords
US Election, CommitToMitt, Teaparty, Budget deficit, …
Reference Seedlist
https://twitter.com/whitehouse , https://twitter.com/blog44 ,
https://twitter.com/BarackObama, ...
Seedlist
http://news.bbc.co.uk/, http://telegraph.co.uk/, ...
Internet
10. ARCOMEM Phases: Offline Processing
Slide 10
3. Offline Processing
The offline processing runs after the
collection of content has been finished.
The aim of this phase is the enrich the
crawled pages with meta-information
that has been extracted from the
content. The enrichments helps
selecting content for the final web
archive. Furthermore it eases the
searching and browsing within the final
Web archive.
Crawling
Online
Processing
Offline Processing
ARCOMEM
Storage
Crawling
Entities
Obama, Romney, Biden, Ryan, Republicans, Democrats, …
Keywords
US Election, CommitToMitt, Teaparty, Budget deficit, …
Reference Seedlist
https://twitter.com/whitehouse , https://twitter.com/blog44 ,
https://twitter.com/BarackObama, ...
Seedlist
http://news.bbc.co.uk/, http://telegraph.co.uk/, ...
Internet
11. ARCOMEM Phases: Appraisal & Selection
Slide 11
4. Based on the information given in the
Intelligent Crawl Specification (ICS) and
the enrichment of the content, the most
interesting content items are selected to
be stored in the final Web archive. The
final Web archive are WARC files,
which include the crawled pages and all
enrichments done during the offline
processing in RDF format.
Crawling
Online
Processing
Offline Processing
ARCOMEM
Storage Archive
Crawling
Appraisal
Selection
Entities
Obama, Romney, Biden, Ryan, Republicans, Democrats, …
Keywords
US Election, CommitToMitt, Teaparty, Budget deficit, …
Reference Seedlist
https://twitter.com/whitehouse , https://twitter.com/blog44 ,
https://twitter.com/BarackObama, ...
Seedlist
http://news.bbc.co.uk/, http://telegraph.co.uk/, ...
Internet
12. ARCOMEM Phases: Applications
Slide 12
Crawling
Online
Processing
Offline Processing
SARA
for
Broadcaster,
Parliaments
ARCOMEM
Storage Archive
Crawling
Appraisal
Selection
Entities
Obama, Romney, Biden, Ryan, Republicans, Democrats, …
Keywords
US Election, CommitToMitt, Teaparty, Budget deficit, …
Reference Seedlist
https://twitter.com/whitehouse , https://twitter.com/blog44 ,
https://twitter.com/BarackObama, ...
Seedlist
http://news.bbc.co.uk/, http://telegraph.co.uk/, ...
Internet
5. The Search and Retrieval Application
(SARA) allows end users to search
and browse the archive in different
ways, e.g. based on keywords,
entities, topics, opinions.
13. ARCOMEM Phases: Cross Crawl Analytics
Slide 13
Crawling
Online
Processing
Offline Processing
SARA
for
Broadcaster,
Parliaments
ARCOMEM
Storage Archive
Crawling
Appraisal
Selection
Cross Crawl Processing
Entities
Obama, Romney, Biden, Ryan, Republicans, Democrats, …
Keywords
US Election, CommitToMitt, Teaparty, Budget deficit, …
Reference Seedlist
https://twitter.com/whitehouse , https://twitter.com/blog44 ,
https://twitter.com/BarackObama, ...
Seedlist
http://news.bbc.co.uk/, http://telegraph.co.uk/, ...
Internet
6. The Cross-Crawl analysis allows
content analytics across archives.
This enables the possibility to
combine Web archives to get a larger
collection of documents or to study
evolutions over time. Examples are
evolution of languages, opinions, etc.
14. Preservation in ARCOMEM
Content Preservation in ARCOMEM
• Selection and appraisal of Web and Social Web content
• Preparation of WARC files for preservation
• Provides access to preserved Web content
• Not part of ARCOMEM are
– Long-term preservation of WARC files
– Format handling, etc.
Semantic Preservation in ARCOMEM
• Extraction of Entities, Events, Topics, Opinions
• Enrichment with Linked Data
• Created WARC files contain
– Raw Web Data
– RDF triples of enrichment
• Preservation of Linked Data
– Not part of ARCOMEM
– See EU Projects: DIACHRON (IP), PRELIDA (CA)
Slide 14
+
WARC