This presentation on the ARCOMEM System is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
1. ARCOMEM System Overview
(Advanced Level)
Prerequisite: ARCOMEM System Overview (Beginner Level)
Thomas Risse
L3S Research Center
Hannover, Germany
risse@L3s.de
2. Architecture Overview
Slide 2
Online
Processing
Crawler
Cross Crawl Analysis
Offline
Processing
Queue
Management
Application-Aware
Helper
Resource Selection
& Prioritization
Resource
Fetching
Intelligent
Crawl
Definition
Consolidation
Enrichment
GATE Offline Analysis
Social Web Analysis
GATE Online Analysis Social Web Analysis
Named Entity
Evol. Recog.
Extracted
SocialWeb
Information
Crawler
Cockpit
ARCOMEM
Storage
(HBase, H2RDF)
URLs
Relevance Analysis
&
Priorization
Image/Video Analysis
Twitter
Dynamics
WARC Export
Application
WARC
Files
SARA
SOLR
Index
+
Broadcaster
Parliament
The levels in the
architecture
represent the
phases as
described in the
ARCOMEM
overview.
Details about the
components can
be found in the
other courses.
3. Crawler and Online Analysis
• Intelligent Crawl Specification (ICS) specifies the crawl intention
• Resource Fetching
– Heritrix or the IMF Large Scale Crawler can be used for collecting Web pages
– API Crawling support the collection of content in the Social Web via the API of the sites
• Application Aware Helper extracts links from Web and Social Web content by
taking application specific functionalities into account e.g. for Twitter, YouTube.
• Simple content analysis (e.g. keyword detection) in the online phase allows
and efficient relevance ranking of extracted links
• All results are stored in the ARCOMEM Storage
• Crawler and Online Analysis are tightly coupled
Slide 3
Online
Processing
Crawler
Queue
Management
Application-Aware
Helper
Resource Selection
& Prioritization
Resource
Fetching
Intelligent
Crawl
Definition
GATE Online Analysis Social Web Analysis
ARCOMEM
Storage
(HBase, H2RDF)
URLs
Relevance Analysis
&
Priorization
4. Offline Processing Level
Thorough analysis of
crawled Web objects
• GATE based extraction of
Entities, Opinions, Events
from text
• Topic Extraction
• Analysis of images and videos
– Extraction of entities, locations, etc.
– Identification of duplicates
• Social Web Analysis
– Identification of cultural differences in the Social Web
– Domain Expert detection
– Social Search
Archive Enrichment
• Enrichment of all crawled content items with semantic Information about Topics, Entities and Events
• Interlinking entities and events with Linked Data
• Sentiments of user content in the Social Web
Supporting for Appraisal and Selection
• Learn more about the crawl intention
• Feedback to the crawl specification
• Ranking of content for WARC export
Slide 4
Offline
Processing
Consolidation
Enrichment
GATE Offline Analysis
Social Web Analysis
Extracted
SocialWeb
Information
ARCOMEM
Storage
(HBase, H2RDF)
Image/Video Analysis
5. Cross Crawl Analysis (CCA) Level
Analyzing several crawls
• Temporal analytics to get a better understanding of changes that
occur over time.
• Combination of content to get a larger collection of content, e.g.
combining several Twitter crawls
Understanding the dynamics of the Web content
• Evolution of entities over time e.g. Joseph Ratzinger Pope
Benedict XVI Pope Emeritus Benedict XVI
• Evolution of opinions
Better understanding of the public perception
• Dynamics of Twitter hashtags
Slide 5
Cross Crawl Analysis
Named Entity
Evol. Recog.
Twitter
Dynamics
GATE CCA Analysis
Opinion DynamicsARCOMEM
Storage
(HBase, H2RDF)
6. Technologies
Slide 6
Technical Framework
• Scalability is important due the
large amount of analysis
• Apache Hadoop based
environment as framework for the
implementation
ARCOMEM Storage
• Central Component of ARCOMEM
• HBase for scalability
– Web Object Store
– RDF Store as Knowledge Base
• ARCOMEM data model has been
specified
Crawler
Cross Crawl Analysis
Online
Processing
Offline
Processing
Queue
Management
Application-Aware
Helper
Resource Selection
& Prioritization
Resource
Fetching
Intelligent
Crawl
Definition
Consolidation
Enrichment
GATE Offline Analysis
Social Web Analysis
GATE Online Analysis Social Web Analysis
Named Entity
Evol. Recog.
Extracted
SocialWeb
Information
Crawler
Cockpit
ARCOMEM
Storage
URLs
Relevance Analysis
&
Priorization
Image/Video Analysis
Twitter
Dynamics
WARC Export
WARC
Files
Applications
Broadcaster
Application
Parliament
Application
7. Crawler Cockpit & Applications
• Crawler Cockpit
– Adapted interfaces to the crawler
– Allows the specification and refinement of the
Intelligent Crawl Specification (ICS)
• WARC Export
– Semi-automatic selection of content to
be preserved
– Selection is based on the ICS and the extracted
meta information
– Raw content and RDF Metadata are exported
as WARC Files
• Search And Retrieval Application (SARA)
– End user access to exported
Web archives (incl. index)
– One Application, Two Scenarios
Slide 7
Crawler
Cockpit
ARCOMEM
Storage
(HBase, H2RDF)
WARC Export
Application
WARC
Files
SARA
SOLR
Index
+
Broadcaster
Parliament
8. ARCOMEM System Configurations (1/3)
• ARCOMEM System is complex
– Development aim was to be generic to serve as
many Web archive goals as possible
– Large number of phases and components
– Complex handling and maintenance of the
whole systems
• But not every user needs all functionalities
– A subset is often enough
– Phases can be used separately
Slide 8
9. ARCOMEM System Configurations (2/3)
Crawler Configurations
• Heritrix + Online Analysis
– Simple configuartion
– Completely Open Source
– Runs on standard servers
– Interesting for a broad group of organizations that do
small to medium sized focused crawls
• Large Scale Crawler + Online Analysis + Offline
Analysis (+ Cross Crawl Analysis)
– Complex High Throughput System
– Requires Big clusters or Server farms
– Analysis steps have to be well selected depending on
user requirements, e.g. not every crawl requires video
analysis
– Mainly interesting for Service Providers (e.g. Internet
Memory Foundation) or other organizations with large
scale crawl requirements (e.g. National Líbraries)
Slide 9
10. ARCOMEM System Configurations (3/3)
Offline Analysis / Cross Crawl processing
• Analysis modules can be used independently
from the crawler to analyze and enrich existing Web crawls
• Analysis steps have to be well selected depending
on user requirements
• Depending on the analysis this requires
Big clusters / Server farms
• Interesting for Service Providers (e.g. Internet Memory
Foundation, University Computing Centers)
Applications / User Interfaces
• Crawler Cockpit
– Easy user interface for crawler control
– Interesting for all crawler users
• SARA
– Generic tool for content exploration
– Interesting for all end users of Web Archives
– Interesting for Service Providers to deliver results
Slide 10