SlideShare una empresa de Scribd logo
1 de 11
ARCOMEM System Overview
(Advanced Level)
Prerequisite: ARCOMEM System Overview (Beginner Level)
Thomas Risse
L3S Research Center
Hannover, Germany
risse@L3s.de
Architecture Overview
Slide 2
Online
Processing
Crawler
Cross Crawl Analysis
Offline
Processing
Queue
Management
Application-Aware
Helper
Resource Selection
& Prioritization
Resource
Fetching
Intelligent
Crawl
Definition
Consolidation
Enrichment
GATE Offline Analysis
Social Web Analysis
GATE Online Analysis Social Web Analysis
Named Entity
Evol. Recog.
Extracted
SocialWeb
Information
Crawler
Cockpit
ARCOMEM
Storage
(HBase, H2RDF)
URLs
Relevance Analysis
&
Priorization
Image/Video Analysis
Twitter
Dynamics
WARC Export
Application
WARC
Files
SARA
SOLR
Index
+
Broadcaster
Parliament
The levels in the
architecture
represent the
phases as
described in the
ARCOMEM
overview.
Details about the
components can
be found in the
other courses.
Crawler and Online Analysis
• Intelligent Crawl Specification (ICS) specifies the crawl intention
• Resource Fetching
– Heritrix or the IMF Large Scale Crawler can be used for collecting Web pages
– API Crawling support the collection of content in the Social Web via the API of the sites
• Application Aware Helper extracts links from Web and Social Web content by
taking application specific functionalities into account e.g. for Twitter, YouTube.
• Simple content analysis (e.g. keyword detection) in the online phase allows
and efficient relevance ranking of extracted links
• All results are stored in the ARCOMEM Storage
• Crawler and Online Analysis are tightly coupled
Slide 3
Online
Processing
Crawler
Queue
Management
Application-Aware
Helper
Resource Selection
& Prioritization
Resource
Fetching
Intelligent
Crawl
Definition
GATE Online Analysis Social Web Analysis
ARCOMEM
Storage
(HBase, H2RDF)
URLs
Relevance Analysis
&
Priorization
Offline Processing Level
Thorough analysis of
crawled Web objects
• GATE based extraction of
Entities, Opinions, Events
from text
• Topic Extraction
• Analysis of images and videos
– Extraction of entities, locations, etc.
– Identification of duplicates
• Social Web Analysis
– Identification of cultural differences in the Social Web
– Domain Expert detection
– Social Search
Archive Enrichment
• Enrichment of all crawled content items with semantic Information about Topics, Entities and Events
• Interlinking entities and events with Linked Data
• Sentiments of user content in the Social Web
Supporting for Appraisal and Selection
• Learn more about the crawl intention
• Feedback to the crawl specification
• Ranking of content for WARC export
Slide 4
Offline
Processing
Consolidation
Enrichment
GATE Offline Analysis
Social Web Analysis
Extracted
SocialWeb
Information
ARCOMEM
Storage
(HBase, H2RDF)
Image/Video Analysis
Cross Crawl Analysis (CCA) Level
Analyzing several crawls
• Temporal analytics to get a better understanding of changes that
occur over time.
• Combination of content to get a larger collection of content, e.g.
combining several Twitter crawls
Understanding the dynamics of the Web content
• Evolution of entities over time e.g. Joseph Ratzinger  Pope
Benedict XVI  Pope Emeritus Benedict XVI
• Evolution of opinions
Better understanding of the public perception
• Dynamics of Twitter hashtags
Slide 5
Cross Crawl Analysis
Named Entity
Evol. Recog.
Twitter
Dynamics
GATE CCA Analysis
Opinion DynamicsARCOMEM
Storage
(HBase, H2RDF)
Technologies
Slide 6
Technical Framework
• Scalability is important due the
large amount of analysis
• Apache Hadoop based
environment as framework for the
implementation
ARCOMEM Storage
• Central Component of ARCOMEM
• HBase for scalability
– Web Object Store
– RDF Store as Knowledge Base
• ARCOMEM data model has been
specified
Crawler
Cross Crawl Analysis
Online
Processing
Offline
Processing
Queue
Management
Application-Aware
Helper
Resource Selection
& Prioritization
Resource
Fetching
Intelligent
Crawl
Definition
Consolidation
Enrichment
GATE Offline Analysis
Social Web Analysis
GATE Online Analysis Social Web Analysis
Named Entity
Evol. Recog.
Extracted
SocialWeb
Information
Crawler
Cockpit
ARCOMEM
Storage
URLs
Relevance Analysis
&
Priorization
Image/Video Analysis
Twitter
Dynamics
WARC Export
WARC
Files
Applications
Broadcaster
Application
Parliament
Application
Crawler Cockpit & Applications
• Crawler Cockpit
– Adapted interfaces to the crawler
– Allows the specification and refinement of the
Intelligent Crawl Specification (ICS)
• WARC Export
– Semi-automatic selection of content to
be preserved
– Selection is based on the ICS and the extracted
meta information
– Raw content and RDF Metadata are exported
as WARC Files
• Search And Retrieval Application (SARA)
– End user access to exported
Web archives (incl. index)
– One Application, Two Scenarios
Slide 7
Crawler
Cockpit
ARCOMEM
Storage
(HBase, H2RDF)
WARC Export
Application
WARC
Files
SARA
SOLR
Index
+
Broadcaster
Parliament
ARCOMEM System Configurations (1/3)
• ARCOMEM System is complex
– Development aim was to be generic to serve as
many Web archive goals as possible
– Large number of phases and components
– Complex handling and maintenance of the
whole systems
• But not every user needs all functionalities
– A subset is often enough
– Phases can be used separately
Slide 8
ARCOMEM System Configurations (2/3)
Crawler Configurations
• Heritrix + Online Analysis
– Simple configuartion
– Completely Open Source
– Runs on standard servers
– Interesting for a broad group of organizations that do
small to medium sized focused crawls
• Large Scale Crawler + Online Analysis + Offline
Analysis (+ Cross Crawl Analysis)
– Complex High Throughput System
– Requires Big clusters or Server farms
– Analysis steps have to be well selected depending on
user requirements, e.g. not every crawl requires video
analysis
– Mainly interesting for Service Providers (e.g. Internet
Memory Foundation) or other organizations with large
scale crawl requirements (e.g. National Líbraries)
Slide 9
ARCOMEM System Configurations (3/3)
Offline Analysis / Cross Crawl processing
• Analysis modules can be used independently
from the crawler to analyze and enrich existing Web crawls
• Analysis steps have to be well selected depending
on user requirements
• Depending on the analysis this requires
Big clusters / Server farms
• Interesting for Service Providers (e.g. Internet Memory
Foundation, University Computing Centers)
Applications / User Interfaces
• Crawler Cockpit
– Easy user interface for crawler control
– Interesting for all crawler users
• SARA
– Generic tool for content exploration
– Interesting for all end users of Web Archives
– Interesting for Service Providers to deliver results
Slide 10
THANK YOU
CONTACT DETAILS
Dr. Thomas Risse
+49 511 762 17764
risse@L3S.de
www.arcomem.eu

Más contenido relacionado

Destacado

Arcomem training heritrix_beginner
Arcomem training heritrix_beginnerArcomem training heritrix_beginner
Arcomem training heritrix_beginnerarcomem
 
Ethical Leadership Ambassadors Training
Ethical Leadership Ambassadors TrainingEthical Leadership Ambassadors Training
Ethical Leadership Ambassadors TrainingYoussef Gaboune
 
Vocational Education and Training System
Vocational Education and Training SystemVocational Education and Training System
Vocational Education and Training Systemrowel
 
vocational education in India and challenges
vocational education in India and challengesvocational education in India and challenges
vocational education in India and challengesmp poonia
 
Vocational Education & Training
Vocational Education & TrainingVocational Education & Training
Vocational Education & TrainingFortress Learning
 
Ethical Leadership Theories
Ethical Leadership TheoriesEthical Leadership Theories
Ethical Leadership Theoriescatsfood
 
Capability Maturity Model
Capability Maturity ModelCapability Maturity Model
Capability Maturity ModelUzair Akram
 
Developing leadership skills
Developing leadership skillsDeveloping leadership skills
Developing leadership skillsYodhia Antariksa
 
Measuring ROI of Training
Measuring ROI of Training  Measuring ROI of Training
Measuring ROI of Training Yodhia Antariksa
 
Career Planning and Development
Career Planning and DevelopmentCareer Planning and Development
Career Planning and DevelopmentYodhia Antariksa
 
Coaching For Optimal Performance
Coaching For Optimal Performance   Coaching For Optimal Performance
Coaching For Optimal Performance Yodhia Antariksa
 
Diagnosing Organizational Effectiveness
Diagnosing Organizational Effectiveness  Diagnosing Organizational Effectiveness
Diagnosing Organizational Effectiveness Yodhia Antariksa
 

Destacado (20)

Arcomem training heritrix_beginner
Arcomem training heritrix_beginnerArcomem training heritrix_beginner
Arcomem training heritrix_beginner
 
Ethical Leadership Ambassadors Training
Ethical Leadership Ambassadors TrainingEthical Leadership Ambassadors Training
Ethical Leadership Ambassadors Training
 
Vocational Education and Training System
Vocational Education and Training SystemVocational Education and Training System
Vocational Education and Training System
 
vocational education in India and challenges
vocational education in India and challengesvocational education in India and challenges
vocational education in India and challenges
 
Vocational Education & Training
Vocational Education & TrainingVocational Education & Training
Vocational Education & Training
 
Ethical leadership
Ethical leadership Ethical leadership
Ethical leadership
 
Ethical Leadership Theories
Ethical Leadership TheoriesEthical Leadership Theories
Ethical Leadership Theories
 
Capability maturity model
Capability maturity modelCapability maturity model
Capability maturity model
 
Capability Maturity Model
Capability Maturity ModelCapability Maturity Model
Capability Maturity Model
 
Ethical Leadership
Ethical LeadershipEthical Leadership
Ethical Leadership
 
Leadership skills
Leadership skillsLeadership skills
Leadership skills
 
Emotional Intelligence
Emotional IntelligenceEmotional Intelligence
Emotional Intelligence
 
Good To Great
Good To GreatGood To Great
Good To Great
 
Balanced Scorecard
Balanced Scorecard  Balanced Scorecard
Balanced Scorecard
 
Developing leadership skills
Developing leadership skillsDeveloping leadership skills
Developing leadership skills
 
Change Management
Change Management  Change Management
Change Management
 
Measuring ROI of Training
Measuring ROI of Training  Measuring ROI of Training
Measuring ROI of Training
 
Career Planning and Development
Career Planning and DevelopmentCareer Planning and Development
Career Planning and Development
 
Coaching For Optimal Performance
Coaching For Optimal Performance   Coaching For Optimal Performance
Coaching For Optimal Performance
 
Diagnosing Organizational Effectiveness
Diagnosing Organizational Effectiveness  Diagnosing Organizational Effectiveness
Diagnosing Organizational Effectiveness
 

Similar a ARCOMEM System Overview Architecture

Arcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls BeginnersArcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls Beginnersarcomem
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypresNekoGato
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Henry S
 
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...Ram G Athreya
 
BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Pres...
BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Pres...BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Pres...
BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Pres...Vangelis Banos
 
How To Implement a CMS
How To Implement a CMSHow To Implement a CMS
How To Implement a CMSJonathan Smith
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.iosrjce
 
Static Site Generators - Developing Websites in Low-resource Condition
Static Site Generators - Developing Websites in Low-resource ConditionStatic Site Generators - Developing Websites in Low-resource Condition
Static Site Generators - Developing Websites in Low-resource ConditionIWMW
 
Avtar's ppt
Avtar's pptAvtar's ppt
Avtar's pptmak57
 
SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...
SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...
SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...NCCOMMS
 
COLLECTION METHODS
COLLECTION METHODSCOLLECTION METHODS
COLLECTION METHODSEssam Obaid
 
Restful风格ž„web服务架构
Restful风格ž„web服务架构Restful风格ž„web服务架构
Restful风格ž„web服务架构Benjamin Tan
 
Enterprise WordPress - Performance, Scalability and Redundancy
Enterprise WordPress - Performance, Scalability and RedundancyEnterprise WordPress - Performance, Scalability and Redundancy
Enterprise WordPress - Performance, Scalability and RedundancyJohn Giaconia
 
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Anna Perricci
 

Similar a ARCOMEM System Overview Architecture (20)

Arcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls BeginnersArcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls Beginners
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1
 
SharePoint 2013 - What's New
SharePoint 2013 - What's NewSharePoint 2013 - What's New
SharePoint 2013 - What's New
 
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
 
BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Pres...
BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Pres...BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Pres...
BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Pres...
 
How To Implement a CMS
How To Implement a CMSHow To Implement a CMS
How To Implement a CMS
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
 
E017624043
E017624043E017624043
E017624043
 
Static Site Generators - Developing Websites in Low-resource Condition
Static Site Generators - Developing Websites in Low-resource ConditionStatic Site Generators - Developing Websites in Low-resource Condition
Static Site Generators - Developing Websites in Low-resource Condition
 
A04210106
A04210106A04210106
A04210106
 
Avtar's ppt
Avtar's pptAvtar's ppt
Avtar's ppt
 
SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...
SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...
SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...
 
Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
COLLECTION METHODS
COLLECTION METHODSCOLLECTION METHODS
COLLECTION METHODS
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Restful风格ž„web服务架构
Restful风格ž„web服务架构Restful风格ž„web服务架构
Restful风格ž„web服务架构
 
Enterprise WordPress - Performance, Scalability and Redundancy
Enterprise WordPress - Performance, Scalability and RedundancyEnterprise WordPress - Performance, Scalability and Redundancy
Enterprise WordPress - Performance, Scalability and Redundancy
 
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
 

Más de arcomem

Arcomem training – Enrichment Advanced (update)
Arcomem training – Enrichment Advanced (update)Arcomem training – Enrichment Advanced (update)
Arcomem training – Enrichment Advanced (update)arcomem
 
Arcomem training – Enrichment Beginner (update)
Arcomem training – Enrichment Beginner (update)Arcomem training – Enrichment Beginner (update)
Arcomem training – Enrichment Beginner (update)arcomem
 
Arcomem training Specifying Crawls Advanced
Arcomem training Specifying Crawls AdvancedArcomem training Specifying Crawls Advanced
Arcomem training Specifying Crawls Advancedarcomem
 
Arcomem training Topic Analysis Models advanced
Arcomem training Topic Analysis Models advancedArcomem training Topic Analysis Models advanced
Arcomem training Topic Analysis Models advancedarcomem
 
Arcomem training Topic Analysis Models beginners
Arcomem training Topic Analysis Models beginnersArcomem training Topic Analysis Models beginners
Arcomem training Topic Analysis Models beginnersarcomem
 
Arcomem training Twitter Domain Experts advanced
Arcomem training Twitter Domain Experts advancedArcomem training Twitter Domain Experts advanced
Arcomem training Twitter Domain Experts advancedarcomem
 
Arcomem training Cultural Analysis Advanced
Arcomem training Cultural Analysis AdvancedArcomem training Cultural Analysis Advanced
Arcomem training Cultural Analysis Advancedarcomem
 
Arcomem training Cultural Analysis Beginner
Arcomem training Cultural Analysis BeginnerArcomem training Cultural Analysis Beginner
Arcomem training Cultural Analysis Beginnerarcomem
 
Arcomem training twitter-dynamics_advanced
Arcomem training twitter-dynamics_advancedArcomem training twitter-dynamics_advanced
Arcomem training twitter-dynamics_advancedarcomem
 
Arcomem training specifying-crawls
Arcomem training specifying-crawlsArcomem training specifying-crawls
Arcomem training specifying-crawlsarcomem
 
Arcomem training simple-text-mining_beginner
Arcomem training simple-text-mining_beginnerArcomem training simple-text-mining_beginner
Arcomem training simple-text-mining_beginnerarcomem
 
Arcomem training opinions_advanced
Arcomem training opinions_advancedArcomem training opinions_advanced
Arcomem training opinions_advancedarcomem
 
Arcomem training neer_beginner
Arcomem training neer_beginnerArcomem training neer_beginner
Arcomem training neer_beginnerarcomem
 
Arcomem training neer_advanced
Arcomem training neer_advancedArcomem training neer_advanced
Arcomem training neer_advancedarcomem
 
Arcomem training heritrix_advanced
Arcomem training heritrix_advancedArcomem training heritrix_advanced
Arcomem training heritrix_advancedarcomem
 
Arcomem training entities-and-events_advanced
Arcomem training entities-and-events_advancedArcomem training entities-and-events_advanced
Arcomem training entities-and-events_advancedarcomem
 
Arcomem training enrichment_advanced
Arcomem training enrichment_advancedArcomem training enrichment_advanced
Arcomem training enrichment_advancedarcomem
 
Arcomem training diversification
Arcomem training diversificationArcomem training diversification
Arcomem training diversificationarcomem
 
Arcomem training twitter-dynamics_beginner
Arcomem training twitter-dynamics_beginnerArcomem training twitter-dynamics_beginner
Arcomem training twitter-dynamics_beginnerarcomem
 
Arcomem TPDL poster
Arcomem TPDL posterArcomem TPDL poster
Arcomem TPDL posterarcomem
 

Más de arcomem (20)

Arcomem training – Enrichment Advanced (update)
Arcomem training – Enrichment Advanced (update)Arcomem training – Enrichment Advanced (update)
Arcomem training – Enrichment Advanced (update)
 
Arcomem training – Enrichment Beginner (update)
Arcomem training – Enrichment Beginner (update)Arcomem training – Enrichment Beginner (update)
Arcomem training – Enrichment Beginner (update)
 
Arcomem training Specifying Crawls Advanced
Arcomem training Specifying Crawls AdvancedArcomem training Specifying Crawls Advanced
Arcomem training Specifying Crawls Advanced
 
Arcomem training Topic Analysis Models advanced
Arcomem training Topic Analysis Models advancedArcomem training Topic Analysis Models advanced
Arcomem training Topic Analysis Models advanced
 
Arcomem training Topic Analysis Models beginners
Arcomem training Topic Analysis Models beginnersArcomem training Topic Analysis Models beginners
Arcomem training Topic Analysis Models beginners
 
Arcomem training Twitter Domain Experts advanced
Arcomem training Twitter Domain Experts advancedArcomem training Twitter Domain Experts advanced
Arcomem training Twitter Domain Experts advanced
 
Arcomem training Cultural Analysis Advanced
Arcomem training Cultural Analysis AdvancedArcomem training Cultural Analysis Advanced
Arcomem training Cultural Analysis Advanced
 
Arcomem training Cultural Analysis Beginner
Arcomem training Cultural Analysis BeginnerArcomem training Cultural Analysis Beginner
Arcomem training Cultural Analysis Beginner
 
Arcomem training twitter-dynamics_advanced
Arcomem training twitter-dynamics_advancedArcomem training twitter-dynamics_advanced
Arcomem training twitter-dynamics_advanced
 
Arcomem training specifying-crawls
Arcomem training specifying-crawlsArcomem training specifying-crawls
Arcomem training specifying-crawls
 
Arcomem training simple-text-mining_beginner
Arcomem training simple-text-mining_beginnerArcomem training simple-text-mining_beginner
Arcomem training simple-text-mining_beginner
 
Arcomem training opinions_advanced
Arcomem training opinions_advancedArcomem training opinions_advanced
Arcomem training opinions_advanced
 
Arcomem training neer_beginner
Arcomem training neer_beginnerArcomem training neer_beginner
Arcomem training neer_beginner
 
Arcomem training neer_advanced
Arcomem training neer_advancedArcomem training neer_advanced
Arcomem training neer_advanced
 
Arcomem training heritrix_advanced
Arcomem training heritrix_advancedArcomem training heritrix_advanced
Arcomem training heritrix_advanced
 
Arcomem training entities-and-events_advanced
Arcomem training entities-and-events_advancedArcomem training entities-and-events_advanced
Arcomem training entities-and-events_advanced
 
Arcomem training enrichment_advanced
Arcomem training enrichment_advancedArcomem training enrichment_advanced
Arcomem training enrichment_advanced
 
Arcomem training diversification
Arcomem training diversificationArcomem training diversification
Arcomem training diversification
 
Arcomem training twitter-dynamics_beginner
Arcomem training twitter-dynamics_beginnerArcomem training twitter-dynamics_beginner
Arcomem training twitter-dynamics_beginner
 
Arcomem TPDL poster
Arcomem TPDL posterArcomem TPDL poster
Arcomem TPDL poster
 

Último

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Último (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

ARCOMEM System Overview Architecture

  • 1. ARCOMEM System Overview (Advanced Level) Prerequisite: ARCOMEM System Overview (Beginner Level) Thomas Risse L3S Research Center Hannover, Germany risse@L3s.de
  • 2. Architecture Overview Slide 2 Online Processing Crawler Cross Crawl Analysis Offline Processing Queue Management Application-Aware Helper Resource Selection & Prioritization Resource Fetching Intelligent Crawl Definition Consolidation Enrichment GATE Offline Analysis Social Web Analysis GATE Online Analysis Social Web Analysis Named Entity Evol. Recog. Extracted SocialWeb Information Crawler Cockpit ARCOMEM Storage (HBase, H2RDF) URLs Relevance Analysis & Priorization Image/Video Analysis Twitter Dynamics WARC Export Application WARC Files SARA SOLR Index + Broadcaster Parliament The levels in the architecture represent the phases as described in the ARCOMEM overview. Details about the components can be found in the other courses.
  • 3. Crawler and Online Analysis • Intelligent Crawl Specification (ICS) specifies the crawl intention • Resource Fetching – Heritrix or the IMF Large Scale Crawler can be used for collecting Web pages – API Crawling support the collection of content in the Social Web via the API of the sites • Application Aware Helper extracts links from Web and Social Web content by taking application specific functionalities into account e.g. for Twitter, YouTube. • Simple content analysis (e.g. keyword detection) in the online phase allows and efficient relevance ranking of extracted links • All results are stored in the ARCOMEM Storage • Crawler and Online Analysis are tightly coupled Slide 3 Online Processing Crawler Queue Management Application-Aware Helper Resource Selection & Prioritization Resource Fetching Intelligent Crawl Definition GATE Online Analysis Social Web Analysis ARCOMEM Storage (HBase, H2RDF) URLs Relevance Analysis & Priorization
  • 4. Offline Processing Level Thorough analysis of crawled Web objects • GATE based extraction of Entities, Opinions, Events from text • Topic Extraction • Analysis of images and videos – Extraction of entities, locations, etc. – Identification of duplicates • Social Web Analysis – Identification of cultural differences in the Social Web – Domain Expert detection – Social Search Archive Enrichment • Enrichment of all crawled content items with semantic Information about Topics, Entities and Events • Interlinking entities and events with Linked Data • Sentiments of user content in the Social Web Supporting for Appraisal and Selection • Learn more about the crawl intention • Feedback to the crawl specification • Ranking of content for WARC export Slide 4 Offline Processing Consolidation Enrichment GATE Offline Analysis Social Web Analysis Extracted SocialWeb Information ARCOMEM Storage (HBase, H2RDF) Image/Video Analysis
  • 5. Cross Crawl Analysis (CCA) Level Analyzing several crawls • Temporal analytics to get a better understanding of changes that occur over time. • Combination of content to get a larger collection of content, e.g. combining several Twitter crawls Understanding the dynamics of the Web content • Evolution of entities over time e.g. Joseph Ratzinger  Pope Benedict XVI  Pope Emeritus Benedict XVI • Evolution of opinions Better understanding of the public perception • Dynamics of Twitter hashtags Slide 5 Cross Crawl Analysis Named Entity Evol. Recog. Twitter Dynamics GATE CCA Analysis Opinion DynamicsARCOMEM Storage (HBase, H2RDF)
  • 6. Technologies Slide 6 Technical Framework • Scalability is important due the large amount of analysis • Apache Hadoop based environment as framework for the implementation ARCOMEM Storage • Central Component of ARCOMEM • HBase for scalability – Web Object Store – RDF Store as Knowledge Base • ARCOMEM data model has been specified Crawler Cross Crawl Analysis Online Processing Offline Processing Queue Management Application-Aware Helper Resource Selection & Prioritization Resource Fetching Intelligent Crawl Definition Consolidation Enrichment GATE Offline Analysis Social Web Analysis GATE Online Analysis Social Web Analysis Named Entity Evol. Recog. Extracted SocialWeb Information Crawler Cockpit ARCOMEM Storage URLs Relevance Analysis & Priorization Image/Video Analysis Twitter Dynamics WARC Export WARC Files Applications Broadcaster Application Parliament Application
  • 7. Crawler Cockpit & Applications • Crawler Cockpit – Adapted interfaces to the crawler – Allows the specification and refinement of the Intelligent Crawl Specification (ICS) • WARC Export – Semi-automatic selection of content to be preserved – Selection is based on the ICS and the extracted meta information – Raw content and RDF Metadata are exported as WARC Files • Search And Retrieval Application (SARA) – End user access to exported Web archives (incl. index) – One Application, Two Scenarios Slide 7 Crawler Cockpit ARCOMEM Storage (HBase, H2RDF) WARC Export Application WARC Files SARA SOLR Index + Broadcaster Parliament
  • 8. ARCOMEM System Configurations (1/3) • ARCOMEM System is complex – Development aim was to be generic to serve as many Web archive goals as possible – Large number of phases and components – Complex handling and maintenance of the whole systems • But not every user needs all functionalities – A subset is often enough – Phases can be used separately Slide 8
  • 9. ARCOMEM System Configurations (2/3) Crawler Configurations • Heritrix + Online Analysis – Simple configuartion – Completely Open Source – Runs on standard servers – Interesting for a broad group of organizations that do small to medium sized focused crawls • Large Scale Crawler + Online Analysis + Offline Analysis (+ Cross Crawl Analysis) – Complex High Throughput System – Requires Big clusters or Server farms – Analysis steps have to be well selected depending on user requirements, e.g. not every crawl requires video analysis – Mainly interesting for Service Providers (e.g. Internet Memory Foundation) or other organizations with large scale crawl requirements (e.g. National Líbraries) Slide 9
  • 10. ARCOMEM System Configurations (3/3) Offline Analysis / Cross Crawl processing • Analysis modules can be used independently from the crawler to analyze and enrich existing Web crawls • Analysis steps have to be well selected depending on user requirements • Depending on the analysis this requires Big clusters / Server farms • Interesting for Service Providers (e.g. Internet Memory Foundation, University Computing Centers) Applications / User Interfaces • Crawler Cockpit – Easy user interface for crawler control – Interesting for all crawler users • SARA – Generic tool for content exploration – Interesting for all end users of Web Archives – Interesting for Service Providers to deliver results Slide 10
  • 11. THANK YOU CONTACT DETAILS Dr. Thomas Risse +49 511 762 17764 risse@L3S.de www.arcomem.eu

Notas del editor

  1. Entitieevolution: Obama assenator president