SlideShare una empresa de Scribd logo
1 de 15
ARCOMEM System Overview
(Beginner Level)
Thomas Risse
L3S Research Center
Hannover, Germany
risse@L3s.de
Overview
Beginner Level
• Approach of current crawlers
• What’s new in ARCOMEM?
• The ARCOMEM Approach
– Overview about the phases
– Overview about the processing levels
• Handling of Preservation in ARCOMEM
Advanced Level
• Overview of the system architecture
• Possible ARCOMEM System Configurations
Slide 2
Standard Crawlers
Seedlist
http://www.economist.com/node/215348
49
http://www.ekathimerini.com/ekathi/com
ment
http://www.bbc.co.uk/news/world-
europe-15589568
http://www.bbc.co.uk/search/news/?q=G
reek%20crisis
http://www.guardian.co.uk/business/blog
http://www.kathimerini.gr/
http://twitter.com/#!/EU_Commission
Web Crawler
e.g. Heritrix,
HTTrack
1
1. A seedlist is specified as
input for the crawler. This
specification might also
contain some limited crawling
parameters like the crawl
depth or maximum crawl time.
Also blacklists of domain to
reduce spam can be given.
Standard Crawlers
Seedlist
http://www.economist.com/node/215348
49
http://www.ekathimerini.com/ekathi/com
ment
http://www.bbc.co.uk/news/world-
europe-15589568
http://www.bbc.co.uk/search/news/?q=G
reek%20crisis
http://www.guardian.co.uk/business/blog
http://www.kathimerini.gr/
http://twitter.com/#!/EU_Commission
Web Crawler
e.g. Heritrix,
HTTrack
2Crawling
1
2. The Web crawler collects
the content from the Web and
follows the links up to the
specified depth to crawl.
Standard Crawlers
Seedlist
http://www.economist.com/node/215348
49
http://www.ekathimerini.com/ekathi/com
ment
http://www.bbc.co.uk/news/world-
europe-15589568
http://www.bbc.co.uk/search/news/?q=G
reek%20crisis
http://www.guardian.co.uk/business/blog
http://www.kathimerini.gr/
http://twitter.com/#!/EU_Commission
Web Crawler
e.g. Heritrix,
HTTrack
Storage
Archive
2Crawling
1 3
3. The results of the crawl are
are directly stored in the Web
archive. This is typically in
WARC or ARC format.
Standard Crawlers
Seedlist
http://www.economist.com/node/215348
49
http://www.ekathimerini.com/ekathi/com
ment
http://www.bbc.co.uk/news/world-
europe-15589568
http://www.bbc.co.uk/search/news/?q=G
reek%20crisis
http://www.guardian.co.uk/business/blog
http://www.kathimerini.gr/
http://twitter.com/#!/EU_Commission
Web Crawler
e.g. Heritrix,
HTTrack
Storage
Archive
2Crawling
1 3
Quality
Assurance
4
4. The Quality Assurance is
applied as the last step to
ensure that all information are
collected and that the pages
are fully stored in the archive.
Missing URLs are given to the
Web Crawler for re-crawling
What‘s new in ARCOMEM?
• Intelligent Crawler
– Semantically Enhanced Crawl Specification
– „Understands“ the crawl intention
– Crawler guidance by using social and semantic information
– Stops crawling at irrelevant pages
– Two stage crawling strategy: Web  ARCOMEM Storage  Archive
• Advanced Web Archive Enrichment
– Semantic Information: Entities, Topics, Opinions, Events (ETOE)
– Social Context: Interlinking Web Social Web, Trustworthiness of
information and users
• Archivist and End User Support
– Archivist Tool
– Searching and browsing Web archives with different facets
Slide 7
ARCOMEM Phases: Crawl Specification
1. Intelligent Crawl Specification (ICS)
The ICS describes the intended crawl
by specifying keywords, entities,
topics, etc. together with reference
page and starting points. Reference
pages matches to 100% with the
crawl content and are used by the
crawler to learn more about the crawl.
Slide 8
Entities
Obama, Romney, Biden, Ryan, Republicans, Democrats, …
Keywords
US Election, CommitToMitt, Teaparty, Budget deficit, …
Reference Seedlist
https://twitter.com/whitehouse , https://twitter.com/blog44 ,
https://twitter.com/BarackObama, ...
Seedlist
http://news.bbc.co.uk/, http://telegraph.co.uk/, ...
ARCOMEM Phases: Crawling & Online Processing
Slide 9
2. Crawling & Online Processing
In this phase the web pages and
social web content will be collected
and a first semantic analysis will be
applied. The analysis result is used to
guide the crawler by ranking
extracted links by their importance.
All information are stored in the
ARCOMEM Storage.
Crawling
Online
Processing
ARCOMEM
Storage
Crawling
Entities
Obama, Romney, Biden, Ryan, Republicans, Democrats, …
Keywords
US Election, CommitToMitt, Teaparty, Budget deficit, …
Reference Seedlist
https://twitter.com/whitehouse , https://twitter.com/blog44 ,
https://twitter.com/BarackObama, ...
Seedlist
http://news.bbc.co.uk/, http://telegraph.co.uk/, ...
Internet
ARCOMEM Phases: Offline Processing
Slide 10
3. Offline Processing
The offline processing runs after the
collection of content has been finished.
The aim of this phase is the enrich the
crawled pages with meta-information
that has been extracted from the
content. The enrichments helps
selecting content for the final web
archive. Furthermore it eases the
searching and browsing within the final
Web archive.
Crawling
Online
Processing
Offline Processing
ARCOMEM
Storage
Crawling
Entities
Obama, Romney, Biden, Ryan, Republicans, Democrats, …
Keywords
US Election, CommitToMitt, Teaparty, Budget deficit, …
Reference Seedlist
https://twitter.com/whitehouse , https://twitter.com/blog44 ,
https://twitter.com/BarackObama, ...
Seedlist
http://news.bbc.co.uk/, http://telegraph.co.uk/, ...
Internet
ARCOMEM Phases: Appraisal & Selection
Slide 11
4. Based on the information given in the
Intelligent Crawl Specification (ICS) and
the enrichment of the content, the most
interesting content items are selected to
be stored in the final Web archive. The
final Web archive are WARC files,
which include the crawled pages and all
enrichments done during the offline
processing in RDF format.
Crawling
Online
Processing
Offline Processing
ARCOMEM
Storage Archive
Crawling
Appraisal
Selection
Entities
Obama, Romney, Biden, Ryan, Republicans, Democrats, …
Keywords
US Election, CommitToMitt, Teaparty, Budget deficit, …
Reference Seedlist
https://twitter.com/whitehouse , https://twitter.com/blog44 ,
https://twitter.com/BarackObama, ...
Seedlist
http://news.bbc.co.uk/, http://telegraph.co.uk/, ...
Internet
ARCOMEM Phases: Applications
Slide 12
Crawling
Online
Processing
Offline Processing
SARA
for
Broadcaster,
Parliaments
ARCOMEM
Storage Archive
Crawling
Appraisal
Selection
Entities
Obama, Romney, Biden, Ryan, Republicans, Democrats, …
Keywords
US Election, CommitToMitt, Teaparty, Budget deficit, …
Reference Seedlist
https://twitter.com/whitehouse , https://twitter.com/blog44 ,
https://twitter.com/BarackObama, ...
Seedlist
http://news.bbc.co.uk/, http://telegraph.co.uk/, ...
Internet
5. The Search and Retrieval Application
(SARA) allows end users to search
and browse the archive in different
ways, e.g. based on keywords,
entities, topics, opinions.
ARCOMEM Phases: Cross Crawl Analytics
Slide 13
Crawling
Online
Processing
Offline Processing
SARA
for
Broadcaster,
Parliaments
ARCOMEM
Storage Archive
Crawling
Appraisal
Selection
Cross Crawl Processing
Entities
Obama, Romney, Biden, Ryan, Republicans, Democrats, …
Keywords
US Election, CommitToMitt, Teaparty, Budget deficit, …
Reference Seedlist
https://twitter.com/whitehouse , https://twitter.com/blog44 ,
https://twitter.com/BarackObama, ...
Seedlist
http://news.bbc.co.uk/, http://telegraph.co.uk/, ...
Internet
6. The Cross-Crawl analysis allows
content analytics across archives.
This enables the possibility to
combine Web archives to get a larger
collection of documents or to study
evolutions over time. Examples are
evolution of languages, opinions, etc.
Preservation in ARCOMEM
Content Preservation in ARCOMEM
• Selection and appraisal of Web and Social Web content
• Preparation of WARC files for preservation
• Provides access to preserved Web content
• Not part of ARCOMEM are
– Long-term preservation of WARC files
– Format handling, etc.
Semantic Preservation in ARCOMEM
• Extraction of Entities, Events, Topics, Opinions
• Enrichment with Linked Data
• Created WARC files contain
– Raw Web Data
– RDF triples of enrichment
• Preservation of Linked Data
– Not part of ARCOMEM
– See EU Projects: DIACHRON (IP), PRELIDA (CA)
Slide 14
+
WARC
THANK YOU
CONTACT DETAILS
Dr. Thomas Risse
+49 511 762 17764
risse@L3S.de
www.arcomem.eu

Más contenido relacionado

Más de arcomem

Arcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls BeginnersArcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls Beginnersarcomem
 
Arcomem training Topic Analysis Models advanced
Arcomem training Topic Analysis Models advancedArcomem training Topic Analysis Models advanced
Arcomem training Topic Analysis Models advancedarcomem
 
Arcomem training Topic Analysis Models beginners
Arcomem training Topic Analysis Models beginnersArcomem training Topic Analysis Models beginners
Arcomem training Topic Analysis Models beginnersarcomem
 
Arcomem training Twitter Domain Experts advanced
Arcomem training Twitter Domain Experts advancedArcomem training Twitter Domain Experts advanced
Arcomem training Twitter Domain Experts advancedarcomem
 
Arcomem training Cultural Analysis Advanced
Arcomem training Cultural Analysis AdvancedArcomem training Cultural Analysis Advanced
Arcomem training Cultural Analysis Advancedarcomem
 
Arcomem training twitter-dynamics_advanced
Arcomem training twitter-dynamics_advancedArcomem training twitter-dynamics_advanced
Arcomem training twitter-dynamics_advancedarcomem
 
Arcomem training system-overview_advanced
Arcomem training system-overview_advancedArcomem training system-overview_advanced
Arcomem training system-overview_advancedarcomem
 
Arcomem training specifying-crawls
Arcomem training specifying-crawlsArcomem training specifying-crawls
Arcomem training specifying-crawlsarcomem
 
Arcomem training simple-text-mining_beginner
Arcomem training simple-text-mining_beginnerArcomem training simple-text-mining_beginner
Arcomem training simple-text-mining_beginnerarcomem
 
Arcomem training opinions_advanced
Arcomem training opinions_advancedArcomem training opinions_advanced
Arcomem training opinions_advancedarcomem
 
Arcomem training neer_beginner
Arcomem training neer_beginnerArcomem training neer_beginner
Arcomem training neer_beginnerarcomem
 
Arcomem training neer_advanced
Arcomem training neer_advancedArcomem training neer_advanced
Arcomem training neer_advancedarcomem
 
Arcomem training heritrix_beginner
Arcomem training heritrix_beginnerArcomem training heritrix_beginner
Arcomem training heritrix_beginnerarcomem
 
Arcomem training heritrix_advanced
Arcomem training heritrix_advancedArcomem training heritrix_advanced
Arcomem training heritrix_advancedarcomem
 
Arcomem training enrichment_beginner
Arcomem training enrichment_beginnerArcomem training enrichment_beginner
Arcomem training enrichment_beginnerarcomem
 
Arcomem training enrichment_advanced
Arcomem training enrichment_advancedArcomem training enrichment_advanced
Arcomem training enrichment_advancedarcomem
 
Arcomem training twitter-dynamics_beginner
Arcomem training twitter-dynamics_beginnerArcomem training twitter-dynamics_beginner
Arcomem training twitter-dynamics_beginnerarcomem
 
Arcomem TPDL poster
Arcomem TPDL posterArcomem TPDL poster
Arcomem TPDL posterarcomem
 
ARCOMEM Poster
ARCOMEM PosterARCOMEM Poster
ARCOMEM Posterarcomem
 
ARCOMEM Flyer
ARCOMEM FlyerARCOMEM Flyer
ARCOMEM Flyerarcomem
 

Más de arcomem (20)

Arcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls BeginnersArcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls Beginners
 
Arcomem training Topic Analysis Models advanced
Arcomem training Topic Analysis Models advancedArcomem training Topic Analysis Models advanced
Arcomem training Topic Analysis Models advanced
 
Arcomem training Topic Analysis Models beginners
Arcomem training Topic Analysis Models beginnersArcomem training Topic Analysis Models beginners
Arcomem training Topic Analysis Models beginners
 
Arcomem training Twitter Domain Experts advanced
Arcomem training Twitter Domain Experts advancedArcomem training Twitter Domain Experts advanced
Arcomem training Twitter Domain Experts advanced
 
Arcomem training Cultural Analysis Advanced
Arcomem training Cultural Analysis AdvancedArcomem training Cultural Analysis Advanced
Arcomem training Cultural Analysis Advanced
 
Arcomem training twitter-dynamics_advanced
Arcomem training twitter-dynamics_advancedArcomem training twitter-dynamics_advanced
Arcomem training twitter-dynamics_advanced
 
Arcomem training system-overview_advanced
Arcomem training system-overview_advancedArcomem training system-overview_advanced
Arcomem training system-overview_advanced
 
Arcomem training specifying-crawls
Arcomem training specifying-crawlsArcomem training specifying-crawls
Arcomem training specifying-crawls
 
Arcomem training simple-text-mining_beginner
Arcomem training simple-text-mining_beginnerArcomem training simple-text-mining_beginner
Arcomem training simple-text-mining_beginner
 
Arcomem training opinions_advanced
Arcomem training opinions_advancedArcomem training opinions_advanced
Arcomem training opinions_advanced
 
Arcomem training neer_beginner
Arcomem training neer_beginnerArcomem training neer_beginner
Arcomem training neer_beginner
 
Arcomem training neer_advanced
Arcomem training neer_advancedArcomem training neer_advanced
Arcomem training neer_advanced
 
Arcomem training heritrix_beginner
Arcomem training heritrix_beginnerArcomem training heritrix_beginner
Arcomem training heritrix_beginner
 
Arcomem training heritrix_advanced
Arcomem training heritrix_advancedArcomem training heritrix_advanced
Arcomem training heritrix_advanced
 
Arcomem training enrichment_beginner
Arcomem training enrichment_beginnerArcomem training enrichment_beginner
Arcomem training enrichment_beginner
 
Arcomem training enrichment_advanced
Arcomem training enrichment_advancedArcomem training enrichment_advanced
Arcomem training enrichment_advanced
 
Arcomem training twitter-dynamics_beginner
Arcomem training twitter-dynamics_beginnerArcomem training twitter-dynamics_beginner
Arcomem training twitter-dynamics_beginner
 
Arcomem TPDL poster
Arcomem TPDL posterArcomem TPDL poster
Arcomem TPDL poster
 
ARCOMEM Poster
ARCOMEM PosterARCOMEM Poster
ARCOMEM Poster
 
ARCOMEM Flyer
ARCOMEM FlyerARCOMEM Flyer
ARCOMEM Flyer
 

Último

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Último (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Arcomem training system-overview_beginner

  • 1. ARCOMEM System Overview (Beginner Level) Thomas Risse L3S Research Center Hannover, Germany risse@L3s.de
  • 2. Overview Beginner Level • Approach of current crawlers • What’s new in ARCOMEM? • The ARCOMEM Approach – Overview about the phases – Overview about the processing levels • Handling of Preservation in ARCOMEM Advanced Level • Overview of the system architecture • Possible ARCOMEM System Configurations Slide 2
  • 3. Standard Crawlers Seedlist http://www.economist.com/node/215348 49 http://www.ekathimerini.com/ekathi/com ment http://www.bbc.co.uk/news/world- europe-15589568 http://www.bbc.co.uk/search/news/?q=G reek%20crisis http://www.guardian.co.uk/business/blog http://www.kathimerini.gr/ http://twitter.com/#!/EU_Commission Web Crawler e.g. Heritrix, HTTrack 1 1. A seedlist is specified as input for the crawler. This specification might also contain some limited crawling parameters like the crawl depth or maximum crawl time. Also blacklists of domain to reduce spam can be given.
  • 6. Standard Crawlers Seedlist http://www.economist.com/node/215348 49 http://www.ekathimerini.com/ekathi/com ment http://www.bbc.co.uk/news/world- europe-15589568 http://www.bbc.co.uk/search/news/?q=G reek%20crisis http://www.guardian.co.uk/business/blog http://www.kathimerini.gr/ http://twitter.com/#!/EU_Commission Web Crawler e.g. Heritrix, HTTrack Storage Archive 2Crawling 1 3 Quality Assurance 4 4. The Quality Assurance is applied as the last step to ensure that all information are collected and that the pages are fully stored in the archive. Missing URLs are given to the Web Crawler for re-crawling
  • 7. What‘s new in ARCOMEM? • Intelligent Crawler – Semantically Enhanced Crawl Specification – „Understands“ the crawl intention – Crawler guidance by using social and semantic information – Stops crawling at irrelevant pages – Two stage crawling strategy: Web  ARCOMEM Storage  Archive • Advanced Web Archive Enrichment – Semantic Information: Entities, Topics, Opinions, Events (ETOE) – Social Context: Interlinking Web Social Web, Trustworthiness of information and users • Archivist and End User Support – Archivist Tool – Searching and browsing Web archives with different facets Slide 7
  • 8. ARCOMEM Phases: Crawl Specification 1. Intelligent Crawl Specification (ICS) The ICS describes the intended crawl by specifying keywords, entities, topics, etc. together with reference page and starting points. Reference pages matches to 100% with the crawl content and are used by the crawler to learn more about the crawl. Slide 8 Entities Obama, Romney, Biden, Ryan, Republicans, Democrats, … Keywords US Election, CommitToMitt, Teaparty, Budget deficit, … Reference Seedlist https://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ... Seedlist http://news.bbc.co.uk/, http://telegraph.co.uk/, ...
  • 9. ARCOMEM Phases: Crawling & Online Processing Slide 9 2. Crawling & Online Processing In this phase the web pages and social web content will be collected and a first semantic analysis will be applied. The analysis result is used to guide the crawler by ranking extracted links by their importance. All information are stored in the ARCOMEM Storage. Crawling Online Processing ARCOMEM Storage Crawling Entities Obama, Romney, Biden, Ryan, Republicans, Democrats, … Keywords US Election, CommitToMitt, Teaparty, Budget deficit, … Reference Seedlist https://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ... Seedlist http://news.bbc.co.uk/, http://telegraph.co.uk/, ... Internet
  • 10. ARCOMEM Phases: Offline Processing Slide 10 3. Offline Processing The offline processing runs after the collection of content has been finished. The aim of this phase is the enrich the crawled pages with meta-information that has been extracted from the content. The enrichments helps selecting content for the final web archive. Furthermore it eases the searching and browsing within the final Web archive. Crawling Online Processing Offline Processing ARCOMEM Storage Crawling Entities Obama, Romney, Biden, Ryan, Republicans, Democrats, … Keywords US Election, CommitToMitt, Teaparty, Budget deficit, … Reference Seedlist https://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ... Seedlist http://news.bbc.co.uk/, http://telegraph.co.uk/, ... Internet
  • 11. ARCOMEM Phases: Appraisal & Selection Slide 11 4. Based on the information given in the Intelligent Crawl Specification (ICS) and the enrichment of the content, the most interesting content items are selected to be stored in the final Web archive. The final Web archive are WARC files, which include the crawled pages and all enrichments done during the offline processing in RDF format. Crawling Online Processing Offline Processing ARCOMEM Storage Archive Crawling Appraisal Selection Entities Obama, Romney, Biden, Ryan, Republicans, Democrats, … Keywords US Election, CommitToMitt, Teaparty, Budget deficit, … Reference Seedlist https://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ... Seedlist http://news.bbc.co.uk/, http://telegraph.co.uk/, ... Internet
  • 12. ARCOMEM Phases: Applications Slide 12 Crawling Online Processing Offline Processing SARA for Broadcaster, Parliaments ARCOMEM Storage Archive Crawling Appraisal Selection Entities Obama, Romney, Biden, Ryan, Republicans, Democrats, … Keywords US Election, CommitToMitt, Teaparty, Budget deficit, … Reference Seedlist https://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ... Seedlist http://news.bbc.co.uk/, http://telegraph.co.uk/, ... Internet 5. The Search and Retrieval Application (SARA) allows end users to search and browse the archive in different ways, e.g. based on keywords, entities, topics, opinions.
  • 13. ARCOMEM Phases: Cross Crawl Analytics Slide 13 Crawling Online Processing Offline Processing SARA for Broadcaster, Parliaments ARCOMEM Storage Archive Crawling Appraisal Selection Cross Crawl Processing Entities Obama, Romney, Biden, Ryan, Republicans, Democrats, … Keywords US Election, CommitToMitt, Teaparty, Budget deficit, … Reference Seedlist https://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ... Seedlist http://news.bbc.co.uk/, http://telegraph.co.uk/, ... Internet 6. The Cross-Crawl analysis allows content analytics across archives. This enables the possibility to combine Web archives to get a larger collection of documents or to study evolutions over time. Examples are evolution of languages, opinions, etc.
  • 14. Preservation in ARCOMEM Content Preservation in ARCOMEM • Selection and appraisal of Web and Social Web content • Preparation of WARC files for preservation • Provides access to preserved Web content • Not part of ARCOMEM are – Long-term preservation of WARC files – Format handling, etc. Semantic Preservation in ARCOMEM • Extraction of Entities, Events, Topics, Opinions • Enrichment with Linked Data • Created WARC files contain – Raw Web Data – RDF triples of enrichment • Preservation of Linked Data – Not part of ARCOMEM – See EU Projects: DIACHRON (IP), PRELIDA (CA) Slide 14 + WARC
  • 15. THANK YOU CONTACT DETAILS Dr. Thomas Risse +49 511 762 17764 risse@L3S.de www.arcomem.eu