SlideShare una empresa de Scribd logo
1 de 30
Crawling the Web 
Fabrizio Celli 
Rome, 25th September 2014
Outline 
• Purpose of this Webinar 
• The Web Crawler 
• The AgroTagger 
• The AGRIS use case 
– What’s next? 
2
Purpose of this Webinar 
• SemaGrow is a project funded by the Seventh 
Framework Programme (FP7) of the European 
Commission 
• Algorithms, infrastructures and methodologies to 
cope with large data volumes and real time 
performance 
• http://www.semagrow.eu 
• One of SemaGrow demonstrators is the 
component “Web Crawler + AgroTagger”, 
objective of this Webinar 
3
The demonstrator 
• It is based on two command line applications 
(no user interface): 
– Web Crawler 
– AgroTagger 
• Goal: 
– discover resources on the Web 
– tag resources with AGROVOC URIs 
– filter only resources about agriculture and 
interlink to AGRIS 
4
What we expect from the Webinar 
• Comments, suggestions, opinions 
• Other real case scenarios for the 
demonstrator 
• You can send your feedback at agris@fao.org 
5
THE WEB-CRAWLER 
6
Apache Nutch 
• http://nutch.apache.org/ 
• Highly extensible and scalable open source 
Web crawler 
• Configurable 
• Input: a list of pre-selected URLs 
• Output: a list of discovered URLs 
7
How it works 
• The user defines a list of Web sites (URLs) 
• Each URL is a ROOT 
• The user defines the “depth”: the number of 
"hops" a discovered link is away from the 
ROOT 
– Links very "far away" from the ROOT are unlikely 
to hold much information 
• Start to crawl the Web! 
8
Example: depth = 3 
9 
ROOT (URL) 
depth = 1 URL_1_1 URL_1_2 URL_1_n 
depth = 2 
depth = 3 
… 
URL_2_2_1 … 
URL_2_2_m 
URL_3_2_1_1 … URL_3_2_1_p
The application 
• https://github.com/agrisfao/agrotagger/tree/master/cr 
awler/application 
• Command line application 
• Provided with bash scripts to run in Linux 
environments 
• Example of usage: 
– depth = 5 
– output directory = work/output 
– directory with source URLS = work/urls 
crawler_exec.sh 5 work/output work/urls 
10
The output 
URL:: http:/ 
URL:: http://%20www.umabroad.umn.edu/students/healthsafety/emergency.php 
URL:: http://10-29-2013-tfic-luncheon.eventbrite.com/ 
URL:: http://1z8jbr3nz90837simd2d2fwoktj.wpengine.netdna-cdn.com/wp-content/uploads/2014/05/Nina- 
Hale-Inc-FactSheet.pdf 
URL:: http://2014.northernspark.org/ 
URL:: http://2014.northernspark.org/project/chimera 
outlink: toUrl: http://media2.northernspark.org/wp-includes/wlwmanifest.xml anchor: 
outlink: toUrl: http://2014.northernspark.org/partners/arts-culture-and-the-creative-economy-program-of-the- 
city-of-minneapolis anchor: 
outlink: toUrl: http://2014.northernspark.org/project/bell-museum-staff anchor: 
URL:: http://aaea.execinc.com/edibo/JobMarketCandidates 
outlink: toUrl: http://www.aaea.org/ anchor: AAEA 
outlink: toUrl: http://aaea.execinc.com/edibo/LoginHelp anchor: Create an Account / Need Help Logging In 
outlink: toUrl: http://www.aaea.org/about-aaea/aaea-sections anchor: AAEA Sections 
outlink: toUrl: http://www.aaea.org/about-aaea/aaea-committees anchor: AAEA Committees 
outlink: toUrl: http://www.aaea.org/about-aaea/awards-and-honors anchor: Awards and Honors 
... 
11
THE AGROTAGGER 
12
AGROVOC 
• FAO multilingual vocabulary 
• Over 32 000 concepts in up to 21 languages 
• Part of the LOD cloud 
• Extensively used by cataloguers for indexing 
data in agricultural information systems 
• http://202.45.139.84:10035/catalogs/fao/rep 
ositories/agrovoc 
13
The AgroTagger 
• At a high level of abstraction, AgroTagger is a 
keyword extractor that uses the AGROVOC 
thesaurus to extract keywords from some 
URLs 
• Or better… to extract URIs 
• It is based on MAUI 
14
MAUI 
• Maui is named after the Polynesian 
mythological hero and demi-god, which would 
transform himself into different kinds of birds 
to perform many of his exploits 
• Maui automatically identifies main topics in 
text documents 
• It uses different kinds of algorithms (Kea and 
Weka, named after New Zealand native birds) 
• https://code.google.com/p/maui-indexer 
15
How it works 
• Input: 
– A text file with a list of URLs 
– The output file of an Apache Nutch crawler 
• Output: 
– A set of triples 
<URL> dcterms:subject <AGROVOC_URI> 
16
The algorithm 
• For each URL in the input file 
– Download the resource 
– Run the MAUI indexer trained with AGROVOC 
– Create a set of triples 
• Multi-threaded 
• Currently, MAUI is trained only for English 
– It can be trained in other languages that use Latin 
characters 
– Other solutions are needed for Chinese, Arabic, 
Russian, etc. 
17
The application 
• https://github.com/agrisfao/agrotagger 
• Command line application 
• Entirely based on JAVA 
• Provided with bash scripts 
• Example of usage: 
– directory with source files = work/source 
– output directory = work/output 
– type of source files = nutchOutput 
– output format = rdfnt 
taggerDir.sh /work/source /work/output nutchOutput rdfnt 
18
The output 
19 
Input 
AgroTagger 
Output
THE AGRIS USE CASE 
20
AGRIS 
• http://agris.fao.org 
• A collection of more than 7.8 million 
bibliographic references in agriculture 
• AGRIS records come with AGROVOC descriptors 
• An RDF-aware system 
– the AGRIS database is publicly exposed as RDF 
– AGROVOC is the backbone to interlink to external 
sources of information (statistics, distribution maps, 
country profiles, germplasm data…) 
21
22
SemaGrow demonstrator 
• The core idea is to harvest the Web 
– Input: pre-selected sources of information about 
agriculture 
• Crawl and assign AGROVOC URIs 
– Store triples in the “crawler” database 
• Definition of combinations between the 
“crawler” database and the AGRIS database 
• New widget in AGRIS mashup pages! 
23
Related resources 
available on the Web 
24 
• http://... 
• https://...
Current status 
• The Web Crawler gathers data from the Web 
• The AgroTagger computes triples to assign 
Agrovoc URIs to discovered URLs 
• A “crawler” triplestore is ready for computations 
25
What’s next 
• Processing phase 
• Discover meaningful combinations between 
the AGRIS core database and “crawler” 
database 
• A triplestore of combinations will be set up 
and used by AGRIS to generate a widget in the 
mashup page 
• Evaluation of the quality of the widget 
• What does “meaningful combinations” mean? 
26
Naïve Algorithm 
• Just for testing purposes 
• Meaningful combinations = at least N 
common AGROVOC URIs 
27
Example 
• http://ageconsearch.umn.edu/ 
• 101,000 distinct Web resources discovered by the 
WebCrawler (depth = 5) 
• ~1 million triples generated by the AgroTagger 
(“crawler” database) 
28 
Number of AGRIS records N: common AGROVOC URIs 
between AGRIS and the 
output of the Crawler 
Number of associations 
900 K 3 17 MLN 
900 K 4 3,2 MLN 
1 MLN 5 0.6 MLN
Your feedback 
• Comments, suggestions, other real case 
scenarios 
• Ideas about the meaning of “meaningful 
combinations” 
• If you will test the application, any comments 
to improve it 
• Can the demonstrator support to overcome 
data problems? 
• You can send your feedback at agris@fao.org 
29
30 
谢谢 
Gracias 
σας ευχαριστώ

Más contenido relacionado

La actualidad más candente

Web crawler and applications
Web crawler and applicationsWeb crawler and applications
Web crawler and applicationsPartnered Health
 
Design and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerDesign and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerGeorge Ang
 
Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler Sanchit Saini
 
Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014Sunny Gupta
 
Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawlerRishikesh Pathak
 
Colloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web CrawlerColloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web CrawlerAkshay Pratap Singh
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...CloudTechnologies
 
Web Crawlers - Exploring the WWW
Web Crawlers - Exploring the WWWWeb Crawlers - Exploring the WWW
Web Crawlers - Exploring the WWWSiddhartha Anand
 
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...Rana Jayant
 
Web crawler with seo analysis
Web crawler with seo analysis Web crawler with seo analysis
Web crawler with seo analysis Vikram Parmar
 
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep WebSmart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep WebS Sai Karthik
 
Smart crawlet A two stage crawler for efficiently harvesting deep web interf...
Smart crawlet A two stage crawler  for efficiently harvesting deep web interf...Smart crawlet A two stage crawler  for efficiently harvesting deep web interf...
Smart crawlet A two stage crawler for efficiently harvesting deep web interf...Rana Jayant
 
A Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET TechnologyA Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET TechnologyIOSR Journals
 

La actualidad más candente (20)

WebCrawler
WebCrawlerWebCrawler
WebCrawler
 
“Web crawler”
“Web crawler”“Web crawler”
“Web crawler”
 
Web Crawlers
Web CrawlersWeb Crawlers
Web Crawlers
 
Web crawler and applications
Web crawler and applicationsWeb crawler and applications
Web crawler and applications
 
Design and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerDesign and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web Crawler
 
Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler
 
Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014
 
Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawler
 
Colloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web CrawlerColloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web Crawler
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
 
Web Crawlers - Exploring the WWW
Web Crawlers - Exploring the WWWWeb Crawlers - Exploring the WWW
Web Crawlers - Exploring the WWW
 
Smart Crawler
Smart CrawlerSmart Crawler
Smart Crawler
 
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
 
Web crawler with seo analysis
Web crawler with seo analysis Web crawler with seo analysis
Web crawler with seo analysis
 
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep WebSmart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
 
Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawler
 
Seminar on crawler
Seminar on crawlerSeminar on crawler
Seminar on crawler
 
Smart crawlet A two stage crawler for efficiently harvesting deep web interf...
Smart crawlet A two stage crawler  for efficiently harvesting deep web interf...Smart crawlet A two stage crawler  for efficiently harvesting deep web interf...
Smart crawlet A two stage crawler for efficiently harvesting deep web interf...
 
A Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET TechnologyA Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET Technology
 

Similar a SemaGrow demonstrator: “Web Crawler + AgroTagger”

Large Scale Drupal - Behind the Scenes
Large Scale Drupal - Behind the ScenesLarge Scale Drupal - Behind the Scenes
Large Scale Drupal - Behind the ScenesBoyan Borisov
 
Automatic Query-Centric API for Routine Access to Linked Data
Automatic Query-Centric API for Routine Access to Linked DataAutomatic Query-Centric API for Routine Access to Linked Data
Automatic Query-Centric API for Routine Access to Linked DataAlbert Meroño-Peñuela
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsDatabricks
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
 
2nd Content Providers Community Call
2nd Content Providers Community Call2nd Content Providers Community Call
2nd Content Providers Community CallOpenAIRE
 
App_Engine_PPT..........................
App_Engine_PPT..........................App_Engine_PPT..........................
App_Engine_PPT..........................HassamShahid2
 
JRuby, Ruby, Rails and You on the Cloud
JRuby, Ruby, Rails and You on the CloudJRuby, Ruby, Rails and You on the Cloud
JRuby, Ruby, Rails and You on the CloudHiro Asari
 
Making Chrome Extension with AngularJS
Making Chrome Extension with AngularJSMaking Chrome Extension with AngularJS
Making Chrome Extension with AngularJSBen Lau
 
ElasticSearch - Suche im Zeitalter der Clouds
ElasticSearch - Suche im Zeitalter der CloudsElasticSearch - Suche im Zeitalter der Clouds
ElasticSearch - Suche im Zeitalter der Cloudsinovex GmbH
 
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveLessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveKritika Garg
 
Module development
Module development Module development
Module development Araport
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Web security: Securing untrusted web content at browsers
Web security: Securing untrusted web content at browsersWeb security: Securing untrusted web content at browsers
Web security: Securing untrusted web content at browsersPhú Phùng
 
Introduction to Laravel Framework (5.2)
Introduction to Laravel Framework (5.2)Introduction to Laravel Framework (5.2)
Introduction to Laravel Framework (5.2)Viral Solani
 

Similar a SemaGrow demonstrator: “Web Crawler + AgroTagger” (20)

Large Scale Drupal - Behind the Scenes
Large Scale Drupal - Behind the ScenesLarge Scale Drupal - Behind the Scenes
Large Scale Drupal - Behind the Scenes
 
Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase
Automatic Indexing of Bibliographic Metadata: The AgroTagger UsecaseAutomatic Indexing of Bibliographic Metadata: The AgroTagger Usecase
Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase
 
Automatic Query-Centric API for Routine Access to Linked Data
Automatic Query-Centric API for Routine Access to Linked DataAutomatic Query-Centric API for Routine Access to Linked Data
Automatic Query-Centric API for Routine Access to Linked Data
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
 
AGROVOC GACS Working Group
AGROVOC GACS Working GroupAGROVOC GACS Working Group
AGROVOC GACS Working Group
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
2nd Content Providers Community Call
2nd Content Providers Community Call2nd Content Providers Community Call
2nd Content Providers Community Call
 
App_Engine_PPT.ppt
App_Engine_PPT.pptApp_Engine_PPT.ppt
App_Engine_PPT.ppt
 
App_Engine_PPT..........................
App_Engine_PPT..........................App_Engine_PPT..........................
App_Engine_PPT..........................
 
App_Engine_PPT.ppt
App_Engine_PPT.pptApp_Engine_PPT.ppt
App_Engine_PPT.ppt
 
App_Engine_PPT.ppt
App_Engine_PPT.pptApp_Engine_PPT.ppt
App_Engine_PPT.ppt
 
JRuby, Ruby, Rails and You on the Cloud
JRuby, Ruby, Rails and You on the CloudJRuby, Ruby, Rails and You on the Cloud
JRuby, Ruby, Rails and You on the Cloud
 
Making Chrome Extension with AngularJS
Making Chrome Extension with AngularJSMaking Chrome Extension with AngularJS
Making Chrome Extension with AngularJS
 
ElasticSearch - Suche im Zeitalter der Clouds
ElasticSearch - Suche im Zeitalter der CloudsElasticSearch - Suche im Zeitalter der Clouds
ElasticSearch - Suche im Zeitalter der Clouds
 
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveLessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
 
Module development
Module development Module development
Module development
 
DevOps-Roadmap
DevOps-RoadmapDevOps-Roadmap
DevOps-Roadmap
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Web security: Securing untrusted web content at browsers
Web security: Securing untrusted web content at browsersWeb security: Securing untrusted web content at browsers
Web security: Securing untrusted web content at browsers
 
Introduction to Laravel Framework (5.2)
Introduction to Laravel Framework (5.2)Introduction to Laravel Framework (5.2)
Introduction to Laravel Framework (5.2)
 

Más de AIMS (Agricultural Information Management Standards)

Más de AIMS (Agricultural Information Management Standards) (20)

Linked Data Competency Index : Mapping the field for teachers and learners
 Linked Data Competency Index : Mapping the field for teachers and learners Linked Data Competency Index : Mapping the field for teachers and learners
Linked Data Competency Index : Mapping the field for teachers and learners
 
Metadata as Standard: improving Interoperability through the Research Data Al...
Metadata as Standard: improving Interoperability through the Research Data Al...Metadata as Standard: improving Interoperability through the Research Data Al...
Metadata as Standard: improving Interoperability through the Research Data Al...
 
Assigning Digital Object Identifiers (DOIs) to Plant Genetic Resources
Assigning Digital Object Identifiers (DOIs) to Plant Genetic ResourcesAssigning Digital Object Identifiers (DOIs) to Plant Genetic Resources
Assigning Digital Object Identifiers (DOIs) to Plant Genetic Resources
 
VocBench 3: some insights on the forthcoming release
VocBench 3: some insights on the forthcoming release VocBench 3: some insights on the forthcoming release
VocBench 3: some insights on the forthcoming release
 
The case for Digital Objects Identifiers (DOIs) in support of research activi...
The case for Digital Objects Identifiers (DOIs) in support of research activi...The case for Digital Objects Identifiers (DOIs) in support of research activi...
The case for Digital Objects Identifiers (DOIs) in support of research activi...
 
Webinar@AIMS_FAIR Principles and Data Management Planning
Webinar@AIMS_FAIR Principles and Data Management PlanningWebinar@AIMS_FAIR Principles and Data Management Planning
Webinar@AIMS_FAIR Principles and Data Management Planning
 
Webinar@ASIRA: How to foster openness from an academic library
Webinar@ASIRA: How to foster openness from an academic library Webinar@ASIRA: How to foster openness from an academic library
Webinar@ASIRA: How to foster openness from an academic library
 
Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research
Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research
Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research
 
Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...
Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...
Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...
 
Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals
Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals
Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals
 
Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA)
Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA) Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA)
Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA)
 
Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...
Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...
Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...
 
Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context
Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context
Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context
 
Webinar@ASIRA: Emerging Themes in Agricultural Research Publishing
Webinar@ASIRA: Emerging Themes in Agricultural Research PublishingWebinar@ASIRA: Emerging Themes in Agricultural Research Publishing
Webinar@ASIRA: Emerging Themes in Agricultural Research Publishing
 
Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...
Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...
Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...
 
Using AGRIS as a portal of choice to access agricultural research and technol...
Using AGRIS as a portal of choice to access agricultural research and technol...Using AGRIS as a portal of choice to access agricultural research and technol...
Using AGRIS as a portal of choice to access agricultural research and technol...
 
Research4Life: La bibliothèque qui ouvre ses portes
Research4Life: La bibliothèque qui ouvre ses portesResearch4Life: La bibliothèque qui ouvre ses portes
Research4Life: La bibliothèque qui ouvre ses portes
 
Publishing skos concept schemes with skosmos
Publishing skos concept schemes with skosmosPublishing skos concept schemes with skosmos
Publishing skos concept schemes with skosmos
 
Research4Life: La biblioteca que abre puertas
Research4Life: La biblioteca que abre puertasResearch4Life: La biblioteca que abre puertas
Research4Life: La biblioteca que abre puertas
 
Research4Life: The library that opens doors
Research4Life: The library that opens doorsResearch4Life: The library that opens doors
Research4Life: The library that opens doors
 

Último

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Último (20)

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

SemaGrow demonstrator: “Web Crawler + AgroTagger”

  • 1. Crawling the Web Fabrizio Celli Rome, 25th September 2014
  • 2. Outline • Purpose of this Webinar • The Web Crawler • The AgroTagger • The AGRIS use case – What’s next? 2
  • 3. Purpose of this Webinar • SemaGrow is a project funded by the Seventh Framework Programme (FP7) of the European Commission • Algorithms, infrastructures and methodologies to cope with large data volumes and real time performance • http://www.semagrow.eu • One of SemaGrow demonstrators is the component “Web Crawler + AgroTagger”, objective of this Webinar 3
  • 4. The demonstrator • It is based on two command line applications (no user interface): – Web Crawler – AgroTagger • Goal: – discover resources on the Web – tag resources with AGROVOC URIs – filter only resources about agriculture and interlink to AGRIS 4
  • 5. What we expect from the Webinar • Comments, suggestions, opinions • Other real case scenarios for the demonstrator • You can send your feedback at agris@fao.org 5
  • 7. Apache Nutch • http://nutch.apache.org/ • Highly extensible and scalable open source Web crawler • Configurable • Input: a list of pre-selected URLs • Output: a list of discovered URLs 7
  • 8. How it works • The user defines a list of Web sites (URLs) • Each URL is a ROOT • The user defines the “depth”: the number of "hops" a discovered link is away from the ROOT – Links very "far away" from the ROOT are unlikely to hold much information • Start to crawl the Web! 8
  • 9. Example: depth = 3 9 ROOT (URL) depth = 1 URL_1_1 URL_1_2 URL_1_n depth = 2 depth = 3 … URL_2_2_1 … URL_2_2_m URL_3_2_1_1 … URL_3_2_1_p
  • 10. The application • https://github.com/agrisfao/agrotagger/tree/master/cr awler/application • Command line application • Provided with bash scripts to run in Linux environments • Example of usage: – depth = 5 – output directory = work/output – directory with source URLS = work/urls crawler_exec.sh 5 work/output work/urls 10
  • 11. The output URL:: http:/ URL:: http://%20www.umabroad.umn.edu/students/healthsafety/emergency.php URL:: http://10-29-2013-tfic-luncheon.eventbrite.com/ URL:: http://1z8jbr3nz90837simd2d2fwoktj.wpengine.netdna-cdn.com/wp-content/uploads/2014/05/Nina- Hale-Inc-FactSheet.pdf URL:: http://2014.northernspark.org/ URL:: http://2014.northernspark.org/project/chimera outlink: toUrl: http://media2.northernspark.org/wp-includes/wlwmanifest.xml anchor: outlink: toUrl: http://2014.northernspark.org/partners/arts-culture-and-the-creative-economy-program-of-the- city-of-minneapolis anchor: outlink: toUrl: http://2014.northernspark.org/project/bell-museum-staff anchor: URL:: http://aaea.execinc.com/edibo/JobMarketCandidates outlink: toUrl: http://www.aaea.org/ anchor: AAEA outlink: toUrl: http://aaea.execinc.com/edibo/LoginHelp anchor: Create an Account / Need Help Logging In outlink: toUrl: http://www.aaea.org/about-aaea/aaea-sections anchor: AAEA Sections outlink: toUrl: http://www.aaea.org/about-aaea/aaea-committees anchor: AAEA Committees outlink: toUrl: http://www.aaea.org/about-aaea/awards-and-honors anchor: Awards and Honors ... 11
  • 13. AGROVOC • FAO multilingual vocabulary • Over 32 000 concepts in up to 21 languages • Part of the LOD cloud • Extensively used by cataloguers for indexing data in agricultural information systems • http://202.45.139.84:10035/catalogs/fao/rep ositories/agrovoc 13
  • 14. The AgroTagger • At a high level of abstraction, AgroTagger is a keyword extractor that uses the AGROVOC thesaurus to extract keywords from some URLs • Or better… to extract URIs • It is based on MAUI 14
  • 15. MAUI • Maui is named after the Polynesian mythological hero and demi-god, which would transform himself into different kinds of birds to perform many of his exploits • Maui automatically identifies main topics in text documents • It uses different kinds of algorithms (Kea and Weka, named after New Zealand native birds) • https://code.google.com/p/maui-indexer 15
  • 16. How it works • Input: – A text file with a list of URLs – The output file of an Apache Nutch crawler • Output: – A set of triples <URL> dcterms:subject <AGROVOC_URI> 16
  • 17. The algorithm • For each URL in the input file – Download the resource – Run the MAUI indexer trained with AGROVOC – Create a set of triples • Multi-threaded • Currently, MAUI is trained only for English – It can be trained in other languages that use Latin characters – Other solutions are needed for Chinese, Arabic, Russian, etc. 17
  • 18. The application • https://github.com/agrisfao/agrotagger • Command line application • Entirely based on JAVA • Provided with bash scripts • Example of usage: – directory with source files = work/source – output directory = work/output – type of source files = nutchOutput – output format = rdfnt taggerDir.sh /work/source /work/output nutchOutput rdfnt 18
  • 19. The output 19 Input AgroTagger Output
  • 20. THE AGRIS USE CASE 20
  • 21. AGRIS • http://agris.fao.org • A collection of more than 7.8 million bibliographic references in agriculture • AGRIS records come with AGROVOC descriptors • An RDF-aware system – the AGRIS database is publicly exposed as RDF – AGROVOC is the backbone to interlink to external sources of information (statistics, distribution maps, country profiles, germplasm data…) 21
  • 22. 22
  • 23. SemaGrow demonstrator • The core idea is to harvest the Web – Input: pre-selected sources of information about agriculture • Crawl and assign AGROVOC URIs – Store triples in the “crawler” database • Definition of combinations between the “crawler” database and the AGRIS database • New widget in AGRIS mashup pages! 23
  • 24. Related resources available on the Web 24 • http://... • https://...
  • 25. Current status • The Web Crawler gathers data from the Web • The AgroTagger computes triples to assign Agrovoc URIs to discovered URLs • A “crawler” triplestore is ready for computations 25
  • 26. What’s next • Processing phase • Discover meaningful combinations between the AGRIS core database and “crawler” database • A triplestore of combinations will be set up and used by AGRIS to generate a widget in the mashup page • Evaluation of the quality of the widget • What does “meaningful combinations” mean? 26
  • 27. Naïve Algorithm • Just for testing purposes • Meaningful combinations = at least N common AGROVOC URIs 27
  • 28. Example • http://ageconsearch.umn.edu/ • 101,000 distinct Web resources discovered by the WebCrawler (depth = 5) • ~1 million triples generated by the AgroTagger (“crawler” database) 28 Number of AGRIS records N: common AGROVOC URIs between AGRIS and the output of the Crawler Number of associations 900 K 3 17 MLN 900 K 4 3,2 MLN 1 MLN 5 0.6 MLN
  • 29. Your feedback • Comments, suggestions, other real case scenarios • Ideas about the meaning of “meaningful combinations” • If you will test the application, any comments to improve it • Can the demonstrator support to overcome data problems? • You can send your feedback at agris@fao.org 29
  • 30. 30 谢谢 Gracias σας ευχαριστώ

Notas del editor

  1. keyphrase extraction algorithm Kea machine learning toolkit Weka