SlideShare una empresa de Scribd logo
1 de 18
DB Group @ UNIMO 
Fabio Benedetti Sonia Bergamaschi Laura Po 
Department of Engineering “Enzo Ferrari” 
University of Modena & Reggio Emilia 
LD4IE 2014 – Riva Del Garda, Italy 
Online Index Extraction from Linked Open Data Sources 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 1
DB Group @ UNIMO 
2 
• Selection of a relevant LOD source 
• Statistical indexes 
• Architecture Overview 
• Performance Evaluation 
• LODeX & Conclusions 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources
DB Group @ UNIMO 
3 
Schmachtenberg, Max, Christian Bizer, and Heiko Paulheim. "Adoption of the Linked Data Best Practices in 
Different Topical Domains." The Semantic Web–ISWC 2014. Springer International Publishing, 2014. 245-260. 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources
DB Group @ UNIMO 
4 
2009 2014* 
Domain Number % Number % 
Cross-domain 41 13.95% 41 4.04% 
Geographic 31 10.54% 21 2.07% 
Government 49 16.67% 183 18.05% 
Life sciences 41 13.95% 83 8.19% 
Media 25 8.50% 22 2.17% 
Publications 87 29.59% 96 9.47% 
Social web 0 0.00% 520 51.28% 
User-generated 
content 20 6.80% 48 4.73% 
Total 294 1014 
*Only 570 datasets belong to the LOD cloud, 
the remaining datasets do not contain 
ingoing/outgoing links to the LOD Cloud. 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources 
2009 Domain 
Cross-domain 
Geographic 
Government 
Life sciences 
Media 
Publications 
Social web 
2014
DB Group @ UNIMO 
5 
1. The documentation of the dataset 
– The documentation can be poor or absent 
– There are no standard to provide the documentation 
– Sometime it is provided as an RDF file in XML format 
2. Searching features of existing catalogs (i.e. Datahub) 
– The metadata contain poor information 
– None information about the structure of the dataset is used by the 
search engine 
3. The manual exploration of the Dataset 
– It is required a good knowledge of SPARQL language 
– It is a time consuming task 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources
DB Group @ UNIMO 
6 
To automatically extract a set of indexes able to 
describe the structure of a LOD dataset 
How to describe the dataset 
LOD datasets can have different purpose and structure: 
• Ontology/Vocabulary (OWL & RDFS constraints) 
• Open Data (i.e. generated from existing RDBMS) 
The indexes should maximize the value of the information extraction 
from heterogeneous datasets 
Online & Automatic extraction 
• It does not require any additional information by the user 
• It works with SPARQL endpoints 
– We have to handle the bad performance issues of these Datasets 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources
DB Group @ UNIMO 
7 
We can think the entire set of RDF triples partitioned between: 
• Intensional Knowledge 
• Extensional Knowledge 
The Intensional knowledge 
• It contains the RDFS or OWL constraints of the Ontology 
• It represents the T-Box components of the knowledge base 
The Extensional knowledge 
• It contains the entities of the real word 
described in the dataset 
• It represents the A-Box components of 
the knowledge base 
• its triples cover most of the dataset 
Instantiated classes act as a 
bridge between the two type of 
knowledge 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources
DB Group @ UNIMO 
8 
ex:sector 
rdf:label rdf:Property 
owl:Class 
rdfs:domain 
rdf:type rdf:type 
ex:Sector ex:Organization 
sector 
rdf:type 
rdf:type 
rdf:type 
ex:sector 
Intensional 
Knowledge 
Instantiated 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources 
rdfs:range 
rdf:label 
rdf:type 
owl:ObjectProperty 
rdf:type 
sector1 
organization1 
ex:sector 
dc:name 
“Energy” organization2 
Classes 
Extensional 
Knowledge
DB Group @ UNIMO 
9 
The Statistical Indexes are grouped in three categories: 
• Generic 
• Intensional 
• Extensional 
Name Description Structure Category 
t Number of Triples Integer 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources 
Generic 
c Number of Classes Integer 
I Number of Instances Integer 
Cl Class List List(name, n. Instances) 
Pl Property List List(name, n. occurrence) 
IK Intensional K. triples List(s, p, o) Intensional 
Sc Subject Class List(c, p, n. occurrence) 
SCl Subject Class to literal List(c, p, n. occurrence) Extensional 
Oc Object Class List(c, p, n. occurrence)
DB Group @ UNIMO 
10 
ex:Sector ex:Organization 
rdf:type 
sector1 
rdf:type 
Subject 
Class 
ex:sector rdf:type 
Subject 
Class to 
literal 
ex:Sector ex:Organization 
rdf:type 
sector1 
rdf:type 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources 
organization1 
ex:sector 
dc:name 
“Energy” organization2 
Sc - Subject Class SCl - Subject Class to literal Oc -Object Class 
S ex:Organization ex:Sector ex:Sector 
P ex:sector dc:name ex:sector 
n 2 1 1 
organization1 
ex:sector 
dc:name 
“Energy” 
ex:sector 
Object 
Class
DB Group @ UNIMO 
11 
It takes in input a list of URLs of SPARQL endpoints 
A set of Statistical Indexes for each endpoint is the output 
• The IE process dynamically generates the SPARQL query used to 
extract the Statistical Indexes 
• It works in parallel querying different datasets 
• Partial results and the Statistical Indexes are stored in a NoSQL DB 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources
DB Group @ UNIMO 
12 
General Statistic Extraction 
• It uses 6 different queries to extract the indexes of this group 
Intensional Knowledge Extraction 
• The extraction of the Intensional knowledge is performed through an 
iterative algorithm 
• The algorithm traverses the graph starting from the instantiated classes 
Extensional Schema Extraction 
• It uses different SPARQL aggregation query to extract SC, SCl and OC 
• Use a technique called Pattern Strategy to complete the extraction 
– It is a technique able to produce an higher number of less 
complex SPARQL query 
– It is used when the endpoint is not able to answer an aggregation 
query and it throws a timeout error 
A complete list of the 24 query patterns is available at http://dbgroup.unimo.it/lodexQueries 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources
DB Group @ UNIMO 
13 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources
DB Group @ UNIMO 
14 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources 
Reachable datasets 244 
SPARQL 1.1 compatible 137 
Extraction completed 107 
Extraction completed 
33 
Without PS 
Total triples (107 datasets) 3,45 b 
AVG time extraction 6,12 m 
Total time (single process) 11,15 h 
Total time (9 processes) 3,35 h 
The test has been performed on a list of 
469 Datasets 
• More than the 90 % completed the 
extraction in less than 500 s 
• The PS technique has proved its worth 
• from 33 to 107 completed the 
extraction 
• The IE process is scalable 
• linear correlation between number of 
triples and time
DB Group @ UNIMO 
LODeX is an online tool able to shows a visual Schema Summary for a LOD source 
• We made use of the statistical indexes for the generation of the Schema 
F. Benedetti, S. Bergamaschi, and L. Po, “A visual summary for linked open data sources” 2014, International Semantic Web Conference (Posters & Demos). 
17 
Summary. 
• Users can interact with the Schema Summary dataset and focus on the 
information that they are more interested in. 
The tool is accessible at: www.dbgroup.unimo.it/lodex 
Come to attend the LODeX demo at the ISWC demo session! 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources
DB Group @ UNIMO 
18 
Conclusion 
• We are able to extract valuable indexes from a LOD dataset 
taking advantage of the definition of Intensional and 
Extensional knowledge 
• The process of extraction is been tested with an huge number 
of dataset and its efficiency and effectiveness has been 
proven 
Future Works 
• To extend VOID vocabulary with our descriptors 
• We want propose LODeX as assistance tool for LOD portals. 
• We are extending LODeX in order to support the automatic 
SPARQL query generation 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources
DB Group @ UNIMO 
19 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources
DB Group @ UNIMO 
20 
Thanks for your attention! 
LD4IE 2014 – Riva Del Garda, Italy 
Online Index Extraction from Linked Open Data Sources 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

Más contenido relacionado

La actualidad más candente

Enlighten research staff_conference_2010
Enlighten research staff_conference_2010Enlighten research staff_conference_2010
Enlighten research staff_conference_2010
elizadams
 
IOTA @ NASIG 2011: Measuring the Quality of OpenURL Links
IOTA @ NASIG 2011: Measuring the Quality of OpenURL LinksIOTA @ NASIG 2011: Measuring the Quality of OpenURL Links
IOTA @ NASIG 2011: Measuring the Quality of OpenURL Links
Rafal Kasprowski
 

La actualidad más candente (7)

Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
 
Enlighten research staff_conference_2010
Enlighten research staff_conference_2010Enlighten research staff_conference_2010
Enlighten research staff_conference_2010
 
EDS Web-scale Panel (Preprint), 2012 Charleston Conference
EDS Web-scale Panel (Preprint), 2012 Charleston ConferenceEDS Web-scale Panel (Preprint), 2012 Charleston Conference
EDS Web-scale Panel (Preprint), 2012 Charleston Conference
 
IOTA @ NASIG 2011: Measuring the Quality of OpenURL Links
IOTA @ NASIG 2011: Measuring the Quality of OpenURL LinksIOTA @ NASIG 2011: Measuring the Quality of OpenURL Links
IOTA @ NASIG 2011: Measuring the Quality of OpenURL Links
 
Data Wrangling Week 4
Data Wrangling Week 4Data Wrangling Week 4
Data Wrangling Week 4
 
Alan Cope (De Montfort University) – EXPLORER (create workflows and processes...
Alan Cope (De Montfort University) – EXPLORER (create workflows and processes...Alan Cope (De Montfort University) – EXPLORER (create workflows and processes...
Alan Cope (De Montfort University) – EXPLORER (create workflows and processes...
 
bonino
boninobonino
bonino
 

Destacado

Issues in Online Education
Issues in Online EducationIssues in Online Education
Issues in Online Education
Mike KEPPELL
 
The advantages and disadvantages of online learning
The advantages and disadvantages of online learningThe advantages and disadvantages of online learning
The advantages and disadvantages of online learning
Janna8482
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment Analysis
Fabio Benedetti
 

Destacado (14)

Visual Querying LOD sources with LODeX
 Visual Querying LOD sources with LODeX Visual Querying LOD sources with LODeX
Visual Querying LOD sources with LODeX
 
Introduction to British Education Index
Introduction to British Education IndexIntroduction to British Education Index
Introduction to British Education Index
 
The Competency Convergence: Core Skills and Knowledge of Library and Museum P...
The Competency Convergence: Core Skills and Knowledge of Library and Museum P...The Competency Convergence: Core Skills and Knowledge of Library and Museum P...
The Competency Convergence: Core Skills and Knowledge of Library and Museum P...
 
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
 
The British education system
The British education systemThe British education system
The British education system
 
Natural Language Access to Data via Deduction
Natural Language Access to Data via DeductionNatural Language Access to Data via Deduction
Natural Language Access to Data via Deduction
 
Issues in Online Education
Issues in Online EducationIssues in Online Education
Issues in Online Education
 
Linked Open Data Principles, Technologies and Examples
Linked Open Data Principles, Technologies and ExamplesLinked Open Data Principles, Technologies and Examples
Linked Open Data Principles, Technologies and Examples
 
The advantages and disadvantages of online learning
The advantages and disadvantages of online learningThe advantages and disadvantages of online learning
The advantages and disadvantages of online learning
 
Online education vs regular education
Online education vs regular educationOnline education vs regular education
Online education vs regular education
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment Analysis
 
Create icons in PowerPoint
Create icons in PowerPointCreate icons in PowerPoint
Create icons in PowerPoint
 
10 Tips for Making Beautiful Slideshow Presentations by www.visuali.se
10 Tips for Making Beautiful Slideshow Presentations by www.visuali.se10 Tips for Making Beautiful Slideshow Presentations by www.visuali.se
10 Tips for Making Beautiful Slideshow Presentations by www.visuali.se
 
8 Tips for an Awesome Powerpoint Presentation
8 Tips for an Awesome Powerpoint Presentation8 Tips for an Awesome Powerpoint Presentation
8 Tips for an Awesome Powerpoint Presentation
 

Similar a Online Index Extraction from Linked Open Data Sources

Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...
Lucy McKenna
 
CNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundationCNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundation
John Doove
 
“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...
“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...
“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...
Marta Villegas
 

Similar a Online Index Extraction from Linked Open Data Sources (20)

Linked Open Graph: browsing multiple SPARQL entry points to build your own LO...
Linked Open Graph: browsing multiple SPARQL entry points to build your own LO...Linked Open Graph: browsing multiple SPARQL entry points to build your own LO...
Linked Open Graph: browsing multiple SPARQL entry points to build your own LO...
 
Linked Open Data Visualization
Linked Open Data VisualizationLinked Open Data Visualization
Linked Open Data Visualization
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...
 
Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...
Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...
Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...
 
Linked Data at the OU - the story so far
Linked Data at the OU - the story so farLinked Data at the OU - the story so far
Linked Data at the OU - the story so far
 
The Experimental Project of DOI Registration for Research Data at Japan Link...
The Experimental Project of DOI Registration for Research Data at Japan Link...The Experimental Project of DOI Registration for Research Data at Japan Link...
The Experimental Project of DOI Registration for Research Data at Japan Link...
 
CNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundationCNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundation
 
Enhanced publications: an introduction – Arjan Hogenaar, DANS
Enhanced publications: an introduction – Arjan Hogenaar, DANSEnhanced publications: an introduction – Arjan Hogenaar, DANS
Enhanced publications: an introduction – Arjan Hogenaar, DANS
 
Integrating an electronic lab notebook with a data repository; American Chemi...
Integrating an electronic lab notebook with a data repository; American Chemi...Integrating an electronic lab notebook with a data repository; American Chemi...
Integrating an electronic lab notebook with a data repository; American Chemi...
 
Elns and repositories, American Chemical Society, Dallas, March 2014
Elns and repositories, American Chemical Society, Dallas, March 2014Elns and repositories, American Chemical Society, Dallas, March 2014
Elns and repositories, American Chemical Society, Dallas, March 2014
 
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
 
“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...
“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...
“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...
 
DSpace-CRIS_An open source solution for Research_EDU15
DSpace-CRIS_An open source solution for Research_EDU15DSpace-CRIS_An open source solution for Research_EDU15
DSpace-CRIS_An open source solution for Research_EDU15
 
Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...
 
Methodology for the publication of Linked Open Data from small and medium siz...
Methodology for the publication of Linked Open Data from small and medium siz...Methodology for the publication of Linked Open Data from small and medium siz...
Methodology for the publication of Linked Open Data from small and medium siz...
 
Camp 4-data workshop presentation
Camp 4-data workshop presentationCamp 4-data workshop presentation
Camp 4-data workshop presentation
 
NISO Webinar: Library Linked Data: From Vision to Reality
NISO Webinar: Library Linked Data: From Vision to RealityNISO Webinar: Library Linked Data: From Vision to Reality
NISO Webinar: Library Linked Data: From Vision to Reality
 
Retooling a Research Data Repository: data.depositar.io
Retooling a Research Data Repository: data.depositar.ioRetooling a Research Data Repository: data.depositar.io
Retooling a Research Data Repository: data.depositar.io
 
Describing Theses and Dissertations Using Schema.org
Describing Theses and Dissertations Using Schema.orgDescribing Theses and Dissertations Using Schema.org
Describing Theses and Dissertations Using Schema.org
 
Building Linked Data Applications
Building Linked Data ApplicationsBuilding Linked Data Applications
Building Linked Data Applications
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Online Index Extraction from Linked Open Data Sources

  • 1. DB Group @ UNIMO Fabio Benedetti Sonia Bergamaschi Laura Po Department of Engineering “Enzo Ferrari” University of Modena & Reggio Emilia LD4IE 2014 – Riva Del Garda, Italy Online Index Extraction from Linked Open Data Sources Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 1
  • 2. DB Group @ UNIMO 2 • Selection of a relevant LOD source • Statistical indexes • Architecture Overview • Performance Evaluation • LODeX & Conclusions LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources
  • 3. DB Group @ UNIMO 3 Schmachtenberg, Max, Christian Bizer, and Heiko Paulheim. "Adoption of the Linked Data Best Practices in Different Topical Domains." The Semantic Web–ISWC 2014. Springer International Publishing, 2014. 245-260. LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources
  • 4. DB Group @ UNIMO 4 2009 2014* Domain Number % Number % Cross-domain 41 13.95% 41 4.04% Geographic 31 10.54% 21 2.07% Government 49 16.67% 183 18.05% Life sciences 41 13.95% 83 8.19% Media 25 8.50% 22 2.17% Publications 87 29.59% 96 9.47% Social web 0 0.00% 520 51.28% User-generated content 20 6.80% 48 4.73% Total 294 1014 *Only 570 datasets belong to the LOD cloud, the remaining datasets do not contain ingoing/outgoing links to the LOD Cloud. LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources 2009 Domain Cross-domain Geographic Government Life sciences Media Publications Social web 2014
  • 5. DB Group @ UNIMO 5 1. The documentation of the dataset – The documentation can be poor or absent – There are no standard to provide the documentation – Sometime it is provided as an RDF file in XML format 2. Searching features of existing catalogs (i.e. Datahub) – The metadata contain poor information – None information about the structure of the dataset is used by the search engine 3. The manual exploration of the Dataset – It is required a good knowledge of SPARQL language – It is a time consuming task LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources
  • 6. DB Group @ UNIMO 6 To automatically extract a set of indexes able to describe the structure of a LOD dataset How to describe the dataset LOD datasets can have different purpose and structure: • Ontology/Vocabulary (OWL & RDFS constraints) • Open Data (i.e. generated from existing RDBMS) The indexes should maximize the value of the information extraction from heterogeneous datasets Online & Automatic extraction • It does not require any additional information by the user • It works with SPARQL endpoints – We have to handle the bad performance issues of these Datasets LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources
  • 7. DB Group @ UNIMO 7 We can think the entire set of RDF triples partitioned between: • Intensional Knowledge • Extensional Knowledge The Intensional knowledge • It contains the RDFS or OWL constraints of the Ontology • It represents the T-Box components of the knowledge base The Extensional knowledge • It contains the entities of the real word described in the dataset • It represents the A-Box components of the knowledge base • its triples cover most of the dataset Instantiated classes act as a bridge between the two type of knowledge LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources
  • 8. DB Group @ UNIMO 8 ex:sector rdf:label rdf:Property owl:Class rdfs:domain rdf:type rdf:type ex:Sector ex:Organization sector rdf:type rdf:type rdf:type ex:sector Intensional Knowledge Instantiated LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources rdfs:range rdf:label rdf:type owl:ObjectProperty rdf:type sector1 organization1 ex:sector dc:name “Energy” organization2 Classes Extensional Knowledge
  • 9. DB Group @ UNIMO 9 The Statistical Indexes are grouped in three categories: • Generic • Intensional • Extensional Name Description Structure Category t Number of Triples Integer LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources Generic c Number of Classes Integer I Number of Instances Integer Cl Class List List(name, n. Instances) Pl Property List List(name, n. occurrence) IK Intensional K. triples List(s, p, o) Intensional Sc Subject Class List(c, p, n. occurrence) SCl Subject Class to literal List(c, p, n. occurrence) Extensional Oc Object Class List(c, p, n. occurrence)
  • 10. DB Group @ UNIMO 10 ex:Sector ex:Organization rdf:type sector1 rdf:type Subject Class ex:sector rdf:type Subject Class to literal ex:Sector ex:Organization rdf:type sector1 rdf:type LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources organization1 ex:sector dc:name “Energy” organization2 Sc - Subject Class SCl - Subject Class to literal Oc -Object Class S ex:Organization ex:Sector ex:Sector P ex:sector dc:name ex:sector n 2 1 1 organization1 ex:sector dc:name “Energy” ex:sector Object Class
  • 11. DB Group @ UNIMO 11 It takes in input a list of URLs of SPARQL endpoints A set of Statistical Indexes for each endpoint is the output • The IE process dynamically generates the SPARQL query used to extract the Statistical Indexes • It works in parallel querying different datasets • Partial results and the Statistical Indexes are stored in a NoSQL DB LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources
  • 12. DB Group @ UNIMO 12 General Statistic Extraction • It uses 6 different queries to extract the indexes of this group Intensional Knowledge Extraction • The extraction of the Intensional knowledge is performed through an iterative algorithm • The algorithm traverses the graph starting from the instantiated classes Extensional Schema Extraction • It uses different SPARQL aggregation query to extract SC, SCl and OC • Use a technique called Pattern Strategy to complete the extraction – It is a technique able to produce an higher number of less complex SPARQL query – It is used when the endpoint is not able to answer an aggregation query and it throws a timeout error A complete list of the 24 query patterns is available at http://dbgroup.unimo.it/lodexQueries LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources
  • 13. DB Group @ UNIMO 13 LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources
  • 14. DB Group @ UNIMO 14 LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources Reachable datasets 244 SPARQL 1.1 compatible 137 Extraction completed 107 Extraction completed 33 Without PS Total triples (107 datasets) 3,45 b AVG time extraction 6,12 m Total time (single process) 11,15 h Total time (9 processes) 3,35 h The test has been performed on a list of 469 Datasets • More than the 90 % completed the extraction in less than 500 s • The PS technique has proved its worth • from 33 to 107 completed the extraction • The IE process is scalable • linear correlation between number of triples and time
  • 15. DB Group @ UNIMO LODeX is an online tool able to shows a visual Schema Summary for a LOD source • We made use of the statistical indexes for the generation of the Schema F. Benedetti, S. Bergamaschi, and L. Po, “A visual summary for linked open data sources” 2014, International Semantic Web Conference (Posters & Demos). 17 Summary. • Users can interact with the Schema Summary dataset and focus on the information that they are more interested in. The tool is accessible at: www.dbgroup.unimo.it/lodex Come to attend the LODeX demo at the ISWC demo session! LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources
  • 16. DB Group @ UNIMO 18 Conclusion • We are able to extract valuable indexes from a LOD dataset taking advantage of the definition of Intensional and Extensional knowledge • The process of extraction is been tested with an huge number of dataset and its efficiency and effectiveness has been proven Future Works • To extend VOID vocabulary with our descriptors • We want propose LODeX as assistance tool for LOD portals. • We are extending LODeX in order to support the automatic SPARQL query generation LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources
  • 17. DB Group @ UNIMO 19 LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources
  • 18. DB Group @ UNIMO 20 Thanks for your attention! LD4IE 2014 – Riva Del Garda, Italy Online Index Extraction from Linked Open Data Sources Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia