SlideShare a Scribd company logo
1 of 13
Download to read offline
Indexing Data on the Web: A Comparison
of Schema-level Indices for Data Search
Till Blume1
and Ansgar Scherp2
DEXA 2020, 16th September, Virtual Event
1
Kiel University, Germany
2
Ulm University, Germany
…
Data Source Search with Graph Indices
Index
1
foaf:Agent
dct:subject bibo:Book
dct:creator
3
2
2
Towards a clean air policy
dct:creator
Great Britain. Central Electricity
foaf:Agent
URI-1 URI-2
bibo:Book
dct:subjectURI-3
Graph Indices
● General Purpose Graph indices: path, tree, and subgraph indices.
● Instance-level indices: find specific data instances, e.g., the book “Towards a
clean air policy”.
● Schema-level indices: find all data instances of type “Book”.
3
Problem Definition & Research Question
● Schema-level indices (SLIs) are designed, implemented, and evaluated for
their individual task only, using different queries, datasets, and metrics.
● Limited work has been done on understanding them in different contexts.
● Our hypothesis is that there is no SLI model that fits all tasks and
that the performance of the specific SLI depends on the specific types
of queries and characteristics of the datasets.
4
Contribution
1. We conduct an extensive empirical evaluation of representative SLI models.
To this end, we have for the first time defined and implemented the features of
existing SLI models in a common framework available on GitHub.
2. We evaluate Schema-level Indices independent of their original application for
a data search scenario.
3. We extend the Schema-level Index SchemEX [9] with RDF Schema
inferencing and owl:sameAs inferencing.
5
Computing the Schema-level Index SchemEX
RDF data graphSchema-level index
Book
i-1 i-2
Subject
i-3
Agent
p-1
p-2
Schema Element SE-1
Book
s-1 o-2
Subject
o-3
Agent Source ds-4
p-1
p-2
Data Instance Is-1
Book
s-91 o-92
Subject
o-93
Agent Source ds-7
p-1
p-2
Data Instance Is-91
6
?x
p1
pl
pl+1
pn
… …
?x
?y1
?yn
…
p1
pn
(a) Characteristic Sets [1] (c) Sem Sets [7]
p1
pl
pl+1
pn
… …
……
?x
(b) Weak Property Cliques [10]
?x
?y1
?yn
…
p1
pn
?c1
?cm
…
rdf:type
?c1
rdf:type
?cm
…
?c1
?cm
rdf:type…
(d) LODex [2], Loupe [3],
ABSTAT [5], SchemEX [9]
?yi
?c1
?cm
…?x
?c1
?cm
rdf:type… rdf:type
p1
∪ … ∪ pn
(e) Term Picker [6]
Analysis of representative SLIs
7
Experimental Setup
Implementation of a single, parameterized algorithm 1
Datasets crawled from the Web of Data:
● TimBL-11M: 11 million triples crawl starting from a single seed URI
● DyLDo-127M: 127 million triples crawl starting from 95k seed URIs
Index Models:
● Characteristic Sets, SemSets, Weak Property Cliques, SchemEX, SchemEX+U+I
● Use sub-graph sizes of k = 0, 1, and 2 hop distances for each model
1. https://github.com/t-blume/fluid-framework 8
Experiment 1: Compression and Summarization
9
Experiment 2: Stream-based Index Computation
● scalable computation by observing the graph over a stream of edges with a fixed window size
● approximation errors due to incomplete data
● evaluate the F1-score of data search queries for approximate indices compared to exact indices
● change the window size (1k, 100k, and 200k instances)
Main results: the approximation quality of an index computed in a stream-based
approach depends on three factors:
a. characteristics of the crawled dataset
b. simple queries consistently outperform complex queries
c. a larger window size typically improves the quality only marginally
10
Main Observations
1. Significant correlation between compression ratio and summarization ratio
(intuition: superior summarization -> superior compression)
2. Significant negative correlation between summarization ratio and
approximation quality (intuition: superior summarization -> superior
approximation quality)
3. Dataset characteristic is important: Over all indices, indices have on average
a 0.15 lower F1-score on the DyLDO-127M dataset compared to the
TimBL-11M dataset.
11
Conclusion
● Huge variations in compression ratio, summarization ratio, and approximation
quality for different Schema-level Indices, queries, and datasets.
● Schema-level indices developed for only a specific application can also be
beneficial in different applications, e.g., Characteristic Sets or TermPicker.
● Inferencing over RDF Schema can lead to a superior summarization ratio.
Please find on GitHub our source code, the queries, and the detailed statistics
about the TimBL-11M and DyLDO-127M datasets:
http://github.com/t-blume/fluid-framework
12
References
1. Neumann, T., Moerkotte, G.: Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In: ICDE. pp. 984–994.
IEEE (2011)
2. Benedetti, F., Bergamaschi, S., Po, L.: Exposing the underlying schema of LOD sources. In: Joint IEEE/WIC/ACM WI and IAT. pp. 301–304.
IEEE (2015)
3. Mihindukulasooriya, N., Poveda-Villal ́on, M., Garc ́ıa-Castro, R., G ́omez-P ́erez, A.:Loupe - an online tool for inspecting datasets in the
Linked Data cloud. In: ISWCPosters & Demos. vol. 1486. CEUR-WS.org (2015)
4. Pietriga, E., G ̈oz ̈ukan, H., Appert, C., Destandau, M., Cebiric, S., Goasdou ́e, F.,Manolescu, I.: Browsing linked data catalogs with
LODAtlas. In: ISWC. vol. 11137,pp. 137–153. Springer (2018)
5. Spahiu, B., Porrini, R., Palmonari, M., Rula, A., Maurino, A.: ABSTAT: ontology-driven linked data summaries with pattern minimalization. In:
ESWC Satellite Events, Revised Selected Papers. vol. 9989, pp. 381–395 (2016)
6. Schaible, J., Gottron, T., Scherp, A.: TermPicker: Enabling the reuse of vocabulary terms by exploiting data from the Linked Open Data
cloud. In: ESWC. vol. 9678,pp. 101–117. Springer (2016)
7. Ciglan, M., Nørv ̊ag, K., Hluch ́y, L.: The SemSets model for ad-hoc semantic list search. In: WWW. pp. 131–140. ACM (2012)
8. Gottron, T., Scherp, A., Krayer, B., Peters, A.: LODatio: using a schema-level index to support users in finding relevant sources of Linked
Data. In: K-CAP. pp.105–108. ACM (2013)
9. Konrath, M., Gottron, T., Staab, S., Scherp, A.: SchemEX - efficient construction of a data catalogue by stream-based indexing of Linked
Data. J. Web Sem.16,52–58 (2012)
10. Goasdou ́e, F., Guzewicz, P., Manolescu, I.: Incremental structural summarization of RDF graphs. In: EDBT. pp. 566–569.
OpenProceedings.org (2019)
13

More Related Content

What's hot

Using parallel hierarchical clustering to
Using parallel hierarchical clustering toUsing parallel hierarchical clustering to
Using parallel hierarchical clustering toBiniam Behailu
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...
AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...
AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...Nexgen Technology
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Robert Grossman
 
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...Robert Grossman
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)Robert Grossman
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Robert Grossman
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Robert Grossman
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
MINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMS
MINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMSMINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMS
MINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMSNexgen Technology
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismseSAT Journals
 
Hattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesHattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesJason Hattrick-Simpers
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataRobert Grossman
 
Optique presentation
Optique presentationOptique presentation
Optique presentationDBOnto
 
Gray-Box Models for Performance Assessment of Spark Applications
Gray-Box Models for Performance Assessment of Spark ApplicationsGray-Box Models for Performance Assessment of Spark Applications
Gray-Box Models for Performance Assessment of Spark ApplicationsATMOSPHERE .
 
Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
 Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar... Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...Lviv Data Science Summer School
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 

What's hot (20)

Using parallel hierarchical clustering to
Using parallel hierarchical clustering toUsing parallel hierarchical clustering to
Using parallel hierarchical clustering to
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...
AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...
AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APA...
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
MINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMS
MINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMSMINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMS
MINING USER-AWARE RARE SEQUENTIAL TOPIC PATTERNS IN DOCUMENT STREAMS
 
ML in materials discovery
ML in materials discovery ML in materials discovery
ML in materials discovery
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanisms
 
Hattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesHattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop Slides
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
 
Optique presentation
Optique presentationOptique presentation
Optique presentation
 
Gray-Box Models for Performance Assessment of Spark Applications
Gray-Box Models for Performance Assessment of Spark ApplicationsGray-Box Models for Performance Assessment of Spark Applications
Gray-Box Models for Performance Assessment of Spark Applications
 
Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
 Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar... Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 

Similar to Indexing data on the web a comparison of schema level indices for data search

Term Paper Presentation
Term Paper PresentationTerm Paper Presentation
Term Paper PresentationShubham Singh
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSymeon Papadopoulos
 
A modified k means algorithm for big data clustering
A modified k means algorithm for big data clusteringA modified k means algorithm for big data clustering
A modified k means algorithm for big data clusteringSK Ahammad Fahad
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSotiris Beis
 
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...ssuser4b1f48
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLAnubhav Jain
 
IRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET Journal
 
論文サーベイ(Sasaki)
論文サーベイ(Sasaki)論文サーベイ(Sasaki)
論文サーベイ(Sasaki)Hajime Sasaki
 
On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...IJDKP
 
On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...IJDKP
 
03 interlinking-dass
03 interlinking-dass03 interlinking-dass
03 interlinking-dassDiego Pessoa
 
Smart Data for Behavioural Change: Towards Energy Efficient Buildings
Smart Data for Behavioural Change: Towards Energy Efficient BuildingsSmart Data for Behavioural Change: Towards Energy Efficient Buildings
Smart Data for Behavioural Change: Towards Energy Efficient BuildingsAnna Fensel
 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET Journal
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...Till Blume
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.Giuseppe Ricci
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 

Similar to Indexing data on the web a comparison of schema level indices for data search (20)

Big Data and IOT
Big Data and IOTBig Data and IOT
Big Data and IOT
 
Term Paper Presentation
Term Paper PresentationTerm Paper Presentation
Term Paper Presentation
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
A modified k means algorithm for big data clustering
A modified k means algorithm for big data clusteringA modified k means algorithm for big data clustering
A modified k means algorithm for big data clustering
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
IRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using Cobweb
 
論文サーベイ(Sasaki)
論文サーベイ(Sasaki)論文サーベイ(Sasaki)
論文サーベイ(Sasaki)
 
Cognitive data
Cognitive dataCognitive data
Cognitive data
 
On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...
 
On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...On Using Network Science in Mining Developers Collaboration in Software Engin...
On Using Network Science in Mining Developers Collaboration in Software Engin...
 
03 interlinking-dass
03 interlinking-dass03 interlinking-dass
03 interlinking-dass
 
Smart Data for Behavioural Change: Towards Energy Efficient Buildings
Smart Data for Behavioural Change: Towards Energy Efficient BuildingsSmart Data for Behavioural Change: Towards Energy Efficient Buildings
Smart Data for Behavioural Change: Towards Energy Efficient Buildings
 
Keynote at AImWD
Keynote at AImWDKeynote at AImWD
Keynote at AImWD
 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 

Recently uploaded

High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....muralinath2
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flyPRADYUMMAURYA1
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Silpa
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceAlex Henderson
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 

Recently uploaded (20)

High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 

Indexing data on the web a comparison of schema level indices for data search

  • 1. Indexing Data on the Web: A Comparison of Schema-level Indices for Data Search Till Blume1 and Ansgar Scherp2 DEXA 2020, 16th September, Virtual Event 1 Kiel University, Germany 2 Ulm University, Germany
  • 2. … Data Source Search with Graph Indices Index 1 foaf:Agent dct:subject bibo:Book dct:creator 3 2 2 Towards a clean air policy dct:creator Great Britain. Central Electricity foaf:Agent URI-1 URI-2 bibo:Book dct:subjectURI-3
  • 3. Graph Indices ● General Purpose Graph indices: path, tree, and subgraph indices. ● Instance-level indices: find specific data instances, e.g., the book “Towards a clean air policy”. ● Schema-level indices: find all data instances of type “Book”. 3
  • 4. Problem Definition & Research Question ● Schema-level indices (SLIs) are designed, implemented, and evaluated for their individual task only, using different queries, datasets, and metrics. ● Limited work has been done on understanding them in different contexts. ● Our hypothesis is that there is no SLI model that fits all tasks and that the performance of the specific SLI depends on the specific types of queries and characteristics of the datasets. 4
  • 5. Contribution 1. We conduct an extensive empirical evaluation of representative SLI models. To this end, we have for the first time defined and implemented the features of existing SLI models in a common framework available on GitHub. 2. We evaluate Schema-level Indices independent of their original application for a data search scenario. 3. We extend the Schema-level Index SchemEX [9] with RDF Schema inferencing and owl:sameAs inferencing. 5
  • 6. Computing the Schema-level Index SchemEX RDF data graphSchema-level index Book i-1 i-2 Subject i-3 Agent p-1 p-2 Schema Element SE-1 Book s-1 o-2 Subject o-3 Agent Source ds-4 p-1 p-2 Data Instance Is-1 Book s-91 o-92 Subject o-93 Agent Source ds-7 p-1 p-2 Data Instance Is-91 6
  • 7. ?x p1 pl pl+1 pn … … ?x ?y1 ?yn … p1 pn (a) Characteristic Sets [1] (c) Sem Sets [7] p1 pl pl+1 pn … … …… ?x (b) Weak Property Cliques [10] ?x ?y1 ?yn … p1 pn ?c1 ?cm … rdf:type ?c1 rdf:type ?cm … ?c1 ?cm rdf:type… (d) LODex [2], Loupe [3], ABSTAT [5], SchemEX [9] ?yi ?c1 ?cm …?x ?c1 ?cm rdf:type… rdf:type p1 ∪ … ∪ pn (e) Term Picker [6] Analysis of representative SLIs 7
  • 8. Experimental Setup Implementation of a single, parameterized algorithm 1 Datasets crawled from the Web of Data: ● TimBL-11M: 11 million triples crawl starting from a single seed URI ● DyLDo-127M: 127 million triples crawl starting from 95k seed URIs Index Models: ● Characteristic Sets, SemSets, Weak Property Cliques, SchemEX, SchemEX+U+I ● Use sub-graph sizes of k = 0, 1, and 2 hop distances for each model 1. https://github.com/t-blume/fluid-framework 8
  • 9. Experiment 1: Compression and Summarization 9
  • 10. Experiment 2: Stream-based Index Computation ● scalable computation by observing the graph over a stream of edges with a fixed window size ● approximation errors due to incomplete data ● evaluate the F1-score of data search queries for approximate indices compared to exact indices ● change the window size (1k, 100k, and 200k instances) Main results: the approximation quality of an index computed in a stream-based approach depends on three factors: a. characteristics of the crawled dataset b. simple queries consistently outperform complex queries c. a larger window size typically improves the quality only marginally 10
  • 11. Main Observations 1. Significant correlation between compression ratio and summarization ratio (intuition: superior summarization -> superior compression) 2. Significant negative correlation between summarization ratio and approximation quality (intuition: superior summarization -> superior approximation quality) 3. Dataset characteristic is important: Over all indices, indices have on average a 0.15 lower F1-score on the DyLDO-127M dataset compared to the TimBL-11M dataset. 11
  • 12. Conclusion ● Huge variations in compression ratio, summarization ratio, and approximation quality for different Schema-level Indices, queries, and datasets. ● Schema-level indices developed for only a specific application can also be beneficial in different applications, e.g., Characteristic Sets or TermPicker. ● Inferencing over RDF Schema can lead to a superior summarization ratio. Please find on GitHub our source code, the queries, and the detailed statistics about the TimBL-11M and DyLDO-127M datasets: http://github.com/t-blume/fluid-framework 12
  • 13. References 1. Neumann, T., Moerkotte, G.: Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In: ICDE. pp. 984–994. IEEE (2011) 2. Benedetti, F., Bergamaschi, S., Po, L.: Exposing the underlying schema of LOD sources. In: Joint IEEE/WIC/ACM WI and IAT. pp. 301–304. IEEE (2015) 3. Mihindukulasooriya, N., Poveda-Villal ́on, M., Garc ́ıa-Castro, R., G ́omez-P ́erez, A.:Loupe - an online tool for inspecting datasets in the Linked Data cloud. In: ISWCPosters & Demos. vol. 1486. CEUR-WS.org (2015) 4. Pietriga, E., G ̈oz ̈ukan, H., Appert, C., Destandau, M., Cebiric, S., Goasdou ́e, F.,Manolescu, I.: Browsing linked data catalogs with LODAtlas. In: ISWC. vol. 11137,pp. 137–153. Springer (2018) 5. Spahiu, B., Porrini, R., Palmonari, M., Rula, A., Maurino, A.: ABSTAT: ontology-driven linked data summaries with pattern minimalization. In: ESWC Satellite Events, Revised Selected Papers. vol. 9989, pp. 381–395 (2016) 6. Schaible, J., Gottron, T., Scherp, A.: TermPicker: Enabling the reuse of vocabulary terms by exploiting data from the Linked Open Data cloud. In: ESWC. vol. 9678,pp. 101–117. Springer (2016) 7. Ciglan, M., Nørv ̊ag, K., Hluch ́y, L.: The SemSets model for ad-hoc semantic list search. In: WWW. pp. 131–140. ACM (2012) 8. Gottron, T., Scherp, A., Krayer, B., Peters, A.: LODatio: using a schema-level index to support users in finding relevant sources of Linked Data. In: K-CAP. pp.105–108. ACM (2013) 9. Konrath, M., Gottron, T., Staab, S., Scherp, A.: SchemEX - efficient construction of a data catalogue by stream-based indexing of Linked Data. J. Web Sem.16,52–58 (2012) 10. Goasdou ́e, F., Guzewicz, P., Manolescu, I.: Incremental structural summarization of RDF graphs. In: EDBT. pp. 566–569. OpenProceedings.org (2019) 13