Linked Data Quality assessment applied and integrated to the Linked Data generation and publication workflow. Presented at the Data Quality tutorial, satellite event at SEMANTICS2016.
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
Mappings Validation
1. Mappings Validation
Data Quality Tutorial - SEMANTICS2016
Anastasia Dimou
Anastasia.Dimou@ugent.be ● @natadimou
Ghent University – iMinds
2. Linked (Open) Data
semantically annotated & interlinked data
using different vocabularies or ontologies
published in the form of RDF datasets
3. Linked (Open) Data
derive from originally heterogeneous
(semi-)structured data
e.g.
Eurostat from TSV
DBLP from DBLP database
DBpedia from Wikipedia
LinkedBrainz from MusicBrainz database
... … …
5. Linked Data Quality dimensions
Representational dimension
Intrinsic dimension
Accessibility dimension
Contextual dimension
A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, and S. Auer.
Quality Assessment for Linked Data: A Survey.
Semantic Web Journal, 2016.
6. Linked Data Quality dimensions
Representational dimension
data modeling
Intrinsic dimension
Linked Data generation
Accessibility dimension
Linked Data publication
Contextual dimension
Linked Data consumption
7. Linked Data Quality dimensions
Representational dimension
data modeling
Intrinsic dimension
Linked Data generation
Accessibility dimension
Linked Data publishing
Contextual dimension
Linked Data consumption
8. Linked Data Quality - Intrinsic Dimension
determines the RDF Dataset Quality
by assessing it for possible violations
with respect to
accuracy (e.g. malformed datatype literals)
consistency (e.g. disjoint classes/properties)
9. Instead of applying Quality Assessment
to the already published Linked Data
as part of Linked Data consumption
Apply Quality Assessment
to the Mappings
that generate the Linked Data
as part of Linked Data production
17. Linked Data Quality Assessment (DQA)
RDFUnit http://rdfunit.aksw.org
test-driven data-debugging framework
based on SPARQL-patterns
D. Kontokostas, P. Westphal, S. Auer, S. Hellmann, J. Lehmann, R. Cornelissen, and A. J. Zaveri
Test-driven evaluation of linked data quality.
In Proceedings of the 23rd International Conference on World Wide Web
21. Linked Data Quality Assessment (DQA)
Similar violations occur repeatedly
within a single Linked Data set
22. Linked Data Quality Assessment (DQA)
Sets of triples of a dataset have
repetitive patterns
23. Linked Data Quality Assessment (DQA)
Sets of triples of a dataset have
repetitive patterns
24. DQA: Linked Data Quality Assessment
is applied by third parties
to already published Linked Data sets
violations
DQA
25. DQA: Linked Data Quality Assessment
Adjustments is NOT applied
at the root of the problem
violations
DQA
26. DQA: Linked Data Quality Assessment
Adjustments are overwritten
if a new version of the original data
is annotated and published as Linked Data
violations
DQA
27. Instead of applying Quality Assessment
to the already published Linked Data set
as part of data consumption
28. Apply Quality Assessment to the Mappings
that generate the Linked Data
A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. Helmann, R. Van De Walle
Assessing and Refining Mappings to RDF to Improve Dataset Quality.
In Proceedings of The Semantic Web - ISWC 2015
31. RDF Mapping Language (RML) http://rml.io
extends the W3C-recommended R2RML
specify the mapping rules to
generate Linked Data
from heterogeneous data sources
mapping rules are Linked Data sets too!
A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, and R. Van de Walle.
RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data.
In Proceedings of the 7th Workshop on Linked Data on the Web (LDOW2014), 2014.
47. MQA: Mapping Quality Assessment
discover not only the violations
but also their origin
before they are even generated
48. MQA: Mapping Quality Assessment
easily apply structural adjustments
prevent same violations to
appear repeatedly over distinct entities
allow intuitively combining
different ontologies and vocabularies
63. Mapping Quality Assessment: Limitations
certain test cases inevitably
require the complete Linked Data set
cardinality,
functionality,
symmetricity
64. Mapping Quality Assessment: Limitations
certain test cases inevitably
require the complete Linked Data set
cardinality,
functionality,
symmetricity
on Mappings defense:
more data issue
NOT affected by the mapping rules
66. Dataset Vs Mapping Quality Assessment
Number of Violations
*Dbpedia and DBLP D2RQ Mappings were translated to RML mappings
#violations - Quality Assessment
Dataset Assessment Mappings Assessment
DBpedia EN 3.2M 160
DBLP 8.1M 8
A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. Helmann, R. Van De Walle
Assessing and Refining Mappings to RDF to Improve Dataset Quality.
In Proceedings of The Semantic Web - ISWC 2015
67. Dataset Vs Mapping Quality Assessment
Time
Dataset Quality Assessment Mappings Quality Assessment
size time size time
DBPedia EN 62M 16h 115K 11s
DBPedia NL 21M 1.5h 53K 6s
DBLP 12M 12h 368 12s
A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. Helmann, R. Van De Walle
Assessing and Refining Mappings to RDF to Improve Dataset Quality.
In Proceedings of The Semantic Web - ISWC 2015
68. Mapping Quality Assessment
* http://mappings.dbpedia.org/validation
Live update of DBpedia Mapping Quality Assessment results every night! ☺
Mapping Quality Assessment
size time
DBpedia EN 115K 11s
DBpedia NL 53K 6s
DBpedia All 511K 32s
A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. Helmann, R. Van De Walle
Assessing and Refining Mappings to RDF to Improve Dataset Quality.
In Proceedings of The Semantic Web - ISWC 2015
69. * http://mappings.dbpedia.org/validation
DBpedia Mappings Quality Assessment
A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. Helmann
DBpedia Mappings Quality Assessment.
To be published in Proceedings of the 15th International Semantic Web Conference: Posters and Demos 2016
Live update of DBpedia Mapping Quality Assessment results every night! ☺
71. Violations
are related to the dataset's schema
(vocabularies or ontologies)
occur repeatedly
within a single RDF dataset
The situation aggravates the more
ontologies and vocabularies
are reused and combined
72. Linked Data Quality Assessment
shifted from data consumption
to data publication
integrated systematically
in the publishing workflow
violations are identified,
resolved and will not re-appear
Linked Data of higher Quality is generated!!!
73. Mappings Validation
Data Quality Tutorial - SEMANTICS2016
Anastasia Dimou
Anastasia.Dimou@ugent.be ● @natadimou
Ghent University – iMinds