SlideShare a Scribd company logo
1 of 48
Measuring Metadata Quality
and the Europeana use case
Péter Király
peter.kiraly@gwdg.de
4th Linked Data Quality Workshop,
Portorož, Slovenia
29th May, 2017
Measuring metadata quality. Glossary
2
★ Metadata here: cultural heritage metadata (descriptions of books etc.)
★ Europeana a metadata aggregator from 3500+ cultural heritage
institutions http://europeana.eu
★ Big Data here: 10-100 million metadata records, 100 GB - 1.5 TB
★ EDM Europeana Data Model, Europeana’s metadata schema
★ MARC MAchine Readable Catalog, a library metadata standard
Measuring metadata quality. Generic title and bad thumbnail
3more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)
Measuring metadata quality. Multilinguality problem
4
★ Mona Lisa → 456
results
★ La Gioconda → 365
results
★ La Joconde → 71
results
http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html
Measuring metadata quality. Problems with title
5more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)
title: "VOETBAL-EREDIVISIE-
FEYENOORD - GO AHEAD 3-1",
description: "VOETBAL-EREDIVISIE-
FEYENOORD - GO AHEAD 3-1"
Same title and description
title: "NLD-820630-AMSTERDAM:
Straatmuzikanten proberen
geld te verdienen voor...",
Machine-readable ID in title
title: "+++EMPTY+++"
Leftover
Measuring metadata quality. Non-informative values
6
non informative dc:title:
“photograph, framed”,
“group photograph”
“photograph”
informative dc:title:
“Photograph of Sir Dugald Clerk”,
“Photograph of "Puffing Billy"”
bad good
Measuring metadata quality. Copy & paste cataloging
7
from a template?
more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)
Measuring metadata quality. The problem
8
there are “good” and “bad” metadata records
but we don’t have clear metrics like this:
functional requirements
goodacceptablebad
Measuring metadata quality. Why data quality is important?
9
“Fitness for purpose” (QA principle)
purpose: to access content
no metadata no access to data no data usage
more explanation:
Data on the Web Best Practices
W3C Working Draft, https://www.w3.org/TR/dwbp/
Measuring metadata quality. Hypothesis
10
by measuring structural elements we
can approximate metadata record quality
≃ metadata smell
Measuring metadata quality. Purposes
11
★improve the metadata
★services: good data → reliable functions
★better metadata schema & documentation
★propagate “good practice”
Measuring metadata quality. Proposal I.
12
Europeana Data Quality Committee
★ Analysing/revising metadata schema
★ Functional requirement analysis
★ Problem catalog
★ Multilinguality
Measuring metadata quality. Proposal II.
13
“Metadata Quality Assurance Framework”
a generic tool for measuring metadata quality
★ adaptable to different metadata schemes
★ scalable (to Big Data)
★ understandable reports for data curators
★ open source
Measuring metadata quality. Data processing workflow
14
★ OAI-PMH
★ Europeana API
★ Hadoop
★ NoSQL
★ Spark
★ Hadoop
★ Java
★ Apache Solr
★ Spark
★ R
★ PHP
★ D3.js
★ highchart.js
★ NoSQL
json csv json, png html, svg
ingest measure statistical
analysis
web
interface
Measuring metadata quality. What to measure?
15
★Structural and semantic features
Completeness, cardinality, uniqueness, length, dictionary entry, data type
conformance, multilinguality (generic metrics)
★Functional requirement analysis / Discovery scenarios
Requirements of the most important functions
★Problem catalog
Known metadata problems
Measuring metadata quality. Metadata requirements / User scenario
16
“As a user I want to be able to filter by whether a person is the
subject of a book, or its author, engraver, printer etc.”
Metadata analysis
Description of relevant metadata elements and their rules
Measurement rules
★ the relevant field values should be resolvable URI
★ each URI should be associated with labels in multiple languages
Measuring metadata quality. Metadata requirements / Supported functions
17
#1 Resource Discovery
★ Search Search for a resource corresponding to stated criteria (i.e., to search either
a single entity or a set of entities using an attribute or relationship of the entity as
the search criteria).
★ Identify confirm that the entity described or located corresponds to the entity sought
★ Select choose an entity that meets the user’s requirements
★ Obtain access a resource either physically or electronically
#2 Resource Use
★ Restrict
★ Manage
★ Operate
★ Interpret
#3 Data Management
★ Identify
★ Process
★ Sort
★ Display
Functional Analysis of the MARC 21 Bibliographic and Holdings Formats
http://www.loc.gov/marc/marc-functional-analysis/source/analysis.pdf
Measuring metadata quality. Metadata requirements / element—function map
18Europeana sub-dimensions MARC Summary of Mapping to User Tasks
Measuring metadata quality. The data aggregation workflow (in Europeana)
19
data transformations Europeana Data Model (EDM)
Dublin Core,
LIDO, EAD,
MARC, EDM
custom, ...
Measuring metadata quality. Measurement
20
overall view collection view record view
Completeness
Field cardinality
Uniqueness
Multilinguality
Language specification
Problem catalog
etc.
links
measurementsaggregated statistics
metrics
Measuring metadata quality. Measurement - Field frequency per collections
21
no record has alternative title
every record has alternative title
filters
Measuring metadata quality. Measurement - Details of field cardinality
22
128 subjects in one record
median is 0, mean is close to 1
link to interesting records
Measuring metadata quality. Measurement - Multilinguality
23
@resource is a URI
@ = language notation in RDF
no language specification
Measuring metadata quality. Measurement - Language frequency
24
has language
specification
has no language
specification
Measuring metadata quality. Measurement - Encoding problems
25
same language,
different encodings
Measuring metadata quality. Measurement - Multilinguality metrics
26
★ Number of (distinct) languages in the metadata
★ Number of tagged literals
★ Tagged literals per language
Requirement: language annotations / tags!
Measuring metadata quality. Measurement - Distinct Languages
27
Text w/o language annotation (dc.subject: Germany):
Text w language annotation (dc.subject: Germany@en)
Text w several language annotations (dc.subject:
Germany@en, Deutschland@de)
Link to (multilingual) vocabulary (http://www.geonames.org
/2921044/federal-republic-of-germany)
0
1
2
n
Measuring metadata quality. Measurement - Record level
28
<#record> a ore:Proxy ;
dc:subject “Ballet”, “Opera” .
<#record> a ore:Proxy ; edm:europeanaProxy true ;
dc:subject <http://data.europeana.eu/concept/base/264>
, <http://data.europeana.eu/concept/base/247> .
<http://data.europeana.eu/concept/base/264> a skos:Concept .
skos:prefLabel "Ballett"@no, "बैले"@hi, "Ballett"@de, "Балет"@be, "Балет"@ru
, "Balé"@pt, "Балет"@bg, "Baletas"@lt, "Balet"@hr, "Balets"@lv .
<http://data.europeana.eu/concept/base/247>
skos:prefLabel "Opera"@no, "ओपेरा (गीतिनाटक)"@hi, "Oper"@de, "Ooppera"@fi
, "Опера"@be, "Опера"@ru, "Ópera"@pt, "Опера"@bg, "Opera"@lt .
0
0
11 19Distinct languages Tagged literals 1,7 Literals per language
dereferencing
Measuring metadata quality. Measurement - Contributing to Multilinguality
29
Data
dc:creator dc:type
dc:subject
<http://dbpedia.org/a
SubjectID>
dc:subject
Data from Provider
dc:creator
dereferenced
Quantifiable
Data added by Europeana
“subject”@en <http://vocab.getty.e
du/aPersonNumber>
dc:subject
“Subject” <http://udcdata.info/rdf
/065280>
Measuring metadata quality. Measurement - Multilingual saturation heatmap
30
Measuring metadata quality. Measurement - Good example
31
dc:descriptiondc:title
Place/skos:prefLabelDescriptive fields Subject headings
"Brandenburger Tor"@de
"Brandenburg Gate"@en
"Grenzübergang Potsdamer Platz"@de
"Postdamer Platz border crossing"@en
"Reichstag"@de
"Reichstag building"@en
"Die Mauer muß weg!"@de
"Die Mauer muß weg! (The
Wall must go!)"@en
"Kommentiertes Fotorama mit
Bildern von 1989-1990 in
Berlin"@de
"Annotated images from 1989-
1990 in Berlin"@en
Measuring metadata quality. Problems - Linked data: depth of iteration
32
Measuring metadata quality. Problems - Linked data: lost links
33
Measuring metadata quality. Problems - Outliers
34
bulk of records are close to zero
although 25% are between 0.05 and 1.25
Measuring metadata quality. Problems - Outliers
35
Measuring metadata quality. Problems - Outliers
36
zeros /
lower outliers high outliers“normal”
values
Measuring metadata quality. Problems - Layers
37
:provider
dc:subject "special relativity"@en ;
dc:creator <http://vocab.getty.eu/ulan/500240971> ;
dc:type <http://udcdata.info/001684> .
:enhancement
dc:subject <http://dbpedia.org/resource/Physics> .
deferencable vocabulary
deferencable vocabulary
non-deferencable vocabulary
<http://vocab.getty.edu/ulan/500240971>
skos:prefLabel "Einstein, Albert"@de .
<http://dbpedia.org/resource/Physics>
skos:prefLabel "Physics"@en .
<http://udcdata.info/001684>
skos:prefLabel "Books in general"@en .
Measuring metadata quality. Problems - Layers
38
source field link value ① ② ③ ④
:provider dc:subject literal "special relativity"@en ① ② ③ ④
dc:creator standard "Einstein, Albert"@de ① ② ③ ④
dc:type non-std "Books in general"@en ② ④
:enhancement dc:subject standard "Physics"@en ③ ④
① provider's data and dereferencable enrichments
② provider's data and all enrichments
③ all data and dereferencable enrichments
④ all data and all enrichments
credit: Antoine Isaac
Measuring metadata quality. Engineering - Measurement processing workflow
39
http://pkiraly.github.io/cheatsheet/
Measuring metadata quality. Engineering - Modules
40
metadata-qa-api
europeana-qa-api
europeana-qa-spark europeana-qa-rest
marc-qa-api* ddb-qa-api*
★ Metadata schema
abstraction
★ Metrics definition
★ Iteration
★ Result data structure
★ ...
<dependencies>
<dependency>
<groupId>de.gwdg.metadataqa</groupId>
<artifactId>metadata−qa−api</artifactId>
<version>0.4</version>
</dependency>
<dependency>
<groupId>de.gwdg.metadataqa</groupId>
<artifactId>europeana−qa−api</artifactId>
<version>0.4</version>
</dependency>
...
</dependencies>
Measuring metadata quality. Engineering - Batch API
41
client Metadata QA
/batch/measuring/start
sessionID
/batch/[recordId]
csv
for each records
/batch/measuring/stop
“success” | “failure”
/batch/analyzing/start
“success” | “failure”
/batch/analyzing/status
“in progress” | “ready”
/batch/analyzing/retriev
e
compressed package
periodically
measurementanalysis
Measuring metadata quality. Engineering - Formal issue definition
42
How to transform human expert knowledge of metadata
issues to machine readable rule?
Measuring metadata quality. Engineering - Formal issue definition I. RDFUnit
43
SELECT ?s WHERE {
?s %% P1 %% ?v1 .
?s %% P2 %% ?v2 .
FILTER ( ?v1 %% OP %% ?v2 )
} SELECT ?s WHERE {
?s dbo: birthDate ?v1.
?s dbo: deathDate ?v2.
FILTER ( ?v1 > ?v2 )
}
pattern
SPARQL
P1 => dbo : birthDate
P2 => dbo : deathDate
OP => >
parameters
Kontokostas et al. (2014), Test-driven Evaluation of Linked Data Quality
Measuring metadata quality. Engineering - Formal issue definition II. SHACL
44
<IssueShape> sh:property [
sh:predicate ex:submittedBy;
sh:minLength 20
] .
<IssueShape> <issue1> pass
<IssueShape> <issue2> fail ex:submittedOn expected to be >= 20
characters, 3 characters found.
shape
result
<issue1> ex:submittedBy
<http://a.example/bob> .
<issue2> ex:submittedBy
"Bob" .
RDF triplets
SHACL Core Abstract Syntax and Semantics
W3C First Public Working Draft 25 August 2016
Measuring metadata quality. Cooperations and project proposals
45
★Europeana Network’s Data Quality Committee
http://pro.europeana.eu/europeana-tech/data-quality-committee
★Digital Library Federation Metadata Assessment Group
http://dlfmetadataassessment.github.io
★Deutsche Digitale Bibliothek https://www.deutsche-digitale-
bibliothek.de
Measuring metadata quality. Community bibliography
46
zotero.org/groups/metadata_assessment
dlfmetadataassessment.github.io
Measuring metadata quality. Further steps
47
★Translate the results into
documentation,
recommendations
★Communication with data
providers
★Human evaluation of metadata
quality
★Cooperation with other projects
★Incorporating into ingestion
process
★Shape Constraint Language
(SHACL) for defining patterns
★Process usage statistics
★Measuring changes of scores
★Machine learning based
classification & clustering
human analysis technical
Measuring metadata quality. Links
48
★Europeana Data Quality Committee // http://pro.europeana.eu/europeana-
tech/data-quality-committee
★site // http://144.76.218.178/europeana-qa/
★source codes (GPL v3.0) // http://pkiraly.github.io/about/#source-codes
★Europeana data (CC0) // http://hdl.handle.net/21.11101/0000-0001-781F-7
★Library of Congress data (OA) //
http://www.loc.gov/cds/products/marcDist.php
★contact: peter.kiraly@gwdg.de, @kiru

More Related Content

More from Péter Király

Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Péter Király
 
Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Péter Király
 
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Péter Király
 
FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)Péter Király
 
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)Péter Király
 
Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Péter Király
 
Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Péter Király
 
Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Péter Király
 
Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)Péter Király
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Péter Király
 
Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Péter Király
 
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Péter Király
 
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)Péter Király
 
Measuring MARC (ELAG 2018)
Measuring MARC (ELAG 2018)Measuring MARC (ELAG 2018)
Measuring MARC (ELAG 2018)Péter Király
 
SHACL shortly (ELAG 2018)
SHACL shortly (ELAG 2018)SHACL shortly (ELAG 2018)
SHACL shortly (ELAG 2018)Péter Király
 
Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)Péter Király
 
Measuring completeness as metadata quality metric in Europeana (DH 2017)
Measuring completeness as metadata quality metric in Europeana (DH 2017)Measuring completeness as metadata quality metric in Europeana (DH 2017)
Measuring completeness as metadata quality metric in Europeana (DH 2017)Péter Király
 
Nothing is created, nothing is lost, everything changes (ELAG, 2017)
Nothing is created, nothing is lost, everything changes (ELAG, 2017)Nothing is created, nothing is lost, everything changes (ELAG, 2017)
Nothing is created, nothing is lost, everything changes (ELAG, 2017)Péter Király
 
Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)Péter Király
 
Stiller & Király, Multilinguality of Metadata
Stiller & Király, Multilinguality of MetadataStiller & Király, Multilinguality of Metadata
Stiller & Király, Multilinguality of MetadataPéter Király
 

More from Péter Király (20)

Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
 
Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)
 
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
 
FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)
 
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
 
Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...
 
Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)
 
Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)
 
Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
 
Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)
 
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
 
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
 
Measuring MARC (ELAG 2018)
Measuring MARC (ELAG 2018)Measuring MARC (ELAG 2018)
Measuring MARC (ELAG 2018)
 
SHACL shortly (ELAG 2018)
SHACL shortly (ELAG 2018)SHACL shortly (ELAG 2018)
SHACL shortly (ELAG 2018)
 
Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)
 
Measuring completeness as metadata quality metric in Europeana (DH 2017)
Measuring completeness as metadata quality metric in Europeana (DH 2017)Measuring completeness as metadata quality metric in Europeana (DH 2017)
Measuring completeness as metadata quality metric in Europeana (DH 2017)
 
Nothing is created, nothing is lost, everything changes (ELAG, 2017)
Nothing is created, nothing is lost, everything changes (ELAG, 2017)Nothing is created, nothing is lost, everything changes (ELAG, 2017)
Nothing is created, nothing is lost, everything changes (ELAG, 2017)
 
Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)
 
Stiller & Király, Multilinguality of Metadata
Stiller & Király, Multilinguality of MetadataStiller & Király, Multilinguality of Metadata
Stiller & Király, Multilinguality of Metadata
 

Recently uploaded

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 

Recently uploaded (20)

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 

Measuring metadata quality and the Europeana use case (4th ldq, 2017)

  • 1. Measuring Metadata Quality and the Europeana use case Péter Király peter.kiraly@gwdg.de 4th Linked Data Quality Workshop, Portorož, Slovenia 29th May, 2017
  • 2. Measuring metadata quality. Glossary 2 ★ Metadata here: cultural heritage metadata (descriptions of books etc.) ★ Europeana a metadata aggregator from 3500+ cultural heritage institutions http://europeana.eu ★ Big Data here: 10-100 million metadata records, 100 GB - 1.5 TB ★ EDM Europeana Data Model, Europeana’s metadata schema ★ MARC MAchine Readable Catalog, a library metadata standard
  • 3. Measuring metadata quality. Generic title and bad thumbnail 3more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)
  • 4. Measuring metadata quality. Multilinguality problem 4 ★ Mona Lisa → 456 results ★ La Gioconda → 365 results ★ La Joconde → 71 results http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html
  • 5. Measuring metadata quality. Problems with title 5more examples in Report and Recommendations from the Task Force on Metadata Quality (2015) title: "VOETBAL-EREDIVISIE- FEYENOORD - GO AHEAD 3-1", description: "VOETBAL-EREDIVISIE- FEYENOORD - GO AHEAD 3-1" Same title and description title: "NLD-820630-AMSTERDAM: Straatmuzikanten proberen geld te verdienen voor...", Machine-readable ID in title title: "+++EMPTY+++" Leftover
  • 6. Measuring metadata quality. Non-informative values 6 non informative dc:title: “photograph, framed”, “group photograph” “photograph” informative dc:title: “Photograph of Sir Dugald Clerk”, “Photograph of "Puffing Billy"” bad good
  • 7. Measuring metadata quality. Copy & paste cataloging 7 from a template? more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)
  • 8. Measuring metadata quality. The problem 8 there are “good” and “bad” metadata records but we don’t have clear metrics like this: functional requirements goodacceptablebad
  • 9. Measuring metadata quality. Why data quality is important? 9 “Fitness for purpose” (QA principle) purpose: to access content no metadata no access to data no data usage more explanation: Data on the Web Best Practices W3C Working Draft, https://www.w3.org/TR/dwbp/
  • 10. Measuring metadata quality. Hypothesis 10 by measuring structural elements we can approximate metadata record quality ≃ metadata smell
  • 11. Measuring metadata quality. Purposes 11 ★improve the metadata ★services: good data → reliable functions ★better metadata schema & documentation ★propagate “good practice”
  • 12. Measuring metadata quality. Proposal I. 12 Europeana Data Quality Committee ★ Analysing/revising metadata schema ★ Functional requirement analysis ★ Problem catalog ★ Multilinguality
  • 13. Measuring metadata quality. Proposal II. 13 “Metadata Quality Assurance Framework” a generic tool for measuring metadata quality ★ adaptable to different metadata schemes ★ scalable (to Big Data) ★ understandable reports for data curators ★ open source
  • 14. Measuring metadata quality. Data processing workflow 14 ★ OAI-PMH ★ Europeana API ★ Hadoop ★ NoSQL ★ Spark ★ Hadoop ★ Java ★ Apache Solr ★ Spark ★ R ★ PHP ★ D3.js ★ highchart.js ★ NoSQL json csv json, png html, svg ingest measure statistical analysis web interface
  • 15. Measuring metadata quality. What to measure? 15 ★Structural and semantic features Completeness, cardinality, uniqueness, length, dictionary entry, data type conformance, multilinguality (generic metrics) ★Functional requirement analysis / Discovery scenarios Requirements of the most important functions ★Problem catalog Known metadata problems
  • 16. Measuring metadata quality. Metadata requirements / User scenario 16 “As a user I want to be able to filter by whether a person is the subject of a book, or its author, engraver, printer etc.” Metadata analysis Description of relevant metadata elements and their rules Measurement rules ★ the relevant field values should be resolvable URI ★ each URI should be associated with labels in multiple languages
  • 17. Measuring metadata quality. Metadata requirements / Supported functions 17 #1 Resource Discovery ★ Search Search for a resource corresponding to stated criteria (i.e., to search either a single entity or a set of entities using an attribute or relationship of the entity as the search criteria). ★ Identify confirm that the entity described or located corresponds to the entity sought ★ Select choose an entity that meets the user’s requirements ★ Obtain access a resource either physically or electronically #2 Resource Use ★ Restrict ★ Manage ★ Operate ★ Interpret #3 Data Management ★ Identify ★ Process ★ Sort ★ Display Functional Analysis of the MARC 21 Bibliographic and Holdings Formats http://www.loc.gov/marc/marc-functional-analysis/source/analysis.pdf
  • 18. Measuring metadata quality. Metadata requirements / element—function map 18Europeana sub-dimensions MARC Summary of Mapping to User Tasks
  • 19. Measuring metadata quality. The data aggregation workflow (in Europeana) 19 data transformations Europeana Data Model (EDM) Dublin Core, LIDO, EAD, MARC, EDM custom, ...
  • 20. Measuring metadata quality. Measurement 20 overall view collection view record view Completeness Field cardinality Uniqueness Multilinguality Language specification Problem catalog etc. links measurementsaggregated statistics metrics
  • 21. Measuring metadata quality. Measurement - Field frequency per collections 21 no record has alternative title every record has alternative title filters
  • 22. Measuring metadata quality. Measurement - Details of field cardinality 22 128 subjects in one record median is 0, mean is close to 1 link to interesting records
  • 23. Measuring metadata quality. Measurement - Multilinguality 23 @resource is a URI @ = language notation in RDF no language specification
  • 24. Measuring metadata quality. Measurement - Language frequency 24 has language specification has no language specification
  • 25. Measuring metadata quality. Measurement - Encoding problems 25 same language, different encodings
  • 26. Measuring metadata quality. Measurement - Multilinguality metrics 26 ★ Number of (distinct) languages in the metadata ★ Number of tagged literals ★ Tagged literals per language Requirement: language annotations / tags!
  • 27. Measuring metadata quality. Measurement - Distinct Languages 27 Text w/o language annotation (dc.subject: Germany): Text w language annotation (dc.subject: Germany@en) Text w several language annotations (dc.subject: Germany@en, Deutschland@de) Link to (multilingual) vocabulary (http://www.geonames.org /2921044/federal-republic-of-germany) 0 1 2 n
  • 28. Measuring metadata quality. Measurement - Record level 28 <#record> a ore:Proxy ; dc:subject “Ballet”, “Opera” . <#record> a ore:Proxy ; edm:europeanaProxy true ; dc:subject <http://data.europeana.eu/concept/base/264> , <http://data.europeana.eu/concept/base/247> . <http://data.europeana.eu/concept/base/264> a skos:Concept . skos:prefLabel "Ballett"@no, "बैले"@hi, "Ballett"@de, "Балет"@be, "Балет"@ru , "Balé"@pt, "Балет"@bg, "Baletas"@lt, "Balet"@hr, "Balets"@lv . <http://data.europeana.eu/concept/base/247> skos:prefLabel "Opera"@no, "ओपेरा (गीतिनाटक)"@hi, "Oper"@de, "Ooppera"@fi , "Опера"@be, "Опера"@ru, "Ópera"@pt, "Опера"@bg, "Opera"@lt . 0 0 11 19Distinct languages Tagged literals 1,7 Literals per language dereferencing
  • 29. Measuring metadata quality. Measurement - Contributing to Multilinguality 29 Data dc:creator dc:type dc:subject <http://dbpedia.org/a SubjectID> dc:subject Data from Provider dc:creator dereferenced Quantifiable Data added by Europeana “subject”@en <http://vocab.getty.e du/aPersonNumber> dc:subject “Subject” <http://udcdata.info/rdf /065280>
  • 30. Measuring metadata quality. Measurement - Multilingual saturation heatmap 30
  • 31. Measuring metadata quality. Measurement - Good example 31 dc:descriptiondc:title Place/skos:prefLabelDescriptive fields Subject headings "Brandenburger Tor"@de "Brandenburg Gate"@en "Grenzübergang Potsdamer Platz"@de "Postdamer Platz border crossing"@en "Reichstag"@de "Reichstag building"@en "Die Mauer muß weg!"@de "Die Mauer muß weg! (The Wall must go!)"@en "Kommentiertes Fotorama mit Bildern von 1989-1990 in Berlin"@de "Annotated images from 1989- 1990 in Berlin"@en
  • 32. Measuring metadata quality. Problems - Linked data: depth of iteration 32
  • 33. Measuring metadata quality. Problems - Linked data: lost links 33
  • 34. Measuring metadata quality. Problems - Outliers 34 bulk of records are close to zero although 25% are between 0.05 and 1.25
  • 35. Measuring metadata quality. Problems - Outliers 35
  • 36. Measuring metadata quality. Problems - Outliers 36 zeros / lower outliers high outliers“normal” values
  • 37. Measuring metadata quality. Problems - Layers 37 :provider dc:subject "special relativity"@en ; dc:creator <http://vocab.getty.eu/ulan/500240971> ; dc:type <http://udcdata.info/001684> . :enhancement dc:subject <http://dbpedia.org/resource/Physics> . deferencable vocabulary deferencable vocabulary non-deferencable vocabulary <http://vocab.getty.edu/ulan/500240971> skos:prefLabel "Einstein, Albert"@de . <http://dbpedia.org/resource/Physics> skos:prefLabel "Physics"@en . <http://udcdata.info/001684> skos:prefLabel "Books in general"@en .
  • 38. Measuring metadata quality. Problems - Layers 38 source field link value ① ② ③ ④ :provider dc:subject literal "special relativity"@en ① ② ③ ④ dc:creator standard "Einstein, Albert"@de ① ② ③ ④ dc:type non-std "Books in general"@en ② ④ :enhancement dc:subject standard "Physics"@en ③ ④ ① provider's data and dereferencable enrichments ② provider's data and all enrichments ③ all data and dereferencable enrichments ④ all data and all enrichments credit: Antoine Isaac
  • 39. Measuring metadata quality. Engineering - Measurement processing workflow 39 http://pkiraly.github.io/cheatsheet/
  • 40. Measuring metadata quality. Engineering - Modules 40 metadata-qa-api europeana-qa-api europeana-qa-spark europeana-qa-rest marc-qa-api* ddb-qa-api* ★ Metadata schema abstraction ★ Metrics definition ★ Iteration ★ Result data structure ★ ... <dependencies> <dependency> <groupId>de.gwdg.metadataqa</groupId> <artifactId>metadata−qa−api</artifactId> <version>0.4</version> </dependency> <dependency> <groupId>de.gwdg.metadataqa</groupId> <artifactId>europeana−qa−api</artifactId> <version>0.4</version> </dependency> ... </dependencies>
  • 41. Measuring metadata quality. Engineering - Batch API 41 client Metadata QA /batch/measuring/start sessionID /batch/[recordId] csv for each records /batch/measuring/stop “success” | “failure” /batch/analyzing/start “success” | “failure” /batch/analyzing/status “in progress” | “ready” /batch/analyzing/retriev e compressed package periodically measurementanalysis
  • 42. Measuring metadata quality. Engineering - Formal issue definition 42 How to transform human expert knowledge of metadata issues to machine readable rule?
  • 43. Measuring metadata quality. Engineering - Formal issue definition I. RDFUnit 43 SELECT ?s WHERE { ?s %% P1 %% ?v1 . ?s %% P2 %% ?v2 . FILTER ( ?v1 %% OP %% ?v2 ) } SELECT ?s WHERE { ?s dbo: birthDate ?v1. ?s dbo: deathDate ?v2. FILTER ( ?v1 > ?v2 ) } pattern SPARQL P1 => dbo : birthDate P2 => dbo : deathDate OP => > parameters Kontokostas et al. (2014), Test-driven Evaluation of Linked Data Quality
  • 44. Measuring metadata quality. Engineering - Formal issue definition II. SHACL 44 <IssueShape> sh:property [ sh:predicate ex:submittedBy; sh:minLength 20 ] . <IssueShape> <issue1> pass <IssueShape> <issue2> fail ex:submittedOn expected to be >= 20 characters, 3 characters found. shape result <issue1> ex:submittedBy <http://a.example/bob> . <issue2> ex:submittedBy "Bob" . RDF triplets SHACL Core Abstract Syntax and Semantics W3C First Public Working Draft 25 August 2016
  • 45. Measuring metadata quality. Cooperations and project proposals 45 ★Europeana Network’s Data Quality Committee http://pro.europeana.eu/europeana-tech/data-quality-committee ★Digital Library Federation Metadata Assessment Group http://dlfmetadataassessment.github.io ★Deutsche Digitale Bibliothek https://www.deutsche-digitale- bibliothek.de
  • 46. Measuring metadata quality. Community bibliography 46 zotero.org/groups/metadata_assessment dlfmetadataassessment.github.io
  • 47. Measuring metadata quality. Further steps 47 ★Translate the results into documentation, recommendations ★Communication with data providers ★Human evaluation of metadata quality ★Cooperation with other projects ★Incorporating into ingestion process ★Shape Constraint Language (SHACL) for defining patterns ★Process usage statistics ★Measuring changes of scores ★Machine learning based classification & clustering human analysis technical
  • 48. Measuring metadata quality. Links 48 ★Europeana Data Quality Committee // http://pro.europeana.eu/europeana- tech/data-quality-committee ★site // http://144.76.218.178/europeana-qa/ ★source codes (GPL v3.0) // http://pkiraly.github.io/about/#source-codes ★Europeana data (CC0) // http://hdl.handle.net/21.11101/0000-0001-781F-7 ★Library of Congress data (OA) // http://www.loc.gov/cds/products/marcDist.php ★contact: peter.kiraly@gwdg.de, @kiru