In cultural heritage institutions metadata creation has long history. The metadata standards themselves evolved over times, several of them have regional and language variations, and most of them leave space for local, institution-specific habits and costumes. Big digital libraries, such as Europeana, the German Digital Library, or the Digital Public Library of America collects metadata records from different domains (libraries, museums, archives), originally created in different metadata standards, locations, times and languages. To build good and reliable data services upon those records, we should know more about the characteristics of the collection. Metadata quality assurance help us to find the weak points, and collaboration with the creators of the data these institutions will be able to improve the collections, and thus the services.
The talk will show this process through the example of Europeana. We will discuss the state-of-the-art of the (meta)data quality assessment researches, the findings of the functional requirement analyses of Europeana records, the data quality analysing framework we build, the general and specific metrics, the visualization, and the scalability issues.
Measuring metadata quality and the Europeana use case (4th ldq, 2017)
1. Measuring Metadata Quality
and the Europeana use case
Péter Király
peter.kiraly@gwdg.de
4th Linked Data Quality Workshop,
Portorož, Slovenia
29th May, 2017
2. Measuring metadata quality. Glossary
2
★ Metadata here: cultural heritage metadata (descriptions of books etc.)
★ Europeana a metadata aggregator from 3500+ cultural heritage
institutions http://europeana.eu
★ Big Data here: 10-100 million metadata records, 100 GB - 1.5 TB
★ EDM Europeana Data Model, Europeana’s metadata schema
★ MARC MAchine Readable Catalog, a library metadata standard
3. Measuring metadata quality. Generic title and bad thumbnail
3more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)
4. Measuring metadata quality. Multilinguality problem
4
★ Mona Lisa → 456
results
★ La Gioconda → 365
results
★ La Joconde → 71
results
http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html
5. Measuring metadata quality. Problems with title
5more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)
title: "VOETBAL-EREDIVISIE-
FEYENOORD - GO AHEAD 3-1",
description: "VOETBAL-EREDIVISIE-
FEYENOORD - GO AHEAD 3-1"
Same title and description
title: "NLD-820630-AMSTERDAM:
Straatmuzikanten proberen
geld te verdienen voor...",
Machine-readable ID in title
title: "+++EMPTY+++"
Leftover
6. Measuring metadata quality. Non-informative values
6
non informative dc:title:
“photograph, framed”,
“group photograph”
“photograph”
informative dc:title:
“Photograph of Sir Dugald Clerk”,
“Photograph of "Puffing Billy"”
bad good
7. Measuring metadata quality. Copy & paste cataloging
7
from a template?
more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)
8. Measuring metadata quality. The problem
8
there are “good” and “bad” metadata records
but we don’t have clear metrics like this:
functional requirements
goodacceptablebad
9. Measuring metadata quality. Why data quality is important?
9
“Fitness for purpose” (QA principle)
purpose: to access content
no metadata no access to data no data usage
more explanation:
Data on the Web Best Practices
W3C Working Draft, https://www.w3.org/TR/dwbp/
10. Measuring metadata quality. Hypothesis
10
by measuring structural elements we
can approximate metadata record quality
≃ metadata smell
11. Measuring metadata quality. Purposes
11
★improve the metadata
★services: good data → reliable functions
★better metadata schema & documentation
★propagate “good practice”
12. Measuring metadata quality. Proposal I.
12
Europeana Data Quality Committee
★ Analysing/revising metadata schema
★ Functional requirement analysis
★ Problem catalog
★ Multilinguality
13. Measuring metadata quality. Proposal II.
13
“Metadata Quality Assurance Framework”
a generic tool for measuring metadata quality
★ adaptable to different metadata schemes
★ scalable (to Big Data)
★ understandable reports for data curators
★ open source
15. Measuring metadata quality. What to measure?
15
★Structural and semantic features
Completeness, cardinality, uniqueness, length, dictionary entry, data type
conformance, multilinguality (generic metrics)
★Functional requirement analysis / Discovery scenarios
Requirements of the most important functions
★Problem catalog
Known metadata problems
16. Measuring metadata quality. Metadata requirements / User scenario
16
“As a user I want to be able to filter by whether a person is the
subject of a book, or its author, engraver, printer etc.”
Metadata analysis
Description of relevant metadata elements and their rules
Measurement rules
★ the relevant field values should be resolvable URI
★ each URI should be associated with labels in multiple languages
17. Measuring metadata quality. Metadata requirements / Supported functions
17
#1 Resource Discovery
★ Search Search for a resource corresponding to stated criteria (i.e., to search either
a single entity or a set of entities using an attribute or relationship of the entity as
the search criteria).
★ Identify confirm that the entity described or located corresponds to the entity sought
★ Select choose an entity that meets the user’s requirements
★ Obtain access a resource either physically or electronically
#2 Resource Use
★ Restrict
★ Manage
★ Operate
★ Interpret
#3 Data Management
★ Identify
★ Process
★ Sort
★ Display
Functional Analysis of the MARC 21 Bibliographic and Holdings Formats
http://www.loc.gov/marc/marc-functional-analysis/source/analysis.pdf
18. Measuring metadata quality. Metadata requirements / element—function map
18Europeana sub-dimensions MARC Summary of Mapping to User Tasks
19. Measuring metadata quality. The data aggregation workflow (in Europeana)
19
data transformations Europeana Data Model (EDM)
Dublin Core,
LIDO, EAD,
MARC, EDM
custom, ...
20. Measuring metadata quality. Measurement
20
overall view collection view record view
Completeness
Field cardinality
Uniqueness
Multilinguality
Language specification
Problem catalog
etc.
links
measurementsaggregated statistics
metrics
21. Measuring metadata quality. Measurement - Field frequency per collections
21
no record has alternative title
every record has alternative title
filters
22. Measuring metadata quality. Measurement - Details of field cardinality
22
128 subjects in one record
median is 0, mean is close to 1
link to interesting records
23. Measuring metadata quality. Measurement - Multilinguality
23
@resource is a URI
@ = language notation in RDF
no language specification
24. Measuring metadata quality. Measurement - Language frequency
24
has language
specification
has no language
specification
26. Measuring metadata quality. Measurement - Multilinguality metrics
26
★ Number of (distinct) languages in the metadata
★ Number of tagged literals
★ Tagged literals per language
Requirement: language annotations / tags!
27. Measuring metadata quality. Measurement - Distinct Languages
27
Text w/o language annotation (dc.subject: Germany):
Text w language annotation (dc.subject: Germany@en)
Text w several language annotations (dc.subject:
Germany@en, Deutschland@de)
Link to (multilingual) vocabulary (http://www.geonames.org
/2921044/federal-republic-of-germany)
0
1
2
n
31. Measuring metadata quality. Measurement - Good example
31
dc:descriptiondc:title
Place/skos:prefLabelDescriptive fields Subject headings
"Brandenburger Tor"@de
"Brandenburg Gate"@en
"Grenzübergang Potsdamer Platz"@de
"Postdamer Platz border crossing"@en
"Reichstag"@de
"Reichstag building"@en
"Die Mauer muß weg!"@de
"Die Mauer muß weg! (The
Wall must go!)"@en
"Kommentiertes Fotorama mit
Bildern von 1989-1990 in
Berlin"@de
"Annotated images from 1989-
1990 in Berlin"@en
38. Measuring metadata quality. Problems - Layers
38
source field link value ① ② ③ ④
:provider dc:subject literal "special relativity"@en ① ② ③ ④
dc:creator standard "Einstein, Albert"@de ① ② ③ ④
dc:type non-std "Books in general"@en ② ④
:enhancement dc:subject standard "Physics"@en ③ ④
① provider's data and dereferencable enrichments
② provider's data and all enrichments
③ all data and dereferencable enrichments
④ all data and all enrichments
credit: Antoine Isaac
41. Measuring metadata quality. Engineering - Batch API
41
client Metadata QA
/batch/measuring/start
sessionID
/batch/[recordId]
csv
for each records
/batch/measuring/stop
“success” | “failure”
/batch/analyzing/start
“success” | “failure”
/batch/analyzing/status
“in progress” | “ready”
/batch/analyzing/retriev
e
compressed package
periodically
measurementanalysis
42. Measuring metadata quality. Engineering - Formal issue definition
42
How to transform human expert knowledge of metadata
issues to machine readable rule?
44. Measuring metadata quality. Engineering - Formal issue definition II. SHACL
44
<IssueShape> sh:property [
sh:predicate ex:submittedBy;
sh:minLength 20
] .
<IssueShape> <issue1> pass
<IssueShape> <issue2> fail ex:submittedOn expected to be >= 20
characters, 3 characters found.
shape
result
<issue1> ex:submittedBy
<http://a.example/bob> .
<issue2> ex:submittedBy
"Bob" .
RDF triplets
SHACL Core Abstract Syntax and Semantics
W3C First Public Working Draft 25 August 2016
45. Measuring metadata quality. Cooperations and project proposals
45
★Europeana Network’s Data Quality Committee
http://pro.europeana.eu/europeana-tech/data-quality-committee
★Digital Library Federation Metadata Assessment Group
http://dlfmetadataassessment.github.io
★Deutsche Digitale Bibliothek https://www.deutsche-digitale-
bibliothek.de
46. Measuring metadata quality. Community bibliography
46
zotero.org/groups/metadata_assessment
dlfmetadataassessment.github.io
47. Measuring metadata quality. Further steps
47
★Translate the results into
documentation,
recommendations
★Communication with data
providers
★Human evaluation of metadata
quality
★Cooperation with other projects
★Incorporating into ingestion
process
★Shape Constraint Language
(SHACL) for defining patterns
★Process usage statistics
★Measuring changes of scores
★Machine learning based
classification & clustering
human analysis technical
48. Measuring metadata quality. Links
48
★Europeana Data Quality Committee // http://pro.europeana.eu/europeana-
tech/data-quality-committee
★site // http://144.76.218.178/europeana-qa/
★source codes (GPL v3.0) // http://pkiraly.github.io/about/#source-codes
★Europeana data (CC0) // http://hdl.handle.net/21.11101/0000-0001-781F-7
★Library of Congress data (OA) //
http://www.loc.gov/cds/products/marcDist.php
★contact: peter.kiraly@gwdg.de, @kiru