Metadata quality Assurance Framework at QQML2016 - short
1. Metadata Quality Assurance Framework
Péter Király <peter.kiraly@gwdg.de>
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen, Germany
QQML2016
8th International Conference on Qualitative and Quantitative Methods in Libraries
2016-05-24, London
3. Metadata Quality Assurance Framework
3
Typical issues – non-informative field
Title is not informative
non informative:
„photograph, framed”,
„group photograph”
„photograph”
vs
informative:
„Photograph of Sir
Dugald Clerk”,
„Photograph of "Puffing Billy"
4. Metadata Quality Assurance Framework
4
Typical issues – Field overuse
What is the meaning of the field? (overuse)
TextGrid OAI-PMH response
5. Metadata Quality Assurance Framework
5
Why data quality is important?
„Fitness for purpose” (QA principle)
no metadata no access to data no data usage
more explanation:
Data on the Web Best Practices
W3C Working Draft 19 May 2016
https://www.w3.org/TR/dwbp/
6. Metadata Quality Assurance Framework
6
Europeana Data Quality Committee
Online collaboration
Use case documents
Problem catalog
Tickets
Discussion forum
#EuropeanaDataQuality
Bi-weekly teleconf
Bi-yearly face-to-face
meeting
Topics
Usage scenarios
Metadata profiles
Schema modification
Measuring
Event model
Proposals for data
providers
7. Metadata Quality Assurance Framework
7
What it is good for?
improve the metadata
improve services: good data → functions
improve metadata schema & documentation
propagate „good practice”
Domains:
cultural heritage sector
research data management and archiving
8. Metadata Quality Assurance Framework
8
Research hypothesis
hypothesis
with measuring structural elements we
can predict metadata record quality
9. Metadata Quality Assurance Framework
9
Research hypothesis
proposed solution
an open source measuring and reporting tool
Metadata Quality Assurance Framework
11. Metadata Quality Assurance Framework
11
Measurements
Schema-independent structural features
existence, cardinality, uniqueness, length,
dictionary entry, data type conformance
Use case scenarios („fit for purpose”)
Requirements of the most important functions
Problem catalog
Known metadata problems
12. Metadata Quality Assurance Framework
12
Discovery scenarios and their metadata requirements
Europeana’s most important functions
1. Basic retrieval with high precision and recall
2. Cross-language recall
3. Entity-based facets
4. Date-based facets
5. Improved language facets
6. Browse by subjects and resource types
7. Browse by agents
8. Browse/Search by Event
9. Entity-based knowledge cards and pages
10. Categorised similar items
11. Spatial search, browse, and map display
12. Entity-based autocompletion
13. Diversification of results
14. Hierarchical search and facets
Credit: the document was initialized by Timothy Hill, Europeana’s search engineer
13. Metadata Quality Assurance Framework
13
Discovery scenarios and their metadata requirements – Entity-based facets
Scenario
As a user I want to be able to filter by whether a person is the
subject of a book, or its author, engraver, printer etc.
Metadata analysis
In each case the underlying requirement is that the relevant EDM
fields for objects be populated by identifying URIs rather than free
text. These URIs need to be related, at a minimum, to a label for
each of the supported languages.
Measurement rules
The relevant field values should be resolvable URI
each URI should have labels in multiple languages
14. Metadata Quality Assurance Framework
14
Problem catalog
Catalog of known metadata problems in Europeana
Title contents same as description contents
Systematic use of the same title
Bad string: "empty" (and variants)
Shelfmarks and other identifiers in fields
Creator not an agent name
Absurd geographical location
Subject field used as description field
Unicode U+FFFD (�)
Very short description field
...
Credit: the document was initialized by Timoty Hill, Europeana’s search engineer
16. Metadata Quality Assurance Framework
16
Problem catalog – proposed basis of implementation
Shapes Constraint Language (SHACL)
https://www.w3.org/TR/shacl/
A language for describing and constraining the contents of RDF
graphs. It provides a high-level vocabulary to identify predicates and
their associated cardinalities, datatypes and other constraints.
sh:equals, sh:notEquals
sh:hasValue
sh:in
sh:lessThan, sh:lessThanOrEquals
sh:minCount, sh:maxCount
sh:minLength, sh:maxLength
sh:pattern
25. Metadata Quality Assurance Framework
25
Language frequency / barchart
same language,
different encodings
26. Metadata Quality Assurance Framework
26
Language frequency / Treemap with resources
has no language
specification
has language
specification
Is a URI
28. Metadata Quality Assurance Framework
28
Entropy – term uniqueness / main
1 means a unique term
0.0000x means a very frequent term
These are cumulative numbers
entropycumolative = term1 + ... + termn
29. Metadata Quality Assurance Framework
29
Entropy – term uniqueness / collection
max is exceptional (=1425 * mean)
unique records
not or less unique records
30. Metadata Quality Assurance Framework
30
Entropy – term uniqueness / refining the picture
bulk of records are close to zero
although 25% are between 0.05 and 1.25
31. Metadata Quality Assurance Framework
31
Entropy – term uniqueness / terms
explanation of uniqueness score
TF-IDF values come from Apache Solr
term frequency: 1
document freq.: 2
uniqueness score: 0.5
33. Metadata Quality Assurance Framework
33
Problem catalog – same title and description
there is one title and
description which is the same
... and we have 9 such records
35. Metadata Quality Assurance Framework
35
completeness sub-dimensions
Are the sub-dimensions (field groups
supporting specific functionalities) complete?
38. Metadata Quality Assurance Framework
38
Further steps
Incorporating into Europeana’s ingestion tool
Process usage statistics (logs, Google Analitics)
Human evaluation of metadata quality
Measuring timeliness (changes of scores over time)
Machine learning based classification & clustering
Incorporating into research data management tool
Cooperation with other projects
39. Metadata Quality Assurance Framework
39
Architectural overview
Apache Spark
(Java)
OAI-PMH client (PHP)
Analysis with
Spark (Scala) Analysis with R
Web interface
(PHP, d3.js)
Hadoop File
System
JSON files
Apache Solr
Apache
Cassandra
JSON files
JSON files image files
CSV files
CSV files
recent workflow
planned workflow
40. Metadata Quality Assurance Framework
40
Follow me
Europeana Data Quality Committee
http://pro.europeana.eu/europeana-tech/data-
quality-committee
research plan and blog http://pkiraly.github.io
site http://144.76.218.178/europeana-qa/
source codes
https://github.com/pkiraly/europeana-qa-spark
https://github.com/pkiraly/europeana-qa-r
@kiru, https://www.linkedin.com/in/peterkiraly