Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Measuring Metadata Quality (doctoral defense 2019)

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 31 Anuncio

Más Contenido Relacionado

Similares a Measuring Metadata Quality (doctoral defense 2019) (20)

Más de Péter Király (19)

Anuncio

Más reciente (20)

Measuring Metadata Quality (doctoral defense 2019)

  1. 1. Measuring Metadata Quality Péter Király Comparative Studies (General and Comparative Literature and Cultural Studies) Georg-August-Universität Göttingen 2019-06-24 slides: http://bit.ly/qa-defense
  2. 2. metadata http://bit.ly/qa-defense 2 metadata something else here: cultural heritage objects ★ describes ★ explains ★ locates ★ represents
  3. 3. the problem https://twitter.com/fxru/status/1052838758066868224 http://bit.ly/qa-defense 3
  4. 4. dates (MoMa collection) Harald Klinke (LMU München) https://twitter.com/HxxxKxxx/status/1066805548866289664 4 http://bit.ly/qa-defense
  5. 5. title – thumbnail 5 http://bit.ly/qa-defense
  6. 6. multilinguality 6 ★ Mona Lisa → 456 results ★ La Gioconda → 365 results ★ La Joconde → 71 results http://bit.ly/qa-defense
  7. 7. information metadata data consequence 7 no metadata no access to data no data usage bad metadata http://bit.ly/qa-defense Anonyme: Le Liseur © 2008 Hulton-Deutsch Collection / Corbis
  8. 8. laws of library science 8 http://bit.ly/qa-defense Anonyme: Le Liseur © 2008 Hulton-Deutsch Collection / Corbis 1. Books are for use. 2. Every person his or her book. 3. Every book its reader. 4. Save the time of the reader. 5. Library is a growing organism. S. R. Ranganathan, 1931
  9. 9. objective 9 there are “good” and “bad” metadata records functional requirements (metrics) good acceptable bad http://bit.ly/qa-defense
  10. 10. metadata quality metrics in literature Bruce and Hillman (2004); Ochoa and Duval (2009); Palavitsinis (2014); Zaveri et al. (2015) https://www.zotero.org/groups/488224/metadata_assessment 10 completeness accuracy consistency ... correctness objectiveness appropriateness http://bit.ly/qa-defense
  11. 11. complication 11 ★ lack of details ★ no shared implementation ★ not flexible (collection specificity) ★ not scalable http://bit.ly/qa-defense
  12. 12. 12 ★ Q1: What are the relevant quality dimensions in two different cultural heritage data sources? ★ Q2: How could it be implemented in a flexible way? ★ Q3: How could it be implemented in scalable way? ★ Q4: How could Big Data analysis be conducted with limited computational resources? http://bit.ly/qa-defense questions
  13. 13. structure is algorithmically measurable, content is not 13 ★ number of fields ★ uniqueness of values ★ language annotation ★ is it really the Mona Lisa? ★ is it about Lower Saxony? ★ is it in Polish? http://bit.ly/qa-defense
  14. 14. hypothesis 14 by measuring structural elements we can approximate metadata quality ≃ metadata smell http://bit.ly/qa-defense
  15. 15. LAM institution workflow 1. ingest 2. measure records 3. aggregate 4. report 5. evaluate with experts 15 improve records http://bit.ly/qa-defense quality assessment tool
  16. 16. Measuring Europeana http://bit.ly/qa-defense
  17. 17. organisational approach 17 Europeana Data Quality Committee ★ analysing/revising metadata schema ★ functional requirement analysis ★ problem catalog ★ multilinguality http://bit.ly/qa-defense
  18. 18. technical approach 18 “Metadata Quality Assessment Framework” ★ adaptable to different metadata schemas ★ scalable (to Big Data) ★ generates understandable reports for data curators ★ open source http://bit.ly/qa-defense
  19. 19. what to measure? 19 ★structural and semantic features completeness, cardinality, uniqueness, length, dictionary entry, data type conformance, multilinguality (generic metrics) ★functional requirement analysis ★problem catalog http://bit.ly/qa-defense
  20. 20. http://bit.ly/qa-defense 20
  21. 21. <#record> a ore:Proxy ; dc:subject “Ballet”, “Opera” . <#record> a ore:Proxy ; edm:europeanaProxy true ; dc:subject <http://data.europeana.eu/concept/base/264> , <http://data.europeana.eu/concept/base/247> . <http://data.europeana.eu/concept/base/264> a skos:Concept . skos:prefLabel "Ballett"@no, "बैले"@hi, "Ballett"@de, "Балет"@be, "Балет"@ru , "Balé"@pt, "Балет"@bg, "Baletas"@lt, "Balet"@hr, "Balets"@lv . <http://data.europeana.eu/concept/base/247> skos:prefLabel "Opera"@no, "ओपेरा (गीतिनाटक)"@hi, "Oper"@de, "Ooppera"@fi , "Опера"@be, "Опера"@ru, "Ópera"@pt, "Опера"@bg, "Opera"@lt . multilinguality 21 0 0 11 19 Distinct languages Tagged literals 1,7 Literals per language dereferencing http://bit.ly/qa-defense
  22. 22. 19% 58% 63% 13.3% 23.7% http://bit.ly/qa-defense 22
  23. 23. Measuring library catalogues Card catalogue at Gent University Library, photo: Pieter Morlion, 2010 CC-BY 4.0 https://commons.wikimedia.org/wiki/File:Boekentoren_2010PM_1179_21H9015.JPG http://bit.ly/qa-defense
  24. 24. a (pretty printed) example LDR 01136cnm a2200253ui 4500 001 002032820 005 20150224114135.0 008 031117s2003 gw 000 0 ger d 020 $a3805909810 100 1 $avon Staudinger, Julius,$d1836-1902$0(viaf)14846766 245 10$aJ. von Staudingers Kommentar zum ... /$cJ. von Staudinger. 250 $aNeubearb. 2003$bvon Jörn Eckert 260 $aBerlin :$bSellier-de Gruyter,$c2003. 300 $a534 p. ;. 500 $aCiteertitel: BGB. 500 $aBandtitel: Staudinger BGB. 700 1 $aEckert, Jörn 852 4 $xRE$bRE55$cRBIB$jRBIB.BUR 011 DE 021$p000000800147 24 http://bit.ly/qa-defense
  25. 25. semantic elements 25 MARC 21 versions total control fields 7 7 control subfields 211 211 data fields 215 68 283 indicators 175 8 183 subfields 2259 344 2603 3287 Java classes qa-metadata-marc.jar Avram JSON data model export machine readable standard http://bit.ly/qa-defense
  26. 26. proportion of records with issues 26 library all core bay 100.0 18.8 bzb 100.0 76.1 cer 2.8 2.8 col 90.4 66.0 dnb 13.9 0.2 gen 40.8 27.3 har 100.0 97.3 loc 30.5 29.3 library all core mic 80.8 67.5 nfi 62.1 58.1 ris 99.7 57.1 sfp 82.7 60.4 sta 92.7 92.5 szt 30.8 30.6 tib 100.0 100.0 tor 100.0 74.2 core = issues in the documented elements http://bit.ly/qa-defense strange almost error-less surprising
  27. 27. issue types record level ★ ambiguous linkage ★ invalid linkage ★ type error control field ★ invalid code ★ invalid value 27 data field ★ missing reference subfield (880$6) ★ non-repeatable field ★ undefined field indicator ★ invalid value ★ non-empty value ★ obsolete value subfield ★ classification ★ invalid ISBN ★ invalid ISSN ★ invalid length ★ invalid value ★ repetition ★ undefined subfield ★ non well-formatted value http://bit.ly/qa-defense most frequent issues
  28. 28. completeness by field groups 28 http://bit.ly/qa-defense
  29. 29. 29 ★ Q1 quality dimensions: completeness, multilinguality, issue detection ★ Q2 flexibility: schema abstraction, decoupling measurements ★ Q3 scalability: using frameworks such as Apache Spark ★ Q4 limited resource: measuring performance, optimizing parameters http://bit.ly/qa-defense conclusions
  30. 30. publications ★ J. Stiller, P. Király (2017) Multilinguality of Metadata Measuring the Multilingual Degree of Europeana’s Metadata. In Proc. of the 15th Intl. Symp. of Information Sci. 164–176. ★ P. Király (2017) Towards an extensible measurement of metadata quality. In Second International Conference on Digital Access to Textual Cultural Heritage. 111–115. 10.1145/3078081.3078109 ★ P. Király (2017) Measuring completeness as metadata quality metric in Europeana. In Digital Humanities 2017 Conference Abstracts. 291–293. ★ V. Charles, J. Stiller, P. Király, W. Bailer, N. Freire (2017) Evaluating Data Quality in Europeana: Metrics for Multilinguality. In Joint Proceedings of TDDL 2017, MDQual 2017 and Futurity 2017 ★ P. Király (2018) Adat a könyvtárban. In Hagyomány és újítás a 21. századi könyvtárban. 49–74. ★ P. Király, M. Büchler (2018) Measuring completeness as metadata quality metric in Europeana. In 2018 IEEE International Conference on Big Data. 2711–2720. 10.1109/BigData.2018.8622487 ★ P. Király, J. Stiller, V. Charles, W. Bailer, N. Freire (2019) Evaluating Data Quality in Europeana: Metrics for Multilinguality. In Metadata and Semantic Research 2018. 199–211. 10.1007/978-3- 030-14401-2_19 30 http://bit.ly/qa-defense
  31. 31. conferences #1 International Symposium on Information Science 2017, Berlin (with Juline Stiller) #2 #dariahTeach 2017, Lausanne (poster) #3 SI & IT Workshop 2017, Göttingen (with Juline Stiller) #4 Linked Data Quality workshop (@ ESWC 2017), Portorož (invited keynote speech) #5 DATeCH 2017, Göttingen #6 ELAG 2017, Athens (with Valentine Charles) #7 Digital Humanities 2017, Montréal #8 Linked Data Quality Workshop (@ Semantics 2017), Amsterdam (organizer and presenter) #9 (Meta)-Data Quality Workshop (@ TPDL 2017), Thessaloniki (presented by Juliane Stiller) #10 ADOCHS meeting 2017, Brussels (invited speech) #11 LDCX 2018, Stanford University #12 ELAG 2018, Prague (workshop together with Anette Strauch, Patrick Hochstenbach, Mark Phillips) #13 12th International Conference on Metadata and Semantics Research 2018, Limassol (with Juliane Stiller) #14 Open Research Knowledge Graph workshop 2018, Hannover #15 Research Infrastructure on Religious Studies Workshop on FAIR Research Data Management 2018, Mainz (invited speech) #16 Computational Archival Science workshop (@ IEEE Big Data 2018), Seattle #17 DATeCH 2019, Brussels 31 http://bit.ly/qa-defense

×