Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Empirical evaluation of library catalogues (SWIB 2019)

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Próximo SlideShare
Session7 02.peter kiraly
Session7 02.peter kiraly
Cargando en…3
×

Eche un vistazo a continuación

1 de 30 Anuncio

Más Contenido Relacionado

Más de Péter Király (20)

Más reciente (20)

Anuncio

Empirical evaluation of library catalogues (SWIB 2019)

  1. 1. Empirical evaluation of library catalogues SWIB 2019, Hamburg, 2019-11-27. Péter Király Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG) Card catalog at Gent University Library, photo: Pieter Morlion, 2010 CC-BY 4.0 https://commons.wikimedia.org/wiki/File:Boekentoren_2010PM_1179_21H9015.JPG http://bit.ly/qa-swib2019
  2. 2. about ❏ MAchine Readable Cataloging ❏ a format and semantic specification ❏ invented in early 60’s by H. Avram ❏ “MARC must die”*, “Stockholm syndrome of MARC”** vs. MARC is still with us ❏ know your data ❏ fix a) MARC before moving to X, or b) X after moving from MARC ❏ audience: those who would like to work with or understand data for different purposes * Roy Tennnant http://lj.libraryjournal.com/2002/10/ljarchives/marc-must-die/ ** Niklas Lindström at ELAG 2019 https://twitter.com/cm_harlow/status/1126068414928293888 2 Henriette D. Avram smithsonianmag.com
  3. 3. 1. ingest 2. measure records 3. aggregate 4. report 5. evaluate with experts catalogue improve records workflow 3 quality assessment tool http://bit.ly/qa-swib2019
  4. 4. binding semantics 4 http://bit.ly/qa-swib2019 control field 1 position 1 position N data field 1 subfield 1 subfield N definitions control field 1 position 1 position N data field 1 subfield 1 subfield N value definition MarcRecord value definition value definition value definition ● name ● URL ● acceptable codes and their meaning ● value constraints ● indexing rules ● FRBR functions ● historical codes ● dictionaries ● BIBFRAME name ● versions ● other rules Avram JSON data model export machine readable format of MARC standard
  5. 5. measure records $ ./validate # validation $ ./completeness # completeness analysis $ ./classifications # classification analysis $ ./authorities # authorities analysis $ ./tt-completeness # Thomson-Trail completeness $ ./serial-score # serial scores $ ./functional-analysis # functional analysis $ ./prepare-solr # prepare Solr index $ ./index # indexing with Solr 5 http://bit.ly/qa-swib2019 $ ./all-analyses CSV files Lightweight web UI
  6. 6. aggregating results – records with issues 6 all filtered bay 100.0 18.8 bzb 100.0 76.1 cer 2.8 2.8 col 90.4 66.0 dnb 13.9 0.2 gen 40.8 27.3 har 100.0 97.3 loc 30.5 29.3 all filtered mic 80.8 67.5 nfi 62.1 58.1 ris 99.7 57.1 sfp 82.7 60.4 sta 92.7 92.5 szt 30.8 30.6 tib 100.0 100.0 tor 100.0 74.2 Filtered = issues excluding the undocumented tags and subfields http://bit.ly/qa-swib2019
  7. 7. 7 http://bit.ly/qa-swib2019
  8. 8. 8 http://bit.ly/qa-swib2019 open to see individual issues
  9. 9. 9 http://bit.ly/qa-swib2019 link to records with this issue link to the field definition in MARC standard ordered by frequency clicked
  10. 10. 10 http://bit.ly/qa-swib2019
  11. 11. 11 http://bit.ly/qa-swib2019 launch a search for 015$a:* shows a term list of 015$a 015$9 is not defined barchar size proportional to recor count with subfield
  12. 12. 12 http://bit.ly/qa-swib2019
  13. 13. 13 http://bit.ly/qa-swib2019
  14. 14. 14 http://bit.ly/qa-swib2019 link to term list check details the element which tells us the source of terms (dictionary)
  15. 15. 15 http://bit.ly/qa-swib2019 strange values probably good, but not listed in MARC
  16. 16. 16 http://bit.ly/qa-swib2019 link to completeness link to term list
  17. 17. 17 http://bit.ly/qa-swib2019 search links the missing step: resolve to human readable labels (cooperation with coli-conc and BARTOC)
  18. 18. 18 http://bit.ly/qa-swib2019
  19. 19. 19 http://bit.ly/qa-swib2019
  20. 20. 20 http://bit.ly/qa-swib2019
  21. 21. 21 http://bit.ly/qa-swib2019 outliers
  22. 22. term #1 term #2 term #4 term #3 term #10 term #5 term #6 term #7 term #9 term #11 term #8 KOS metadata record #1 field #1 field #2 metadata record #2 field #1 field #2 references to generic concepts references to specific concepts 22 http://bit.ly/qa-swib2019 plan: deep dive into subjects
  23. 23. references, credits https://github.com/pkiraly/{metadata-qa-marc, metadata-qa-marc-web} Validating 126 million MARC records. doi:10.1145/3322905.3322929 Avram (w/ Jakob Voß): http://format.gbv.de/schema/avram/specification @kiru, pkiraly@gwdg.de, pkiraly.github.io Thanks to: J. Rolschewski, Phú, J. Voß, C. Klee, P. Hochstenbach, O. Suominen, T. Virolainen, Kokas K., Bernátsky L., S. Auer, B. Genat, Sh. Doljack, D. L. Rueda, Ph. E. Schreur, M. Lefferts, A. Jahnke, M. Kittelmann, J. Christoffersen, R. Heuvelmann, Gyuricza A., Balázs L., Ungváry R., G. Coen, A. Ledl, A. Kasprzik, U. Balakrishnan, Y. Y. Nicolas, M. Franke-Maier, G. Lauer 23 http://bit.ly/qa-swib2019
  24. 24. back material 24 http://bit.ly/qa-swib2019
  25. 25. data sources Bavarian union cat. (bay) – 27.3 million records; Baden-Würt. union cat. (bzb) – 23.1 m; Columbia (col) – 6.0 m; Heritage of the Printed Book DB, CERL (cer) – 6.7 m; Germ. National Bibl. (dnb) – 16.7 m; Gent (gen) – 1.8 m; Harvard (har) – 13.7 m; Library of Congress (loc) – 10.1 m; Michigan (mic) – 1.3 m; Finnish National Bibl. (nfi) – 1.0 m; Repertoire International des Sources Musicales (ris) – 1.3 m; San Francisco Public Lib. (sfp) – 0.9 m; Stanford (sta) – 9.4 m; Szeged (szt) – 1.2 m; TIB Hannover (tib) – 3.5 m; Toronto Public Lib. (tor) – 2.5 m; Polish National Lib. - 6.5 m; North Germ. union cat. - 69 m; Hung. Acad. of Science Lib. - 1 m; Hung. union cat. - 9 m union catalogues – national libraries – university libraries – public libraries 25 http://bit.ly/qa-swib2019
  26. 26. issue types issues on record level ❏ ambiguous linkage ❏ invalid linkage ❏ type error control field issues ❏ invalid code ❏ invalid value 26 field issues ❏ missing reference subfield (880$6) ❏ non-repeatable field ❏ undefined field indicator issues ❏ invalid value ❏ non-empty value ❏ obsolete value subfield issues ❏ classification ❏ invalid ISBN ❏ invalid ISSN ❏ invalid length ❏ invalid value ❏ repetition ❏ undefined subfield ❏ non well-formatted value http://bit.ly/qa-swib2019
  27. 27. number of subfields in catalogues total 1% 10% bay 854 144 51 bzb 522 144 65 crl 169 65 39 col 1862 196 59 dnb 575 186 97 gnt 955 122 47 har 2024 154 49 loc 1156 128 40 27 total 1% 10% mic 1233 138 37 nfi 811 145 54 ris 138 88 52 sfp 1046 125 37 sta 2997 225 64 szt 1210 74 42 tib 46 41 35 tor 1733 163 46 The tool has 2600+ subfield definitions total: total number of fields, 1% fields available in at least 1% of the records, 10%: fields available in at least 10% of the records. Top fields (not in the table) – 50%: 13-25 fields, 80%: 4-18 fields, 90%: 0-16 fields http://bit.ly/qa-swib2019
  28. 28. 28 http://bit.ly/qa-swib2019 outliers discriminated negative values
  29. 29. Tom Delsey (2002) Functional analysis of the MARC 21 bibliographic and holdings formats. Tech. report, Library of Congress, 2002. Prepared for the Network Development and MARC Standards Office Library of Congress. Second Revision: September 17, 2003. https://www.loc.gov/marc/marc-functional-analysis/original_source/analysis.pdf 29 http://bit.ly/qa-swib2019
  30. 30. reuse 30 MARC 21 versions total control fields 7 7 control subfields 211 211 data fields 215 68 283 indicators 175 8 183 subfields 2259 344 2603 3287 Java classes qa-metadata-marc.jar Avram JSON data model export machine readable standard http://bit.ly/qa-swib2019

Notas del editor

  • What is “report”?

×