1. Validating 126 million
MARC records
DATeCH 2019, Brussels, 2019-05-10.
Péter Király
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Card catalog at Gent University Library, photo: Pieter Morlion, 2010 CC-BY 4.0
https://commons.wikimedia.org/wiki/File:Boekentoren_2010PM_1179_21H9015.JPGhttp://bit.ly/qa-datech2019
2. part I. short introduction to MARC
❏ MAchine Readable Cataloging
❏ format and semantic specification
❏ comes from the age of punch cards – information compression required
❏ invented in early 60’s
❏ love to hate criticise it: “MARC must die”*, “Stockholm syndrome of MARC”**
❏ “There are only two kinds of people who believe themselves able to read a
MARC record without referring to a stack of manuals: a handful of our top
catalogers and those on serious drugs.”
* Roy Tennnant http://lj.libraryjournal.com/2002/10/ljarchives/marc-must-die/
** Niklas Lindström at ELAG 2019 https://twitter.com/cm_harlow/status/1126068414928293888
2
http://bit.ly/qa-datech2019
3. a (pretty printed) example
LDR 01136cnm a2200253ui 4500
001 002032820
005 20150224114135.0
008 031117s2003 gw 000 0 ger d
020 $a3805909810
100 1 $avon Staudinger, Julius, $d1836-1902$0(viaf)14846766
245 10$aJ. von Staudingers Kommentar zum ... / $cJ. von Staudinger.
250 $aNeubearb. 2003 $bvon Jörn Eckert
260 $aBerlin :$bSellier-de Gruyter, $c2003.
300 $a534 p. ;.
500 $aCiteertitel: BGB.
500 $aBandtitel: Staudinger BGB.
700 1 $aEckert, Jörn
852 4 $xRE$bRE55$cRBIB$jRBIB.BUR 011 DE 021 $p000000800147
3
http://bit.ly/qa-datech2019
4. looks like rocket science...
Apollo 11 (moon landing) source code
https://github.com/chrislgarry/Apollo-11
4
http://bit.ly/qa-datech2019
5. positional fields - 008
‘801003s1958 ja 000 0 jpn ‘
0 1 2 3
0123456789012345678901234567890123456789
aaaaaabccccddddeeefffgh All materials
IIIIjkLLLLmnopqr Books
ijklmnOOOpqrs Continuing Resources
iijklmNNNNNNOOp Music
IIIIjjklmnOO Maps
Iiijklmn Visual Materials
ijkl Computer Files
i Mixed Materials
lower case = distinct units
upper case = repeatable units
= undefined position
depends on record
type (calculated from
Leader values)
5
http://bit.ly/qa-datech2019
6. datafields
repeatable/non-repeatable
Indicator1
Indicator2
Subfield1
, ... , Subfieldn
always 1 char long dictionary term
❏ code
❏ value
❏ free text
❏ dictionary term
❏ fixed format (e.g. yymmdd)
❏ fixed format + dictionary terms (d7i2)
❏ fixed positions + dictionary terms
❏ repeatable/non-repeatable
6
http://bit.ly/qa-datech2019
7. versions
❏ changes of the standard
❏ no versioning
❏ new, deleted and changed elements every year
❏ localized versions
❏ introducing new fields
❏ overwriting existing fields
❏ mixing localized versions
❏ no notion about the localization
❏ 50+ localizations (international, national, consortial)
7
http://bit.ly/qa-datech2019
8. size – number of data elements implemented
8
MARC 21 versions total
control fields 7 7
control subfields 211 211
data fields 215 68 283
indicators 175 8 183
subfields 2259 344 2603
3287
http://bit.ly/qa-datech2019
Java classes
qa-metadata-marc.jar
Avram JSON
data model
export
machine readable standard
10. Part II.
record validation
and quality assessment
Boekentoren UGent - de belvedère, photo: Michiel Hendryckx, 2013, CC-BY-SA 3.0
https://commons.wikimedia.org/wiki/File:Boekentoren_ugent_belvedere_675.jpg10
http://bit.ly/qa-datech2019
11. quality assessment workflow
1. ingest
2. measure records
3. aggregate
4. report
5. evaluate with experts (feedback loop)
11
http://bit.ly/qa-datech2019
Improve records
12. 1. ingest data
Bavarian union catalogue (bay) – 27.3 million records; Baden-Würtemberg
union catalogue (bzb) – 23.1 m; Columbia (col) – 6.0 m; Heritage of the
Printed Book Database, CERL (cer) – 6.7 m; German National Bibliography
(dnb) – 16.7 m; Gent (gen) – 1.8 m; Harvard (har) – 13.7 m; Library of
Congress (loc) – 10.1 m; Michigan (mic) – 1.3 m; Finnish National
Bibliography (nfi) – 1.0 m; Repertoire International des Sources Musicales
(ris) – 1.3 m; San Francisco Public Library (sfp) – 0.9 m; Stanford (sta) – 9.4
m; Szeged (szt) – 1.2 m; TIB Hannover (tib) – 3.5 m; Toronto Public Library
(tor) – 2.5 m
union catalogues – national libraries – university libraries – public libraries
12
http://bit.ly/qa-datech2019
13. 2. measure records
$ ./validator [options] [file]
001999999 852 undefined subfield L https://www.loc.gov/...
002000005 035 undefined subfield 9 https://www.loc.gov/...
002000005 852 undefined subfield L https://www.loc.gov/...
002000005 852 undefined subfield L https://www.loc.gov/...
002000008 035 undefined subfield 9 https://www.loc.gov/…
002000011 Leader/17 (leader17) invalid value M https://www.loc.gov/…
002000012 Leader/17 (leader17) invalid value I https://www.loc.gov/...
13
http://bit.ly/qa-datech2019
14. 3. aggregating results – records with issues
14
all filtered
bay 100.0 18.8
bzb 100.0 76.1
cer 2.8 2.8
col 90.4 66.0
dnb 13.9 0.2
gen 40.8 27.3
har 100.0 97.3
loc 30.5 29.3
all filtered
mic 80.8 67.5
nfi 62.1 58.1
ris 99.7 57.1
sfp 82.7 60.4
sta 92.7 92.5
szt 30.8 30.6
tib 100.0 100.0
tor 100.0 74.2
Filtered = issues excluding the undocumented tags and subfields
http://bit.ly/qa-datech2019
15. issue types
issues on record level
❏ R1 ambiguous linkage
❏ R2 invalid linkage
❏ R3 type error
control field issues
❏ C1 invalid code
❏ C2 invalid value
15
field issues
❏ F1 missing reference
subfield (880$6)
❏ F2 non-repeatable field
❏ F3 undefined field
indicator issues
❏ I1 invalid value
❏ I2 non-empty value
❏ I3 obsolete value
subfield issues
❏ S1 classification
❏ S2 invalid ISBN
❏ S3 invalid ISSN
❏ S4 invalid length
❏ S5 invalid value
❏ S6 repetition
❏ S7 undefined subfield
❏ S8 non well-formatted
value
http://bit.ly/qa-datech2019
16. number of subfields in catalogues
total 1% 10%
bay 854 144 51
bzb 522 144 65
crl 169 65 39
col 1862 196 59
dnb 575 186 97
gnt 955 122 47
har 2024 154 49
loc 1156 128 40
16
total 1% 10%
mic 1233 138 37
nfi 811 145 54
ris 138 88 52
sfp 1046 125 37
sta 2997 225 64
szt 1210 74 42
tib 46 41 35
tor 1733 163 46
The tool has 2600+ subfield definitions
total: total number of fields, 1% fields availabe in at least 1% of the records, 10%: fields available in at
least 10% of the records.
Top fields (not in the table) – 50%: 13-25 fields, 80%: 4-18 fields, 90%: 0-16 fields
http://bit.ly/qa-datech2019
19. K-means clustering
Spark (Scala)
increasing number of clusters
decreasing the distance from
the centroids
after a point this gain is not
so big (“elbow effect”) -- in
theory
Big number or low
quality records
small clusters with ‘in
between’ quality records
the acceptable average
clusters with good quality
records
19
http://bit.ly/qa-datech2019 Thompson and Traill (2017) http://journal.code4lib.org/articles/12828
24. Finding problems with facets
Vandenhoeck und Ruprecht
Vandenhoeck & Ruprecht
Vandenhoeck u. Ruprecht
Vandenhoeck
Vandenhoek & Ruprecht
Vandenhoek und Ruprecht
Bandenhoed und Ruprecht
Vandenhoeck et Ruprecht
Vandenhoeck & Reprecht
Vandenhoed und Ruprecht
V&R unipress
V&R Unipress
V & R Unipress
V & R unipress
24
http://bit.ly/qa-datech2019
est. 1735
25. cataloging
frontline
intensive backward
cataloging -
maybe importing?
backward
cataloging is still
intensive, the
tendency continues
peak is > 13K
2000-07-10, the “golden day”:
95K new records
forward cataloging
25
http://bit.ly/qa-datech2019
26. everything else
… at least regarding to this project
code & docs: https://github.com/pkiraly/metadata-qa-marc
Web UI source code: https://github.com/pkiraly/metadata-qa-marc-web
Avram Specification (Jakob Voß):
http://format.gbv.de/schema/avram/specification
https://twitter.com/kiru
peter.kiraly@gwdg.de
26
http://bit.ly/qa-datech2019