Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 92 Anuncio

Más Contenido Relacionado

Similares a Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018) (20)

Más de Péter Király (20)

Anuncio

Más reciente (20)

Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)

  1. 1. Metadata quality in cultural heritage institutions Péter Király {pkiraly@gwdg.de, @kiru, pkiraly.github.io} Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG) ReIReS (Research Infrastructure on Religious Studies) Workshop on FAIR Principle for Digital Research Data Management Leibniz-Institute of European History, Mainz, 2018-11-28 these slides: http://bit.ly/qa-relres-fair
  2. 2. the problem https://twitter.com/fxru/status/1052838758066868224 http://bit.ly/qa-relres-fair 2
  3. 3. top 20 patterns, ‘date’ field, MoMa collection Harald Klinke (LMU München) https://twitter.com/HxxxKxxx/status/1066805548866289664 http://bit.ly/qa-relres-fair 3
  4. 4. Generic title and bad thumbnail 4 more examples in Report and Recommendations from the Task Force on Metadata Quality (2015) http://bit.ly/qa-relres-fair
  5. 5. Multilinguality problem 5 ★ Mona Lisa → 456 results ★ La Gioconda → 365 results ★ La Joconde → 71 results http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html http://bit.ly/qa-relres-fair
  6. 6. Problems with title 6 more examples in Report and Recommendations from the Task Force on Metadata Quality (2015) title: "VOETBAL-EREDIVISIE- FEYENOORD - GO AHEAD 3-1", description: "VOETBAL-EREDIVISIE- FEYENOORD - GO AHEAD 3-1" Same title and description title: "NLD-820630-AMSTERDAM: Straatmuzikanten proberen geld te verdienen voor...", Machine-readable ID in title title: "+++EMPTY+++" Leftover http://bit.ly/qa-relres-fair
  7. 7. Measuring metadata quality. Non-informative values 7 non informative dc:title: “photograph, framed”, “group photograph” “photograph” informative dc:title: “Photograph of Sir Dugald Clerk”, “Photograph of "Puffing Billy"” bad good http://bit.ly/qa-relres-fair
  8. 8. Copy & paste cataloging 8 from a template? more examples in Report and Recommendations from the Task Force on Metadata Quality (2015) http://bit.ly/qa-relres-fair
  9. 9. metadata structured information that describes, explains, locates, or otherwise represents something else. NISO (2004) http://bit.ly/qa-relres-fair 9
  10. 10. quality and ‘fitness for purpose’ ★ fulfilment of a specification or stated outcomes ★ measured against what is seen to be the goal of the unit ★ achieving institutional mission and objectives ’We know it when we see it, but conveying the full bundle of assumptions and experience that allow us to identify it is a different matter.’ http://bit.ly/qa-relres-fair 10
  11. 11. metadata quality 11 purpose: to access content no metadata no access to data no data usage more explanation: Data on the Web Best Practices W3C Working Draft, https://www.w3.org/TR/dwbp/ bad metadata http://bit.ly/qa-relres-fair
  12. 12. the problem statement – improved 12 there are “good” and “bad” metadata records we would like to achieve metrics like this: functional requirements good acceptable bad http://bit.ly/qa-relres-fair
  13. 13. general metrics ★ completeness: number of metadata elements filled out ★ accuracy: data correspond to the resource that is being described ★ consistency: values compliant to what is defined by the metadata scheme ★ objectiveness: values describe the resource in an unbiased way ★ appropriateness: values are facilitating the deployment of search ★ correctness: syntactically and grammatically correct language Bruce and Hillman (2004); Ochoa and Duval (2009); Palavitsinis (2014) http://bit.ly/qa-relres-fair 13
  14. 14. linked data dimensions and metrics accessibility ★ Availability ★ Licensing ★ Interlinking ★ Security ★ Performance intrinsic ★ Syntactic validity ★ Semantic accuracy ★ Consistency ★ Conciseness ★ Completeness contextual ★ Relevancy ★ Trustworthiness ★ Understandability ★ Timeliness representational ★ Representational conciseness ★ Interoperability ★ Interpretability ★ Versatility Stvilia et al. (2007); Zaveri et al. (2015) http://bit.ly/qa-relres-fair 14
  15. 15. The good metrics are ★ clear ★ realistic ★ discriminating ★ measurable ★ universality http://fairmetrics.org – https://github.com/FAIRMetrics/Metrics/blob/master/ALL.pdf FAIR metrics http://bit.ly/qa-relres-fair 15
  16. 16. http://bit.ly/qa-relres-fair 16
  17. 17. F1 – Identifier Uniqueness What is being measured? Whether there is a scheme to uniquely identify the digital resource. How do we measure it? An identifier scheme is valid if and only if it is described in a repository that can register and present such identifier schemes (e.g. fairsharing.org). http://bit.ly/qa-relres-fair 17
  18. 18. F1 – Identifier persistence What is being measured? Whether there is a policy that describes what the provider will do in the event an identifier scheme becomes deprecated. How do we measure it? Use an HTTP GET on URL provided. http://bit.ly/qa-relres-fair 18
  19. 19. F2 – Machine-readability of metadata What is being measured? The availability of machine-readable metadata that describes a digital resource. How do we measure it? HTTP GET on the metadata URL. A response of [a 200,202,203 or 206 HTTP response after resolving all and any prior redirects. e.g. 301→302→200 OK] indicates that there is indeed a document. The second URL should resolve to the record of a registered file format (e.g. DCAT, DICOM, schema.org etc.) in a registry like FAIRsharing. Future enhancements to FAIRsharing may include tags that indicate whether or not a given file format is generally-agreed to be machine- readable. http://bit.ly/qa-relres-fair 19
  20. 20. F3 – Resource Identifier in Metadata What is being measured? Whether the metadata document contains the globally unique and persistent identifier for the digital resource. How do we measure it? Parsing the metadata for the given digital resource GUID. http://bit.ly/qa-relres-fair 20
  21. 21. F4 – Indexed in a searchable resource What is being measured? The degree to which the digital resource can be found using web-based search engines. How do we measure it? We perform an HTTP GET on the URLs provided and attempt to to nd the persistent identifier in the page that is returned. A second step might include following each of the top XX hits and examine the resulting documents for presence of the identifier. http://bit.ly/qa-relres-fair 21
  22. 22. A2 - Metadata Longevity What is being measured? The existence of metadata even in the absence/removal of data How do we measure it? Resolve the URL http://bit.ly/qa-relres-fair 22
  23. 23. RDFUnit, SHACL and ShEx ★ Linked Data is based on Open World assumption ★ No “record”, no clear boundaries ★ RDF Data Shapes: reinventing the schema ★ ShEx (Shape Expressions, https://shex.io) and SHACL (Shapes Constraint Language, https://www.w3.org/TR/shacl/) ★ Finding individual data issues http://bit.ly/qa-relres-fair 23
  24. 24. Core constraints Cardinality minCount, maxCount Types of values class, datatype, nodeKind Shapes node, property, in, hasValue Range of values minInclusive, maxInclusive, minExclusive, maxExclusive String based minLength, maxLength, pattern, stem, uniqueLang Logical constraints not, and, or, xone Closed shapes closed, ignoredProperties Property pair constraints equals, disjoint, lessThan, lessThanOrEquals Non-validating constraints name, value, defaultValue Qualified shapes qualifiedValueShape, qualifiedMinCount, qualifiedMaxCount 24
  25. 25. SHACL with BibFRAME Capturing Cataloger Expectations in an RDF Editor. Presentation at SWIB 2018 by S. Folsom, H. Khan, L. Rayle, J. Kovari, R. Younes, S. Warner https://twitter.com/sf433/status/1067370567303614464
  26. 26. The Quartz guide to bad data (2015) ★ by Christopher Groskopf ★ guide for data journalist about how to recognize data issues ★ practical guide, not an academic paper ★ take-away messages: ○ be sceptic about the data ○ check it with exploratory data analysis ○ check it early, check it often ★ https://github.com/Quartz/bad-data-guide, https://qz.com/572338/the-quartz- guide-to-bad-data/ http://bit.ly/qa-relres-fair 26
  27. 27. Issues that your source should solve ★ Values are missing ★ Zeros replace missing values ★ Data are missing you know should be there ★ Rows or values are duplicated ★ Spelling is inconsistent ★ Name order is inconsistent ★ Date formats are inconsistent ★ Units are not specified ★ Categories are badly chosen ★ Field names are ambiguous ★ Provenance is not documented ★ Suspicious numbers are present ★ Data are too coarse ★ Totals differ from published aggregates ★ Spreadsheet has 65536 rows ★ Spreadsheet has dates in 1900 or 1904 ★ Text has been converted to numbers http://bit.ly/qa-relres-fair 27
  28. 28. Issues that you should solve ★ Text is garbled ★ Data are in a PDF ★ Data are too granular ★ Data was entered by humans ★ Aggregations were computed on missing values ★ Sample is not random ★ Margin-of-error is too large ★ Margin-of-error is unknown ★ Sample is biased ★ Data has been manually edited ★ Inflation skews the data ★ Natural/seasonal variation skews the data ★ Timeframe has been manipulated ★ Frame of reference has been manipulated http://bit.ly/qa-relres-fair 28
  29. 29. Issues a third-party expert should help you solve ★ Author is untrustworthy ★ Collection process is opaque ★ Data asserts unrealistic precision ★ There are inexplicable outliers ★ An index masks underlying variation ★ Results have been p-hacked ★ Benford’s Law fails ★ It’s too good to be true http://bit.ly/qa-relres-fair 29
  30. 30. Issues a programmer should help you solve ★ Data are aggregated to the wrong categories or geographies ★ Data are in scanned documents http://bit.ly/qa-relres-fair 30
  31. 31. https://www.zotero.org/groups/488224/metadata_assessment
  32. 32. in practice part II http://bit.ly/qa-relres-fair 32
  33. 33. hypothesis 33 by measuring structural elements we can approximate metadata record quality ≃ metadata smell http://bit.ly/qa-relres-fair
  34. 34. purposes 34 ★improve the metadata ★services: good data → reliable functions ★better metadata schema & documentation ★propagate “good practice” http://bit.ly/qa-relres-fair
  35. 35. Measuring Europeana http://bit.ly/qa-relres-fair
  36. 36. data aggregation workflow (organizational) LAM inst. 1 aggregator 1 Europeana LAM inst. 2 LAM inst. ... aggregator ... LAM inst. ... http://bit.ly/qa-relres-fair 36
  37. 37. data aggregation workflow (technical) 37 data transformations Europeana Data Model (EDM) Dublin Core, LIDO, EAD, MARC, EDM custom, ... http://bit.ly/qa-relres-fair
  38. 38. organisational proposal 38 Europeana Data Quality Committee ★ Analysing/revising metadata schema ★ Functional requirement analysis ★ Problem catalog ★ Multilinguality http://bit.ly/qa-relres-fair
  39. 39. technical proposal 39 “Metadata Quality Assurance Framework” a generic tool for measuring metadata quality ★ adaptable to different metadata schemes ★ scalable (to Big Data) ★ understandable reports for data curators ★ open source http://bit.ly/qa-relres-fair
  40. 40. measuring workflow 40 ★ OAI-PMH ★ Europeana API ★ Hadoop ★ NoSQL ★ Spark ★ Hadoop ★ Java ★ Apache Solr ★ Spark ★ R ★ PHP ★ D3.js ★ highchart.js ★ NoSQL json csv json, png html, svg ingest measure statistical analysis web interface http://bit.ly/qa-relres-fair
  41. 41. What to measure? 41 ★Structural and semantic features Completeness, cardinality, uniqueness, length, dictionary entry, data type conformance, multilinguality (generic metrics) ★Functional requirement analysis / Discovery scenarios Requirements of the most important functions ★Problem catalog Known metadata problems http://bit.ly/qa-relres-fair
  42. 42. metadata requirements / user scenarios 42 “As a user I want to be able to filter by whether a person is the subject of a book, or its author, engraver, printer etc.” Metadata analysis Description of relevant metadata elements and their rules Measurement rules ★ the relevant field values should be resolvable URI ★ each URI should be associated with labels in multiple languages http://bit.ly/qa-relres-fair
  43. 43. measurement 43 overall view collection view record view Completeness Field cardinality Uniqueness Multilinguality Language specification Problem catalog etc. links measurements aggregated statistics metrics http://bit.ly/qa-relres-fair
  44. 44. multilinguality 44 Text w/o language annotation (dc.subject: Germany): Text w language annotation (dc.subject: Germany@en) Text w several language annotations (dc.subject: Germany@en, Deutschland@de) Link to (multilingual) vocabulary (http://www.geonames.org /2921044/federal-republic-of-germany) 0 1 2 n http://bit.ly/qa-relres-fair
  45. 45. multilinguality – details 45 <#record> a ore:Proxy ; dc:subject “Ballet”, “Opera” . <#record> a ore:Proxy ; edm:europeanaProxy true ; dc:subject <http://data.europeana.eu/concept/base/264> , <http://data.europeana.eu/concept/base/247> . <http://data.europeana.eu/concept/base/264> a skos:Concept . skos:prefLabel "Ballett"@no, "बैले"@hi, "Ballett"@de, "Балет"@be, "Балет"@ru , "Balé"@pt, "Балет"@bg, "Baletas"@lt, "Balet"@hr, "Balets"@lv . <http://data.europeana.eu/concept/base/247> skos:prefLabel "Opera"@no, "ओपेरा (गीतिनाटक)"@hi, "Oper"@de, "Ooppera"@fi , "Опера"@be, "Опера"@ru, "Ópera"@pt, "Опера"@bg, "Opera"@lt . 0 0 11 19 Distinct languages Tagged literals 1,7 Literals per language dereferencing http://bit.ly/qa-relres-fair
  46. 46. a good multilingual example 46 dc:description dc:title Place/skos:prefLabel Descriptive fields Subject headings "Brandenburger Tor"@de "Brandenburg Gate"@en "Grenzübergang Potsdamer Platz"@de "Postdamer Platz border crossing"@en "Reichstag"@de "Reichstag building"@en "Die Mauer muß weg!"@de "Die Mauer muß weg! (The Wall must go!)"@en "Kommentiertes Fotorama mit Bildern von 1989-1990 in Berlin"@de "Annotated images from 1989- 1990 in Berlin"@en http://bit.ly/qa-relres-fair
  47. 47. canned demo http://bit.ly/qa-relres-fair 47
  48. 48. Measuring library catalogs Card catalog at Gent University Library, photo: Pieter Morlion, 2010 CC-BY 4.0 https://commons.wikimedia.org/wiki/File:Boekentoren_2010PM_1179_21H9015.JPG http://bit.ly/qa-relres-fair
  49. 49. Part I. Introduction to MARC ❏ MAchine Readable Catalog ❏ format and semantic specification ❏ comes from the age of punchcards - information compression ❏ invented in early 60’s ❏ even the lapidary “MARC must die” article* celebrated its 16th anniversary last month, but MARC is still living ❏ „There are only two kinds of people who believe themselves able to read a MARC record without referring to a stack of manuals: a handful of our top catalogers and those on serious drugs.” * by Roy Tennnant http://lj.libraryjournal.com/2002/10/ljarchives/marc-must-die/ 64 http://bit.ly/qa-relres-fair
  50. 50. an example LEADER 01136cnm a2200253ui 4500 001 002032820 005 20150224114135.0 008 031117s2003 gw 000 0 ger d 020 $a3805909810 100 1 $avon Staudinger, Julius,$d1836-1902$0(viaf)14846766 245 10$aJ. von Staudingers Kommentar zum ... /$cJ. von Staudinger. 250 $aNeubearb. 2003$bvon Jörn Eckert 260 $aBerlin :$bSellier-de Gruyter,$c2003. 300 $a534 p. ;. 500 $aCiteertitel: BGB. 500 $aBandtitel: Staudinger BGB. 700 1 $aEckert, Jörn 852 4 $xRE$bRE55$cRBIB$jRBIB.BUR 011 DE 021$p000000800147 65 http://bit.ly/qa-relres-fair
  51. 51. Positional fields - Leader 00928nam a2200265 c 4500 0 1 2 01234 5 6 7 8 9 0 1 2345 6 7 8 9 0 1 2 3 00928|n|a|m| |a|2|2|0026|5| |c| |4|5|0|0 ❏ LDR/0-4 Record length: ‘00928’ - is a number padding with 0-s (max. value: 99999) ❏ LDR/5 Record status: ‘n’ - is a dictionary term, means “new” ❏ LDR/6 Type of record: ‘a’ - is a dictionary term, means “Language material” ❏ LDR/7 Bibliographic level: ‘m’ - means “Monograph/Item” ❏ ... 66 http://bit.ly/qa-relres-fair
  52. 52. Datafields repeatable/non-repeatable Indicator1 Indicator2 Subfield1, ... , Subfieldn always 1 char long dictionary term ❏ code ❏ value ❏ free text ❏ dictionary term ❏ fixed format (e.g. yymmdd) ❏ fixed format + dictionary terms (d7i2) ❏ fixed positions + dictionary terms ❏ repeatable/non-repeatable 67 http://bit.ly/qa-relres-fair
  53. 53. Versions ❏ Changes of the standard ❏ No versioning ❏ New, deleted and changed elements every year ❏ Localized versions ❏ Introducing new fields ❏ Overwriting existing fields ❏ Mixing localized versions ❏ No notion about the localization ❏ 50+ localizations (international, national, consortial) 68 http://bit.ly/qa-relres-fair
  54. 54. Handling versions (020, ISBN) setSubfieldsWithCardinality( "a", "International Standard Book Number", "NR", "c", "Terms of availability", "NR", "q", "Qualifying information", "R", ... ); setHistoricalSubfields( "b", "Binding information (BK, MP, MU) [OBSOLETE]" ); putVersionSpecificSubfields(MarcVersion.DNB, Arrays.asList( new SubfieldDefinition("9", "ISBN mit Bindestrichen", "R") )); 69 http://bit.ly/qa-relres-fair
  55. 55. Addressing elements - MARCspec XML: XPath﹣W3C standard JSON: JSONPath﹣by Stefan Gössner (http://goessner.net/articles/JsonPath/) MARC: MARCspec﹣by Carsten Klee (Zeitschriftendatenbank, Berlin) ❏ 260﹣field ❏ 245^2﹣the second indicator of a field ❏ 700[0]﹣the first instance of a field ❏ 245$c﹣a subfield ❏ 245$b{007/0=a|007/0=t}﹣subfield ‘b’ of field ‘245’, if character with position ‘0’ of field 007 equals ‘a’ OR ‘t’. ❏ 020$c{$q=paperback}﹣subfield ‘c’ if subfield ‘q’ equals to ‘paperback’. http://marcspec.github.io/MARCspec/marc-spec.html 70 http://bit.ly/qa-relres-fair
  56. 56. record validation and quality assurance Boekentoren UGent - de belvedère, photo: Michiel Hendryckx, 2013, CC-BY-SA 3.0 https://commons.wikimedia.org/wiki/File:Boekentoren_ugent_belvedere_675.jpg 71 http://bit.ly/qa-relres-fair
  57. 57. validating individual records ./validator [file] 001999999 852 undefined subfield L https://www.loc.gov/... 002000005 035 undefined subfield 9 https://www.loc.gov/... 002000005 852 undefined subfield L https://www.loc.gov/... 002000005 852 undefined subfield L https://www.loc.gov/... 002000008 035 undefined subfield 9 https://www.loc.gov/… 72 http://bit.ly/qa-relres-fair
  58. 58. summary of errors ./validator --summary [file] 006/01-02 (tag006music01): invalid value ' ' (https...) (1 times) 006/01-04 (tag006book01): contains invalid code ''0' in '060 '' (https...) (2 times) 006/01-04 (tag006book01): contains invalid code ''6' in '060 '' (https...) (1 times) 006/01-04 (tag006book01): contains invalid code ''n' in 'nnn '' (https...) (3 times) 006/01-04 (tag006book01): contains invalid code ''n' in 'uunn'' (https...) (2 times) 006/01-04 (tag006book01): contains invalid code ''u' in 'uunn'' (https...) (2 times) 73 http://bit.ly/qa-relres-fair
  59. 59. other options ./validator --marcVersion “GENT” [file] ./validator --format “tsv” [file] ./validator --defaultRecordType “BOOKS” [file] SEVERE: Error with record '002066968'. Leader/06 (typeOfRecord): 'n', Leader/07 (bibliographicLevel): 'm' ./validator --fileName “my-report” [file] ./validator ... [file] | catmandu … | RScript … | python … | grep ... 74 http://bit.ly/qa-relres-fair
  60. 60. viewing/filtering/selecting records Displaying record with given ID ./formatter --id “002032820” [file] Displaying records matching a query ./formatter --search ‘245$c=Shakespeare’ [file] Retrieve given elements ./formatter --selector ‘245$c’ [file] 75 http://bit.ly/qa-relres-fair
  61. 61. variation to weighted completeness Thompson and Traill (2017) 76 http://bit.ly/qa-relres-fair
  62. 62. calculating Thompson-Traill completeness ./tt-completeness [options] [file] output: id,ISBN,Authors,Alternative Titles,Edition,Contributors,Series,TOC,Date 008,Date 26X,LC/NLM,LoC,Mesh,Fast,GND,Other,Online,Language of Resource,Country of Publication,noLanguageOrEnglish,RDA,total "010002197",0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,4 "01000288X",0,0,1,0,0,1,0,1,2,0,0,0,0,0,0,0,0,0,0,0,5 "010004483",0,0,1,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,5 "010018883",0,0,0,0,1,0,0,1,2,0,0,0,0,0,0,0,1,1,0,0,6 "010023623",0,0,3,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,7 77 http://bit.ly/qa-relres-fair
  63. 63. K-means clustering Spark (Scala) increasing number of clusters decreasing the distance from the centroids after a point this gain is not so big (“elbow effect”) -- in theory Big number or low quality records small clusters with ‘in between’ quality records the acceptable average clusters with good quality records 78 http://bit.ly/qa-relres-fair
  64. 64. Indexing with Solr "marc-tags" format "100a_ss": "Jung-Baek, Myong Ja", "100ind1_ss": "Surname", "245c_ss": "Vorgelegt von Myong Ja Jung-Baek." "human-readable" format "MainPersonalName_personalName_ss": "Jung-Baek, Myong Ja", "MainPersonalName_type_ss": "Surname", "Title_responsibilityStatement_ss": "Vorgelegt von Myong Ja Jung-Baek." "mixed" format "100a_MainPersonalName_personalName_ss": "Jung-Baek, Myong Ja", "100ind1_MainPersonalName_type_ss": "Surname", "245a_Title_responsibilityStatement_ss": "Vorgelegt von Myong Ja Jung-Baek." 79 How to name the fields? http://bit.ly/qa-relres-fair
  65. 65. Facetted search interface 80
  66. 66. accessing every record element 81 http://bit.ly/qa-relres-fair
  67. 67. Finding problems with facets Vandenhoeck und Ruprecht Vandenhoeck & Ruprecht Vandenhoeck u. Ruprecht Vandenhoeck Vandenhoek & Ruprecht Vandenhoek und Ruprecht Bandenhoed und Ruprecht Vandenhoeck et Ruprecht Vandenhoeck & Reprecht Vandenhoed und Ruprecht V&R unipress V&R Unipress V & R Unipress V & R unipress 82 http://bit.ly/qa-relres-fair
  68. 68. http://sappingattention.blogspot.de/2017/05/a-brief-visual-history-of-marc.html Usage in DH Benjamin Smith (2017) A brief visual history of MARC cataloging at the Library of Congress. 1. extract fields from MARC 2. data cleaning 3. visualize with R 83 http://bit.ly/qa-relres-fair
  69. 69. ./formatter --selector "260c;008~0-5" [file] > dates.tsv or put into a cleaning pileline ./formatter --selector "260c;008~0-5" [file] | sed ... | grep ... | awk ... > dates.tsv Extract data 260c 008~0- 5 1977. 780804 1977. 781121 [1973]. 740215 publication record 1977 1978-08-04 1977 1978-11-21 1973 1974-02-15 84 http://bit.ly/qa-relres-fair
  70. 70. Filtering out extreme values data %>% filter(publication > 2018) %>% arrange(desc(publication)) publication record <int> <int> 1 5732 1990 2 4185 2013 3 2201 2012 4 2030 2015 5 2022 2016 6 2020 2011 7 2019 2015 85 http://bit.ly/qa-relres-fair
  71. 71. cataloging frontline intensive backward cataloging - maybe importing? backward cataloging is still intensive, the tendency continues peak is > 13K 2000-07-10, the “golden day”: 95K new records forward cataloging 86 http://bit.ly/qa-relres-fair
  72. 72. 87 http://bit.ly/qa-relres-fair
  73. 73. reproducibility of science ❏ accessing users (first one: Gent) ❏ making easy of usage (downloadable binaries, helper scripts, documentation) ❏ distribution via Maven Central ❏ continuous integration (Travis CI) ❏ code coverage report ❏ list of freely reusable library catalogs ❏ licencing (GPL-3.0) 88 http://bit.ly/qa-relres-fair
  74. 74. available catalogs to measure 89 ❏ Library of Congress ❏ Harvard University Library ❏ Columbia University Library ❏ Deutsche Nationalbibliothek ❏ Universiteitsbibliotheek Gent ❏ Bibliotheksservice-Zentrum Baden Würtemberg ❏ Bibliotheksverbundes Bayern ❏ University of Michigan Library ❏ Toronto Public Library ❏ Leibniz-Informationszentrum Technik und Naturwiss. Universitätsbibliothek (TIB) ❏ Répertoire International des Sources Musicales ❏ ETH-Bibliothek (Swiss Federal Institute of Technology in Zurich) ❏ British library ❏ Talis https://github.com/pkiraly/metadata-qa-marc#datasources http://bit.ly/qa-relres-fair
  75. 75. future work ❏ implementing more validation rules ❏ visual dashboard ❏ communication with catalogers ❏ writing articles/dissertation 90 http://bit.ly/qa-relres-fair
  76. 76. authority entries Responsibility statement: Herr Seele (tekeningen); Toon Coussement (foto's); Peter Claes, Kris Coremans en Hera Van Sande, vakgroep architectuur en stedenbouw Universiteit Gent (vormgeving). Authority entries: ❏ Herr Seele Kris Coremans is missing! ❏ Coussement, Toon ❏ Claes, Peter ❏ Van Sande, Hera 91 http://bit.ly/qa-relres-fair
  77. 77. everything else … at least regarding to this project https://github.com/pkiraly/metadata-qa-marc http://pkiraly.github.io https://twitter.com/kiru peter.kiraly@gwdg.de 92 http://bit.ly/qa-relres-fair

×