University library of KU Leuven - Sam Alloing et Demmy Verbecke

University Library of KU Leuven
Sam Alloing and Demmy Verbeke

University Library of KU Leuven
Divisions involved:
Arts Faculty Library
•Collections and services focused on ongoing research and teaching in the Faculty of Arts
•Some special collections (e.g. Gulden Librije)
LIBIS
•Provides services for libraries, museums and archives (inside and outside the university)
Digitisation Unit
•A.o. Digital Lab: High-tech digital photography centre

Why did we get involved?
Already digitization infrastructure/experience, but focused on visualization => now: digitization of textual material with a view to creating digital text corpora for research
http://www.arts.kuleuven.be/ono/meso/projects/digitalisatie
http://www.illuminare.be/rich_project
http://www.europeana-photography.eu

Corpus
13 books from the pretiosa collection of the Gulden Librije:
-translations from Latin
-books that had not been digitized yet Augustinus, Stad Gods (1876-8); Augustinus, Belydenis (1741); Boëthius, Vertroostinge der wysgeerte (1703); Horatius, Over de dichtkunst (1866); Horatius, Hekeldichten en brieven (1728); Nepos, Leevens van doorlugtige mannen (1796); Nepos, Leeven der doorluchtige veld-ooversten (1726); Ovidius, Treur-digten (1814-5); Ovidius, Treur-gesangen (1692); Seneca, Christelycke Seneca (1705); Tacitus, Vande ghedenkwaerdige geschiedenissen der Romeinen (1645); Vergilius, Wercken (1737); Vergilius, Aeneis (1662)

Assumptions
•As automated as possible
•Try as soon as possible, to fail early
•Use ALTO format throughout the workflow

Workflow OCR
Attestation
Improving
•User pattern training
•Use dictionary
•Improve images
Executing OCR
Digitisation
Evaluation set
ocrevalUAtion
Lesson learnt:
high error rate is not necessarily bad
Aletheia
•Create ground truth
•User friendly
Lessons learnt:
•B&W images
•Remove border
•Biggest problem: letters from other pages coming through
ABBYY FineReader engine
•Useful sample applications
•Windows

Workflow NER
Attestation
Training set
Test set
Execute NER
Model
Input
Europeana Newspaper NER
•ALTO input from OCR
•Lesson learnt: lot of resources (RAM) needed
INL Attestation tool
Lesson learnt:
lot more ground truth needed than OCR
NERT of INL
80/20 split training/test
NERT of INL
•Different split training and test set
•Create variants from old spelling
Improving

Results NER
Precision
Recall
F1
Overall
0.6257
0.5130
0.5638
Location
0.675
0.2903
0.40601
Organization
1.0
0.1666
0.2857
Person
0.6207
0.5571
0.5871
Segmentation
0.6634
0.5438
0.5977
Classification accuracy
0.9433
> 60% recognised correctly
≈ 50% of the entities found

Results NER, an experiment
Input
Corrected file
Training file
Test file
Split
Combine
Precision
Recall
F1
Overall
0.8398
0.7954
0.8170
Location
0.8741
0.6720
0.7599
Organization
1.0
0.5
0.6666
Person
0.8320
0.8320
0.8320
Segmentation
0.8920
0.8448
0.8677
Classification accuracy
0.9415
80% recognised correctly
≈ 80% entities found

Next steps
•Create a OCR and NER platform for the university and as part of the LIBIS services
•New project about OCR and (early modern) Latin texts
•Looking into other tools :
•Lexicon building
•Border detection
•Automatically remove ‘noise’ from a page
•NER:
•Learning to use Latin (and Greek)

Thanks!
Questions?
•Sam Alloing (Sam.Alloing@libis.kuleuven.be)
•Demmy Verbeke (Demmy.Verbeke@arts.kuleuven.be; @viroviacum)
•http://bib.kuleuven.be/english/ub

University library of KU Leuven - Sam Alloing et Demmy Verbecke

Recommended

Recommended

More Related Content

Similar to University library of KU Leuven - Sam Alloing et Demmy Verbecke

Similar to University library of KU Leuven - Sam Alloing et Demmy Verbecke (20)

More from IMPACT Centre of Competence

More from IMPACT Centre of Competence (20)

Recently uploaded

Recently uploaded (20)

University library of KU Leuven - Sam Alloing et Demmy Verbecke