University library of KU Leuven presentation at "Succeed in Digitisation. Spreading Excellence" Conference. Validation and take-up of text digitisation tools.
2. University Library of KU Leuven
Divisions involved:
Arts Faculty Library
•Collections and services focused on ongoing research and teaching in the Faculty of Arts
•Some special collections (e.g. Gulden Librije)
LIBIS
•Provides services for libraries, museums and archives (inside and outside the university)
Digitisation Unit
•A.o. Digital Lab: High-tech digital photography centre
3. Why did we get involved?
Already digitization infrastructure/experience, but focused on visualization => now: digitization of textual material with a view to creating digital text corpora for research
http://www.arts.kuleuven.be/ono/meso/projects/digitalisatie
http://www.illuminare.be/rich_project
http://www.europeana-photography.eu
4. Corpus
13 books from the pretiosa collection of the Gulden Librije:
-translations from Latin
-books that had not been digitized yet Augustinus, Stad Gods (1876-8); Augustinus, Belydenis (1741); Boëthius, Vertroostinge der wysgeerte (1703); Horatius, Over de dichtkunst (1866); Horatius, Hekeldichten en brieven (1728); Nepos, Leevens van doorlugtige mannen (1796); Nepos, Leeven der doorluchtige veld-ooversten (1726); Ovidius, Treur-digten (1814-5); Ovidius, Treur-gesangen (1692); Seneca, Christelycke Seneca (1705); Tacitus, Vande ghedenkwaerdige geschiedenissen der Romeinen (1645); Vergilius, Wercken (1737); Vergilius, Aeneis (1662)
5. Assumptions
•As automated as possible
•Try as soon as possible, to fail early
•Use ALTO format throughout the workflow
6. Workflow OCR
Attestation
Improving
•User pattern training
•Use dictionary
•Improve images
Executing OCR
Digitisation
Evaluation set
ocrevalUAtion
Lesson learnt:
high error rate is not necessarily bad
Aletheia
•Create ground truth
•User friendly
Lessons learnt:
•B&W images
•Remove border
•Biggest problem: letters from other pages coming through
ABBYY FineReader engine
•Useful sample applications
•Windows
7. Workflow NER
Attestation
Training set
Test set
Execute NER
Model
Input
Europeana Newspaper NER
•ALTO input from OCR
•Lesson learnt: lot of resources (RAM) needed
INL Attestation tool
Lesson learnt:
lot more ground truth needed than OCR
NERT of INL
80/20 split training/test
NERT of INL
•Different split training and test set
•Create variants from old spelling
Improving
8. Results NER
Precision
Recall
F1
Overall
0.6257
0.5130
0.5638
Location
0.675
0.2903
0.40601
Organization
1.0
0.1666
0.2857
Person
0.6207
0.5571
0.5871
Segmentation
0.6634
0.5438
0.5977
Classification accuracy
0.9433
> 60% recognised correctly
≈ 50% of the entities found
9. Results NER, an experiment
Input
Corrected file
Training file
Test file
Split
Combine
Precision
Recall
F1
Overall
0.8398
0.7954
0.8170
Location
0.8741
0.6720
0.7599
Organization
1.0
0.5
0.6666
Person
0.8320
0.8320
0.8320
Segmentation
0.8920
0.8448
0.8677
Classification accuracy
0.9415
80% recognised correctly
≈ 80% entities found
10. Next steps
•Create a OCR and NER platform for the university and as part of the LIBIS services
•New project about OCR and (early modern) Latin texts
•Looking into other tools :
•Lexicon building
•Border detection
•Automatically remove ‘noise’ from a page
•NER:
•Learning to use Latin (and Greek)