SlideShare a Scribd company logo
1 of 11
Download to read offline
University Library of KU Leuven 
Sam Alloing and Demmy Verbeke
University Library of KU Leuven 
Divisions involved: 
Arts Faculty Library 
•Collections and services focused on ongoing research and teaching in the Faculty of Arts 
•Some special collections (e.g. Gulden Librije) 
LIBIS 
•Provides services for libraries, museums and archives (inside and outside the university) 
Digitisation Unit 
•A.o. Digital Lab: High-tech digital photography centre
Why did we get involved? 
Already digitization infrastructure/experience, but focused on visualization => now: digitization of textual material with a view to creating digital text corpora for research 
http://www.arts.kuleuven.be/ono/meso/projects/digitalisatie 
http://www.illuminare.be/rich_project 
http://www.europeana-photography.eu
Corpus 
13 books from the pretiosa collection of the Gulden Librije: 
-translations from Latin 
-books that had not been digitized yet Augustinus, Stad Gods (1876-8); Augustinus, Belydenis (1741); Boëthius, Vertroostinge der wysgeerte (1703); Horatius, Over de dichtkunst (1866); Horatius, Hekeldichten en brieven (1728); Nepos, Leevens van doorlugtige mannen (1796); Nepos, Leeven der doorluchtige veld-ooversten (1726); Ovidius, Treur-digten (1814-5); Ovidius, Treur-gesangen (1692); Seneca, Christelycke Seneca (1705); Tacitus, Vande ghedenkwaerdige geschiedenissen der Romeinen (1645); Vergilius, Wercken (1737); Vergilius, Aeneis (1662)
Assumptions 
•As automated as possible 
•Try as soon as possible, to fail early 
•Use ALTO format throughout the workflow
Workflow OCR 
Attestation 
Improving 
•User pattern training 
•Use dictionary 
•Improve images 
Executing OCR 
Digitisation 
Evaluation set 
ocrevalUAtion 
Lesson learnt: 
high error rate is not necessarily bad 
Aletheia 
•Create ground truth 
•User friendly 
Lessons learnt: 
•B&W images 
•Remove border 
•Biggest problem: letters from other pages coming through 
ABBYY FineReader engine 
•Useful sample applications 
•Windows
Workflow NER 
Attestation 
Training set 
Test set 
Execute NER 
Model 
Input 
Europeana Newspaper NER 
•ALTO input from OCR 
•Lesson learnt: lot of resources (RAM) needed 
INL Attestation tool 
Lesson learnt: 
lot more ground truth needed than OCR 
NERT of INL 
80/20 split training/test 
NERT of INL 
•Different split training and test set 
•Create variants from old spelling 
Improving
Results NER 
Precision 
Recall 
F1 
Overall 
0.6257 
0.5130 
0.5638 
Location 
0.675 
0.2903 
0.40601 
Organization 
1.0 
0.1666 
0.2857 
Person 
0.6207 
0.5571 
0.5871 
Segmentation 
0.6634 
0.5438 
0.5977 
Classification accuracy 
0.9433 
> 60% recognised correctly 
≈ 50% of the entities found
Results NER, an experiment 
Input 
Corrected file 
Training file 
Test file 
Split 
Combine 
Precision 
Recall 
F1 
Overall 
0.8398 
0.7954 
0.8170 
Location 
0.8741 
0.6720 
0.7599 
Organization 
1.0 
0.5 
0.6666 
Person 
0.8320 
0.8320 
0.8320 
Segmentation 
0.8920 
0.8448 
0.8677 
Classification accuracy 
0.9415 
80% recognised correctly 
≈ 80% entities found
Next steps 
•Create a OCR and NER platform for the university and as part of the LIBIS services 
•New project about OCR and (early modern) Latin texts 
•Looking into other tools : 
•Lexicon building 
•Border detection 
•Automatically remove ‘noise’ from a page 
•NER: 
•Learning to use Latin (and Greek)
Thanks! 
Questions? 
•Sam Alloing (Sam.Alloing@libis.kuleuven.be) 
•Demmy Verbeke (Demmy.Verbeke@arts.kuleuven.be; @viroviacum) 
•http://bib.kuleuven.be/english/ub

More Related Content

Similar to University library of KU Leuven - Sam Alloing et Demmy Verbecke

150310 Implementing Alma for LIBISnet
150310 Implementing Alma for LIBISnet150310 Implementing Alma for LIBISnet
150310 Implementing Alma for LIBISnetJo Rademakers
 
Dag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections onlineDag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections onlinelab_SNG
 
OA academic book publishing – OAPEN Library and DOAB
OA academic book publishing – OAPEN Library and DOABOA academic book publishing – OAPEN Library and DOAB
OA academic book publishing – OAPEN Library and DOABRonald Snijder
 
Redesigning our Combine Harvester
Redesigning our Combine HarvesterRedesigning our Combine Harvester
Redesigning our Combine HarvesterTry PurpleSearch
 
Keep Things Simple @ Dortmunder U
Keep Things Simple @ Dortmunder UKeep Things Simple @ Dortmunder U
Keep Things Simple @ Dortmunder Ulab_SNG
 
Islandora Webinar: Highlighting CUHK Chinese Digital Collections
Islandora Webinar:  Highlighting CUHK Chinese Digital CollectionsIslandora Webinar:  Highlighting CUHK Chinese Digital Collections
Islandora Webinar: Highlighting CUHK Chinese Digital CollectionsErin Tripp
 
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaVangelis Banos
 
Technion IR: Institutional Repository with DSpace
Technion IR: Institutional Repository with DSpaceTechnion IR: Institutional Repository with DSpace
Technion IR: Institutional Repository with DSpaceElena Yaroshenko
 
BL Labs and Digital Humanities
BL Labs and Digital HumanitiesBL Labs and Digital Humanities
BL Labs and Digital Humanitieslabsbl
 
Geek out : Adding Coding Skills to Your Professional Repertoire
Geek out: Adding Coding Skills to Your Professional RepertoireGeek out: Adding Coding Skills to Your Professional Repertoire
Geek out : Adding Coding Skills to Your Professional RepertoireBohyun Kim
 
In Context: Case Studies in Integrated Physical and Virtual Library Service D...
In Context: Case Studies in Integrated Physical and Virtual Library Service D...In Context: Case Studies in Integrated Physical and Virtual Library Service D...
In Context: Case Studies in Integrated Physical and Virtual Library Service D...Jason Casden
 
Europeana Cloud Aggregator Forum 2014
Europeana Cloud Aggregator Forum 2014Europeana Cloud Aggregator Forum 2014
Europeana Cloud Aggregator Forum 2014Europeana
 
Reaching the researcher
Reaching the researcherReaching the researcher
Reaching the researcherLIBER Europe
 
ArchivesSpace - Scott Renton, University of Edinburgh
ArchivesSpace - Scott Renton, University of EdinburghArchivesSpace - Scott Renton, University of Edinburgh
ArchivesSpace - Scott Renton, University of EdinburghRepository Fringe
 
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...MCN (Museum Computer Network)
 
Panel Discussion, The Future of the Museum: Technology
Panel Discussion, The Future of the Museum: TechnologyPanel Discussion, The Future of the Museum: Technology
Panel Discussion, The Future of the Museum: TechnologyJane Alexander
 

Similar to University library of KU Leuven - Sam Alloing et Demmy Verbecke (20)

150310 Implementing Alma for LIBISnet
150310 Implementing Alma for LIBISnet150310 Implementing Alma for LIBISnet
150310 Implementing Alma for LIBISnet
 
Introducing SUL
Introducing SULIntroducing SUL
Introducing SUL
 
Dag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections onlineDag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections online
 
OA academic book publishing – OAPEN Library and DOAB
OA academic book publishing – OAPEN Library and DOABOA academic book publishing – OAPEN Library and DOAB
OA academic book publishing – OAPEN Library and DOAB
 
Redesigning our Combine Harvester
Redesigning our Combine HarvesterRedesigning our Combine Harvester
Redesigning our Combine Harvester
 
Keep Things Simple @ Dortmunder U
Keep Things Simple @ Dortmunder UKeep Things Simple @ Dortmunder U
Keep Things Simple @ Dortmunder U
 
KU Leuven - Words and numbers - ICoC
KU Leuven - Words and numbers - ICoCKU Leuven - Words and numbers - ICoC
KU Leuven - Words and numbers - ICoC
 
Islandora Webinar: Highlighting CUHK Chinese Digital Collections
Islandora Webinar:  Highlighting CUHK Chinese Digital CollectionsIslandora Webinar:  Highlighting CUHK Chinese Digital Collections
Islandora Webinar: Highlighting CUHK Chinese Digital Collections
 
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
 
Technion IR: Institutional Repository with DSpace
Technion IR: Institutional Repository with DSpaceTechnion IR: Institutional Repository with DSpace
Technion IR: Institutional Repository with DSpace
 
BL Labs and Digital Humanities
BL Labs and Digital HumanitiesBL Labs and Digital Humanities
BL Labs and Digital Humanities
 
Geek out : Adding Coding Skills to Your Professional Repertoire
Geek out: Adding Coding Skills to Your Professional RepertoireGeek out: Adding Coding Skills to Your Professional Repertoire
Geek out : Adding Coding Skills to Your Professional Repertoire
 
In Context: Case Studies in Integrated Physical and Virtual Library Service D...
In Context: Case Studies in Integrated Physical and Virtual Library Service D...In Context: Case Studies in Integrated Physical and Virtual Library Service D...
In Context: Case Studies in Integrated Physical and Virtual Library Service D...
 
Europeana Cloud Aggregator Forum 2014
Europeana Cloud Aggregator Forum 2014Europeana Cloud Aggregator Forum 2014
Europeana Cloud Aggregator Forum 2014
 
Reaching the researcher
Reaching the researcherReaching the researcher
Reaching the researcher
 
Sistema Compartit a l'ICOLC
Sistema Compartit a l'ICOLCSistema Compartit a l'ICOLC
Sistema Compartit a l'ICOLC
 
ArchivesSpace - Scott Renton, University of Edinburgh
ArchivesSpace - Scott Renton, University of EdinburghArchivesSpace - Scott Renton, University of Edinburgh
ArchivesSpace - Scott Renton, University of Edinburgh
 
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...
 
Emea, March 2011
Emea, March 2011 Emea, March 2011
Emea, March 2011
 
Panel Discussion, The Future of the Museum: Technology
Panel Discussion, The Future of the Museum: TechnologyPanel Discussion, The Future of the Museum: Technology
Panel Discussion, The Future of the Museum: Technology
 

More from IMPACT Centre of Competence

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Recently uploaded

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Recently uploaded (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

University library of KU Leuven - Sam Alloing et Demmy Verbecke

  • 1. University Library of KU Leuven Sam Alloing and Demmy Verbeke
  • 2. University Library of KU Leuven Divisions involved: Arts Faculty Library •Collections and services focused on ongoing research and teaching in the Faculty of Arts •Some special collections (e.g. Gulden Librije) LIBIS •Provides services for libraries, museums and archives (inside and outside the university) Digitisation Unit •A.o. Digital Lab: High-tech digital photography centre
  • 3. Why did we get involved? Already digitization infrastructure/experience, but focused on visualization => now: digitization of textual material with a view to creating digital text corpora for research http://www.arts.kuleuven.be/ono/meso/projects/digitalisatie http://www.illuminare.be/rich_project http://www.europeana-photography.eu
  • 4. Corpus 13 books from the pretiosa collection of the Gulden Librije: -translations from Latin -books that had not been digitized yet Augustinus, Stad Gods (1876-8); Augustinus, Belydenis (1741); Boëthius, Vertroostinge der wysgeerte (1703); Horatius, Over de dichtkunst (1866); Horatius, Hekeldichten en brieven (1728); Nepos, Leevens van doorlugtige mannen (1796); Nepos, Leeven der doorluchtige veld-ooversten (1726); Ovidius, Treur-digten (1814-5); Ovidius, Treur-gesangen (1692); Seneca, Christelycke Seneca (1705); Tacitus, Vande ghedenkwaerdige geschiedenissen der Romeinen (1645); Vergilius, Wercken (1737); Vergilius, Aeneis (1662)
  • 5. Assumptions •As automated as possible •Try as soon as possible, to fail early •Use ALTO format throughout the workflow
  • 6. Workflow OCR Attestation Improving •User pattern training •Use dictionary •Improve images Executing OCR Digitisation Evaluation set ocrevalUAtion Lesson learnt: high error rate is not necessarily bad Aletheia •Create ground truth •User friendly Lessons learnt: •B&W images •Remove border •Biggest problem: letters from other pages coming through ABBYY FineReader engine •Useful sample applications •Windows
  • 7. Workflow NER Attestation Training set Test set Execute NER Model Input Europeana Newspaper NER •ALTO input from OCR •Lesson learnt: lot of resources (RAM) needed INL Attestation tool Lesson learnt: lot more ground truth needed than OCR NERT of INL 80/20 split training/test NERT of INL •Different split training and test set •Create variants from old spelling Improving
  • 8. Results NER Precision Recall F1 Overall 0.6257 0.5130 0.5638 Location 0.675 0.2903 0.40601 Organization 1.0 0.1666 0.2857 Person 0.6207 0.5571 0.5871 Segmentation 0.6634 0.5438 0.5977 Classification accuracy 0.9433 > 60% recognised correctly ≈ 50% of the entities found
  • 9. Results NER, an experiment Input Corrected file Training file Test file Split Combine Precision Recall F1 Overall 0.8398 0.7954 0.8170 Location 0.8741 0.6720 0.7599 Organization 1.0 0.5 0.6666 Person 0.8320 0.8320 0.8320 Segmentation 0.8920 0.8448 0.8677 Classification accuracy 0.9415 80% recognised correctly ≈ 80% entities found
  • 10. Next steps •Create a OCR and NER platform for the university and as part of the LIBIS services •New project about OCR and (early modern) Latin texts •Looking into other tools : •Lexicon building •Border detection •Automatically remove ‘noise’ from a page •NER: •Learning to use Latin (and Greek)
  • 11. Thanks! Questions? •Sam Alloing (Sam.Alloing@libis.kuleuven.be) •Demmy Verbeke (Demmy.Verbeke@arts.kuleuven.be; @viroviacum) •http://bib.kuleuven.be/english/ub