IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and University Library of Slovenia by Alenka Kavčič-Čolić (NUK)
1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Library experiences in IMPACT:
THE NATIONAL AND UNIVERSITY LIBRARY OF SLOVENIA
Alenka KAVČIČ-ČOLIĆ, Ines VODOPIVEC
Library Research Centre, National and University Library
Ljubljana, Slovenia
in cooperation with
Tomaž ERJAVEC
Dept. of Knowledge Technologies, Jožef Stefan Institute
Ljubljana, Slovenia
2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OUTLINE
• Introduction
• Cooperation in IMPACT project:
• OCR improvement
• Lexicon building
• Improvment of information
retrieval on historical document
collections
• Benefits overview
IMPACT Outcomes, 26 June 2012, KB, The Hague
3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
The National and University Library of Slovenia
(Narodna in univerzitetna knjižnica - NUK)
Digital Library of Slovenia (2005 - )
NUK entire online digital collection comprises more than 4 million
scans and digital objects, including:
– 19.000 pages of scientific journals,
– 402.714 pages of newspapers,
– 9.540 photographs,
– 100 music records,
– 15 3D objects,
– 3 virtual exhibitions etc.
IMPACT Outcomes, 26 June 2012, KB,
4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
NUK statistics
More than 200.000 visitors per year
More than 1 million of distant users
Approx. 2.8 million visits to the digital library portal dLib.si (2011)
2,30% 1,80%
Library members
school children
21,50%
students
university employees
3,10%
general public
71,30%
foreign citizens
IMPACT Outcomes, 26 June 2012, KB, The Hague
5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Slovene historical documents
PDF file and HTML preview
IMPACT Outcomes, 26 June 2012, KB, The Hague
6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Documents published before 1850
Example of a bad OCR in a HTML preview
IMPACT Outcomes, 26 June 2012, KB, The Hague
7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Example of a
historical text by
Linhart from the
18th century
IMPACT Outcomes, 26 June 2012, KB, The Hague
8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Cooperation in IMPACT
NUK & IJS joined IMPACT project on the 2nd extension
(1st Apr. 2010 – 31st Dec.2011)
Goals to achieve in the project:
1. Lexicon building (JSI)
2. Improving OCR on historical documents by using special
lexica for historical language (NUK & JSI)
3. Improving information retrieval on historical document
collections (NUK)
IMPACT Outcomes, 26 June 2012, KB, The Hague
9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
The 3 goals were interdependent
1. OCR improvement (NUK)
2. Lexicon 3. Improved IR
building in old texts
(JSI) (NUK & JSI)
IMPACT Outcomes, 26 June 2012, KB, The Hague
10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Corpus selection (JSI & NUK):
Selection of typical documents from the 19th and second half of the 18th
century
Materials from several sources:
– dLib (www.dlib.si)
– AHLib (http://nl.ijs.si/ahlib/) –
books from 1848-1918, translated from German original
Dataset: 41,313 digitized pages of historical newspapers & books from the
18th-19th century
Subset of approx. 5,000 scans for Groundtruth (GT)* production
(* a dataset of high-quality transcriptions of historical texts which also serves as a basis for the
production of the Lexicon for historical Slovene)
IMPACT Outcomes, 26 June 2012, KB, The Hague
11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Characteristics of the dataset:
Errors originating from language characteristics
– Two basic historical alphabets (Bohoričica / Gajica)
– Historical language / vocabulary Poor vocabulary recognition
– Special characters, digraphs and ligatures
Errors originating from print properties
– Latin and gothic types
– Complex page structure or segmentation
– Irregular spacing between letters, words and columns
– Irregular / changing font sizes
– Poor paper quality
– Inconsistent inking
Errors originating from digitisation procedures
– Specific characteristics of the originals: staining, foxing, paper wrapping caused by humidity
IMPACT Outcomes, 26 June 2012, KB, The Hague
12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Groundtruth production (NUK)
Digitisation and pre-processing of scans
OCR procedure
Post-processing of the text GT production
– Errors correction – manually
– Page segmentation and reading order - Aletheia
– OCR outcomes encoded in PAGE XML
Evaluation
IMPACT Outcomes, 26 June 2012, KB, The Hague
13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexicon building for historical Slovene
Developed by the Jožef Stefan Institute (JSI)
Built to be incorporated in OCR & IR
But also as a human-readable reference and as a training &
testing set for Human Language Technologies
Two stages development:
– Reference corpus of historical Slovene
– Lexicon of historical Slovene
IMPACT Outcomes, 26 June 2012, KB,
14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Period Units Pages Tokens
Reference corpus goo300k
1584 1 8 6000
Page sampled 1695 1 27 10000
Each word annotated with: 1751-1800 8 155 27000
– Contemporary equivalent 1801-1850 12 206 74000
1851-1875 36 380 126000
– Modern lemma
1876-1900 23 224 51000
– Part-of-speech tag ∑ 81 1000 296000
– Gloss for archaic words (lemmas)
First automatically, then manually corrected:
– Institute for Dutch Lexicology (INL) CoBaLT Lexicon Tool
– A team of annotators
– Also correcting errors in transcription
Available via a concordancer + download (CC-BY licence)
Development supported by Google humanities research award (JSI + ZRC SAZU)
IMPACT Outcomes, 26 June 2012, KB, The Hague
15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Historical lexicon Lexicon n
Lex. entries 70,000
Lexicon dump from goo300k
Word-forms 68,000
+ additional lexicon from full-text collection
First automatically, then manually corrected: Modernised 50,000
INL CoBaLT Lexicon Tool Lemmas 24,000
A team of annotators Glosses 1,900
Dual role:
– As a human readable lexicon of historical Slovene
– For HLT applications (e.g. IR)
Available via a web browser + download (CC-BY)
JSI also developed ToTrTaLe – a tool for processing historical (Slovene) text, which
annotates words in a TEI encoded corpus with their modern-day equivalents, PoS tags and
lemmas
– Used by Vaam finite-state library (Centrum für Informations- und Sprachverarbeitung, University of
Munich)
IMPACT Outcomes, 26 June 2012, KB,
16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Example lexical entry
(converted from TEI to HTML)
IMPACT Outcomes, 26 June 2012, KB,
17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
JSI resources
available from
http://nl.ijs.si/imp/
IMPACT Outcomes, 26 June 2012, KB, The Hague
18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Improved IR in old texts (NUK & JSI)
Integration of
historical lexicon
developed by IJS into
the full-text search
engine of dLib.si
IMPACT Outcomes, 26 June 2012, KB, The Hague
19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Outcomes, 26 June 2012, KB,
20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Outcomes, 26 June 2012, KB, The Hague
21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Outcomes, 26 June 2012, KB, The Hague
22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT project benefits
- OCR quality improvement:
- For texts in Bohoričica and Gajica the OCR done by Abbey Fine Reader increased from
58% to 70% for Bohoričica and 85% for Gajica
- Full-text search with modern words in more than 200.000 digitised historical
documents
- Historical Slovene Lexicon that can be integrated in other Digital Library tools
- Processing tools for large scale digitisation (ex. Aletheia, FEP … )
- High-quality datasets for R&D of langauge technlogies
- Other „invisible“ benefits:
- Cooperation and integration in national and international networks
- Additional experiences in large scale digitisation
- New knowledge (on OCR, language processing, and other tools used in large
scale digitisation)
IMPACT Outcomes, 26 June 2012, KB, The Hague
23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Literature:
Erjavec, Tomaž. 2012. The goo300k corpus of historical Slovene. Proceedings of the Eight International Conference on
Language Resources and Evaluation (LREC'12), Istambul.
Erjavec, Tomaž. Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene. Proceedings of the
5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, 2011, Portland.
Erjavec, Tomaž, Christoph Ringlstetter, Maja Žorga, Annette Gotscharek. A lexicon for processing archaic language: the case of
XIXth century Slovene. Proceedings of WoLeR: ESSLLI Workshop on Lexical Resources, 2011, Ljubljana.
Erjavec, Tomaž, Christoph Ringlstetter, Maja Žorga, Annette Gotscharek. Towards a Lexicon of XIXth Century Slovene.
Proceedings of the Seventh Language Technologies Conference Ljubljana, 2010.
Erjavec, Tomaž, Ines Jerele, Maša Kodrič. 2011. Izdelava korpusa starejših slovenskih besedil v okviru projekta IMPACT. V:
Kranjc, Simona (ur.). Meddisciplinarnost v slovenistiki. Ljubljana: Znanstvena založba Filozofske fakultete, 2011, 41-47
Erjavec, Tomaž. Slovenska prevodna književnost 1848-1918 : digitalna knjižnica in korpus AHLib. V: KRANJC, Simona (ur.).
Meddisciplinarnost v slovenistiki. Ljubljana: Znanstvena založba Filozofske fakultete, 2011, str. 33-40.
Gotscharek, A., Neumann, A., Reffle, U., Ringlstetter, C., Schulz, K.U. (2009). Enabling Information Retrieval on Historical
Document Collections - the Role of Matching Procedures and Special Lexica. AND2009 Workshop (23-24 July 2009, Barcelona,
Spain). Also available on http://sites.google.com/site/and2009workshop/.
Jerele, Ines, Erjavec, Tomaž, Pokorn, Daša, Kavčič-Čolić, Alenka. 2012. Optical Character Recognition of Historical Texts: End-
User Focused Research for Slovenian Books and Newspapers from the 18th and 19th Century. Review of the National Center
for Digitization 21/2012, Faculty of Mathematics, Belgrade.
Jerele, I., Erjavec, T., Pokorn, D., Kavčič-Čolić, A. (2011). Optical character recognition of historical texts: end-user focused
research for slovenian books and newspapers from the 18th and 19th century. In: 6. SEEDI conference : proceedings, (16-20
May 2011, Zagreb, Croatia), p. 11, (unpublished jet). Also available on http://www.nsk.hr/seedi/seedi-hrv/index.html.
Kenter, Tom, Erjavec, Tomaž, Maja and Žorga, Darja Fišer. 2012. Lexicon construction and corpus annotation of historical
language with the CoBaLT editor. In Proceedings of the EACL Workshop on Language Technology for Cultural Heritage, Social
Sciences, and Humanities, Avignon, France, April. Association for Computational Linguistics.
IMPACT Outcomes, 26 June 2012, KB,
24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Thank you for your attention!
Alenka Kavčič-Čolić
Alenka.kavcic@nuk.uni-lj.si