SlideShare una empresa de Scribd logo
1 de 24
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




           Library experiences in IMPACT:
THE NATIONAL AND UNIVERSITY LIBRARY OF SLOVENIA
                   Alenka KAVČIČ-ČOLIĆ, Ines VODOPIVEC
             Library Research Centre, National and University Library
                               Ljubljana, Slovenia
                                                                in cooperation with
                               Tomaž ERJAVEC
              Dept. of Knowledge Technologies, Jožef Stefan Institute
                               Ljubljana, Slovenia
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




OUTLINE
• Introduction
• Cooperation in IMPACT project:
   • OCR improvement
   • Lexicon building
   • Improvment of information
     retrieval on historical document
     collections
• Benefits overview


     IMPACT Outcomes, 26 June 2012, KB, The Hague
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




The National and University Library of Slovenia
(Narodna in univerzitetna knjižnica - NUK)
 Digital Library of Slovenia (2005 - )
 NUK entire online digital collection comprises more than 4 million
  scans and digital objects, including:
      –      19.000 pages of scientific journals,
      –      402.714 pages of newspapers,
      –      9.540 photographs,
      –      100 music records,
      –      15 3D objects,
      –      3 virtual exhibitions etc.


                                                                                                                                                 IMPACT Outcomes, 26 June 2012, KB,
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.



NUK statistics
 More than 200.000 visitors per year
 More than 1 million of distant users
 Approx. 2.8 million visits to the digital library portal dLib.si (2011)
                                                                   2,30%               1,80%
                                                                                                                                     Library members
                                                                                                                                            school children
                                                   21,50%

                                                                                                                                            students

                                                                                                                                            university employees
                                      3,10%
                                                                                                                                            general public
                                                                                                     71,30%
                                                                                                                                            foreign citizens
   IMPACT Outcomes, 26 June 2012, KB, The Hague
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Slovene historical documents




                                                     PDF file and HTML preview
IMPACT Outcomes, 26 June 2012, KB, The Hague
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Documents published before 1850




                                  Example of a bad OCR in a HTML preview
IMPACT Outcomes, 26 June 2012, KB, The Hague
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Example of a
historical text by
Linhart from the
18th century




      IMPACT Outcomes, 26 June 2012, KB, The Hague
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Cooperation in IMPACT
 NUK & IJS joined IMPACT project on the 2nd extension
   (1st Apr. 2010 – 31st Dec.2011)
 Goals to achieve in the project:
      1. Lexicon building (JSI)
      2. Improving OCR on historical documents by using special
         lexica for historical language (NUK & JSI)
      3. Improving information retrieval on historical document
         collections (NUK)



 IMPACT Outcomes, 26 June 2012, KB, The Hague
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




The 3 goals were interdependent
                                                                 1. OCR improvement (NUK)




 2. Lexicon                                                                                                                               3. Improved IR
 building                                                                                                                                 in old texts
 (JSI)                                                                                                                                    (NUK & JSI)




  IMPACT Outcomes, 26 June 2012, KB, The Hague
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Corpus selection (JSI & NUK):
 Selection of typical documents from the 19th and second half of the 18th
  century
 Materials from several sources:
   – dLib (www.dlib.si)
   – AHLib (http://nl.ijs.si/ahlib/) –
      books from 1848-1918, translated from German original
 Dataset: 41,313 digitized pages of historical newspapers & books from the
  18th-19th century
 Subset of approx. 5,000 scans for Groundtruth (GT)* production

(* a dataset of high-quality transcriptions of historical texts which also serves as a basis for the
production of the Lexicon for historical Slovene)

IMPACT Outcomes, 26 June 2012, KB, The Hague
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.



 Characteristics of the dataset:
 Errors originating from language characteristics
   – Two basic historical alphabets (Bohoričica / Gajica)
   – Historical language / vocabulary  Poor vocabulary recognition
   – Special characters, digraphs and ligatures
 Errors originating from print properties
   –     Latin and gothic types
   –     Complex page structure or segmentation
   –     Irregular spacing between letters, words and columns
   –     Irregular / changing font sizes
   –     Poor paper quality
   –     Inconsistent inking
 Errors originating from digitisation procedures
   – Specific characteristics of the originals: staining, foxing, paper wrapping caused by humidity
       IMPACT Outcomes, 26 June 2012, KB, The Hague
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Groundtruth production (NUK)
 Digitisation and pre-processing of scans
 OCR procedure
 Post-processing of the text  GT production
   – Errors correction – manually
   – Page segmentation and reading order - Aletheia
   – OCR outcomes encoded in PAGE XML
 Evaluation




      IMPACT Outcomes, 26 June 2012, KB, The Hague
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Lexicon building for historical Slovene
 Developed by the Jožef Stefan Institute (JSI)
 Built to be incorporated in OCR & IR
 But also as a human-readable reference and as a training &
  testing set for Human Language Technologies
 Two stages development:
   – Reference corpus of historical Slovene
   – Lexicon of historical Slovene


                                                                                                                                                    IMPACT Outcomes, 26 June 2012, KB,
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

                                                                                                      Period                       Units                 Pages          Tokens
Reference corpus goo300k
                                                                                                    1584                                        1                8              6000
  Page sampled                                                                                     1695                                        1               27             10000
  Each word annotated with:                                                                      1751-1800                                     8              155             27000

    – Contemporary equivalent                 1801-1850        12  206      74000
                                              1851-1875        36  380     126000
    – Modern lemma
                                              1876-1900        23  224      51000
    – Part-of-speech tag                          ∑            81 1000     296000
    – Gloss for archaic words (lemmas)
  First automatically, then manually corrected:
    – Institute for Dutch Lexicology (INL) CoBaLT Lexicon Tool
    – A team of annotators
    – Also correcting errors in transcription
  Available via a concordancer + download (CC-BY licence)
  Development supported by Google humanities research award (JSI + ZRC SAZU)
                                                                                                                                         IMPACT Outcomes, 26 June 2012, KB, The Hague
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Historical lexicon                                                                                                                 Lexicon                         n
                                                                                                                                   Lex. entries                        70,000
 Lexicon dump from goo300k
                                                                                                                                   Word-forms                          68,000
 + additional lexicon from full-text collection
                      First automatically, then manually corrected:                                                               Modernised                          50,000
                      INL CoBaLT Lexicon Tool                                                                                     Lemmas                              24,000
                      A team of annotators                                                                                        Glosses                               1,900
 Dual role:
          – As a human readable lexicon of historical Slovene
          – For HLT applications (e.g. IR)
 Available via a web browser + download (CC-BY)
       JSI also developed ToTrTaLe – a tool for processing historical (Slovene) text, which
        annotates words in a TEI encoded corpus with their modern-day equivalents, PoS tags and
        lemmas
          – Used by Vaam finite-state library (Centrum für Informations- und Sprachverarbeitung, University of
            Munich)
                                                                                                                                                IMPACT Outcomes, 26 June 2012, KB,
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.



Example lexical entry
(converted from TEI to HTML)




                                                                                                                                                IMPACT Outcomes, 26 June 2012, KB,
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




JSI resources
available from
http://nl.ijs.si/imp/




     IMPACT Outcomes, 26 June 2012, KB, The Hague
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Improved IR in old texts (NUK & JSI)
 Integration of
  historical lexicon
  developed by IJS into
  the full-text search
  engine of dLib.si




   IMPACT Outcomes, 26 June 2012, KB, The Hague
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                IMPACT Outcomes, 26 June 2012, KB,
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




IMPACT Outcomes, 26 June 2012, KB, The Hague
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




IMPACT Outcomes, 26 June 2012, KB, The Hague
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




    IMPACT project benefits
-   OCR quality improvement:
     - For texts in Bohoričica and Gajica the OCR done by Abbey Fine Reader increased from
       58% to 70% for Bohoričica and 85% for Gajica
-   Full-text search with modern words in more than 200.000 digitised historical
    documents
-   Historical Slovene Lexicon that can be integrated in other Digital Library tools
-   Processing tools for large scale digitisation (ex. Aletheia, FEP … )
-   High-quality datasets for R&D of langauge technlogies

-   Other „invisible“ benefits:
     - Cooperation and integration in national and international networks
     - Additional experiences in large scale digitisation
     - New knowledge (on OCR, language processing, and other tools used in large
       scale digitisation)
       IMPACT Outcomes, 26 June 2012, KB, The Hague
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




    Literature:
    Erjavec, Tomaž. 2012. The goo300k corpus of historical Slovene. Proceedings of the Eight International Conference on
     Language Resources and Evaluation (LREC'12), Istambul.
    Erjavec, Tomaž. Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene. Proceedings of the
     5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, 2011, Portland.
    Erjavec, Tomaž, Christoph Ringlstetter, Maja Žorga, Annette Gotscharek. A lexicon for processing archaic language: the case of
     XIXth century Slovene. Proceedings of WoLeR: ESSLLI Workshop on Lexical Resources, 2011, Ljubljana.
    Erjavec, Tomaž, Christoph Ringlstetter, Maja Žorga, Annette Gotscharek. Towards a Lexicon of XIXth Century Slovene.
     Proceedings of the Seventh Language Technologies Conference Ljubljana, 2010.
    Erjavec, Tomaž, Ines Jerele, Maša Kodrič. 2011. Izdelava korpusa starejših slovenskih besedil v okviru projekta IMPACT. V:
     Kranjc, Simona (ur.). Meddisciplinarnost v slovenistiki. Ljubljana: Znanstvena založba Filozofske fakultete, 2011, 41-47
    Erjavec, Tomaž. Slovenska prevodna književnost 1848-1918 : digitalna knjižnica in korpus AHLib. V: KRANJC, Simona (ur.).
     Meddisciplinarnost v slovenistiki. Ljubljana: Znanstvena založba Filozofske fakultete, 2011, str. 33-40.
    Gotscharek, A., Neumann, A., Reffle, U., Ringlstetter, C., Schulz, K.U. (2009). Enabling Information Retrieval on Historical
     Document Collections - the Role of Matching Procedures and Special Lexica. AND2009 Workshop (23-24 July 2009, Barcelona,
     Spain). Also available on http://sites.google.com/site/and2009workshop/.
    Jerele, Ines, Erjavec, Tomaž, Pokorn, Daša, Kavčič-Čolić, Alenka. 2012. Optical Character Recognition of Historical Texts: End-
     User Focused Research for Slovenian Books and Newspapers from the 18th and 19th Century. Review of the National Center
     for Digitization 21/2012, Faculty of Mathematics, Belgrade.
    Jerele, I., Erjavec, T., Pokorn, D., Kavčič-Čolić, A. (2011). Optical character recognition of historical texts: end-user focused
     research for slovenian books and newspapers from the 18th and 19th century. In: 6. SEEDI conference : proceedings, (16-20
     May 2011, Zagreb, Croatia), p. 11, (unpublished jet). Also available on http://www.nsk.hr/seedi/seedi-hrv/index.html.
    Kenter, Tom, Erjavec, Tomaž, Maja and Žorga, Darja Fišer. 2012. Lexicon construction and corpus annotation of historical
     language with the CoBaLT editor. In Proceedings of the EACL Workshop on Language Technology for Cultural Heritage, Social
     Sciences, and Humanities, Avignon, France, April. Association for Computational Linguistics.

                                                                                                                                                      IMPACT Outcomes, 26 June 2012, KB,
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




         Thank you for your attention!

         Alenka Kavčič-Čolić
         Alenka.kavcic@nuk.uni-lj.si

Más contenido relacionado

Destacado

University library of KU Leuven - Sam Alloing et Demmy Verbecke
University library of KU Leuven - Sam Alloing et Demmy VerbeckeUniversity library of KU Leuven - Sam Alloing et Demmy Verbecke
University library of KU Leuven - Sam Alloing et Demmy VerbeckeIMPACT Centre of Competence
 
2. Interoperability framework and Taverna. Enrique Molla, Succeed Project.
2. Interoperability framework and Taverna. Enrique Molla, Succeed Project. 2. Interoperability framework and Taverna. Enrique Molla, Succeed Project.
2. Interoperability framework and Taverna. Enrique Molla, Succeed Project. IMPACT Centre of Competence
 
Biblioteca Virtual Miguel de Cervantes - Oskarbi Zubiarrain
Biblioteca Virtual Miguel de Cervantes - Oskarbi ZubiarrainBiblioteca Virtual Miguel de Cervantes - Oskarbi Zubiarrain
Biblioteca Virtual Miguel de Cervantes - Oskarbi ZubiarrainIMPACT Centre of Competence
 
IMPACT Final Event 26-06-2012 - The IMPACT Centre of Competence by Rafael Car...
IMPACT Final Event 26-06-2012 - The IMPACT Centre of Competence by Rafael Car...IMPACT Final Event 26-06-2012 - The IMPACT Centre of Competence by Rafael Car...
IMPACT Final Event 26-06-2012 - The IMPACT Centre of Competence by Rafael Car...IMPACT Centre of Competence
 
7. Technical development at the Meertens Institute. Marc Kemps Snijders.
7. Technical development at the Meertens Institute. Marc Kemps Snijders.7. Technical development at the Meertens Institute. Marc Kemps Snijders.
7. Technical development at the Meertens Institute. Marc Kemps Snijders.IMPACT Centre of Competence
 
Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)
Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)
Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)IMPACT Centre of Competence
 
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...IMPACT Centre of Competence
 
Datech2014-Session1-Document Representation Refinement for Precise Region Des...
Datech2014-Session1-Document Representation Refinement for Precise Region Des...Datech2014-Session1-Document Representation Refinement for Precise Region Des...
Datech2014-Session1-Document Representation Refinement for Precise Region Des...IMPACT Centre of Competence
 
Datech2014 - Session 4 - Construction of Text Digitization System for Nôm His...
Datech2014 - Session 4 - Construction of Text Digitization System for Nôm His...Datech2014 - Session 4 - Construction of Text Digitization System for Nôm His...
Datech2014 - Session 4 - Construction of Text Digitization System for Nôm His...IMPACT Centre of Competence
 
Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]
Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]
Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]IMPACT Centre of Competence
 
Datech2014 - Session 5 - Wittgenstein’s Nachlass: WiTTFind and Wittgenstein A...
Datech2014 - Session 5 - Wittgenstein’s Nachlass: WiTTFind and Wittgenstein A...Datech2014 - Session 5 - Wittgenstein’s Nachlass: WiTTFind and Wittgenstein A...
Datech2014 - Session 5 - Wittgenstein’s Nachlass: WiTTFind and Wittgenstein A...IMPACT Centre of Competence
 
IMPACT Final Event 26-06-2012 - Automated metadata extraction from title page...
IMPACT Final Event 26-06-2012 - Automated metadata extraction from title page...IMPACT Final Event 26-06-2012 - Automated metadata extraction from title page...
IMPACT Final Event 26-06-2012 - Automated metadata extraction from title page...IMPACT Centre of Competence
 
Succeed final conference - The interoperability of digitisation platforms
Succeed final conference - The interoperability of digitisation platformsSucceed final conference - The interoperability of digitisation platforms
Succeed final conference - The interoperability of digitisation platformsIMPACT Centre of Competence
 
IMPACT Interoperability Framework - Clemens Neudecker
IMPACT Interoperability Framework - Clemens NeudeckerIMPACT Interoperability Framework - Clemens Neudecker
IMPACT Interoperability Framework - Clemens NeudeckerIMPACT Centre of Competence
 
6. Digital Humanities Innovation Lab (LINHD). Clara Martínez Cantón
6.  Digital Humanities Innovation Lab (LINHD). Clara Martínez Cantón6.  Digital Humanities Innovation Lab (LINHD). Clara Martínez Cantón
6. Digital Humanities Innovation Lab (LINHD). Clara Martínez CantónIMPACT Centre of Competence
 

Destacado (20)

University library of KU Leuven - Sam Alloing et Demmy Verbecke
University library of KU Leuven - Sam Alloing et Demmy VerbeckeUniversity library of KU Leuven - Sam Alloing et Demmy Verbecke
University library of KU Leuven - Sam Alloing et Demmy Verbecke
 
2. Interoperability framework and Taverna. Enrique Molla, Succeed Project.
2. Interoperability framework and Taverna. Enrique Molla, Succeed Project. 2. Interoperability framework and Taverna. Enrique Molla, Succeed Project.
2. Interoperability framework and Taverna. Enrique Molla, Succeed Project.
 
Biblioteca Virtual Miguel de Cervantes - Oskarbi Zubiarrain
Biblioteca Virtual Miguel de Cervantes - Oskarbi ZubiarrainBiblioteca Virtual Miguel de Cervantes - Oskarbi Zubiarrain
Biblioteca Virtual Miguel de Cervantes - Oskarbi Zubiarrain
 
IMPACT Final Event 26-06-2012 - The IMPACT Centre of Competence by Rafael Car...
IMPACT Final Event 26-06-2012 - The IMPACT Centre of Competence by Rafael Car...IMPACT Final Event 26-06-2012 - The IMPACT Centre of Competence by Rafael Car...
IMPACT Final Event 26-06-2012 - The IMPACT Centre of Competence by Rafael Car...
 
7. Technical development at the Meertens Institute. Marc Kemps Snijders.
7. Technical development at the Meertens Institute. Marc Kemps Snijders.7. Technical development at the Meertens Institute. Marc Kemps Snijders.
7. Technical development at the Meertens Institute. Marc Kemps Snijders.
 
Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)
Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)
Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)
 
Kennisbank IMPACT by Lotte Wilms
Kennisbank IMPACT by Lotte WilmsKennisbank IMPACT by Lotte Wilms
Kennisbank IMPACT by Lotte Wilms
 
CONCERT IMPACT by Lotte Wilms
CONCERT IMPACT by Lotte WilmsCONCERT IMPACT by Lotte Wilms
CONCERT IMPACT by Lotte Wilms
 
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
 
Datech2014-Session1-Document Representation Refinement for Precise Region Des...
Datech2014-Session1-Document Representation Refinement for Precise Region Des...Datech2014-Session1-Document Representation Refinement for Precise Region Des...
Datech2014-Session1-Document Representation Refinement for Precise Region Des...
 
Datech2014 - Session 4 - Construction of Text Digitization System for Nôm His...
Datech2014 - Session 4 - Construction of Text Digitization System for Nôm His...Datech2014 - Session 4 - Construction of Text Digitization System for Nôm His...
Datech2014 - Session 4 - Construction of Text Digitization System for Nôm His...
 
Image Enhancement tools by Lotte Wilms
Image Enhancement tools by Lotte WilmsImage Enhancement tools by Lotte Wilms
Image Enhancement tools by Lotte Wilms
 
Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]
Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]
Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]
 
Datech2014 - Session 5 - Wittgenstein’s Nachlass: WiTTFind and Wittgenstein A...
Datech2014 - Session 5 - Wittgenstein’s Nachlass: WiTTFind and Wittgenstein A...Datech2014 - Session 5 - Wittgenstein’s Nachlass: WiTTFind and Wittgenstein A...
Datech2014 - Session 5 - Wittgenstein’s Nachlass: WiTTFind and Wittgenstein A...
 
IMPACT Final Event 26-06-2012 - Automated metadata extraction from title page...
IMPACT Final Event 26-06-2012 - Automated metadata extraction from title page...IMPACT Final Event 26-06-2012 - Automated metadata extraction from title page...
IMPACT Final Event 26-06-2012 - Automated metadata extraction from title page...
 
National library of the netherlands judith rog
National library of the netherlands   judith rogNational library of the netherlands   judith rog
National library of the netherlands judith rog
 
Succeed final conference - The interoperability of digitisation platforms
Succeed final conference - The interoperability of digitisation platformsSucceed final conference - The interoperability of digitisation platforms
Succeed final conference - The interoperability of digitisation platforms
 
IMPACT Interoperability Framework - Clemens Neudecker
IMPACT Interoperability Framework - Clemens NeudeckerIMPACT Interoperability Framework - Clemens Neudecker
IMPACT Interoperability Framework - Clemens Neudecker
 
6. Digital Humanities Innovation Lab (LINHD). Clara Martínez Cantón
6.  Digital Humanities Innovation Lab (LINHD). Clara Martínez Cantón6.  Digital Humanities Innovation Lab (LINHD). Clara Martínez Cantón
6. Digital Humanities Innovation Lab (LINHD). Clara Martínez Cantón
 
UA - GT Aligner - ICoC
UA - GT Aligner - ICoCUA - GT Aligner - ICoC
UA - GT Aligner - ICoC
 

Similar a IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and University Library of Slovenia by Alenka Kavčič-Čolić (NUK)

IMPACT: Building a Centre of Competence for Digitisation
IMPACT: Building a Centre of Competence for DigitisationIMPACT: Building a Centre of Competence for Digitisation
IMPACT: Building a Centre of Competence for DigitisationIMPACT Centre of Competence
 
An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...cneudecker
 
The Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiativesThe Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiativesMichael Day
 
Centre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerCentre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerBiblioteca Nacional de España
 
OCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTOCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTcneudecker
 
Experimental Workflow Development in Digitisation
Experimental Workflow Development in DigitisationExperimental Workflow Development in Digitisation
Experimental Workflow Development in Digitisationcneudecker
 
BL Demo Day - July2011 - (1) Introduction to IMPACT
BL Demo Day - July2011 - (1) Introduction to IMPACTBL Demo Day - July2011 - (1) Introduction to IMPACT
BL Demo Day - July2011 - (1) Introduction to IMPACTIMPACT Centre of Competence
 
LIBER's portfolio of EU projects
LIBER's portfolio of EU projectsLIBER's portfolio of EU projects
LIBER's portfolio of EU projectsLIBER Europe
 
Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)cneudecker
 
IMPACT HPC Cloud Day
IMPACT HPC Cloud DayIMPACT HPC Cloud Day
IMPACT HPC Cloud Daycneudecker
 
Ecloud copenhagen-130625074823-phpapp01
Ecloud copenhagen-130625074823-phpapp01Ecloud copenhagen-130625074823-phpapp01
Ecloud copenhagen-130625074823-phpapp01The European Library
 
Europeana Cloud Work Package 1: Assessing Researchers' Needs in the Cloud
Europeana Cloud Work Package 1: Assessing Researchers' Needs in the CloudEuropeana Cloud Work Package 1: Assessing Researchers' Needs in the Cloud
Europeana Cloud Work Package 1: Assessing Researchers' Needs in the CloudTU Delft, Netherlands
 
IMPACT Interoperability and Evaluation Framework. Clemens Neudecker
IMPACT Interoperability and Evaluation Framework. Clemens NeudeckerIMPACT Interoperability and Evaluation Framework. Clemens Neudecker
IMPACT Interoperability and Evaluation Framework. Clemens NeudeckerBiblioteca Nacional de España
 
IMPACT Demo Dag at KB
IMPACT Demo Dag at KBIMPACT Demo Dag at KB
IMPACT Demo Dag at KBcneudecker
 
Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...
Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...
Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...The European Library
 
IMPACT at OCR Summit
IMPACT at OCR SummitIMPACT at OCR Summit
IMPACT at OCR Summitcneudecker
 
The Europeana group: integrating the projects Project overviews
The Europeana group: integrating the projects Project overviewsThe Europeana group: integrating the projects Project overviews
The Europeana group: integrating the projects Project overviewsEuropeana
 
Alaistar e cloud
Alaistar e cloudAlaistar e cloud
Alaistar e cloudEuropeana
 
77. newsletter d andrea2012
77. newsletter d andrea201277. newsletter d andrea2012
77. newsletter d andrea2012Andrea D'Andrea
 
The European (Digital) Library - Overview and Outlook
The European (Digital) Library - Overview and OutlookThe European (Digital) Library - Overview and Outlook
The European (Digital) Library - Overview and OutlookOlaf Janssen
 

Similar a IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and University Library of Slovenia by Alenka Kavčič-Čolić (NUK) (20)

IMPACT: Building a Centre of Competence for Digitisation
IMPACT: Building a Centre of Competence for DigitisationIMPACT: Building a Centre of Competence for Digitisation
IMPACT: Building a Centre of Competence for Digitisation
 
An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...
 
The Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiativesThe Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiatives
 
Centre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerCentre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens Neudecker
 
OCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTOCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACT
 
Experimental Workflow Development in Digitisation
Experimental Workflow Development in DigitisationExperimental Workflow Development in Digitisation
Experimental Workflow Development in Digitisation
 
BL Demo Day - July2011 - (1) Introduction to IMPACT
BL Demo Day - July2011 - (1) Introduction to IMPACTBL Demo Day - July2011 - (1) Introduction to IMPACT
BL Demo Day - July2011 - (1) Introduction to IMPACT
 
LIBER's portfolio of EU projects
LIBER's portfolio of EU projectsLIBER's portfolio of EU projects
LIBER's portfolio of EU projects
 
Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)
 
IMPACT HPC Cloud Day
IMPACT HPC Cloud DayIMPACT HPC Cloud Day
IMPACT HPC Cloud Day
 
Ecloud copenhagen-130625074823-phpapp01
Ecloud copenhagen-130625074823-phpapp01Ecloud copenhagen-130625074823-phpapp01
Ecloud copenhagen-130625074823-phpapp01
 
Europeana Cloud Work Package 1: Assessing Researchers' Needs in the Cloud
Europeana Cloud Work Package 1: Assessing Researchers' Needs in the CloudEuropeana Cloud Work Package 1: Assessing Researchers' Needs in the Cloud
Europeana Cloud Work Package 1: Assessing Researchers' Needs in the Cloud
 
IMPACT Interoperability and Evaluation Framework. Clemens Neudecker
IMPACT Interoperability and Evaluation Framework. Clemens NeudeckerIMPACT Interoperability and Evaluation Framework. Clemens Neudecker
IMPACT Interoperability and Evaluation Framework. Clemens Neudecker
 
IMPACT Demo Dag at KB
IMPACT Demo Dag at KBIMPACT Demo Dag at KB
IMPACT Demo Dag at KB
 
Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...
Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...
Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...
 
IMPACT at OCR Summit
IMPACT at OCR SummitIMPACT at OCR Summit
IMPACT at OCR Summit
 
The Europeana group: integrating the projects Project overviews
The Europeana group: integrating the projects Project overviewsThe Europeana group: integrating the projects Project overviews
The Europeana group: integrating the projects Project overviews
 
Alaistar e cloud
Alaistar e cloudAlaistar e cloud
Alaistar e cloud
 
77. newsletter d andrea2012
77. newsletter d andrea201277. newsletter d andrea2012
77. newsletter d andrea2012
 
The European (Digital) Library - Overview and Outlook
The European (Digital) Library - Overview and OutlookThe European (Digital) Library - Overview and Outlook
The European (Digital) Library - Overview and Outlook
 

Más de IMPACT Centre of Competence

Más de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Último

Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 

Último (20)

Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 

IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and University Library of Slovenia by Alenka Kavčič-Čolić (NUK)

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Library experiences in IMPACT: THE NATIONAL AND UNIVERSITY LIBRARY OF SLOVENIA Alenka KAVČIČ-ČOLIĆ, Ines VODOPIVEC Library Research Centre, National and University Library Ljubljana, Slovenia in cooperation with Tomaž ERJAVEC Dept. of Knowledge Technologies, Jožef Stefan Institute Ljubljana, Slovenia
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OUTLINE • Introduction • Cooperation in IMPACT project: • OCR improvement • Lexicon building • Improvment of information retrieval on historical document collections • Benefits overview IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The National and University Library of Slovenia (Narodna in univerzitetna knjižnica - NUK)  Digital Library of Slovenia (2005 - )  NUK entire online digital collection comprises more than 4 million scans and digital objects, including: – 19.000 pages of scientific journals, – 402.714 pages of newspapers, – 9.540 photographs, – 100 music records, – 15 3D objects, – 3 virtual exhibitions etc. IMPACT Outcomes, 26 June 2012, KB,
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. NUK statistics  More than 200.000 visitors per year  More than 1 million of distant users  Approx. 2.8 million visits to the digital library portal dLib.si (2011) 2,30% 1,80% Library members school children 21,50% students university employees 3,10% general public 71,30% foreign citizens IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Slovene historical documents PDF file and HTML preview IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Documents published before 1850 Example of a bad OCR in a HTML preview IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Example of a historical text by Linhart from the 18th century IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Cooperation in IMPACT  NUK & IJS joined IMPACT project on the 2nd extension (1st Apr. 2010 – 31st Dec.2011)  Goals to achieve in the project: 1. Lexicon building (JSI) 2. Improving OCR on historical documents by using special lexica for historical language (NUK & JSI) 3. Improving information retrieval on historical document collections (NUK) IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The 3 goals were interdependent 1. OCR improvement (NUK) 2. Lexicon 3. Improved IR building in old texts (JSI) (NUK & JSI) IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Corpus selection (JSI & NUK):  Selection of typical documents from the 19th and second half of the 18th century  Materials from several sources: – dLib (www.dlib.si) – AHLib (http://nl.ijs.si/ahlib/) – books from 1848-1918, translated from German original  Dataset: 41,313 digitized pages of historical newspapers & books from the 18th-19th century  Subset of approx. 5,000 scans for Groundtruth (GT)* production (* a dataset of high-quality transcriptions of historical texts which also serves as a basis for the production of the Lexicon for historical Slovene) IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Characteristics of the dataset:  Errors originating from language characteristics – Two basic historical alphabets (Bohoričica / Gajica) – Historical language / vocabulary  Poor vocabulary recognition – Special characters, digraphs and ligatures  Errors originating from print properties – Latin and gothic types – Complex page structure or segmentation – Irregular spacing between letters, words and columns – Irregular / changing font sizes – Poor paper quality – Inconsistent inking  Errors originating from digitisation procedures – Specific characteristics of the originals: staining, foxing, paper wrapping caused by humidity IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Groundtruth production (NUK)  Digitisation and pre-processing of scans  OCR procedure  Post-processing of the text  GT production – Errors correction – manually – Page segmentation and reading order - Aletheia – OCR outcomes encoded in PAGE XML  Evaluation IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexicon building for historical Slovene  Developed by the Jožef Stefan Institute (JSI)  Built to be incorporated in OCR & IR  But also as a human-readable reference and as a training & testing set for Human Language Technologies  Two stages development: – Reference corpus of historical Slovene – Lexicon of historical Slovene IMPACT Outcomes, 26 June 2012, KB,
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Period Units Pages Tokens Reference corpus goo300k 1584 1 8 6000  Page sampled 1695 1 27 10000  Each word annotated with: 1751-1800 8 155 27000 – Contemporary equivalent 1801-1850 12 206 74000 1851-1875 36 380 126000 – Modern lemma 1876-1900 23 224 51000 – Part-of-speech tag ∑ 81 1000 296000 – Gloss for archaic words (lemmas)  First automatically, then manually corrected: – Institute for Dutch Lexicology (INL) CoBaLT Lexicon Tool – A team of annotators – Also correcting errors in transcription  Available via a concordancer + download (CC-BY licence)  Development supported by Google humanities research award (JSI + ZRC SAZU) IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Historical lexicon Lexicon n Lex. entries 70,000  Lexicon dump from goo300k Word-forms 68,000  + additional lexicon from full-text collection  First automatically, then manually corrected: Modernised 50,000  INL CoBaLT Lexicon Tool Lemmas 24,000  A team of annotators Glosses 1,900  Dual role: – As a human readable lexicon of historical Slovene – For HLT applications (e.g. IR)  Available via a web browser + download (CC-BY)  JSI also developed ToTrTaLe – a tool for processing historical (Slovene) text, which annotates words in a TEI encoded corpus with their modern-day equivalents, PoS tags and lemmas – Used by Vaam finite-state library (Centrum für Informations- und Sprachverarbeitung, University of Munich) IMPACT Outcomes, 26 June 2012, KB,
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Example lexical entry (converted from TEI to HTML) IMPACT Outcomes, 26 June 2012, KB,
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. JSI resources available from http://nl.ijs.si/imp/ IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Improved IR in old texts (NUK & JSI)  Integration of historical lexicon developed by IJS into the full-text search engine of dLib.si IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Outcomes, 26 June 2012, KB,
  • 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT project benefits - OCR quality improvement: - For texts in Bohoričica and Gajica the OCR done by Abbey Fine Reader increased from 58% to 70% for Bohoričica and 85% for Gajica - Full-text search with modern words in more than 200.000 digitised historical documents - Historical Slovene Lexicon that can be integrated in other Digital Library tools - Processing tools for large scale digitisation (ex. Aletheia, FEP … ) - High-quality datasets for R&D of langauge technlogies - Other „invisible“ benefits: - Cooperation and integration in national and international networks - Additional experiences in large scale digitisation - New knowledge (on OCR, language processing, and other tools used in large scale digitisation) IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Literature:  Erjavec, Tomaž. 2012. The goo300k corpus of historical Slovene. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istambul.  Erjavec, Tomaž. Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene. Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, 2011, Portland.  Erjavec, Tomaž, Christoph Ringlstetter, Maja Žorga, Annette Gotscharek. A lexicon for processing archaic language: the case of XIXth century Slovene. Proceedings of WoLeR: ESSLLI Workshop on Lexical Resources, 2011, Ljubljana.  Erjavec, Tomaž, Christoph Ringlstetter, Maja Žorga, Annette Gotscharek. Towards a Lexicon of XIXth Century Slovene. Proceedings of the Seventh Language Technologies Conference Ljubljana, 2010.  Erjavec, Tomaž, Ines Jerele, Maša Kodrič. 2011. Izdelava korpusa starejših slovenskih besedil v okviru projekta IMPACT. V: Kranjc, Simona (ur.). Meddisciplinarnost v slovenistiki. Ljubljana: Znanstvena založba Filozofske fakultete, 2011, 41-47  Erjavec, Tomaž. Slovenska prevodna književnost 1848-1918 : digitalna knjižnica in korpus AHLib. V: KRANJC, Simona (ur.). Meddisciplinarnost v slovenistiki. Ljubljana: Znanstvena založba Filozofske fakultete, 2011, str. 33-40.  Gotscharek, A., Neumann, A., Reffle, U., Ringlstetter, C., Schulz, K.U. (2009). Enabling Information Retrieval on Historical Document Collections - the Role of Matching Procedures and Special Lexica. AND2009 Workshop (23-24 July 2009, Barcelona, Spain). Also available on http://sites.google.com/site/and2009workshop/.  Jerele, Ines, Erjavec, Tomaž, Pokorn, Daša, Kavčič-Čolić, Alenka. 2012. Optical Character Recognition of Historical Texts: End- User Focused Research for Slovenian Books and Newspapers from the 18th and 19th Century. Review of the National Center for Digitization 21/2012, Faculty of Mathematics, Belgrade.  Jerele, I., Erjavec, T., Pokorn, D., Kavčič-Čolić, A. (2011). Optical character recognition of historical texts: end-user focused research for slovenian books and newspapers from the 18th and 19th century. In: 6. SEEDI conference : proceedings, (16-20 May 2011, Zagreb, Croatia), p. 11, (unpublished jet). Also available on http://www.nsk.hr/seedi/seedi-hrv/index.html.  Kenter, Tom, Erjavec, Tomaž, Maja and Žorga, Darja Fišer. 2012. Lexicon construction and corpus annotation of historical language with the CoBaLT editor. In Proceedings of the EACL Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, Avignon, France, April. Association for Computational Linguistics. IMPACT Outcomes, 26 June 2012, KB,
  • 24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Thank you for your attention! Alenka Kavčič-Čolić Alenka.kavcic@nuk.uni-lj.si