SlideShare una empresa de Scribd logo
1 de 44
Descargar para leer sin conexión
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Targeted Language Resources for the
Digitization of Historical Collections

 Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter,
 Klaus U. Schulz
 CIS, University of Munich
JISC Workshop 2009, Bath, UK
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




 Questions and Methods

 For historical documents of a specific period: what
  kind of linguistic resources?
 What kind of improvements can be expected?
 Consequences for engeneering and processes?
  ----------------------------------------------------------------------
 (1) Corpus analysis
 (2) Quantitative Experiments on OCR and IR



                                                                                                                                                          2
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Survey

      1. Special Challenges to Digitize Historical Materials
      2. Composing and Analyzing a Historical Corpus
      3. Types of Linguistic Resources
      4. Evaluation of Benefits: OCR, IR
      5. Consequences for Engineering




                                                                                                                                                             3
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




  0. CIS within IMPACT

• (1) Text Recognition: Adaption of Optical Character
  Recognition to historical documents
• (2) Resources Building: Enrichment of texts to Improve
  Information Retrieval (IR) on historical documents
• (3) Research beyond IMPACT: steps to a next generation
  interface to access collections of historical documents
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Digitization Projects for Historical Materials


SELECT                       Create a Historical Collection
SCAN                         Create Images: Greyscale, Color, QA
PROCESS                      Improve Images
OCR/Type                     Create a Symbolical Representation, QA
INDEX                        Process a Term-Document Representation
PRESENT                      Provide a User Interface for Access


                                                                                                                                                         5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.



1. Special Challenges for the Digitization of
Historical Materials

          1500 1600 1700 1800 1900

•   Imaging                                                                      Damages on Originals
•   Optical Character Recognition                                                Rate of Recognition Errors
•   Information Retrieval                                                        Historical Variants
•   Human Reading                                                                Unknown Words




date footertext
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.



Optical Character Recognition: Gothic, good quality

Städte den römischen mumcizmg gleich zu stellen. Allem wenn sich je in
  einem Rechtstheile die altrechtlichen teutschen Gewohnheiten, und Gesetze
erhalten haben, so ist es gewiß in dieser Lehre, man mag entweder auf die
Befugniß, die Stadtgerechtigkeit zu ertheilen , oder auf die innere
Regimentsverfftssung so-
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.



Optical Character Recognition: medium quality

 Fürsten zu Gstternwerden/wer wollte vermainen / daßwt
 IhroKhurftrstl Durchl gnädiglsterHcttVatterinderpictcr
 rndFrombkcltallmFürstenvorzusetzen!scyn/vnd das halst>
 in^cclcQ^ vci pluz^uäzn 5accr6o5 daß tl iN KilchkN GottW
 wehr als ein Priester                          .
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Examples of Noise introduced by imperfect OCR

(1) processed word images may lead to False Friends
    Fischerei - Tischlerei: F->T, h -> hl (Engfishery - carpenterry)
(2) processed word images may relate to no word at all

   ^.uglltt.         schreibet/
(3) severe word segmentation errors

   vndExcmpelFürstl-vnd HeroischerTuzenF

OCR on Gothic materials:
good (WER < 10%); medium (10-30%); bad (< 30%)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




 Why is it so bad

(1) The image quality is challenging, further
 processing needed

(2) The classifiers of the OCR disregard certain type
 faces used in historic print

(3) The language resources of the OCR are
inappropriate: historical language
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.



   IR: Search Problem
   even for keyed collections                                                                                          Kreüter


                    ?                                                                                krauter


                                             Kräuter
                                              (Eng herbs)                                                              kreuter


                                      0 Results for Kreuter
                                                                                                                   creuther



– kräuter (= Engherbs) as kraͤuter, kreuter, kreüter, kreuter, creuther
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Special challenge for IR and OCR on
historical Texts: Spelling variation
 Missing normalization of orthography leads to plenty
  of spelling variants in historical documents, e.g. in
  German texts (1500-1850):
        – teil (= Engpart) as theil, teyl, theyl
        – kräuter (= Engherbs) as kraͤuter, kreuter, kreüter,
     kreuter, creuther
   – fragte (= Engasked) as frug, fruk
 User is not aware of the variants and misses many
  documents: sometimes even false friends
 Solution: Mapping from variants to modern lemma
                                                                                                                                                         12
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Language Resources to tackle challenges
encountered in OCR and IR

• Lexica for OCR
• Language Models for OCR
• Statistical Information about transformation patterns
• Historical Stopwords for IR
• Normalization Lexica with a mapping between modern and
  historical wordform for IR
• Syntactical Information for paradigmatic expansion and
  disambiguation at POS level



date footertext
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Mantra: All Language Resources are Corpus
Based

Possible sources:

• Keyed Materials on the Web
• Non Public Electronic Corpora
• Keying/corrected OCR of Image Corpora
• Noisy OCR Corpora
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Status of German historical Corpora

1. Main development corpus
•Proofread texts from 1400 to 1900,
•Medium size: 2.7 Mill. tokens
•For lexicon construction
•For diachronic analysis/classification of vocabulary of distinct periods

2. OCR corpus for lexicon testing
•OCRed Images + groundtruth aligned
•Texts from 16th, 18th, 19th century (5034, 2659, 18052) tokens

3. IR test corpus for lexicon testing
•Special linguistically annotated groundtruth
•Texts from 16th, 17th, 18th, 19th century 31080 tokens
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Language in a historical corpus for German
  Modern lexicon (CISLEX): coverage on Main Corpus on 10 periods
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Language in a historical corpus for German
Compounds (modern components); coverage on Main Corpus on 10 periods
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Two Variants of Lexica for IR and OCR
 Hypotetical Lexicon:
              Trying to map input strings to modern lexicon entries
              in a dynamic way via a special approximate matching
              procedure using historical transformation patterns.


 Witnessed Lexicon:
              Corpus checked lexicon entries:
              Historical spelling variant + modern Lemma for IR
              Historical word list for OCR


                                                                                                                                                         18
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.



Hypothetical Lexicon: approximate
matching procedure
 Many of the spelling variations can be traced back to
  a modern word by applying characteristic patterns /
  rewrite rules
   – e.g. the historical string theyle can be traced to
     its modern equivalent teile by applying th → t
     and ey → ei.
 Required resources:
   – Contemporary lexicon with inflected word forms
   – Set of typical language-specific spelling variation
     patterns
                                                                                                                                                         19
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Approximate matching procedure
                                                                                        Modern lexicon

                                                                           Inflected                      Lemmatizing
                                                                             forms                         information
                                                                                  …                                  …
                                                                                teile                        teil (= part)
                                                                                  ...                 teilen (= to share)
                                                                                taille                    taille (= waist)
                                                                               fragte                  fragen (= to ask)
                                                                                  …                                  …


                                                                                                                                                         20
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Approximate matching procedure
                                       ~ 140                                            Modern lexicon
                                      Patterns
                                                                           Inflected                      Lemmatizing
                                             …                               forms                         information
                                          th → t                                  …                                  …
                                        ei → ai                                 teile                        teil (= part)
                                        ey → ei
                                                                                  ...                 teilen (= to share)
                                         l → ll
                                                                                taille                    taille (= waist)
                                              …
                                                                               fragte                  fragen (= to ask)
                                                                                  …                                  …


                                                                                                                                                         21
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




 Approximate matching procedure
Spelling                                ~ 140                                            Modern lexicon
variation                              Patterns
                                                                            Inflected                      Lemmatizing
                                              …                               forms                         information
                                           th → t                                  …                                  …
  theile                                 ei → ai                                 teile                        teil (= part)
                                         ey → ei
                                                                                   ...                 teilen (= to share)
                                          l → ll
                                                                                 taille                    taille (= waist)
                                               …
                                                                                fragte                  fragen (= to ask)
                                                                                   …                                  …


                                                                                                                                                          22
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




 Approximate matching procedure
Spelling                                ~ 140                                            Modern lexicon
variation                              Patterns
                                                                            Inflected                      Lemmatizing
                                              …                               forms                         information
                                           th → t                                  …                                  …
  theile                                 ei → ai                                 teile                        teil (= part)
                                         ey → ei
                                                                                   ...                 teilen (= to share)
                                          l → ll
                                                                                 taille                    taille (= waist)
                                               …
                                                                                fragte                  fragen (= to ask)
                                                                                   …                                  …


                                                                                                                                                          23
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




 Approximate matching procedure
Spelling                                ~ 140                                            Modern lexicon
variation                              Patterns
                                                                            Inflected                      Lemmatizing
                                              …                               forms                         information
                                           th → t                                  …                                  …
  theile                                 ei → ai                                 teile                        teil (= part)
                                         ey → ei
                                                                                   ...                 teilen (= to share)
                                          l → ll
                                                                                 taille                    taille (= waist)
                                               …
                                                                                fragte                  fragen (= to ask)
                                                                                   …                                  …


                                                                                                                                                          24
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




 Approximate matching procedure
Spelling                                ~ 140                                            Modern lexicon
variation                              Patterns
                                                                            Inflected                      Lemmatizing
                                              …                               forms                         information
                                           th → t                                  …                                  …
  theile                                 ei → ai                                 teile                        teil (= part)
                                         ey → ei
                                                                                   ...                 teilen (= to share)
                                          l → ll
                                                                                 taille                    taille (= waist)
                                               …
                                                                                fragte                  fragen (= to ask)
                                                                                   …                                  …


                                                                                                                                                          25
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




 Approximate matching procedure
Spelling                                ~ 140                                            Modern lexicon
variation                              Patterns
                                                                            Inflected                      Lemmatizing
                                              …                               forms                         information
                                           th → t                                  …                                  …
    frug                                 ei → ai                                 teile                        teil (= part)
                                         ey → ei
                                                                                   ...                 teilen (= to share)
                                          l → ll
                                                                                 taille                    taille (= waist)
                                               …
                                                                                fragte                  fragen (= to ask)
                                                                                   …                                  …


                                                                                                                                                          26
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




 Approximate matching procedure
Spelling                                ~ 140                                            Modern lexicon
variation                              Patterns
                                                                            Inflected                      Lemmatizing
                                              …                               forms                         information
      ?                                    th → t                                  …                                  …
    frug                                 ei → ai                                 teile                        teil (= part)
                                         ey → ei
                                                                                   ...                 teilen (= to share)
                                          l → ll
                                                                                 taille                    taille (= waist)
                                               …
                                                                                fragte                  fragen (= to ask)
                                                                                   …                                  …


                                                                                                                                                          27
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Approximate matching procedure
 Advantages:
        – No manual work needed
        – Dynamic approach

 Limitations:
   – Mismatches may link a historical spelling
     variation to a wrong modern word.
   – A part of the historical vocabulary cannot be
     reduced to a modern word by simple matching


                                                                                                                                                         28
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Language in a historical corpus for German
  Hypothetical lexicon; coverage on Main Corpus on 10 periods
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




 Manually collected special lexica
Spelling                                                                                 Modern lexicon
variation
                                                                            Inflected                      Lemmatizing
                                                                              forms                         information
  theile                                                                           …                                  …
                                                                                 teile                        teil (= part)
                                                                                   ...                 teilen (= to share)
    frug                                                                         taille                    taille (= waist)
                                                                                fragte                  fragen (= to ask)
                                                                                   …                                  …


                                                                                                                                                          30
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




 Manually collected special lexica
Spelling                      Manual                                                     Modern lexicon
variation                     mapping
                                                                            Inflected                      Lemmatizing
                                                                              forms                         information
  theile                                                                           …                                  …
                                                                                 teile                        teil (= part)
                                                                                   ...                 teilen (= to share)
    frug                                                                         taille                    taille (= waist)
                                                                                fragte                  fragen (= to ask)
                                                                                   …                                  …


                                                                                                                                                          31
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Manually collected special lexica
 Advantages:
   – Associations between historical variant and
     modern lemma are safe
   – Associations that are not covered by the matching
     approach can be stored explicitly
 Limitations:
   – Time consuming, labor-intensive, situations
     occur where specialists (historical linguists) are
     needed.
   – Hardly ever complete because of immense
     number of spelling variants
                                                                                                                                                         32
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.



Evaluation of the approximative matcher –
is the lexicon redundant?

 Few empirical studies on crucial decisions for
  IR and OCR on historical texts:
        – Is a matching approach enough?
        – Do we need a lexicon, and if so, in which
          scenarios?




                                                                                                                                                         33
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Lexica built
  both assign modern lemmas to historical full forms

   Validated lexicon constructed from Main Corpus
    currently ca. 15,000 entries (“kernel lexicon for hist. German”) poor
    coverage
    Witnessed lexicon for OCR: from Main corpus, 200,000 tokens without
    modern correspondence still limited coverage: corpus size
   Hypothetical lexicon for IR”: matching procedure mapping historical full
    form to modern pendant plus lemmatizer for modern language (historical
    full form <---> modern lemma), based on 140 patterns theoretically 100
    Mio entries. High coverage, assignments can be erroneous, only able to
    capture regular correspondences (pattern based)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Evaluation of OCR Results

   Test Corpus for 16th , 18th , and 19th century
   Development version of a professional OCR engine with an external
    dictionary interface
   Experiments with different lexicon settings
           No additional lexicon, character model only
           German modern lexicon
           corpus based witnessed lexicon
           hypothetical lexicon
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




           16th century               18th century               19th century
                         Reduction                  Reduction                  Reduction
Dictionary No. of word                No. of word                No. of word
                        of error rate              of error rate              of error rate
              errors                     errors                     errors

No Lexicon              1306                       -                   827                      -                  2074                       -

 Optimal
                         756                   42%                     395                   52%                    612                    70%
 Lexicon
  Modern
                        1096                   16%                     501                   39%                    888                    57%
  Lexicon

W.Historical
                         938                   28%                     481                   42%                    856                    59%
 Lexicon

 Modern +
                        1011                   25%                     480                   42%                    849                    59%
Virtual H.L.

                                           WER > 50%                                                              WER ~ 10%
                                                                                                                                                              36
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.



Evaluation of the approximate matcher in an
IR scenario

 proofread documents from 16th, 17th, 18th and 19th
  century and tagged each token manually.
 Collected a list of historical) stopwords
 Defined “precision” and “recall” for our scenario.




                                                                                                                                                         37
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Main insights IR experiment
 18th and 19th century:
   – Pure matching approach leads to good
     precision values.
   – Recall values are acceptable




                                                                                                                                                         38
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Main insights IR experiment
 16th and 17th century:
        – Precision of the matching approach poor, a
          lexicon will help to avoid wrong matches.
        – Recall values show that a large number of words
          can only be explained by a special lexicon




                                                                                                                                                         39
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Answer to our refined Question
   Is the lexicon redundant for OCR/IR on historical
     texts?
 Depends on the material, especially on the date of
  origin of the collection:
        – Matching approach leads to acceptable results for 19th
          and 18th century collections.
        – Serious limitations for 16th and 17th century collections.
          Special lexica will lead to important improvements
 Combination of matching approach and manually
  collected lexica may lead to optimal results.
 For postprocessing validated lexica are needed

                                                                                                                                                         40
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Engineering Consequences
        A Focus Collection of the BavarianStateLibrary
        VD16: Collection of Early High German Books

 Collaborative project with BSB on OCR/IR for this
  collection: Clemens Neudecker/Fedor Bochow
 Special lexicon building needed
 No 16th century electronic corpora available for
  lexicon development
 For real world test we defined a topic area as main
  interest: Theology
                                                                                                                                                         41
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Engineering Consequences
        Iterative Process with Bavarian State Library to
           Create Resources for VD16 Collection

(1) A random selection of 200 pages from 100 sources
(2) OCR and corpus experiments
(3) Selection of usable sources
(4) Specification of keying by BSB/CIS for 70 complete
books usable for both presentation and linguistic
resources building
(5) Contract with service providers
                                                                                                                                                         42
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




     Engineering Consequences
             A Focus Collection of the BavarianStateLibrary
             VD16: Collection of Early High German Books with
               30 million pages

                                                                             Integrate OCR Supplier: Special Type Face
                                                                             Models; Character Models


                                                                             Improve OCR with a specialized historical
Liguistic Database                                                           lexicon
for VD16
                                                                         Improve IR access with a normalization lexicon

                                                                                                                                                              43
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Wrap Up
•           For challenging historical materials specialized
            lexica are needed
•           Special lexica directly implemented into basic OCR
            lifts OCR quality significantly. For bigger projects
            seek direct collaboration with OCR partners
•           For IR: use approximative matching or normalization
            lexica to process user queries
•           Integrate research institutions and collection
            holders

Más contenido relacionado

Destacado

2011 top tips for accessing oxford's information resources
2011 top tips for accessing oxford's information resources2011 top tips for accessing oxford's information resources
2011 top tips for accessing oxford's information resources
Emma Huber
 

Destacado (8)

Using digitised text collections in research and learning
Using digitised text collections in research and learningUsing digitised text collections in research and learning
Using digitised text collections in research and learning
 
The Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiativesThe Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiatives
 
British Library - Digitising Historic Newspapers
British Library - Digitising Historic NewspapersBritish Library - Digitising Historic Newspapers
British Library - Digitising Historic Newspapers
 
2011 top tips for accessing oxford's information resources
2011 top tips for accessing oxford's information resources2011 top tips for accessing oxford's information resources
2011 top tips for accessing oxford's information resources
 
The Near Future of CSS
The Near Future of CSSThe Near Future of CSS
The Near Future of CSS
 
The Buyer's Journey - by Chris Lema
The Buyer's Journey - by Chris LemaThe Buyer's Journey - by Chris Lema
The Buyer's Journey - by Chris Lema
 
Classroom Management Tips for Kids and Adolescents
Classroom Management Tips for Kids and AdolescentsClassroom Management Tips for Kids and Adolescents
Classroom Management Tips for Kids and Adolescents
 
The Presentation Come-Back Kid
The Presentation Come-Back KidThe Presentation Come-Back Kid
The Presentation Come-Back Kid
 

Similar a Targeted Language Resources for the Digitisation of Historical Collections

IMPACT Final Conference - Language Parallel Sessions - Gotscharek
IMPACT Final Conference - Language Parallel Sessions -  GotscharekIMPACT Final Conference - Language Parallel Sessions -  Gotscharek
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
IMPACT Centre of Competence
 
Bratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfBratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdf
IMPACT Centre of Competence
 
Building The European Digital Library - An Insider’s Point of View
Building The European Digital Library - An Insider’s Point of View Building The European Digital Library - An Insider’s Point of View
Building The European Digital Library - An Insider’s Point of View
Olaf Janssen
 
The European (Digital) Library - Overview and Outlook
The European (Digital) Library - Overview and OutlookThe European (Digital) Library - Overview and Outlook
The European (Digital) Library - Overview and Outlook
Olaf Janssen
 

Similar a Targeted Language Resources for the Digitisation of Historical Collections (20)

IMPACT HPC Cloud Day
IMPACT HPC Cloud DayIMPACT HPC Cloud Day
IMPACT HPC Cloud Day
 
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
IMPACT Final Conference - Language Parallel Sessions -  GotscharekIMPACT Final Conference - Language Parallel Sessions -  Gotscharek
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
 
IMPACT: Building a Centre of Competence for Digitisation
IMPACT: Building a Centre of Competence for DigitisationIMPACT: Building a Centre of Competence for Digitisation
IMPACT: Building a Centre of Competence for Digitisation
 
OCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTOCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACT
 
Bratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfBratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdf
 
Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)
 
IMPACT Final Conference - Ulrich Reffle
IMPACT Final Conference - Ulrich ReffleIMPACT Final Conference - Ulrich Reffle
IMPACT Final Conference - Ulrich Reffle
 
Centre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerCentre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens Neudecker
 
Building The European Digital Library - An Insider’s Point of View
Building The European Digital Library - An Insider’s Point of View Building The European Digital Library - An Insider’s Point of View
Building The European Digital Library - An Insider’s Point of View
 
BL Demo Day - July2011 - (1) Introduction to IMPACT
BL Demo Day - July2011 - (1) Introduction to IMPACTBL Demo Day - July2011 - (1) Introduction to IMPACT
BL Demo Day - July2011 - (1) Introduction to IMPACT
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday Muehlberger
 
Museums and Europeana
Museums and EuropeanaMuseums and Europeana
Museums and Europeana
 
The European (Digital) Library - Overview and Outlook
The European (Digital) Library - Overview and OutlookThe European (Digital) Library - Overview and Outlook
The European (Digital) Library - Overview and Outlook
 
An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...
 
Experimental Workflow Development in Digitisation
Experimental Workflow Development in DigitisationExperimental Workflow Development in Digitisation
Experimental Workflow Development in Digitisation
 
TR5 Prolifer and Post-Correction System. Ludwig Maximilians
TR5 Prolifer and Post-Correction System. Ludwig MaximiliansTR5 Prolifer and Post-Correction System. Ludwig Maximilians
TR5 Prolifer and Post-Correction System. Ludwig Maximilians
 
Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...
Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...
Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...
 
EuropeanaLocal, its contribution
EuropeanaLocal, its contributionEuropeanaLocal, its contribution
EuropeanaLocal, its contribution
 
Computer Lexica in OCR and Retrieval
Computer Lexica in OCR and RetrievalComputer Lexica in OCR and Retrieval
Computer Lexica in OCR and Retrieval
 
The European Library
The European LibraryThe European Library
The European Library
 

Último

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Último (20)

ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 

Targeted Language Resources for the Digitisation of Historical Collections

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Targeted Language Resources for the Digitization of Historical Collections Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter, Klaus U. Schulz CIS, University of Munich JISC Workshop 2009, Bath, UK
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Questions and Methods  For historical documents of a specific period: what kind of linguistic resources?  What kind of improvements can be expected?  Consequences for engeneering and processes? ----------------------------------------------------------------------  (1) Corpus analysis  (2) Quantitative Experiments on OCR and IR 2
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Survey  1. Special Challenges to Digitize Historical Materials  2. Composing and Analyzing a Historical Corpus  3. Types of Linguistic Resources  4. Evaluation of Benefits: OCR, IR  5. Consequences for Engineering 3
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 0. CIS within IMPACT • (1) Text Recognition: Adaption of Optical Character Recognition to historical documents • (2) Resources Building: Enrichment of texts to Improve Information Retrieval (IR) on historical documents • (3) Research beyond IMPACT: steps to a next generation interface to access collections of historical documents
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Digitization Projects for Historical Materials SELECT Create a Historical Collection SCAN Create Images: Greyscale, Color, QA PROCESS Improve Images OCR/Type Create a Symbolical Representation, QA INDEX Process a Term-Document Representation PRESENT Provide a User Interface for Access 5
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 1. Special Challenges for the Digitization of Historical Materials 1500 1600 1700 1800 1900 • Imaging Damages on Originals • Optical Character Recognition Rate of Recognition Errors • Information Retrieval Historical Variants • Human Reading Unknown Words date footertext
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Optical Character Recognition: Gothic, good quality Städte den römischen mumcizmg gleich zu stellen. Allem wenn sich je in einem Rechtstheile die altrechtlichen teutschen Gewohnheiten, und Gesetze erhalten haben, so ist es gewiß in dieser Lehre, man mag entweder auf die Befugniß, die Stadtgerechtigkeit zu ertheilen , oder auf die innere Regimentsverfftssung so-
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Optical Character Recognition: medium quality Fürsten zu Gstternwerden/wer wollte vermainen / daßwt IhroKhurftrstl Durchl gnädiglsterHcttVatterinderpictcr rndFrombkcltallmFürstenvorzusetzen!scyn/vnd das halst> in^cclcQ^ vci pluz^uäzn 5accr6o5 daß tl iN KilchkN GottW wehr als ein Priester .
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Examples of Noise introduced by imperfect OCR (1) processed word images may lead to False Friends Fischerei - Tischlerei: F->T, h -> hl (Engfishery - carpenterry) (2) processed word images may relate to no word at all ^.uglltt. schreibet/ (3) severe word segmentation errors vndExcmpelFürstl-vnd HeroischerTuzenF OCR on Gothic materials: good (WER < 10%); medium (10-30%); bad (< 30%)
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Why is it so bad (1) The image quality is challenging, further processing needed (2) The classifiers of the OCR disregard certain type faces used in historic print (3) The language resources of the OCR are inappropriate: historical language
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IR: Search Problem even for keyed collections Kreüter ? krauter Kräuter (Eng herbs) kreuter 0 Results for Kreuter creuther – kräuter (= Engherbs) as kraͤuter, kreuter, kreüter, kreuter, creuther
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Special challenge for IR and OCR on historical Texts: Spelling variation  Missing normalization of orthography leads to plenty of spelling variants in historical documents, e.g. in German texts (1500-1850): – teil (= Engpart) as theil, teyl, theyl – kräuter (= Engherbs) as kraͤuter, kreuter, kreüter, kreuter, creuther – fragte (= Engasked) as frug, fruk  User is not aware of the variants and misses many documents: sometimes even false friends  Solution: Mapping from variants to modern lemma 12
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Language Resources to tackle challenges encountered in OCR and IR • Lexica for OCR • Language Models for OCR • Statistical Information about transformation patterns • Historical Stopwords for IR • Normalization Lexica with a mapping between modern and historical wordform for IR • Syntactical Information for paradigmatic expansion and disambiguation at POS level date footertext
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Mantra: All Language Resources are Corpus Based Possible sources: • Keyed Materials on the Web • Non Public Electronic Corpora • Keying/corrected OCR of Image Corpora • Noisy OCR Corpora
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Status of German historical Corpora 1. Main development corpus •Proofread texts from 1400 to 1900, •Medium size: 2.7 Mill. tokens •For lexicon construction •For diachronic analysis/classification of vocabulary of distinct periods 2. OCR corpus for lexicon testing •OCRed Images + groundtruth aligned •Texts from 16th, 18th, 19th century (5034, 2659, 18052) tokens 3. IR test corpus for lexicon testing •Special linguistically annotated groundtruth •Texts from 16th, 17th, 18th, 19th century 31080 tokens
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Language in a historical corpus for German Modern lexicon (CISLEX): coverage on Main Corpus on 10 periods
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Language in a historical corpus for German Compounds (modern components); coverage on Main Corpus on 10 periods
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Two Variants of Lexica for IR and OCR  Hypotetical Lexicon: Trying to map input strings to modern lexicon entries in a dynamic way via a special approximate matching procedure using historical transformation patterns.  Witnessed Lexicon: Corpus checked lexicon entries: Historical spelling variant + modern Lemma for IR Historical word list for OCR 18
  • 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Hypothetical Lexicon: approximate matching procedure  Many of the spelling variations can be traced back to a modern word by applying characteristic patterns / rewrite rules – e.g. the historical string theyle can be traced to its modern equivalent teile by applying th → t and ey → ei.  Required resources: – Contemporary lexicon with inflected word forms – Set of typical language-specific spelling variation patterns 19
  • 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Approximate matching procedure Modern lexicon Inflected Lemmatizing forms information … … teile teil (= part) ... teilen (= to share) taille taille (= waist) fragte fragen (= to ask) … … 20
  • 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Approximate matching procedure ~ 140 Modern lexicon Patterns Inflected Lemmatizing … forms information th → t … … ei → ai teile teil (= part) ey → ei ... teilen (= to share) l → ll taille taille (= waist) … fragte fragen (= to ask) … … 21
  • 22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Approximate matching procedure Spelling ~ 140 Modern lexicon variation Patterns Inflected Lemmatizing … forms information th → t … … theile ei → ai teile teil (= part) ey → ei ... teilen (= to share) l → ll taille taille (= waist) … fragte fragen (= to ask) … … 22
  • 23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Approximate matching procedure Spelling ~ 140 Modern lexicon variation Patterns Inflected Lemmatizing … forms information th → t … … theile ei → ai teile teil (= part) ey → ei ... teilen (= to share) l → ll taille taille (= waist) … fragte fragen (= to ask) … … 23
  • 24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Approximate matching procedure Spelling ~ 140 Modern lexicon variation Patterns Inflected Lemmatizing … forms information th → t … … theile ei → ai teile teil (= part) ey → ei ... teilen (= to share) l → ll taille taille (= waist) … fragte fragen (= to ask) … … 24
  • 25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Approximate matching procedure Spelling ~ 140 Modern lexicon variation Patterns Inflected Lemmatizing … forms information th → t … … theile ei → ai teile teil (= part) ey → ei ... teilen (= to share) l → ll taille taille (= waist) … fragte fragen (= to ask) … … 25
  • 26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Approximate matching procedure Spelling ~ 140 Modern lexicon variation Patterns Inflected Lemmatizing … forms information th → t … … frug ei → ai teile teil (= part) ey → ei ... teilen (= to share) l → ll taille taille (= waist) … fragte fragen (= to ask) … … 26
  • 27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Approximate matching procedure Spelling ~ 140 Modern lexicon variation Patterns Inflected Lemmatizing … forms information ? th → t … … frug ei → ai teile teil (= part) ey → ei ... teilen (= to share) l → ll taille taille (= waist) … fragte fragen (= to ask) … … 27
  • 28. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Approximate matching procedure  Advantages: – No manual work needed – Dynamic approach  Limitations: – Mismatches may link a historical spelling variation to a wrong modern word. – A part of the historical vocabulary cannot be reduced to a modern word by simple matching 28
  • 29. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Language in a historical corpus for German Hypothetical lexicon; coverage on Main Corpus on 10 periods
  • 30. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Manually collected special lexica Spelling Modern lexicon variation Inflected Lemmatizing forms information theile … … teile teil (= part) ... teilen (= to share) frug taille taille (= waist) fragte fragen (= to ask) … … 30
  • 31. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Manually collected special lexica Spelling Manual Modern lexicon variation mapping Inflected Lemmatizing forms information theile … … teile teil (= part) ... teilen (= to share) frug taille taille (= waist) fragte fragen (= to ask) … … 31
  • 32. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Manually collected special lexica  Advantages: – Associations between historical variant and modern lemma are safe – Associations that are not covered by the matching approach can be stored explicitly  Limitations: – Time consuming, labor-intensive, situations occur where specialists (historical linguists) are needed. – Hardly ever complete because of immense number of spelling variants 32
  • 33. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Evaluation of the approximative matcher – is the lexicon redundant?  Few empirical studies on crucial decisions for IR and OCR on historical texts: – Is a matching approach enough? – Do we need a lexicon, and if so, in which scenarios? 33
  • 34. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexica built both assign modern lemmas to historical full forms  Validated lexicon constructed from Main Corpus currently ca. 15,000 entries (“kernel lexicon for hist. German”) poor coverage Witnessed lexicon for OCR: from Main corpus, 200,000 tokens without modern correspondence still limited coverage: corpus size  Hypothetical lexicon for IR”: matching procedure mapping historical full form to modern pendant plus lemmatizer for modern language (historical full form <---> modern lemma), based on 140 patterns theoretically 100 Mio entries. High coverage, assignments can be erroneous, only able to capture regular correspondences (pattern based)
  • 35. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Evaluation of OCR Results  Test Corpus for 16th , 18th , and 19th century  Development version of a professional OCR engine with an external dictionary interface  Experiments with different lexicon settings No additional lexicon, character model only German modern lexicon corpus based witnessed lexicon hypothetical lexicon
  • 36. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 16th century 18th century 19th century Reduction Reduction Reduction Dictionary No. of word No. of word No. of word of error rate of error rate of error rate errors errors errors No Lexicon 1306 - 827 - 2074 - Optimal 756 42% 395 52% 612 70% Lexicon Modern 1096 16% 501 39% 888 57% Lexicon W.Historical 938 28% 481 42% 856 59% Lexicon Modern + 1011 25% 480 42% 849 59% Virtual H.L. WER > 50% WER ~ 10% 36
  • 37. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Evaluation of the approximate matcher in an IR scenario  proofread documents from 16th, 17th, 18th and 19th century and tagged each token manually.  Collected a list of historical) stopwords  Defined “precision” and “recall” for our scenario. 37
  • 38. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Main insights IR experiment  18th and 19th century: – Pure matching approach leads to good precision values. – Recall values are acceptable 38
  • 39. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Main insights IR experiment  16th and 17th century: – Precision of the matching approach poor, a lexicon will help to avoid wrong matches. – Recall values show that a large number of words can only be explained by a special lexicon 39
  • 40. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Answer to our refined Question Is the lexicon redundant for OCR/IR on historical texts?  Depends on the material, especially on the date of origin of the collection: – Matching approach leads to acceptable results for 19th and 18th century collections. – Serious limitations for 16th and 17th century collections. Special lexica will lead to important improvements  Combination of matching approach and manually collected lexica may lead to optimal results.  For postprocessing validated lexica are needed 40
  • 41. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Engineering Consequences A Focus Collection of the BavarianStateLibrary VD16: Collection of Early High German Books  Collaborative project with BSB on OCR/IR for this collection: Clemens Neudecker/Fedor Bochow  Special lexicon building needed  No 16th century electronic corpora available for lexicon development  For real world test we defined a topic area as main interest: Theology 41
  • 42. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Engineering Consequences Iterative Process with Bavarian State Library to Create Resources for VD16 Collection (1) A random selection of 200 pages from 100 sources (2) OCR and corpus experiments (3) Selection of usable sources (4) Specification of keying by BSB/CIS for 70 complete books usable for both presentation and linguistic resources building (5) Contract with service providers 42
  • 43. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Engineering Consequences A Focus Collection of the BavarianStateLibrary VD16: Collection of Early High German Books with 30 million pages Integrate OCR Supplier: Special Type Face Models; Character Models Improve OCR with a specialized historical Liguistic Database lexicon for VD16 Improve IR access with a normalization lexicon 43
  • 44. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Wrap Up • For challenging historical materials specialized lexica are needed • Special lexica directly implemented into basic OCR lifts OCR quality significantly. For bigger projects seek direct collaboration with OCR partners • For IR: use approximative matching or normalization lexica to process user queries • Integrate research institutions and collection holders