Targeted Language Resources for the Digitisation of Historical Collections
1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Targeted Language Resources for the
Digitization of Historical Collections
Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter,
Klaus U. Schulz
CIS, University of Munich
JISC Workshop 2009, Bath, UK
2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Questions and Methods
For historical documents of a specific period: what
kind of linguistic resources?
What kind of improvements can be expected?
Consequences for engeneering and processes?
----------------------------------------------------------------------
(1) Corpus analysis
(2) Quantitative Experiments on OCR and IR
2
3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Survey
1. Special Challenges to Digitize Historical Materials
2. Composing and Analyzing a Historical Corpus
3. Types of Linguistic Resources
4. Evaluation of Benefits: OCR, IR
5. Consequences for Engineering
3
4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
0. CIS within IMPACT
• (1) Text Recognition: Adaption of Optical Character
Recognition to historical documents
• (2) Resources Building: Enrichment of texts to Improve
Information Retrieval (IR) on historical documents
• (3) Research beyond IMPACT: steps to a next generation
interface to access collections of historical documents
5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Digitization Projects for Historical Materials
SELECT Create a Historical Collection
SCAN Create Images: Greyscale, Color, QA
PROCESS Improve Images
OCR/Type Create a Symbolical Representation, QA
INDEX Process a Term-Document Representation
PRESENT Provide a User Interface for Access
5
6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
1. Special Challenges for the Digitization of
Historical Materials
1500 1600 1700 1800 1900
• Imaging Damages on Originals
• Optical Character Recognition Rate of Recognition Errors
• Information Retrieval Historical Variants
• Human Reading Unknown Words
date footertext
7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Optical Character Recognition: Gothic, good quality
Städte den römischen mumcizmg gleich zu stellen. Allem wenn sich je in
einem Rechtstheile die altrechtlichen teutschen Gewohnheiten, und Gesetze
erhalten haben, so ist es gewiß in dieser Lehre, man mag entweder auf die
Befugniß, die Stadtgerechtigkeit zu ertheilen , oder auf die innere
Regimentsverfftssung so-
8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Optical Character Recognition: medium quality
Fürsten zu Gstternwerden/wer wollte vermainen / daßwt
IhroKhurftrstl Durchl gnädiglsterHcttVatterinderpictcr
rndFrombkcltallmFürstenvorzusetzen!scyn/vnd das halst>
in^cclcQ^ vci pluz^uäzn 5accr6o5 daß tl iN KilchkN GottW
wehr als ein Priester .
9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Examples of Noise introduced by imperfect OCR
(1) processed word images may lead to False Friends
Fischerei - Tischlerei: F->T, h -> hl (Engfishery - carpenterry)
(2) processed word images may relate to no word at all
^.uglltt. schreibet/
(3) severe word segmentation errors
vndExcmpelFürstl-vnd HeroischerTuzenF
OCR on Gothic materials:
good (WER < 10%); medium (10-30%); bad (< 30%)
10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Why is it so bad
(1) The image quality is challenging, further
processing needed
(2) The classifiers of the OCR disregard certain type
faces used in historic print
(3) The language resources of the OCR are
inappropriate: historical language
11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IR: Search Problem
even for keyed collections Kreüter
? krauter
Kräuter
(Eng herbs) kreuter
0 Results for Kreuter
creuther
– kräuter (= Engherbs) as kraͤuter, kreuter, kreüter, kreuter, creuther
12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Special challenge for IR and OCR on
historical Texts: Spelling variation
Missing normalization of orthography leads to plenty
of spelling variants in historical documents, e.g. in
German texts (1500-1850):
– teil (= Engpart) as theil, teyl, theyl
– kräuter (= Engherbs) as kraͤuter, kreuter, kreüter,
kreuter, creuther
– fragte (= Engasked) as frug, fruk
User is not aware of the variants and misses many
documents: sometimes even false friends
Solution: Mapping from variants to modern lemma
12
13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Language Resources to tackle challenges
encountered in OCR and IR
• Lexica for OCR
• Language Models for OCR
• Statistical Information about transformation patterns
• Historical Stopwords for IR
• Normalization Lexica with a mapping between modern and
historical wordform for IR
• Syntactical Information for paradigmatic expansion and
disambiguation at POS level
date footertext
14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Mantra: All Language Resources are Corpus
Based
Possible sources:
• Keyed Materials on the Web
• Non Public Electronic Corpora
• Keying/corrected OCR of Image Corpora
• Noisy OCR Corpora
15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Status of German historical Corpora
1. Main development corpus
•Proofread texts from 1400 to 1900,
•Medium size: 2.7 Mill. tokens
•For lexicon construction
•For diachronic analysis/classification of vocabulary of distinct periods
2. OCR corpus for lexicon testing
•OCRed Images + groundtruth aligned
•Texts from 16th, 18th, 19th century (5034, 2659, 18052) tokens
3. IR test corpus for lexicon testing
•Special linguistically annotated groundtruth
•Texts from 16th, 17th, 18th, 19th century 31080 tokens
16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Language in a historical corpus for German
Modern lexicon (CISLEX): coverage on Main Corpus on 10 periods
17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Language in a historical corpus for German
Compounds (modern components); coverage on Main Corpus on 10 periods
18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Two Variants of Lexica for IR and OCR
Hypotetical Lexicon:
Trying to map input strings to modern lexicon entries
in a dynamic way via a special approximate matching
procedure using historical transformation patterns.
Witnessed Lexicon:
Corpus checked lexicon entries:
Historical spelling variant + modern Lemma for IR
Historical word list for OCR
18
19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Hypothetical Lexicon: approximate
matching procedure
Many of the spelling variations can be traced back to
a modern word by applying characteristic patterns /
rewrite rules
– e.g. the historical string theyle can be traced to
its modern equivalent teile by applying th → t
and ey → ei.
Required resources:
– Contemporary lexicon with inflected word forms
– Set of typical language-specific spelling variation
patterns
19
20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Approximate matching procedure
Modern lexicon
Inflected Lemmatizing
forms information
… …
teile teil (= part)
... teilen (= to share)
taille taille (= waist)
fragte fragen (= to ask)
… …
20
21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Approximate matching procedure
~ 140 Modern lexicon
Patterns
Inflected Lemmatizing
… forms information
th → t … …
ei → ai teile teil (= part)
ey → ei
... teilen (= to share)
l → ll
taille taille (= waist)
…
fragte fragen (= to ask)
… …
21
22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Approximate matching procedure
Spelling ~ 140 Modern lexicon
variation Patterns
Inflected Lemmatizing
… forms information
th → t … …
theile ei → ai teile teil (= part)
ey → ei
... teilen (= to share)
l → ll
taille taille (= waist)
…
fragte fragen (= to ask)
… …
22
23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Approximate matching procedure
Spelling ~ 140 Modern lexicon
variation Patterns
Inflected Lemmatizing
… forms information
th → t … …
theile ei → ai teile teil (= part)
ey → ei
... teilen (= to share)
l → ll
taille taille (= waist)
…
fragte fragen (= to ask)
… …
23
24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Approximate matching procedure
Spelling ~ 140 Modern lexicon
variation Patterns
Inflected Lemmatizing
… forms information
th → t … …
theile ei → ai teile teil (= part)
ey → ei
... teilen (= to share)
l → ll
taille taille (= waist)
…
fragte fragen (= to ask)
… …
24
25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Approximate matching procedure
Spelling ~ 140 Modern lexicon
variation Patterns
Inflected Lemmatizing
… forms information
th → t … …
theile ei → ai teile teil (= part)
ey → ei
... teilen (= to share)
l → ll
taille taille (= waist)
…
fragte fragen (= to ask)
… …
25
26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Approximate matching procedure
Spelling ~ 140 Modern lexicon
variation Patterns
Inflected Lemmatizing
… forms information
th → t … …
frug ei → ai teile teil (= part)
ey → ei
... teilen (= to share)
l → ll
taille taille (= waist)
…
fragte fragen (= to ask)
… …
26
27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Approximate matching procedure
Spelling ~ 140 Modern lexicon
variation Patterns
Inflected Lemmatizing
… forms information
? th → t … …
frug ei → ai teile teil (= part)
ey → ei
... teilen (= to share)
l → ll
taille taille (= waist)
…
fragte fragen (= to ask)
… …
27
28. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Approximate matching procedure
Advantages:
– No manual work needed
– Dynamic approach
Limitations:
– Mismatches may link a historical spelling
variation to a wrong modern word.
– A part of the historical vocabulary cannot be
reduced to a modern word by simple matching
28
29. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Language in a historical corpus for German
Hypothetical lexicon; coverage on Main Corpus on 10 periods
30. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Manually collected special lexica
Spelling Modern lexicon
variation
Inflected Lemmatizing
forms information
theile … …
teile teil (= part)
... teilen (= to share)
frug taille taille (= waist)
fragte fragen (= to ask)
… …
30
31. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Manually collected special lexica
Spelling Manual Modern lexicon
variation mapping
Inflected Lemmatizing
forms information
theile … …
teile teil (= part)
... teilen (= to share)
frug taille taille (= waist)
fragte fragen (= to ask)
… …
31
32. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Manually collected special lexica
Advantages:
– Associations between historical variant and
modern lemma are safe
– Associations that are not covered by the matching
approach can be stored explicitly
Limitations:
– Time consuming, labor-intensive, situations
occur where specialists (historical linguists) are
needed.
– Hardly ever complete because of immense
number of spelling variants
32
33. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation of the approximative matcher –
is the lexicon redundant?
Few empirical studies on crucial decisions for
IR and OCR on historical texts:
– Is a matching approach enough?
– Do we need a lexicon, and if so, in which
scenarios?
33
34. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexica built
both assign modern lemmas to historical full forms
Validated lexicon constructed from Main Corpus
currently ca. 15,000 entries (“kernel lexicon for hist. German”) poor
coverage
Witnessed lexicon for OCR: from Main corpus, 200,000 tokens without
modern correspondence still limited coverage: corpus size
Hypothetical lexicon for IR”: matching procedure mapping historical full
form to modern pendant plus lemmatizer for modern language (historical
full form <---> modern lemma), based on 140 patterns theoretically 100
Mio entries. High coverage, assignments can be erroneous, only able to
capture regular correspondences (pattern based)
35. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation of OCR Results
Test Corpus for 16th , 18th , and 19th century
Development version of a professional OCR engine with an external
dictionary interface
Experiments with different lexicon settings
No additional lexicon, character model only
German modern lexicon
corpus based witnessed lexicon
hypothetical lexicon
36. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
16th century 18th century 19th century
Reduction Reduction Reduction
Dictionary No. of word No. of word No. of word
of error rate of error rate of error rate
errors errors errors
No Lexicon 1306 - 827 - 2074 -
Optimal
756 42% 395 52% 612 70%
Lexicon
Modern
1096 16% 501 39% 888 57%
Lexicon
W.Historical
938 28% 481 42% 856 59%
Lexicon
Modern +
1011 25% 480 42% 849 59%
Virtual H.L.
WER > 50% WER ~ 10%
36
37. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation of the approximate matcher in an
IR scenario
proofread documents from 16th, 17th, 18th and 19th
century and tagged each token manually.
Collected a list of historical) stopwords
Defined “precision” and “recall” for our scenario.
37
38. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Main insights IR experiment
18th and 19th century:
– Pure matching approach leads to good
precision values.
– Recall values are acceptable
38
39. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Main insights IR experiment
16th and 17th century:
– Precision of the matching approach poor, a
lexicon will help to avoid wrong matches.
– Recall values show that a large number of words
can only be explained by a special lexicon
39
40. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Answer to our refined Question
Is the lexicon redundant for OCR/IR on historical
texts?
Depends on the material, especially on the date of
origin of the collection:
– Matching approach leads to acceptable results for 19th
and 18th century collections.
– Serious limitations for 16th and 17th century collections.
Special lexica will lead to important improvements
Combination of matching approach and manually
collected lexica may lead to optimal results.
For postprocessing validated lexica are needed
40
41. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Engineering Consequences
A Focus Collection of the BavarianStateLibrary
VD16: Collection of Early High German Books
Collaborative project with BSB on OCR/IR for this
collection: Clemens Neudecker/Fedor Bochow
Special lexicon building needed
No 16th century electronic corpora available for
lexicon development
For real world test we defined a topic area as main
interest: Theology
41
42. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Engineering Consequences
Iterative Process with Bavarian State Library to
Create Resources for VD16 Collection
(1) A random selection of 200 pages from 100 sources
(2) OCR and corpus experiments
(3) Selection of usable sources
(4) Specification of keying by BSB/CIS for 70 complete
books usable for both presentation and linguistic
resources building
(5) Contract with service providers
42
43. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Engineering Consequences
A Focus Collection of the BavarianStateLibrary
VD16: Collection of Early High German Books with
30 million pages
Integrate OCR Supplier: Special Type Face
Models; Character Models
Improve OCR with a specialized historical
Liguistic Database lexicon
for VD16
Improve IR access with a normalization lexicon
43
44. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Wrap Up
• For challenging historical materials specialized
lexica are needed
• Special lexica directly implemented into basic OCR
lifts OCR quality significantly. For bigger projects
seek direct collaboration with OCR partners
• For IR: use approximative matching or normalization
lexica to process user queries
• Integrate research institutions and collection
holders