SlideShare una empresa de Scribd logo
1 de 21
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Special resources to access 16th century German
Ludwig-Maximilians-Universität München

Annette Gotscharek




15. 10. 2011, IMPACT Conference
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.



Special resources to access 16th century German
                   “access”?
         OCR:
          Role of the lexicon: defines the set of valid words.

          ...      Geist
                   Geister
                   Teile
                   gemütlich …


         Information Retrieval (IR):
          Role of the lexicon: meaningful expansion of the user query to increase recall.

          ...      Geist  Geister, Geiste, Geistern
                   Teil  Teile, Teils, Teilen
                   gemütlich  gemütlicher, gemütlichste ...
15. 10. 2011, IMPACT Conference                                                                                                                          2
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Special resources to access 16th century German
         In IMPACT, we worked on documents from 1500-1950, but 16th century is special:
         –          Language period: Early New High German (1350-1650)
         –          Oldest and therefore most challenging period of printed books
         –          Large library holdings from 16th century at our partner library BSB
         linguistic features of historical language on word-level

                                                                                       Historic               modern                                    English
          –      Historical spelling variation:                                        geyſte                Geiste                                     spirit
          –      Historical morphology:                                                er frug                er fragte                                 he asked
          –      Obsolete vocabulary:                                                  mirackel              Wunder (?)                                 miracle
          –      Obsolete character set:                                               aͤ                    ä…


  Need adapted linguistic resources
15. 10. 2011, IMPACT Conference                                                                                                                                     3
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                Adapted linguistic resources: structure
         OCR:

          ...      Geist
                   Geister
                   Teile
                   gemütlich …


         Information Retrieval (IR):

          ...      Geist  Geister, Geiste, Geistern
                   Teil  Teile, Teils, Teilen
                   gemütlich  gemütlicher, gemütlichste ...




15. 10. 2011, IMPACT Conference                                                                                                                          4
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                Adapted linguistic resources: structure
         OCR:

          ...      Geist                                                                                     Geyst
                   Geister                                                                                   Geyster
                   Teile                                                                                     Theile
                   gemütlich                                                                                 gemüthlich …


         Information Retrieval (IR):

          ...      Geist  Geister, Geiste, Geistern                                                         Geyster, Geyste, Geystern
                   Teil  Teile, Teils, Teilen                                                               Theile, Theils, Theilen
                   gemütlich  gemütlicher, gemütlichste                                                     gemüthlicher, gemüthlichste...




15. 10. 2011, IMPACT Conference                                                                                                                          5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




             Linguistic Resources for Historical Texts

 Diachronic Groundtruth Corpus (1500-1950)
 Hypothetical lexicon for rule based variants
 Manually verified lexicon




15. 10. 2011, IMPACT Conference                                                                                                                          6
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




             Linguistic Resources for Historical Texts

 Diachronic Groundtruth Corpus (1500-1950)
 Hypothetical lexicon for rule based variants
 Manually verified lexicon




15. 10. 2011, IMPACT Conference                                                                                                                          7
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




       Diachronic Groundtruth Corpus (1500-1950)

 Collection of groundtruth material from different sources in the web and non-public
  electronic corpora (Institut für Deutsche Sprache Mannheim)

 Large gap especially in 16th / 17th century:
   with BSB: preparation of additional corpus from BSB documents:
   – Random selection of 100 works from digitized images of 16th and 17th century
   – Mostly related to theology
   – Latin texts excluded, no poems etc.
   – Keyed by a service provider
   – 1766 pages with ~ 858,000 tokens groundtruth material

15. 10. 2011, IMPACT Conference                                                                                                                          8
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




       Diachronic Groundtruth Corpus (1500-1950)
         Gains of tokens by the extension of the corpus:




         Complete corpus contains ~ 3,380,000 tokens in 500 texts from 4 centuries
           basis for different analyses and lexicon building

15. 10. 2011, IMPACT Conference                                                                                                                          9
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




             Coverage on Diachronic Corpus: modern
Types (%)                        1500- 1550- 1600- 1650- 1700- 1750- 1800- 1850- 1900-
           –                     1549              1599             1649             1699              1749             1799             1849             1899   1949
Modern simple                      15.3             28.8              29.2             31.5             38.1              52.0           54.7             48.0   60.1
words
Modern                             5.1              6.1               6.9              8.6              7.13              15.5           20.6             28.1   27.8
compounds


   Less than 45% of the vocabulary is covered by modern resources before 1750.
   16th century: only 15% - 29% modern simple words, modern closed compounds
    are hardly relevant.




 15. 10. 2011, IMPACT Conference                                                                                                                                        10
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




             Linguistic Resources for Historical Texts

 Diachronic Groundtruth Corpus (1500-1950)
 Hypothetical lexicon for rule based variants
 Manually verified lexicon




15. 10. 2011, IMPACT Conference                                                                                                                          11
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




        Hypothetical lexicon for rule based variants

  Systematic substitution rules (patterns) describe the difference
   between modern and historical spelling:
                                                                      t        th,ei               ey
 (modern)                             teil                                                                            theyl                              (historic)

  Based on the modern lexicon and the 140 manually collected
   patterns, the set of all potential rule based historical variants can be
   computed automatically (“hypothetical lexicon”).


15. 10. 2011, IMPACT Conference                                                                                                                                       12
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




           Hypothetical lexicon for rule based variants
                                                                                                      hypothetical
                                                                                                             lexicon
    …
   Esel                                                                                 Teil
    …                                                 Esel                             Teill
   Teil                                               Esell                            Teyl
                                                                                                                                                             …
    …                                                Esehl                                                                                                  e →eh
                                                                                       Teyll
                                                     Esehll                                                                                                 ei →ey
                                                                                      Tehill
                                                      Eßel                                                                                                  s →ß
                                                                                      Theil
modern                                                Eßell                                                                                                  l→ll
                                                                                        …
                                                    Eßehll                                                                                                  t →th
lexicon                                                                                                                                                       …
                                                        …

                                                                                                                                                pattern set

   15. 10. 2011, IMPACT Conference                                                                                                                                   13
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




        Hypothetical lexicon for rule based variants

  Automatic mapping from rule based historical variants to their equivalent in
   the modern vocabulary is possible:
                       historic                                  modern
                       Geyst                 =                   Geist + (ei  ey)
                       Theile                =                   Teile + (t th)


 By far not all historical variants can be described by simple replacement rules:
                     historic                                    modern
                     frug     =                                  fragte + ?
                     Mirackel =                                  ?+?


15. 10. 2011, IMPACT Conference                                                                                                                          14
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




        Coverage on Diachronic Corpus: hypothetic
Types (%)                        1500- 1550- 1600- 1650- 1700- 1750- 1800- 1850- 1900-
                                 1549              1599             1649             1699              1749             1799             1849             1899   1949
Modern simple                      15.3             28.8              29.2             31.5             38.1              52.0           54.7             48.0   60.1
words
Modern                             5.1              6.1               6.9              8.6              7.13              15.5           20.6             28.1   27.8
compounds
Hypothetic                         29.5             29.8              27.9             26.0             21.9              14.3           8.1              7.7    2.0




   16th century: 30% of the vocabulary are covered by the lexicon of rule based
    variants
   Applied as OCR-Lexicon via the IMPACT Abbyy External Dictionary Interface:
    improvement of recognition rate (published 2009)
 15. 10. 2011, IMPACT Conference                                                                                                                                        15
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




             Coverage on Diachronic Corpus: missing
Types (%)                        1500- 1550- 1600- 1650- 1700- 1750- 1800- 1850- 1900-
                                 1549              1599             1649             1699              1749             1799             1849             1899   1949
Modern simple                      15.3             28.8              29.2             31.5             38.1              52.0           54.7             48.0   60.1
words
Modern                             5.1              6.1               6.9              8.6              7.13              15.5           20.6             28.1   27.8
compounds
Hypothetic                         29.5             29.8              27.9             26.0             21.9              14.3           8.1              7.7    2.0


Missing                            45.9             28.7              29.7             26.0             23.5              15.1           13.9             13.5   8.1




   Especially in the 16th century: Up to 46% “difficult” vocabulary.
     manually verified lexicon necessary!
 15. 10. 2011, IMPACT Conference                                                                                                                                        16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




             Linguistic Resources for Historical Texts

 Diachronic Groundtruth Corpus (1500-1950)
 Hypothetical lexicon for rule based variants
 Manually verified lexicon




15. 10. 2011, IMPACT Conference                                                                                                                          17
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                  Manually verified IR-lexicon: Structure

One entry contains:
          –      Historical word form from the corpus
          –      Corresponding modern word form
          –      Patterns if applicable
          –      Corresponding modern lemma
          –      At least one occurrence in the corpus as a attestation for the reading


 Manual assignment of modern word form and lemma
 Explicit handling of not rule based variants

15. 10. 2011, IMPACT Conference                                                                                                                          18
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




            Manually verified IR-lexicon: Compilation

 Web-based, collaborative user interface
 User support:
          – For rule based variants: Suggestion of the corresponding modern word
            form by the hypothetic lexicon
          – Suggestion of all possible lemmas for the modern word form by a large
            modern lexicon (CISLEX)
          – Concordance list of the historical variant




15. 10. 2011, IMPACT Conference                                                                                                                          19
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                       Manually verified IR-lexicon: Status

 41,600 entries have been created for 24,800 historical word forms
  from the diachronic corpus, 72,100 attestations were annotated.

 IMPACT-Partner in Slovenia und Bulgaria create corresponding
  lexica with an adapted version of the tool.




15. 10. 2011, IMPACT Conference                                                                                                                          20
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                 Thank you.




15. 10. 2011, IMPACT Conference                                                                                                                          21

Más contenido relacionado

La actualidad más candente

The Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiativesThe Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiativesMichael Day
 
Europeana. A Digital Library for the Humanities?
Europeana. A Digital Library for the Humanities?Europeana. A Digital Library for the Humanities?
Europeana. A Digital Library for the Humanities?AubreyMcFato
 
Jpl cv en JUN10
Jpl cv en JUN10Jpl cv en JUN10
Jpl cv en JUN10JoaoPL
 
DATA VISUALIZATION USING MATPLOTLIB (PYTHON)
DATA VISUALIZATION USING MATPLOTLIB (PYTHON)DATA VISUALIZATION USING MATPLOTLIB (PYTHON)
DATA VISUALIZATION USING MATPLOTLIB (PYTHON)Mohammed Anzil
 
Tel concertation meeting project presentations - 7-2-2014
Tel concertation meeting   project presentations - 7-2-2014Tel concertation meeting   project presentations - 7-2-2014
Tel concertation meeting project presentations - 7-2-2014munarmu
 
Jpl cv en_mar11
Jpl cv en_mar11Jpl cv en_mar11
Jpl cv en_mar11JoaoPL
 
Javier Diaz Presentacion Korea V4
Javier Diaz Presentacion Korea V4Javier Diaz Presentacion Korea V4
Javier Diaz Presentacion Korea V4Javier Diaz
 
iDiscover: Towards the next generation of contextualised mobile museum guides
iDiscover: Towards the next generation of contextualised mobile museum guidesiDiscover: Towards the next generation of contextualised mobile museum guides
iDiscover: Towards the next generation of contextualised mobile museum guidesiDiscover Interactief Erfgoed
 
Jpl Cv En v2
Jpl Cv En v2Jpl Cv En v2
Jpl Cv En v2JoaoPL
 

La actualidad más candente (10)

The Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiativesThe Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiatives
 
IMPACT Final Conference - Ulrich Reffle
IMPACT Final Conference - Ulrich ReffleIMPACT Final Conference - Ulrich Reffle
IMPACT Final Conference - Ulrich Reffle
 
Europeana. A Digital Library for the Humanities?
Europeana. A Digital Library for the Humanities?Europeana. A Digital Library for the Humanities?
Europeana. A Digital Library for the Humanities?
 
Jpl cv en JUN10
Jpl cv en JUN10Jpl cv en JUN10
Jpl cv en JUN10
 
DATA VISUALIZATION USING MATPLOTLIB (PYTHON)
DATA VISUALIZATION USING MATPLOTLIB (PYTHON)DATA VISUALIZATION USING MATPLOTLIB (PYTHON)
DATA VISUALIZATION USING MATPLOTLIB (PYTHON)
 
Tel concertation meeting project presentations - 7-2-2014
Tel concertation meeting   project presentations - 7-2-2014Tel concertation meeting   project presentations - 7-2-2014
Tel concertation meeting project presentations - 7-2-2014
 
Jpl cv en_mar11
Jpl cv en_mar11Jpl cv en_mar11
Jpl cv en_mar11
 
Javier Diaz Presentacion Korea V4
Javier Diaz Presentacion Korea V4Javier Diaz Presentacion Korea V4
Javier Diaz Presentacion Korea V4
 
iDiscover: Towards the next generation of contextualised mobile museum guides
iDiscover: Towards the next generation of contextualised mobile museum guidesiDiscover: Towards the next generation of contextualised mobile museum guides
iDiscover: Towards the next generation of contextualised mobile museum guides
 
Jpl Cv En v2
Jpl Cv En v2Jpl Cv En v2
Jpl Cv En v2
 

Destacado

IMACT Final Conference - Language Parallel Sessions - Erjavec
IMACT Final Conference - Language Parallel Sessions - ErjavecIMACT Final Conference - Language Parallel Sessions - Erjavec
IMACT Final Conference - Language Parallel Sessions - ErjavecIMPACT Centre of Competence
 
IMPACT Final Conference - Language Parallel Sessions - Landsbergen
IMPACT Final Conference - Language Parallel Sessions -  LandsbergenIMPACT Final Conference - Language Parallel Sessions -  Landsbergen
IMPACT Final Conference - Language Parallel Sessions - LandsbergenIMPACT Centre of Competence
 
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...IMPACT Centre of Competence
 
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...IMPACT Centre of Competence
 
IMPACT/myGrid Hackathon - Introduction to Taverna
IMPACT/myGrid Hackathon - Introduction to TavernaIMPACT/myGrid Hackathon - Introduction to Taverna
IMPACT/myGrid Hackathon - Introduction to TavernaIMPACT Centre of Competence
 
IMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACTIMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACTIMPACT Centre of Competence
 
IMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a PortalIMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a PortalIMPACT Centre of Competence
 
IMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos AntonacopoulosIMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos AntonacopoulosIMPACT Centre of Competence
 
IMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer LaamanenIMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer LaamanenIMPACT Centre of Competence
 

Destacado (20)

IMACT Final Conference - Language Parallel Sessions - Erjavec
IMACT Final Conference - Language Parallel Sessions - ErjavecIMACT Final Conference - Language Parallel Sessions - Erjavec
IMACT Final Conference - Language Parallel Sessions - Erjavec
 
IMPACT Final Conference - Language Parallel Sessions - Landsbergen
IMPACT Final Conference - Language Parallel Sessions -  LandsbergenIMPACT Final Conference - Language Parallel Sessions -  Landsbergen
IMPACT Final Conference - Language Parallel Sessions - Landsbergen
 
IMPACT Final Conference - Steven Krauwer
IMPACT Final Conference - Steven KrauwerIMPACT Final Conference - Steven Krauwer
IMPACT Final Conference - Steven Krauwer
 
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
 
IMPACT Final Conference - Khalil Rouhana
IMPACT Final Conference - Khalil  RouhanaIMPACT Final Conference - Khalil  Rouhana
IMPACT Final Conference - Khalil Rouhana
 
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
 
IMPACT Final Conference - Aly Conteh
IMPACT Final Conference - Aly ContehIMPACT Final Conference - Aly Conteh
IMPACT Final Conference - Aly Conteh
 
IMPACT/myGrid Hackathon - Introduction to Taverna
IMPACT/myGrid Hackathon - Introduction to TavernaIMPACT/myGrid Hackathon - Introduction to Taverna
IMPACT/myGrid Hackathon - Introduction to Taverna
 
IMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACTIMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACT
 
IMPACT/myGrid Hackathon - Taverna Roadmap
IMPACT/myGrid Hackathon - Taverna RoadmapIMPACT/myGrid Hackathon - Taverna Roadmap
IMPACT/myGrid Hackathon - Taverna Roadmap
 
IMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a PortalIMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a Portal
 
IMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos AntonacopoulosIMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos Antonacopoulos
 
IMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer LaamanenIMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer Laamanen
 
IMPACT Final Conference - Clemens Neudecker
IMPACT Final Conference - Clemens NeudeckerIMPACT Final Conference - Clemens Neudecker
IMPACT Final Conference - Clemens Neudecker
 
IMPACT Final Conference - Paul Fogel
IMPACT Final Conference - Paul FogelIMPACT Final Conference - Paul Fogel
IMPACT Final Conference - Paul Fogel
 
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael FuchsIMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
 
IMPACT Final Conference - Gregory Crane
IMPACT Final Conference - Gregory CraneIMPACT Final Conference - Gregory Crane
IMPACT Final Conference - Gregory Crane
 
IMPACT Final Conference - Asaf Tzadok
IMPACT Final Conference - Asaf TzadokIMPACT Final Conference - Asaf Tzadok
IMPACT Final Conference - Asaf Tzadok
 
IMPACT Final Conference - Claus Gravenhorst
IMPACT Final Conference - Claus GravenhorstIMPACT Final Conference - Claus Gravenhorst
IMPACT Final Conference - Claus Gravenhorst
 
IMPACT Final Conference - Stefan Pletschacher
IMPACT Final Conference - Stefan PletschacherIMPACT Final Conference - Stefan Pletschacher
IMPACT Final Conference - Stefan Pletschacher
 

Similar a IMPACT Final Conference - Language Parallel Sessions - Gotscharek

Targeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsTargeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsEmma Huber
 
20091209 Poznan European Tower Of Babel
20091209 Poznan European Tower Of Babel20091209 Poznan European Tower Of Babel
20091209 Poznan European Tower Of BabelStefan Gradmann
 
OCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTOCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTcneudecker
 
IMPACT HPC Cloud Day
IMPACT HPC Cloud DayIMPACT HPC Cloud Day
IMPACT HPC Cloud Daycneudecker
 
Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)cneudecker
 
Europe’s Common Cultural Heritage – Unity in Diversity: Digital Technologies ...
Europe’s Common Cultural Heritage – Unity in Diversity: Digital Technologies ...Europe’s Common Cultural Heritage – Unity in Diversity: Digital Technologies ...
Europe’s Common Cultural Heritage – Unity in Diversity: Digital Technologies ...Aneta Kozuchowska
 
Building The European Digital Library - An Insider’s Point of View
Building The European Digital Library - An Insider’s Point of View Building The European Digital Library - An Insider’s Point of View
Building The European Digital Library - An Insider’s Point of View Olaf Janssen
 
The e-documentation of the Past in 3D: A Challenge and a Risk for the Present...
The e-documentation of the Past in 3D: A Challenge and a Risk for the Present...The e-documentation of the Past in 3D: A Challenge and a Risk for the Present...
The e-documentation of the Past in 3D: A Challenge and a Risk for the Present...Marios Pitikakis
 
20110407 europeana semanticsbabel_santiago
20110407 europeana semanticsbabel_santiago20110407 europeana semanticsbabel_santiago
20110407 europeana semanticsbabel_santiagoStefan Gradmann
 
The European (Digital) Library - Overview and Outlook
The European (Digital) Library - Overview and OutlookThe European (Digital) Library - Overview and Outlook
The European (Digital) Library - Overview and OutlookOlaf Janssen
 
“Virtual Communities in Europe: the cultural mix and how the European Library...
“Virtual Communities in Europe: the cultural mix and how the European Library...“Virtual Communities in Europe: the cultural mix and how the European Library...
“Virtual Communities in Europe: the cultural mix and how the European Library...bridgingworlds2008
 
Europeana roadmap and satellite projects (Lizzy Komen)
Europeana roadmap and satellite projects (Lizzy Komen)Europeana roadmap and satellite projects (Lizzy Komen)
Europeana roadmap and satellite projects (Lizzy Komen)Kulturnett Sogn og Fjordane
 
Olaf Janssen on benefits of collaboration between Europeana and archives duri...
Olaf Janssen on benefits of collaboration between Europeana and archives duri...Olaf Janssen on benefits of collaboration between Europeana and archives duri...
Olaf Janssen on benefits of collaboration between Europeana and archives duri...Olaf Janssen
 
Olaf Janssen on the collaboration between European national libraries during ...
Olaf Janssen on the collaboration between European national libraries during ...Olaf Janssen on the collaboration between European national libraries during ...
Olaf Janssen on the collaboration between European national libraries during ...Olaf Janssen
 
BL Demo Day - July2011 - (1) Introduction to IMPACT
BL Demo Day - July2011 - (1) Introduction to IMPACTBL Demo Day - July2011 - (1) Introduction to IMPACT
BL Demo Day - July2011 - (1) Introduction to IMPACTIMPACT Centre of Competence
 
Nico Verplancke - Digital archiving at the Waalse Krook
Nico Verplancke - Digital archiving at the Waalse KrookNico Verplancke - Digital archiving at the Waalse Krook
Nico Verplancke - Digital archiving at the Waalse KrookiMinds conference
 

Similar a IMPACT Final Conference - Language Parallel Sessions - Gotscharek (20)

Targeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsTargeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical Collections
 
20091209 Poznan European Tower Of Babel
20091209 Poznan European Tower Of Babel20091209 Poznan European Tower Of Babel
20091209 Poznan European Tower Of Babel
 
OCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTOCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACT
 
IMPACT HPC Cloud Day
IMPACT HPC Cloud DayIMPACT HPC Cloud Day
IMPACT HPC Cloud Day
 
Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)
 
Europe’s Common Cultural Heritage – Unity in Diversity: Digital Technologies ...
Europe’s Common Cultural Heritage – Unity in Diversity: Digital Technologies ...Europe’s Common Cultural Heritage – Unity in Diversity: Digital Technologies ...
Europe’s Common Cultural Heritage – Unity in Diversity: Digital Technologies ...
 
Building The European Digital Library - An Insider’s Point of View
Building The European Digital Library - An Insider’s Point of View Building The European Digital Library - An Insider’s Point of View
Building The European Digital Library - An Insider’s Point of View
 
The e-documentation of the Past in 3D: A Challenge and a Risk for the Present...
The e-documentation of the Past in 3D: A Challenge and a Risk for the Present...The e-documentation of the Past in 3D: A Challenge and a Risk for the Present...
The e-documentation of the Past in 3D: A Challenge and a Risk for the Present...
 
20110407 europeana semanticsbabel_santiago
20110407 europeana semanticsbabel_santiago20110407 europeana semanticsbabel_santiago
20110407 europeana semanticsbabel_santiago
 
The European (Digital) Library - Overview and Outlook
The European (Digital) Library - Overview and OutlookThe European (Digital) Library - Overview and Outlook
The European (Digital) Library - Overview and Outlook
 
“Virtual Communities in Europe: the cultural mix and how the European Library...
“Virtual Communities in Europe: the cultural mix and how the European Library...“Virtual Communities in Europe: the cultural mix and how the European Library...
“Virtual Communities in Europe: the cultural mix and how the European Library...
 
The European Library
The European LibraryThe European Library
The European Library
 
EuropeanaLocal, its contribution
EuropeanaLocal, its contributionEuropeanaLocal, its contribution
EuropeanaLocal, its contribution
 
Museums and Europeana
Museums and EuropeanaMuseums and Europeana
Museums and Europeana
 
Europeana roadmap and satellite projects (Lizzy Komen)
Europeana roadmap and satellite projects (Lizzy Komen)Europeana roadmap and satellite projects (Lizzy Komen)
Europeana roadmap and satellite projects (Lizzy Komen)
 
EOD at Lublin Conference, Poland, October 2012
EOD at Lublin Conference, Poland, October 2012EOD at Lublin Conference, Poland, October 2012
EOD at Lublin Conference, Poland, October 2012
 
Olaf Janssen on benefits of collaboration between Europeana and archives duri...
Olaf Janssen on benefits of collaboration between Europeana and archives duri...Olaf Janssen on benefits of collaboration between Europeana and archives duri...
Olaf Janssen on benefits of collaboration between Europeana and archives duri...
 
Olaf Janssen on the collaboration between European national libraries during ...
Olaf Janssen on the collaboration between European national libraries during ...Olaf Janssen on the collaboration between European national libraries during ...
Olaf Janssen on the collaboration between European national libraries during ...
 
BL Demo Day - July2011 - (1) Introduction to IMPACT
BL Demo Day - July2011 - (1) Introduction to IMPACTBL Demo Day - July2011 - (1) Introduction to IMPACT
BL Demo Day - July2011 - (1) Introduction to IMPACT
 
Nico Verplancke - Digital archiving at the Waalse Krook
Nico Verplancke - Digital archiving at the Waalse KrookNico Verplancke - Digital archiving at the Waalse Krook
Nico Verplancke - Digital archiving at the Waalse Krook
 

Más de IMPACT Centre of Competence

Más de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Último

Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 

Último (20)

Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 

IMPACT Final Conference - Language Parallel Sessions - Gotscharek

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Special resources to access 16th century German Ludwig-Maximilians-Universität München Annette Gotscharek 15. 10. 2011, IMPACT Conference
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Special resources to access 16th century German “access”?  OCR: Role of the lexicon: defines the set of valid words. ... Geist Geister Teile gemütlich …  Information Retrieval (IR): Role of the lexicon: meaningful expansion of the user query to increase recall. ... Geist  Geister, Geiste, Geistern Teil  Teile, Teils, Teilen gemütlich  gemütlicher, gemütlichste ... 15. 10. 2011, IMPACT Conference 2
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Special resources to access 16th century German  In IMPACT, we worked on documents from 1500-1950, but 16th century is special: – Language period: Early New High German (1350-1650) – Oldest and therefore most challenging period of printed books – Large library holdings from 16th century at our partner library BSB  linguistic features of historical language on word-level Historic  modern English – Historical spelling variation: geyſte Geiste spirit – Historical morphology: er frug  er fragte he asked – Obsolete vocabulary: mirackel Wunder (?) miracle – Obsolete character set: aͤ ä…  Need adapted linguistic resources 15. 10. 2011, IMPACT Conference 3
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Adapted linguistic resources: structure  OCR: ... Geist Geister Teile gemütlich …  Information Retrieval (IR): ... Geist  Geister, Geiste, Geistern Teil  Teile, Teils, Teilen gemütlich  gemütlicher, gemütlichste ... 15. 10. 2011, IMPACT Conference 4
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Adapted linguistic resources: structure  OCR: ... Geist Geyst Geister Geyster Teile Theile gemütlich gemüthlich …  Information Retrieval (IR): ... Geist  Geister, Geiste, Geistern Geyster, Geyste, Geystern Teil  Teile, Teils, Teilen Theile, Theils, Theilen gemütlich  gemütlicher, gemütlichste gemüthlicher, gemüthlichste... 15. 10. 2011, IMPACT Conference 5
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Linguistic Resources for Historical Texts  Diachronic Groundtruth Corpus (1500-1950)  Hypothetical lexicon for rule based variants  Manually verified lexicon 15. 10. 2011, IMPACT Conference 6
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Linguistic Resources for Historical Texts  Diachronic Groundtruth Corpus (1500-1950)  Hypothetical lexicon for rule based variants  Manually verified lexicon 15. 10. 2011, IMPACT Conference 7
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Diachronic Groundtruth Corpus (1500-1950)  Collection of groundtruth material from different sources in the web and non-public electronic corpora (Institut für Deutsche Sprache Mannheim)  Large gap especially in 16th / 17th century:  with BSB: preparation of additional corpus from BSB documents: – Random selection of 100 works from digitized images of 16th and 17th century – Mostly related to theology – Latin texts excluded, no poems etc. – Keyed by a service provider – 1766 pages with ~ 858,000 tokens groundtruth material 15. 10. 2011, IMPACT Conference 8
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Diachronic Groundtruth Corpus (1500-1950)  Gains of tokens by the extension of the corpus:  Complete corpus contains ~ 3,380,000 tokens in 500 texts from 4 centuries  basis for different analyses and lexicon building 15. 10. 2011, IMPACT Conference 9
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Coverage on Diachronic Corpus: modern Types (%) 1500- 1550- 1600- 1650- 1700- 1750- 1800- 1850- 1900- – 1549 1599 1649 1699 1749 1799 1849 1899 1949 Modern simple 15.3 28.8 29.2 31.5 38.1 52.0 54.7 48.0 60.1 words Modern 5.1 6.1 6.9 8.6 7.13 15.5 20.6 28.1 27.8 compounds  Less than 45% of the vocabulary is covered by modern resources before 1750.  16th century: only 15% - 29% modern simple words, modern closed compounds are hardly relevant. 15. 10. 2011, IMPACT Conference 10
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Linguistic Resources for Historical Texts  Diachronic Groundtruth Corpus (1500-1950)  Hypothetical lexicon for rule based variants  Manually verified lexicon 15. 10. 2011, IMPACT Conference 11
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Hypothetical lexicon for rule based variants  Systematic substitution rules (patterns) describe the difference between modern and historical spelling: t th,ei ey (modern) teil theyl (historic)  Based on the modern lexicon and the 140 manually collected patterns, the set of all potential rule based historical variants can be computed automatically (“hypothetical lexicon”). 15. 10. 2011, IMPACT Conference 12
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Hypothetical lexicon for rule based variants hypothetical lexicon … Esel Teil … Esel Teill Teil Esell Teyl … … Esehl e →eh Teyll Esehll ei →ey Tehill Eßel s →ß Theil modern Eßell l→ll … Eßehll t →th lexicon … … pattern set 15. 10. 2011, IMPACT Conference 13
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Hypothetical lexicon for rule based variants  Automatic mapping from rule based historical variants to their equivalent in the modern vocabulary is possible: historic modern Geyst = Geist + (ei  ey) Theile = Teile + (t th)  By far not all historical variants can be described by simple replacement rules: historic modern frug = fragte + ? Mirackel = ?+? 15. 10. 2011, IMPACT Conference 14
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Coverage on Diachronic Corpus: hypothetic Types (%) 1500- 1550- 1600- 1650- 1700- 1750- 1800- 1850- 1900- 1549 1599 1649 1699 1749 1799 1849 1899 1949 Modern simple 15.3 28.8 29.2 31.5 38.1 52.0 54.7 48.0 60.1 words Modern 5.1 6.1 6.9 8.6 7.13 15.5 20.6 28.1 27.8 compounds Hypothetic 29.5 29.8 27.9 26.0 21.9 14.3 8.1 7.7 2.0  16th century: 30% of the vocabulary are covered by the lexicon of rule based variants  Applied as OCR-Lexicon via the IMPACT Abbyy External Dictionary Interface: improvement of recognition rate (published 2009) 15. 10. 2011, IMPACT Conference 15
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Coverage on Diachronic Corpus: missing Types (%) 1500- 1550- 1600- 1650- 1700- 1750- 1800- 1850- 1900- 1549 1599 1649 1699 1749 1799 1849 1899 1949 Modern simple 15.3 28.8 29.2 31.5 38.1 52.0 54.7 48.0 60.1 words Modern 5.1 6.1 6.9 8.6 7.13 15.5 20.6 28.1 27.8 compounds Hypothetic 29.5 29.8 27.9 26.0 21.9 14.3 8.1 7.7 2.0 Missing 45.9 28.7 29.7 26.0 23.5 15.1 13.9 13.5 8.1  Especially in the 16th century: Up to 46% “difficult” vocabulary.  manually verified lexicon necessary! 15. 10. 2011, IMPACT Conference 16
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Linguistic Resources for Historical Texts  Diachronic Groundtruth Corpus (1500-1950)  Hypothetical lexicon for rule based variants  Manually verified lexicon 15. 10. 2011, IMPACT Conference 17
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Manually verified IR-lexicon: Structure One entry contains: – Historical word form from the corpus – Corresponding modern word form – Patterns if applicable – Corresponding modern lemma – At least one occurrence in the corpus as a attestation for the reading  Manual assignment of modern word form and lemma  Explicit handling of not rule based variants 15. 10. 2011, IMPACT Conference 18
  • 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Manually verified IR-lexicon: Compilation  Web-based, collaborative user interface  User support: – For rule based variants: Suggestion of the corresponding modern word form by the hypothetic lexicon – Suggestion of all possible lemmas for the modern word form by a large modern lexicon (CISLEX) – Concordance list of the historical variant 15. 10. 2011, IMPACT Conference 19
  • 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Manually verified IR-lexicon: Status  41,600 entries have been created for 24,800 historical word forms from the diachronic corpus, 72,100 attestations were annotated.  IMPACT-Partner in Slovenia und Bulgaria create corresponding lexica with an adapted version of the tool. 15. 10. 2011, IMPACT Conference 20
  • 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Thank you. 15. 10. 2011, IMPACT Conference 21