Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
OCR-D: An end-to-end open source OCR framework for historical printed documents
1. OCR-D: An end-to-end open
source OCR framework for
historical printed documents
Clemens Neudecker, Konstantin Baierer, Maria
Federbusch, Matthias Boenig, Kay-Michael
Würzner, Volker Hartmann, Elisa Herrmann
DATeCH2019
8-10 May 2019, Brussels, Belgium
2. Introduction
2
● Many OCR-projects in the past: METAe, IMPACT, eMOP, asf.
● Yet there is still a lack of open and comprehensive tools & methods
for OCR/OLR specifically targeting historical printed documents
● New breakthroughs through artificial intelligence/machine learning
enable competitive OCR quality given sufficient training
● With the advent of the Digital Humanities, the requirement for large
scale text corpora with high quality OCR is growing rapidly
→ Setting up of the OCR-D project in 2015
Main goals of OCR-D:
● Development of OCR solutions suitable for historical prints
● Standardisation of metadata and Ground Truth
● Creation of training and evaluation data
3. Architecture
OCR-D is composed of a “Coordination Project” consisting of
● Herzog-August-Library Wolfenbüttel
● Berlin-Brandenburg Academy of Sciences and Humanities
● Bavarian State Library (until 08/2016)
● Berlin State Library (from 12/2016)
● Karlsruhe Institute of Technology (from 08/2017)
...as well as a total of 8 separate “Module Projects” that
● Develop technical solutions to the identified challenges
● Implement the specifications of the “Coordination Project”
OCR-D receives funding from the DFG for 2015 - 2020
3
4. OCR-D Specifications
Specifications and conventions for interfaces and exchange formats:
● Command Line Interface (CLI)
● Metadata and structural data (METS)
● Full Text (PAGE-XML)
● Software (ocrd-tool.json, Dockerfile)
● Long-term preservation (ocrd-zip, BagIt)
→ https://ocr-d.github.io/
4
5. OCR-D Specifications
● For sustainability and reuse, documentation beats implementation
● Open and transparent development on GitHub
6. OCR-D Core
Reference implementation of the specifications:
● Utility functions for common tasks (ocrd_utils)
● Programmatic access to data formats (ocrd_models,
ocrd_modelfactory)
● Validation of interfaces and data formats (ocrd_validators)
● Toolkit to create compatible command line tools (ocrd)
→ https://github.com/OCR-D/core
6
8. OCR-D provides Scientific Workflow Components for OCR using the
Apache Taverna Engine
● Via the Workflow Description
(in SCUFL2 language), the
workflows do become
easily reproducible and can
be shared and reused by others
● Simplifies the transparent
benchmarking and evaluation
of modules/components
● This approach also allows
the capture of workflow
provenance data
→ https://github.com/OCR-D/taverna_workflow
OCR-D Workflow
8
9. OCR-D Ground Truth
In order to support the development, training and evaluation of tools
and methods, the OCR-D Coordination Project provides:
● Comprehensive transcription guidelines (only in German)
● Ca. 60 complete volumes from 16th - 19th century
● Special corpora with
○ Low quality OCR
○ Challenging images
● “Structure” Ground Truth corpus
● Additional Ground Truth is currently in production
All Ground Truth data is made freely available under open licenses:
http://www.ocr-d.de/daten
9
11. Module 2: Layout Analysis
Partner(s):
● DFKI Kaiserslautern
Task(s):
● Page segmentation
● Line segmentation based on RAST
● Region classification with CNN
● Document analysis (e.g.
reconstruction of table of contents)
Code:
● Not yet available
11
12. Module 3: Layout Analysis &
Region Extraction and Classification
Partner(s):
● University Würzburg
Task(s):
● Based on LAREX
● Development of a CNN-based
pixel classifier
● Integration of document-specific
rules and heuristics
● Highly automated processing
Code:
● https://github.com/ocr-d-modul-2-segmentierung/segmentation-runner
12
13. Module 4: Unsupervised OCR Postcorrection
Partner(s):
● University Leipzig
Task(s):
● Combine finite-state-transducer and
neural network based postcorrection
in a noisy channel model
● Provision of (tools to create)
language- and domain specific models
Code:
● https://git.informatik.uni-leipzig.de/ocr-d/cor-asv-fst
● https://git.informatik.uni-leipzig.de/ocr-d/cor-asv-ann
13
14. Module 5: Tesseract
Partner(s):
● University Library Mannheim
Task(s):
● Adaptation of Tesseract to the
OCR-D specifications and interfaces
● Improvement of the code quality
● Optimisation for high throughput
Code:
● https://github.com/tesseract-ocr
● https://github.com/OCR-D/ocrd_tesserocr
14
15. Module 6: Automated Postcorrection with
optional interactive Postcorrection
Partner(s):
● University of Munich
Task(s):
● Alignment of multiple OCR
● Profiling of OCR
● Automated postcorrection
and correction protocol
● (Optional) interactive postcorrection
Code:
● https://github.com/cisocrgroup/ocrd-postcorrection
● https://github.com/cisocrgroup/cis-ocrd-py
15
16. Module 7: Automated Font Recognition &
Model Training Infrastructure
Partner(s):
● University of Mainz
● University of Erlangen
● University of Leipzig
Task(s):
● Automated identification of
fonts from images
● Development of an OCR training
infrastructure and model repository
Code:
● https://github.com/seuretm/ocrd_typegroups_classifier
● https://github.com/Doreenruirui/okralact
16
17. Module 8: Long-term preservation
Partner(s):
● Göttingen State and University Library
● GWDG Göttingen
Task(s):
● Analysis of requirements for
long-term preservation of OCR
● Concept and prototype implementation for
○ Persistent storage and identification
of OCR data in the archive
○ Citation of OCR data in the archive
○ Search functionality within the archive
Code:
● https://github.com/subugoe/OLA-HD-IMPL
17