SlideShare una empresa de Scribd logo
1 de 19
OCR-D: An end-to-end open
source OCR framework for
historical printed documents
Clemens Neudecker, Konstantin Baierer, Maria
Federbusch, Matthias Boenig, Kay-Michael
Würzner, Volker Hartmann, Elisa Herrmann
DATeCH2019
8-10 May 2019, Brussels, Belgium
Introduction
2
● Many OCR-projects in the past: METAe, IMPACT, eMOP, asf.
● Yet there is still a lack of open and comprehensive tools & methods
for OCR/OLR specifically targeting historical printed documents
● New breakthroughs through artificial intelligence/machine learning
enable competitive OCR quality given sufficient training
● With the advent of the Digital Humanities, the requirement for large
scale text corpora with high quality OCR is growing rapidly
→ Setting up of the OCR-D project in 2015
Main goals of OCR-D:
● Development of OCR solutions suitable for historical prints
● Standardisation of metadata and Ground Truth
● Creation of training and evaluation data
Architecture
OCR-D is composed of a “Coordination Project” consisting of
● Herzog-August-Library Wolfenbüttel
● Berlin-Brandenburg Academy of Sciences and Humanities
● Bavarian State Library (until 08/2016)
● Berlin State Library (from 12/2016)
● Karlsruhe Institute of Technology (from 08/2017)
...as well as a total of 8 separate “Module Projects” that
● Develop technical solutions to the identified challenges
● Implement the specifications of the “Coordination Project”
OCR-D receives funding from the DFG for 2015 - 2020
3
OCR-D Specifications
Specifications and conventions for interfaces and exchange formats:
● Command Line Interface (CLI)
● Metadata and structural data (METS)
● Full Text (PAGE-XML)
● Software (ocrd-tool.json, Dockerfile)
● Long-term preservation (ocrd-zip, BagIt)
→ https://ocr-d.github.io/
4
OCR-D Specifications
● For sustainability and reuse, documentation beats implementation
● Open and transparent development on GitHub
OCR-D Core
Reference implementation of the specifications:
● Utility functions for common tasks (ocrd_utils)
● Programmatic access to data formats (ocrd_models,
ocrd_modelfactory)
● Validation of interfaces and data formats (ocrd_validators)
● Toolkit to create compatible command line tools (ocrd)
→ https://github.com/OCR-D/core
6
OCR-D Core
● Easy to install via PyPi:
pip install ocrd
OCR-D provides Scientific Workflow Components for OCR using the
Apache Taverna Engine
● Via the Workflow Description
(in SCUFL2 language), the
workflows do become
easily reproducible and can
be shared and reused by others
● Simplifies the transparent
benchmarking and evaluation
of modules/components
● This approach also allows
the capture of workflow
provenance data
→ https://github.com/OCR-D/taverna_workflow
OCR-D Workflow
8
OCR-D Ground Truth
In order to support the development, training and evaluation of tools
and methods, the OCR-D Coordination Project provides:
● Comprehensive transcription guidelines (only in German)
● Ca. 60 complete volumes from 16th - 19th century
● Special corpora with
○ Low quality OCR
○ Challenging images
● “Structure” Ground Truth corpus
● Additional Ground Truth is currently in production
All Ground Truth data is made freely available under open licenses:
http://www.ocr-d.de/daten
9
Module 1: Image Optimisation
Partner(s):
● DFKI Kaiserslautern
Task(s):
● Deskewing
● Dewarping
● Despeckling
● Cropping (print space, border removal)
● Binarization
Code:
● https://github.com/syedsaqibbukhari/docanalysis
10
Module 2: Layout Analysis
Partner(s):
● DFKI Kaiserslautern
Task(s):
● Page segmentation
● Line segmentation based on RAST
● Region classification with CNN
● Document analysis (e.g.
reconstruction of table of contents)
Code:
● Not yet available
11
Module 3: Layout Analysis &
Region Extraction and Classification
Partner(s):
● University Würzburg
Task(s):
● Based on LAREX
● Development of a CNN-based
pixel classifier
● Integration of document-specific
rules and heuristics
● Highly automated processing
Code:
● https://github.com/ocr-d-modul-2-segmentierung/segmentation-runner
12
Module 4: Unsupervised OCR Postcorrection
Partner(s):
● University Leipzig
Task(s):
● Combine finite-state-transducer and
neural network based postcorrection
in a noisy channel model
● Provision of (tools to create)
language- and domain specific models
Code:
● https://git.informatik.uni-leipzig.de/ocr-d/cor-asv-fst
● https://git.informatik.uni-leipzig.de/ocr-d/cor-asv-ann
13
Module 5: Tesseract
Partner(s):
● University Library Mannheim
Task(s):
● Adaptation of Tesseract to the
OCR-D specifications and interfaces
● Improvement of the code quality
● Optimisation for high throughput
Code:
● https://github.com/tesseract-ocr
● https://github.com/OCR-D/ocrd_tesserocr
14
Module 6: Automated Postcorrection with
optional interactive Postcorrection
Partner(s):
● University of Munich
Task(s):
● Alignment of multiple OCR
● Profiling of OCR
● Automated postcorrection
and correction protocol
● (Optional) interactive postcorrection
Code:
● https://github.com/cisocrgroup/ocrd-postcorrection
● https://github.com/cisocrgroup/cis-ocrd-py
15
Module 7: Automated Font Recognition &
Model Training Infrastructure
Partner(s):
● University of Mainz
● University of Erlangen
● University of Leipzig
Task(s):
● Automated identification of
fonts from images
● Development of an OCR training
infrastructure and model repository
Code:
● https://github.com/seuretm/ocrd_typegroups_classifier
● https://github.com/Doreenruirui/okralact
16
Module 8: Long-term preservation
Partner(s):
● Göttingen State and University Library
● GWDG Göttingen
Task(s):
● Analysis of requirements for
long-term preservation of OCR
● Concept and prototype implementation for
○ Persistent storage and identification
of OCR data in the archive
○ Citation of OCR data in the archive
○ Search functionality within the archive
Code:
● https://github.com/subugoe/OLA-HD-IMPL
17
OCR-D contact/access points
● OCR-D Website:
http://ocr-d.de/eng
● OCR-D GitHub:
https://github.com/OCR-D
● OCR-D Specification and Documentation:
https://ocr-d.github.io/
● OCR-D Ground Truth:
https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit
● OCR-D Gitter:
https://gitter.im/OCR-D/Lobby
● OCR-D Docker:
https://hub.docker.com/u/ocrd
18
Thank you for your attention!
Questions, please?
DATeCH2019
8-10 May 2019, Brussels, Belgium

Más contenido relacionado

Similar a OCR-D: An end-to-end open source OCR framework for historical printed documents

Gjergj Sheldija: Albania
Gjergj Sheldija: AlbaniaGjergj Sheldija: Albania
Gjergj Sheldija: AlbaniaAccessITplus
 
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowdFranco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowdEOSC-hub project
 
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Librarycneudecker
 
BigDataEurope @BDVA Summit2016 2: Societal Pilots
BigDataEurope @BDVA Summit2016 2: Societal PilotsBigDataEurope @BDVA Summit2016 2: Societal Pilots
BigDataEurope @BDVA Summit2016 2: Societal PilotsBigData_Europe
 
The Open Chemistry Project
The Open Chemistry ProjectThe Open Chemistry Project
The Open Chemistry ProjectMarcus Hanwell
 
Red hat infrastructure for analytics
Red hat infrastructure for analyticsRed hat infrastructure for analytics
Red hat infrastructure for analyticsKyle Bader
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
UGent Research Projects on Linked Data in Architecture and Construction
UGent Research Projects on Linked Data in Architecture and ConstructionUGent Research Projects on Linked Data in Architecture and Construction
UGent Research Projects on Linked Data in Architecture and ConstructionPieter Pauwels
 
OCCIware @ Cloud Computing World 2016 - year 1 milestone & Linked Data demo
OCCIware @ Cloud Computing World 2016 - year 1 milestone & Linked Data demoOCCIware @ Cloud Computing World 2016 - year 1 milestone & Linked Data demo
OCCIware @ Cloud Computing World 2016 - year 1 milestone & Linked Data demoMarc Dutoo
 
Introduction to Data Models & Cisco's NextGen Device Level APIs: an overview
Introduction to Data Models & Cisco's NextGen Device Level APIs: an overviewIntroduction to Data Models & Cisco's NextGen Device Level APIs: an overview
Introduction to Data Models & Cisco's NextGen Device Level APIs: an overviewCisco DevNet
 
Avogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and SemanticsAvogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and SemanticsMarcus Hanwell
 
Building Applications using Apache Hadoop
Building Applications using Apache HadoopBuilding Applications using Apache Hadoop
Building Applications using Apache HadoopC4Media
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j
 
BigDataEurope @BDVA Summit2016 1: The BDE Platform
BigDataEurope @BDVA Summit2016 1: The BDE PlatformBigDataEurope @BDVA Summit2016 1: The BDE Platform
BigDataEurope @BDVA Summit2016 1: The BDE PlatformBigData_Europe
 
Cloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachCloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachNicola Ferraro
 
Introduction to LoCloud
Introduction to LoCloud Introduction to LoCloud
Introduction to LoCloud locloud
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
BDE SC3.3 Workshop - BDE Platform: Technical overview
 BDE SC3.3 Workshop -  BDE Platform: Technical overview BDE SC3.3 Workshop -  BDE Platform: Technical overview
BDE SC3.3 Workshop - BDE Platform: Technical overviewBigData_Europe
 
Drupal and the semantic web - SemTechBiz 2012
Drupal and the semantic web - SemTechBiz 2012Drupal and the semantic web - SemTechBiz 2012
Drupal and the semantic web - SemTechBiz 2012scorlosquet
 

Similar a OCR-D: An end-to-end open source OCR framework for historical printed documents (20)

Gjergj Sheldija: Albania
Gjergj Sheldija: AlbaniaGjergj Sheldija: Albania
Gjergj Sheldija: Albania
 
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowdFranco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd
 
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
 
BigDataEurope @BDVA Summit2016 2: Societal Pilots
BigDataEurope @BDVA Summit2016 2: Societal PilotsBigDataEurope @BDVA Summit2016 2: Societal Pilots
BigDataEurope @BDVA Summit2016 2: Societal Pilots
 
The Open Chemistry Project
The Open Chemistry ProjectThe Open Chemistry Project
The Open Chemistry Project
 
AntoineLambertResume
AntoineLambertResumeAntoineLambertResume
AntoineLambertResume
 
Red hat infrastructure for analytics
Red hat infrastructure for analyticsRed hat infrastructure for analytics
Red hat infrastructure for analytics
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
UGent Research Projects on Linked Data in Architecture and Construction
UGent Research Projects on Linked Data in Architecture and ConstructionUGent Research Projects on Linked Data in Architecture and Construction
UGent Research Projects on Linked Data in Architecture and Construction
 
OCCIware @ Cloud Computing World 2016 - year 1 milestone & Linked Data demo
OCCIware @ Cloud Computing World 2016 - year 1 milestone & Linked Data demoOCCIware @ Cloud Computing World 2016 - year 1 milestone & Linked Data demo
OCCIware @ Cloud Computing World 2016 - year 1 milestone & Linked Data demo
 
Introduction to Data Models & Cisco's NextGen Device Level APIs: an overview
Introduction to Data Models & Cisco's NextGen Device Level APIs: an overviewIntroduction to Data Models & Cisco's NextGen Device Level APIs: an overview
Introduction to Data Models & Cisco's NextGen Device Level APIs: an overview
 
Avogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and SemanticsAvogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and Semantics
 
Building Applications using Apache Hadoop
Building Applications using Apache HadoopBuilding Applications using Apache Hadoop
Building Applications using Apache Hadoop
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 
BigDataEurope @BDVA Summit2016 1: The BDE Platform
BigDataEurope @BDVA Summit2016 1: The BDE PlatformBigDataEurope @BDVA Summit2016 1: The BDE Platform
BigDataEurope @BDVA Summit2016 1: The BDE Platform
 
Cloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachCloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps Approach
 
Introduction to LoCloud
Introduction to LoCloud Introduction to LoCloud
Introduction to LoCloud
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
BDE SC3.3 Workshop - BDE Platform: Technical overview
 BDE SC3.3 Workshop -  BDE Platform: Technical overview BDE SC3.3 Workshop -  BDE Platform: Technical overview
BDE SC3.3 Workshop - BDE Platform: Technical overview
 
Drupal and the semantic web - SemTechBiz 2012
Drupal and the semantic web - SemTechBiz 2012Drupal and the semantic web - SemTechBiz 2012
Drupal and the semantic web - SemTechBiz 2012
 

Más de cneudecker

ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltextecneudecker
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungencneudecker
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?cneudecker
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspaperscneudecker
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...cneudecker
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritagecneudecker
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenzcneudecker
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-Dcneudecker
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspaperscneudecker
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...cneudecker
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...cneudecker
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Miningcneudecker
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltextecneudecker
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europecneudecker
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minutencneudecker
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshellcneudecker
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlincneudecker
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspaperscneudecker
 
What's up, Europeana Newspapers?
What's up, Europeana Newspapers?What's up, Europeana Newspapers?
What's up, Europeana Newspapers?cneudecker
 
Active archives @SBB
Active archives @SBBActive archives @SBB
Active archives @SBBcneudecker
 

Más de cneudecker (20)

ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritage
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europe
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshell
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
 
What's up, Europeana Newspapers?
What's up, Europeana Newspapers?What's up, Europeana Newspapers?
What's up, Europeana Newspapers?
 
Active archives @SBB
Active archives @SBBActive archives @SBB
Active archives @SBB
 

Último

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Último (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

OCR-D: An end-to-end open source OCR framework for historical printed documents

  • 1. OCR-D: An end-to-end open source OCR framework for historical printed documents Clemens Neudecker, Konstantin Baierer, Maria Federbusch, Matthias Boenig, Kay-Michael Würzner, Volker Hartmann, Elisa Herrmann DATeCH2019 8-10 May 2019, Brussels, Belgium
  • 2. Introduction 2 ● Many OCR-projects in the past: METAe, IMPACT, eMOP, asf. ● Yet there is still a lack of open and comprehensive tools & methods for OCR/OLR specifically targeting historical printed documents ● New breakthroughs through artificial intelligence/machine learning enable competitive OCR quality given sufficient training ● With the advent of the Digital Humanities, the requirement for large scale text corpora with high quality OCR is growing rapidly → Setting up of the OCR-D project in 2015 Main goals of OCR-D: ● Development of OCR solutions suitable for historical prints ● Standardisation of metadata and Ground Truth ● Creation of training and evaluation data
  • 3. Architecture OCR-D is composed of a “Coordination Project” consisting of ● Herzog-August-Library Wolfenbüttel ● Berlin-Brandenburg Academy of Sciences and Humanities ● Bavarian State Library (until 08/2016) ● Berlin State Library (from 12/2016) ● Karlsruhe Institute of Technology (from 08/2017) ...as well as a total of 8 separate “Module Projects” that ● Develop technical solutions to the identified challenges ● Implement the specifications of the “Coordination Project” OCR-D receives funding from the DFG for 2015 - 2020 3
  • 4. OCR-D Specifications Specifications and conventions for interfaces and exchange formats: ● Command Line Interface (CLI) ● Metadata and structural data (METS) ● Full Text (PAGE-XML) ● Software (ocrd-tool.json, Dockerfile) ● Long-term preservation (ocrd-zip, BagIt) → https://ocr-d.github.io/ 4
  • 5. OCR-D Specifications ● For sustainability and reuse, documentation beats implementation ● Open and transparent development on GitHub
  • 6. OCR-D Core Reference implementation of the specifications: ● Utility functions for common tasks (ocrd_utils) ● Programmatic access to data formats (ocrd_models, ocrd_modelfactory) ● Validation of interfaces and data formats (ocrd_validators) ● Toolkit to create compatible command line tools (ocrd) → https://github.com/OCR-D/core 6
  • 7. OCR-D Core ● Easy to install via PyPi: pip install ocrd
  • 8. OCR-D provides Scientific Workflow Components for OCR using the Apache Taverna Engine ● Via the Workflow Description (in SCUFL2 language), the workflows do become easily reproducible and can be shared and reused by others ● Simplifies the transparent benchmarking and evaluation of modules/components ● This approach also allows the capture of workflow provenance data → https://github.com/OCR-D/taverna_workflow OCR-D Workflow 8
  • 9. OCR-D Ground Truth In order to support the development, training and evaluation of tools and methods, the OCR-D Coordination Project provides: ● Comprehensive transcription guidelines (only in German) ● Ca. 60 complete volumes from 16th - 19th century ● Special corpora with ○ Low quality OCR ○ Challenging images ● “Structure” Ground Truth corpus ● Additional Ground Truth is currently in production All Ground Truth data is made freely available under open licenses: http://www.ocr-d.de/daten 9
  • 10. Module 1: Image Optimisation Partner(s): ● DFKI Kaiserslautern Task(s): ● Deskewing ● Dewarping ● Despeckling ● Cropping (print space, border removal) ● Binarization Code: ● https://github.com/syedsaqibbukhari/docanalysis 10
  • 11. Module 2: Layout Analysis Partner(s): ● DFKI Kaiserslautern Task(s): ● Page segmentation ● Line segmentation based on RAST ● Region classification with CNN ● Document analysis (e.g. reconstruction of table of contents) Code: ● Not yet available 11
  • 12. Module 3: Layout Analysis & Region Extraction and Classification Partner(s): ● University Würzburg Task(s): ● Based on LAREX ● Development of a CNN-based pixel classifier ● Integration of document-specific rules and heuristics ● Highly automated processing Code: ● https://github.com/ocr-d-modul-2-segmentierung/segmentation-runner 12
  • 13. Module 4: Unsupervised OCR Postcorrection Partner(s): ● University Leipzig Task(s): ● Combine finite-state-transducer and neural network based postcorrection in a noisy channel model ● Provision of (tools to create) language- and domain specific models Code: ● https://git.informatik.uni-leipzig.de/ocr-d/cor-asv-fst ● https://git.informatik.uni-leipzig.de/ocr-d/cor-asv-ann 13
  • 14. Module 5: Tesseract Partner(s): ● University Library Mannheim Task(s): ● Adaptation of Tesseract to the OCR-D specifications and interfaces ● Improvement of the code quality ● Optimisation for high throughput Code: ● https://github.com/tesseract-ocr ● https://github.com/OCR-D/ocrd_tesserocr 14
  • 15. Module 6: Automated Postcorrection with optional interactive Postcorrection Partner(s): ● University of Munich Task(s): ● Alignment of multiple OCR ● Profiling of OCR ● Automated postcorrection and correction protocol ● (Optional) interactive postcorrection Code: ● https://github.com/cisocrgroup/ocrd-postcorrection ● https://github.com/cisocrgroup/cis-ocrd-py 15
  • 16. Module 7: Automated Font Recognition & Model Training Infrastructure Partner(s): ● University of Mainz ● University of Erlangen ● University of Leipzig Task(s): ● Automated identification of fonts from images ● Development of an OCR training infrastructure and model repository Code: ● https://github.com/seuretm/ocrd_typegroups_classifier ● https://github.com/Doreenruirui/okralact 16
  • 17. Module 8: Long-term preservation Partner(s): ● Göttingen State and University Library ● GWDG Göttingen Task(s): ● Analysis of requirements for long-term preservation of OCR ● Concept and prototype implementation for ○ Persistent storage and identification of OCR data in the archive ○ Citation of OCR data in the archive ○ Search functionality within the archive Code: ● https://github.com/subugoe/OLA-HD-IMPL 17
  • 18. OCR-D contact/access points ● OCR-D Website: http://ocr-d.de/eng ● OCR-D GitHub: https://github.com/OCR-D ● OCR-D Specification and Documentation: https://ocr-d.github.io/ ● OCR-D Ground Truth: https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit ● OCR-D Gitter: https://gitter.im/OCR-D/Lobby ● OCR-D Docker: https://hub.docker.com/u/ocrd 18
  • 19. Thank you for your attention! Questions, please? DATeCH2019 8-10 May 2019, Brussels, Belgium