IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demokritos, University of Salford, IBM, Uni-versity of Innsbruck, LMU University of Munich, INL Institute for Dutch Lexicology and KB
Similar a IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demokritos, University of Salford, IBM, Uni-versity of Innsbruck, LMU University of Munich, INL Institute for Dutch Lexicology and KB
Similar a IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demokritos, University of Salford, IBM, Uni-versity of Innsbruck, LMU University of Munich, INL Institute for Dutch Lexicology and KB (20)
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demokritos, University of Salford, IBM, Uni-versity of Innsbruck, LMU University of Munich, INL Institute for Dutch Lexicology and KB
3. Baseline: FineReader XIX
First Omnifont OCR for Fraktur and Old European Scripts
Special dictionaries for Old European languages
Available as a result of METAe project
Baseline for IMPACT project
1
4. IMPACT project: Improvements at every step
Image pre-processing
Better binarisation
Layout analysis
General quality improvements
Better historical newspapers segmentation
Recognition
New classifiers
Better gothic recognition, including new graphemes
Old Slavonic script supported
Export
ALTO XML
PDF MRC: smaller file size with the same image quality
Extensibility
Extended language support
Training API
Licensing and pricing
Special XIX pricing for Impact project
2
8. Text Recognition
Improved recognition of pre-trained gothic fonts.
New graphemes.
DoW:
Test results: 20% ~ 30% recognition quality improvement
“Significantly
improved
performance and
substitution rate
especially for
historical texts.”
New language: Old Church Slavonic
Test results: ~5% recognition errors
6
9. More: Native ALTO Support
Available in
FineReader Engine > 10 R2
Cloud OCR SDK
7
10. More: PDF MRC
The same image quality with much lower file size
Original: jp2, 251 kb PDF MRC, 53 Kb
8
12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Tools developed by the
National Center for Scientific Research (NCSR) "Demokritos"
Border Removal Tool
Image enhancement tools
Download demo: It detects and removes noisy
http://www.iit.demokritos.gr/~bgat/H-DocPro/ black borders as well as noisy text
regions.
Moreover, it detects the optimal
page frames of double page
document images.
IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 1
13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Tools developed by the
National Center for Scientific Research (NCSR) "Demokritos"
Border Removal Tool
Image enhancement tools
Download demo:
http://www.iit.demokritos.gr/~bgat/H-DocPro/
(SET‐A) 38718 randomly selected historical
images with and without noisy black
borders.
(SET‐B) 22383 images with noisy black
borders, subset of SET‐A .
(SET‐C) 3467 double page historical images
IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 2
14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Tools developed by the
National Center for Scientific Research (NCSR) "Demokritos"
Page Curl Correction Tool
Image enhancement tools
Download demo: It rectifies document images
http://www.iit.demokritos.gr/~bgat/H-DocPro/ which suffer from warping and
perspective distortions that
deteriorate the performance of
OCR.
IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 3
15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Tools developed by the
National Center for Scientific Research (NCSR) "Demokritos"
Page Curl Correction Tool
Image enhancement tools
Download demo: 420 randomly
selected historical
http://www.iit.demokritos.gr/~bgat/H-DocPro/ document images
IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 4
16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Tools developed by the
National Center for Scientific Research (NCSR) "Demokritos"
OCR evaluation toolkit
Character Accuracy
Download:
http://www.iit.demokritos.gr/~bgat/OCREval/ Word Accuracy
Rejection Rate
Based on the isri‐ocr‐evaluation‐tools
(Information Science Research Characters Marked
Institute, University of Nevada ‐
Source code from
Accuracy after Correction
http://code.google.com/p/isri‐ocr‐ Accuracy by Character Class
evaluation‐tools/
Code license: Apache License 2.0) Figure of Merit
OCREval.exe [a,u8,u16] stop_words.txt (or “-”) gt.txt (or batchgt.txt)
ocr.txt (or batch.txt) character_report.txt
word_report.txt overall_report.xml
IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 5
17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Tools developed by the
National Center for Scientific Research (NCSR) "Demokritos"
It provides an integrated GUI for
Word Spotting Application indexing historical documents
Download demo: without an OCR engine. It allows
http://www.iit.demokritos.gr/~bgat/WordSpot/ searching the database for
instances of a query keyword using
several different methods (list,
query by example, free text)
IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 6
18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Tools developed by the
National Center for Scientific Research (NCSR) "Demokritos"
Word Spotting Application
Download demo:
http://www.iit.demokritos.gr/~bgat/WordSpot/
Benchmarking set: French book, BnF – year of
publication: 1838 (153 pages), German book,
BSB – year of publication: 1788 (126 pages)
IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 7
19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Tools developed by the
National Center for Scientific Research (NCSR) "Demokritos"
Scientific impact
Part of the PhD thesis of N.
Stamatopolous who was awarded
the "Best PhD Thesis of 2011 Impact Factor: 2.918 Impact Factor:
1.091
Award in the area of Informatics
and Telematics“ in Greece.
Publications on a great number
of widely recognized journals and
international conferences.
IMPACT related publications
have 127 citations from non‐ Impact Factor: 1.474
IMPACT researchers (information
calculated from Scopus)
IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 8
21. IMPACT Project Outcomes, 26 June 2012, The Hague
IMPACT Image and Ground Truth Repository
• Content Management System
• Datasets
– 667,437 images
– 52,880 PAGE ground truth files
• Invaluable resource for planning,
experimentation and evaluation
relating to digitisation projects
• Access to existing material via CoC
• Further development and usage in
other projects and domains
(Europeana Newspapers) 1
22. IMPACT Project Outcomes, 26 June 2012, The Hague
Aletheia
• Semi-automated tool for ground truth and showcase
production
• Full PAGE support
• Page border, print
space
• Layout regions
(incl. metadata)
• Text lines, words
and glyphs
• Unicode text at all
levels
• Reading order,
layers etc.
• Quality control 2
23. IMPACT Project Outcomes, 26 June 2012, The Hague
Layout Evaluation Tool
• Performance evaluation of page segmentation and region
classification
• Configuration of scenarios
based on different types of
errors
Miss /
Part. Miss
Split
Misclass.
Merge
False
Detection 3
24. IMPACT Project Outcomes, 26 June 2012, The Hague
Text Line and Word Segmentation
• Modules for integration in workflows and/or other tools
• Independent of OCR
• Used in
• Aletheia
• Dewarping
• T-OCR
• Word spotting
4
25. IMPACT Project Outcomes, 26 June 2012, The Hague
Dewarping Tool
• Correction of arbitrary warping artefacts
• Two stage process:
• Model detection
• Image
transformation
• Improved legibility
• Increased OCR
accuracy
5
26. IMPACT Project Outcomes, 26 June 2012, The Hague
More Information
PRImA
http://www.primaresearch.org
IMPACT CoC
http://www.digitisation.eu
6
28. IBM Labs in Haifa
IBM`s Highlights
CONCERT (COoperative eNgine for Correction of ExtRacted Text)
IBM`s collaborative correction platform announced in August 2010
http://www-03.ibm.com/press/us/en/pressrelease/32380.wss
IBM`s Adaptive OCR integrated
EE2/EE3 Dictionaries integrated
Dutch, German, Spanish
Sophisticated monitoring mechanism integrated 2011
Designed for both malicious users and game users
CONCERT Games introduction
IMPACT Conference 2011
Beyond the productivity tools
Adaptive OCR System
Font and language independent
1
29. IBM Labs in Haifa
CONCERT
Adaptive collaborative correction platform
Uses the feedback from the users to improve productivity
Fully connected to the Adaptive OCR Engine
Strong emphasis on productivity tools
Reduce the time for verification/correction
Patented smart-key approach
Motivate volunteers
Separating data entry process into several complementary tasks
Optimized application dedicated to each task
Break down the tasks into subtask
Make it suitable for parallel processing
Online compilation
User Monitoring
Robust against malicious users and designed for gamification
2
30. IBM Labs in Haifa
Adaptive OCR Engine
Consistent and reliable confidence level
Important for quality assurance
No use of prior knowledge on the font
Crazy font can be handled
Good use of the feedback from the users
Character and Word level
Robust to distortion
Page level distortion and printing variations
Easy to migrate between books from the same publisher
Continues update
Not too slow
Around 2-3 times slower than OMNI Engines
3
31. IBM Labs in Haifa
What is next ?
Search Over OCR
Beyond transcription
Improve User Feedback
Online advisor
Best performers list
Community building around content
Integrate community tools within the platform
Complete CONCERT Games
iPhone/iPad/Android/Desktop
E-Book creation
Fully digital transcription
Using original font as option
Page distortion correction
Fully integrate the word-based page distortion correction
4
33. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
The Functional Extension Parser (FEP)
A Document Understanding Platform
Günter Mühlberger
University of Innsbruck
Department for German Language and Literature Studies
34. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Functional Extension Parser
A book is more than just pure text – it contains a lot of structural metadata
– These metadata are (often) encoded in the layout of a document
– Size of characters, position on page, distance to other lines, etc. is used to express
structural meaning such as headlines, footnotes, caption lines, etc.
FEP is a platform to process digitised or born digital documents and to
“understand” the meaning of the layout by using a rules engine
– Modular approach
– Images and OCR as input
– Web-GUI for visualisation, editing and correction
– Evaluation module
– Export module (METS/ALTO, PDF, ePUB,...)
Several pilots
– EOD Network: Enhancement of PDF eBook of historical books
– DNB: Extraction of metadata from title pages
– Internal project: Card index
FEP is available for free for research applications
– Commercial deployment is foreseen via technology transfer platform of the University
Innsbruck
2
36. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Postcorrection Tool
Ludwig-Maximilians-Universität München
Centrum für Informations- und Sprachverarbeitung
The Hague, June 26th 2012
Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter, Klaus U. Schulz,
Thorsten Vobl
37. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Postcorrection Tool as Carrier of IMPACT Technology
Advanced language technology used for improvement of interactive postcorrection
Advanced language technology used for improvement of interactive postcorrection
Lexica, matching tool, profiler integrated as background technology
Lexica, matching tool, profiler integrated as background technology
Document centric knowledge from unsupervised analysis of OCRed
Document centric knowledge from unsupervised analysis of OCRed
document used for detection of error classes and suggested corrections
document used for detection of error classes and suggested corrections
Batchmode for corrections of many errors in „one shot“
Batchmode for corrections of many errors in „one shot“
Rich graphical user interface to let users fully benefit from „knowledge“ on
Rich graphical user interface to let users fully benefit from „knowledge“ on
document derived error classes
document derived error classes
38. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Improved Lexicon Technology
Valid historical words not marked as
Valid historical words not marked as
errors
errors
Historical variants proposed as correction
Historical variants proposed as correction
candidates
candidates
39. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Improved Workflow - Batch Processing
Strings with identical error patterns
Strings with identical error patterns
corrected as batch
corrected as batch
40. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Improved Selection of Correction Candidates
Ranking of candidates through
Ranking of candidates through
document specific language and
document specific language and
error profile
error profile
42. Lexica and Strategies and Tools for
Lexicon Building & Lexicon Deployment
in OCR and Retrieval
Historical Language
Available through the IMPACT Centre of Competence:
www.digitisation.eu
43. Lexica for 9 languages
Example:
·Dutch IR lexicon: (will be available as web service)
WNT (Dictionary of the Dutch language), 475,498 distinct word forms, 215,180 lemmata,
and 558,438 distinct lemma/word form combinations, with 1,636,709 attestations.
·Dutch OCR lexicon (1)
Based on a large historical corpus from the DBNL (Digitale Bibliotheek voor de
Nederlandse Letteren, Digital Library for Dutch Literature).
·Dutch OCR lexicon (2)
Based on the IR lexicon (WNT) [best results in CCS experiment]
·Dutch Named Entity lexicon (persons, locations, organizations)
·Set of historical spelling variation rules
Using a reference set of 10,000 pairs of historical words taken from the WNT-based lexico
to which parallel modern words have been added
45. Lexica in OCR:
External Dictonary interface
· Function of the tool: enable addition of
(IMPACT) special lexica to FineReader
engine
· Availability: free for non‐commercial use
46. Lexica in OCR:
OCR Evaluation Tool
· Function of the tool: evaluate influence of
lexica on OCR, measure OCR (word)
accuracy
· Availability: free for non‐commercial use,
Impact Interoperability Framework
47. Lexica in Retrieval:
BlackLab
· Function of the tool: enable use of lexica
and linguistic annotation in a lucene‐
based search engine
· Deployment of IMPACT IR lexica in query expansion
· Keyword-in-context display of search results
· Grouping and sorting of search results
· Support for linguistic annotation (lemma, part of speech, etc)
· Exploitation of named entity tagging
· Incremental indexing
· Supports CLARIN standards (CQL)
· Availability: free for non‐commercial use
48. Toolbox for
lexicon building and deployment
· Function of the tools: support lexicon
building for historical language
· Corpus-based lexicon building
· Lexicon-building from dictionary quotations
· Spelling variation: pattern extraction and matching
· Lemmatization
· Availability: free for non‐commercial use
49. CoBaLT corpus-based
lexicon building tool
· Function of the tool: productivity tool for
IR lexicon building from corpora
· Both for pre-lemmatized and unannotated corpora
· Web-based
· Availability: free for non‐commercial use
51. Dictionary attestation tool
· Function of the tool: productivity tool for IR
lexicon building from dictionary quotations
· Manual evaluation and correction of large quantities of
automatically matched occurrences of a headword (scripts
available) in the quotations of the particular article in a
comprehensive dictionary
· High productivity: revision speed (OED) 400-600
entries/hour
· Web-based
· Availability: free for non‐commercial use
53. Spellingvariation
and lemmatization tools
· Function of the tools:
· match (inflected) historical spellings of words to
modern dictionary headword form
· Derive spelling variation patterns from example
material
· Availability: free for non‐commercial use
54. NERT
named entity recognition tool
· Function of the tool: mark up named
entities in text
· Extension of Stanford NE recognizer with
extensions for spelling variation and OCR
errors
· Availability: free for non‐commercial use
55. Named entity
matching tool
· Function of the tool: match named entities
with historical spelling to standard form
· Availability: free for non‐commercial use
56. Named entity
attestation tool
· Function of the tool: manual evaluation
and correction of automatically matched
occurrences of Named Entities in text
material
· Web‐based
· Availability: free for non‐commercial use
59. Background / Use case
A diverse set of tools relevant to OCR & digitisation being
developed by IMPACT and also others
But:
Mix of prototypes, robust commercial solutions, 3rd party tools
Variety of programming languages, environments, backgrounds
Solution:
Wrapping of tools as web services allows:
- quick & easy integration → uniform access
- flexible, loose coupling of separate modules
- platform independent execution via the web
IMPACT Interoperability Framework (IIF)
60. Overview: Service Oriented Architecture
IMPACT Interoperability Framework Technologies:
Java
REST/SOAP
Apache Maven
Apache Tomcat
Apache Axis2
Apache Synapse
Taverna Engine
Interoperability through open technology standards (OASIS/W3C)
61. Workflow example
• OCR task =
processing pipeline
• Building blocks =
individual tools
• Workflow =
interaction between
tools (mashups)
• Collaboration with
62. Availability & Takeup
Availability
IIF components released as open source (Apache 2.0 License):
https://github.com/impactcentre/interoperability-framework
Community workflow resources (CC License):
http://www.myexperiment.org/groups/235
Takeup:
SCAPE – Scalable Preservation Environments
DDMAL – Digital Music Archives & Libraries Laboratory Montreal