4.11.24 Mass Incarceration and the New Jim Crow.pptx
IMPACT Final Conference - Stefan Pletschacher
1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation Tools
Stefan Pletschacher
2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Overview
Digitisation Workflows
Performance Evaluation
Ground Truth
Evaluation Tools
Segmentation and Layout
OCR Text
Interpretation of Results
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 2
3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Digitisation Workflow
Evaluation
① Scanning • Individual Processing Steps
② Image enhancement • Complex Workflows
Page splitting
Border removal
Dewarping (page curl, arbitrary warping)
Noise removal
Binarisation
③ Layout analysis
Segmentation of regions, lines, words and characters
Region classification
Logical layout analysis
④ OCR
⑤ Post-processing
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 3
4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Performance Evaluation Overview
Evaluation
Results
Evaluation
Tools
Compatibility through
one common format
(PAGE)
Image
Repository
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 4
5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Image Repository
Central management of
Metadata, Images and Ground Truth
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 5
6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Datasets
Total number of images: 667,120
Institutional Datasets (10 libraries): 602,313 images
Demonstrator Sets: 56,141 images
Ground Truth in PAGE format: 36,498 approved instances, still growing
Working Sets (Showcases, Typewritten Set, Dewarping Set, Challenge Sets etc.)
Usage statistics (Since 6/10/2010)
– 5,153,347 thumbs browsed
– 810,001 images accessed (724,946 full quality images, 22,676 direct access calls)
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 6
7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Tools for Ground Truth Production
Aletheia
Page border, print space
Layout regions (incl.
metadata)
Text lines, words and
glyphs
Unicode text at all levels
Reading order, layers etc.
FineReader Engine
Exporter (Preproduction)
GT Validator
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 7
8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Ground Truthing Historical Documents
Full Unicode Support (incl. special characters for
historical documents)
Complex Reading Order
(Groups of ordered and
unordered elements)
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 8
9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Ground Truth – Image Enhancement
Deskew Dewarping Border Removal Binarisation
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 9
10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
The PAGE Format Framework
Page Analysis and Ground-Truth Elements
PAGE root
Two-level architecture: (XML)
– root structure
– task specific sub-formats
Separate XML Schema definitions PAGE gts PAGE gts PAGE gts
(XML) (XML) (XML)
Format identification via Namespaces
Mapping of
– dependencies
– processing chains
Representation of
– alternative processing steps Processing Results /
Linking via IDs Ground Truth
http://schema.primaresearch.org/PAGE/
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 10
11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation Tools
Segmentation and
Layout
OCR Text
Deskewing
Dewarping
Border Removal
Binarisation
Double Page Splitting
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 11
12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Segmentation and Layout
Ground
Result
Truth
Error types
Differentiation of errors based on
Miss / Part. Overlap reading order
Miss
Split
allowable
Misclass.
Merge
False
non-allowable
Detection
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 12
13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Example – Ground Truth
Page Header
Paragraph
Paragraph
Caption
Image
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 13
14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Example – Layout Analysis Result
Header
Paragraph
Paragraph
Image
Image
Image
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 14
15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation
Miss Partial Miss
Misclassi-
fication
Merge
Caption
Paragraph
Ground Truth
Layout Analysis
Result
Split
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 15
16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR Text
Comparison of Ground Truth and OCR output based on encoded text (ASCII, Unicode)
Character accuracy
– Distance measure: minimum number of edit operations (insertions, deletions, substitutions)
– Per character class (lower case, upper case, whitespace characters, numbers, symbols, ...)
Word accuracy
– Correctly recognised words vs. total word count
– Stop words and non-stop words
Rejected and suspicious characters
Substitution errors (higher penalty)
Correction effort
OCR is cool OOR is cod
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 16
17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Interpretation of Results
Metrics Miss
– Measurements of conditions Misclass.
– Types and number of errors
Merge
Scenarios
Split
– Application context
False detect.
– Error weights
M2 S1 S2
M1 M3 Overall success/error rates are based on
– weighted individual results
Merge Split
... – type and size of affected regions
Rate Rate
– allowable vs. non-allowable errors
Error
Rate
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 17
18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
For more information visit:
PRImA
http://www.primaresearch.org
IMPACT
http://www.impact-project.eu
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 18
Notas del editor
11 characters2 substitutions1 insertionError rate of 27% = 73% accuracy