Reinhold Huber-Mörk, Austrian Institute of Technology, presented a method for quality assurance of scanned content based on computer vision at iPres 2012, Toronto.
In: iPRES 2012 – Proceedings of the 9th International Conference on Preservation of Digital Objects. Toronto 2012, 136-143.
ISBN 978-0-9917997-0-1
Duplicate detection for quality assurance of document image collections
1. Duplicate detection for quality assurance
of document image collections
Reinhold Huber-Mörk1 & Alexander Schindler1,2 & Sven Schlarb3
1 Research Area Intelligent Vision Systems, Department Safety & Security
AIT Austrian Institute of Technology
2 Department of Software Technology and Interactive Systems
Vienna University of Technology
3 Department for Research and Development
Austrian National Library
2. Overview
Digital preservation & quality assurance
Digital image preservation workflows
Image duplicate detection
Keypoints and feature descriptors in Computer Vision
Bag of visual words
Results on a real-world data set
22.11.2012 2
3. SCAPE project and quality assurance
SCAlable Preservation Environments, EU FP7
Preservation Components:
improve and extend existing tools,
develop new ones where necessary,
apply proven approaches like
image and patterns analysis to the
problem of ensuring quality in digital
preservation
22.11.2012 3
4. Quality assurance in image preservation
Comparison of image content
- automatic image processing worflows (e.g. format conversion)
- reacquisition of images
Duplicate detection
- within a single collection (filtering)
- between collections (merging, comparison)
Solutions:
- page segmention + OCR
- feature based approaches
22.11.2012 4
11. Bag of visual words (1)
Bag of words model in text information retrieval:
Document 1: “Peter likes to read books. Paul likes too”.
Document 2: “Peter also likes to read poems”
Bag: [ Peter, likes, to, read, books, Paul, too, also, poems ]
Histogram 1: [ 1, 2, 1, 1, 1, 1, 1, 0, 0 ]
Histogram 2: [ 1, 1, 1, 1, 0, 0, 0, 1, 1 ]
Visual analogy: bag of visual words or bag of features
Document Image
Document made of words Image made of descriptors
Bag of words Bag of clustered descriptors = visual words
Word occurrence histogram Visual word histogram / ”fingerprint”
22.11.2012 11
16. Spatial verification (1)
Bag of visual words maintains no (or limited) spatial information
Spatial verification:
1. Ranking of most similar images in a shortlist
2. Direct matching of descriptors for pairs of images
3. Overlaying of images
4. Estimation of similarity
22.11.2012 16
17. Spatial verification (2)
Pair of possible duplicates Descriptor matching
Estimation of affine
transformation
Image overlay Similarity estimation
Similarity
measure
MSSIM
22.11.2012 17
18. Duplicate detection (1)
Pairwise comparison for a collection of N pages
1
0.9
0.8
0.7
0.6
max(Da)
0.5
0.4
0.3
0.2
0.1
0
0 50 100 150 200 250 300 350 400 450 500
22.11.2012 image index a 18
20. Comparison of duplicate detection schemes
a) Visual histogram comparison - tf
b) tf and inv. document frequency - tf/idf
c) tf and spatial verification – tf/sv
22.11.2012 20
22. Conclusion and outlook
Workflows for duplicate detection for complex documents
Keypoint detection and description = purely image based
Bag of visual words provides fast matching
Spatial verification applied to shortlist
Robust thresholding scheme for duplicate identification
Evaluation at Austrian National Library
Integration on SCAPE platform for scalable preservation
22.11.2012 22
23. AIT Austrian Institute of Technology
your ingenious partner
reinhold.huber-moerk@ait.ac.at