Duplicate detection for quality assurance of document image collections

Duplicate detection for quality assurance
of document image collections
Reinhold Huber-Mörk1 & Alexander Schindler1,2 & Sven Schlarb3

1 Research Area Intelligent Vision Systems, Department Safety & Security
AIT Austrian Institute of Technology

2 Department of Software Technology and Interactive Systems
Vienna University of Technology

3 Department for Research and Development
Austrian National Library

Overview

 Digital preservation & quality assurance

 Digital image preservation workflows

 Image duplicate detection

 Keypoints and feature descriptors in Computer Vision

 Bag of visual words

 Results on a real-world data set

22.11.2012 2

SCAPE project and quality assurance

 SCAlable Preservation Environments, EU FP7

 Preservation Components:

improve and extend existing tools,

develop new ones where necessary,

apply proven approaches like

image and patterns analysis to the

problem of ensuring quality in digital

preservation

22.11.2012 3

Quality assurance in image preservation

 Comparison of image content

- automatic image processing worflows (e.g. format conversion)

- reacquisition of images

 Duplicate detection

- within a single collection (filtering)

- between collections (merging, comparison)

Solutions:

- page segmention + OCR

- feature based approaches

22.11.2012 4

Book scan sequence with duplicates

22.11.2012 5

Duplicate
detection
workflow

22.11.2012 6

Keypoint detection and description (1)

 Keypoints are detected at salient image regions

 A keypoint is described in a descriptor ( = vector of features)

 Scalable Invariant Feature Transform - SIFT (Lowe, 2004)

0.2

0.1

0
20 40 60 80 100 120

0.2

0.1

0
20 40 60 80 100 120

22.11.2012 7


 Invariance w.r.t. color/tone transformation

 Invariance w.r.t. rotation, scaling or translation
0.2

0.1

0
20 40 60 80 100 120

0.2

0.1

0
20 40 60 80 100 120

0.2

0.1

0
20 40 60 80 100 120

0.2

0.1

22.11.2012 0 8
20 40 60 80 100 120


 All detections (ordered by scale)

22.11.2012 9

Duplicate
detection
workflow

22.11.2012 10

Bag of visual words (1)

 Bag of words model in text information retrieval:

Document 1: “Peter likes to read books. Paul likes too”.
Document 2: “Peter also likes to read poems”
Bag: [ Peter, likes, to, read, books, Paul, too, also, poems ]
Histogram 1: [ 1, 2, 1, 1, 1, 1, 1, 0, 0 ]
Histogram 2: [ 1, 1, 1, 1, 0, 0, 0, 1, 1 ]

 Visual analogy: bag of visual words or bag of features

Document Image
Document made of words Image made of descriptors
Bag of words Bag of clustered descriptors = visual words
Word occurrence histogram Visual word histogram / ”fingerprint”
22.11.2012 11

Bag of visual words (2)

22.11.2012 12

Bag of visual words (3) Visual
word
#104

Visual
word
#15

Visual
word
#221

Visual
word
#312

Visual
word
#424

Visual
word
#250

22.11.2012 13

Duplicate
detection
workflow

22.11.2012 14

Image comparison / duplicate detection schemes

 Comparison of visual histograms – tf (“term frequency”) score
-3
x 10
-3
2 x 10
0
-3 50 100 150 200 250 300 350 400 450 500 2
x 10
4 0
2 50 100 150 200 250 300 350 400 450 500
0
50 100 150 200 250 300 350 400 450 500

 Inverse document frequency –idf

 Spatial verification – sv detailed image comparison

22.11.2012 15

Spatial verification (1)

 Bag of visual words maintains no (or limited) spatial information

 Spatial verification:

1. Ranking of most similar images in a shortlist
2. Direct matching of descriptors for pairs of images
3. Overlaying of images
4. Estimation of similarity
22.11.2012 16

Spatial verification (2)
Pair of possible duplicates Descriptor matching
Estimation of affine
transformation

Image overlay Similarity estimation

Similarity
measure
MSSIM

22.11.2012 17

Duplicate detection (1)

 Pairwise comparison for a collection of N pages

1

0.9

0.8

0.7

0.6
max(Da)

0.5

0.4

0.3

0.2

0.1

0
0 50 100 150 200 250 300 350 400 450 500
22.11.2012 image index a 18

Duplicate detection (2)

 Robust outlier detection

1

a=12..15 0.9 a=106,107
0.8

0.7

0.6
max(Da)

0.5
a=22..25
0.4
a=108,109
0.3

0.2

0.1

0
0 50 100 150 200 250 300 350 400 450 500
image index a

a=188..197 a=198..207

22.11.2012 19

Comparison of duplicate detection schemes

 a) Visual histogram comparison - tf

 b) tf and inv. document frequency - tf/idf

 c) tf and spatial verification – tf/sv

22.11.2012 20

Results

 Manual vs. automatic detection

 59 books, 34805 pages

 53 books correctly processed

53/59 ≈ 90% correct

 69 of 75 duplicate runs detected

69/75 ≈ 92% correct

 Missing detections due to

heavily mixed content

22.11.2012 21

Conclusion and outlook

 Workflows for duplicate detection for complex documents

 Keypoint detection and description = purely image based

 Bag of visual words provides fast matching

 Spatial verification applied to shortlist

 Robust thresholding scheme for duplicate identification

 Evaluation at Austrian National Library

 Integration on SCAPE platform for scalable preservation

22.11.2012 22

AIT Austrian Institute of Technology
your ingenious partner

reinhold.huber-moerk@ait.ac.at

Duplicate detection for quality assurance of document image collections

Recommended

Recommended

More Related Content

Similar to Duplicate detection for quality assurance of document image collections

Similar to Duplicate detection for quality assurance of document image collections (6)

More from SCAPE Project

More from SCAPE Project (20)

Recently uploaded

Recently uploaded (20)

Duplicate detection for quality assurance of document image collections