SlideShare a Scribd company logo
1 of 23
Download to read offline
Duplicate detection for quality assurance
of document image collections
Reinhold Huber-Mörk1 & Alexander Schindler1,2 & Sven Schlarb3

1   Research Area Intelligent Vision Systems, Department Safety & Security
    AIT Austrian Institute of Technology

2   Department of Software Technology and Interactive Systems
    Vienna University of Technology

3   Department for Research and Development
    Austrian National Library
Overview

    Digital preservation & quality assurance

    Digital image preservation workflows

    Image duplicate detection

    Keypoints and feature descriptors in Computer Vision

    Bag of visual words

    Results on a real-world data set




22.11.2012                                                  2
SCAPE project and quality assurance

    SCAlable Preservation Environments, EU FP7

    Preservation Components:

      improve and extend existing tools,

      develop new ones where necessary,

      apply proven approaches like

      image and patterns analysis to the

      problem of ensuring quality in digital

      preservation




22.11.2012                                        3
Quality assurance in image preservation

    Comparison of image content

     - automatic image processing worflows (e.g. format conversion)

     - reacquisition of images

    Duplicate detection

     - within a single collection (filtering)

     - between collections (merging, comparison)

     Solutions:

             - page segmention + OCR

             - feature based approaches

22.11.2012                                                            4
Book scan sequence with duplicates




22.11.2012                           5
Duplicate
detection
workflow




22.11.2012   6
Keypoint detection and description (1)

    Keypoints are detected at salient image regions

    A keypoint is described in a descriptor ( = vector of features)

    Scalable Invariant Feature Transform - SIFT (Lowe, 2004)


                                     0.2

                                     0.1

                                      0
                                           20   40    60    80        100    120




                                                0.2


                                                0.1


                                                 0
                                                       20        40     60     80   100   120




22.11.2012                                                                                      7
Keypoint detection and description (2)

    Invariance w.r.t. color/tone transformation

    Invariance w.r.t. rotation, scaling or translation
                            0.2

                            0.1

                              0
                                  20   40   60   80   100   120



                            0.2

                            0.1

                             0
                                  20   40   60   80   100   120


                            0.2

                            0.1

                             0
                                  20   40   60   80   100   120


                            0.2

                            0.1

22.11.2012                   0                                    8
                                  20   40   60   80   100   120
Keypoint detection and description (3)

    All detections (ordered by scale)




22.11.2012                               9
Duplicate
detection
workflow




22.11.2012   10
Bag of visual words (1)

    Bag of words model in text information retrieval:

     Document 1: “Peter likes to read books. Paul likes too”.
     Document 2: “Peter also likes to read poems”
     Bag:         [ Peter, likes, to, read, books, Paul, too, also, poems ]
     Histogram 1: [ 1,       2, 1, 1,         1,    1,    1, 0,       0 ]
     Histogram 2: [ 1,       1, 1, 1,         0,    0,    0, 1,       1 ]

    Visual analogy: bag of visual words or bag of features

Document                               Image
Document made of words                 Image made of descriptors
Bag of words                           Bag of clustered descriptors = visual words
Word occurrence histogram              Visual word histogram / ”fingerprint”
22.11.2012                                                                       11
Bag of visual words (2)




22.11.2012                             12
Bag of visual words (3)   Visual
                          word
                          #104

                          Visual
                          word
                          #15

                          Visual
                          word
                          #221

                          Visual
                          word
                          #312

                          Visual
                          word
                          #424

                          Visual
                          word
                          #250

22.11.2012                         13
Duplicate
detection
workflow




22.11.2012   14
Image comparison / duplicate detection schemes

              Comparison of visual histograms – tf (“term frequency”) score
           -3
        x 10
                                                                                 -3
    2                                                                         x 10
    0
           -3  50   100   150   200   250   300 350     400   450   500   2
        x 10
    4                                                                     0
    2                                                                                50   100   150   200   250   300 350   400   450   500
    0
               50   100   150   200   250   300   350   400   450   500




              Inverse document frequency –idf




              Spatial verification – sv detailed image comparison




22.11.2012                                                                                                                                    15
Spatial verification (1)

    Bag of visual words maintains no (or limited) spatial information




    Spatial verification:

                     1. Ranking of most similar images in a shortlist
                     2. Direct matching of descriptors for pairs of images
                     3. Overlaying of images
                     4. Estimation of similarity
22.11.2012                                                                   16
Spatial verification (2)
Pair of possible duplicates          Descriptor matching
                                                                      Estimation of affine
                                                                      transformation




                     Image overlay            Similarity estimation

                                                                            Similarity
                                                                            measure
                                                                            MSSIM



   22.11.2012                                                                            17
Duplicate detection (1)

    Pairwise comparison for a collection of N pages



                        1

                       0.9

                       0.8

                       0.7

                       0.6
             max(Da)




                       0.5

                       0.4

                       0.3

                       0.2

                       0.1

                        0
                         0   50   100   150   200    250        300   350   400   450   500
22.11.2012                                      image index a                                 18
Duplicate detection (2)

    Robust outlier detection



                                                1

             a=12..15                          0.9                                                                    a=106,107
                                               0.8

                                               0.7

                                               0.6
                                     max(Da)




                                               0.5
             a=22..25
                                               0.4
                                                                                                                       a=108,109
                                               0.3

                                               0.2

                                               0.1

                                                0
                                                 0   50   100   150   200    250        300   350   400   450   500
                                                                        image index a



                        a=188..197                                                  a=198..207




22.11.2012                                                                                                                         19
Comparison of duplicate detection schemes

    a) Visual histogram comparison - tf

    b) tf and inv. document frequency - tf/idf

    c) tf and spatial verification – tf/sv




22.11.2012                                        20
Results

    Manual vs. automatic detection

    59 books, 34805 pages

    53 books correctly processed

     53/59 ≈ 90% correct

    69 of 75 duplicate runs detected

     69/75 ≈ 92% correct

    Missing detections due to

     heavily mixed content




22.11.2012                              21
Conclusion and outlook

    Workflows for duplicate detection for complex documents

    Keypoint detection and description = purely image based

    Bag of visual words provides fast matching

    Spatial verification applied to shortlist

    Robust thresholding scheme for duplicate identification

    Evaluation at Austrian National Library

    Integration on SCAPE platform for scalable preservation




22.11.2012                                                     22
AIT Austrian Institute of Technology
your ingenious partner



reinhold.huber-moerk@ait.ac.at

More Related Content

Similar to Duplicate detection for quality assurance of document image collections

P02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for visionP02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for visionzukun
 
PowerPoint’s Impact on Conference Ratings and Social Media Likes
PowerPoint’s Impact on Conference Ratings and Social Media Likes PowerPoint’s Impact on Conference Ratings and Social Media Likes
PowerPoint’s Impact on Conference Ratings and Social Media Likes Nathan Garrett
 

Similar to Duplicate detection for quality assurance of document image collections (6)

Monovision vs Pinhole
Monovision vs PinholeMonovision vs Pinhole
Monovision vs Pinhole
 
DCT_TR802
DCT_TR802DCT_TR802
DCT_TR802
 
DCT_TR802
DCT_TR802DCT_TR802
DCT_TR802
 
DCT_TR802
DCT_TR802DCT_TR802
DCT_TR802
 
P02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for visionP02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for vision
 
PowerPoint’s Impact on Conference Ratings and Social Media Likes
PowerPoint’s Impact on Conference Ratings and Social Media Likes PowerPoint’s Impact on Conference Ratings and Social Media Likes
PowerPoint’s Impact on Conference Ratings and Social Media Likes
 

More from SCAPE Project

SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Project
 
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...SCAPE Project
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Project
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Project
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Project
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Project
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE Project
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...SCAPE Project
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014SCAPE Project
 
Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...SCAPE Project
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsSCAPE Project
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbSCAPE Project
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3POSCAPE Project
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulationSCAPE Project
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusSCAPE Project
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsSCAPE Project
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE Project
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalitySCAPE Project
 
Automatic Preservation Watch
Automatic Preservation WatchAutomatic Preservation Watch
Automatic Preservation WatchSCAPE Project
 

More from SCAPE Project (20)

C sz z6
C sz z6C sz z6
C sz z6
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with Nanite
 
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with Hadoop
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation Tool
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
 
Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation Environments
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven Schlarb
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3PO
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulation
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, Aarhus
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collections
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionality
 
Automatic Preservation Watch
Automatic Preservation WatchAutomatic Preservation Watch
Automatic Preservation Watch
 

Recently uploaded

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Duplicate detection for quality assurance of document image collections

  • 1. Duplicate detection for quality assurance of document image collections Reinhold Huber-Mörk1 & Alexander Schindler1,2 & Sven Schlarb3 1 Research Area Intelligent Vision Systems, Department Safety & Security AIT Austrian Institute of Technology 2 Department of Software Technology and Interactive Systems Vienna University of Technology 3 Department for Research and Development Austrian National Library
  • 2. Overview  Digital preservation & quality assurance  Digital image preservation workflows  Image duplicate detection  Keypoints and feature descriptors in Computer Vision  Bag of visual words  Results on a real-world data set 22.11.2012 2
  • 3. SCAPE project and quality assurance  SCAlable Preservation Environments, EU FP7  Preservation Components: improve and extend existing tools, develop new ones where necessary, apply proven approaches like image and patterns analysis to the problem of ensuring quality in digital preservation 22.11.2012 3
  • 4. Quality assurance in image preservation  Comparison of image content - automatic image processing worflows (e.g. format conversion) - reacquisition of images  Duplicate detection - within a single collection (filtering) - between collections (merging, comparison) Solutions: - page segmention + OCR - feature based approaches 22.11.2012 4
  • 5. Book scan sequence with duplicates 22.11.2012 5
  • 7. Keypoint detection and description (1)  Keypoints are detected at salient image regions  A keypoint is described in a descriptor ( = vector of features)  Scalable Invariant Feature Transform - SIFT (Lowe, 2004) 0.2 0.1 0 20 40 60 80 100 120 0.2 0.1 0 20 40 60 80 100 120 22.11.2012 7
  • 8. Keypoint detection and description (2)  Invariance w.r.t. color/tone transformation  Invariance w.r.t. rotation, scaling or translation 0.2 0.1 0 20 40 60 80 100 120 0.2 0.1 0 20 40 60 80 100 120 0.2 0.1 0 20 40 60 80 100 120 0.2 0.1 22.11.2012 0 8 20 40 60 80 100 120
  • 9. Keypoint detection and description (3)  All detections (ordered by scale) 22.11.2012 9
  • 11. Bag of visual words (1)  Bag of words model in text information retrieval: Document 1: “Peter likes to read books. Paul likes too”. Document 2: “Peter also likes to read poems” Bag: [ Peter, likes, to, read, books, Paul, too, also, poems ] Histogram 1: [ 1, 2, 1, 1, 1, 1, 1, 0, 0 ] Histogram 2: [ 1, 1, 1, 1, 0, 0, 0, 1, 1 ]  Visual analogy: bag of visual words or bag of features Document Image Document made of words Image made of descriptors Bag of words Bag of clustered descriptors = visual words Word occurrence histogram Visual word histogram / ”fingerprint” 22.11.2012 11
  • 12. Bag of visual words (2) 22.11.2012 12
  • 13. Bag of visual words (3) Visual word #104 Visual word #15 Visual word #221 Visual word #312 Visual word #424 Visual word #250 22.11.2012 13
  • 15. Image comparison / duplicate detection schemes  Comparison of visual histograms – tf (“term frequency”) score -3 x 10 -3 2 x 10 0 -3 50 100 150 200 250 300 350 400 450 500 2 x 10 4 0 2 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500  Inverse document frequency –idf  Spatial verification – sv detailed image comparison 22.11.2012 15
  • 16. Spatial verification (1)  Bag of visual words maintains no (or limited) spatial information  Spatial verification: 1. Ranking of most similar images in a shortlist 2. Direct matching of descriptors for pairs of images 3. Overlaying of images 4. Estimation of similarity 22.11.2012 16
  • 17. Spatial verification (2) Pair of possible duplicates Descriptor matching Estimation of affine transformation Image overlay Similarity estimation Similarity measure MSSIM 22.11.2012 17
  • 18. Duplicate detection (1)  Pairwise comparison for a collection of N pages 1 0.9 0.8 0.7 0.6 max(Da) 0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 200 250 300 350 400 450 500 22.11.2012 image index a 18
  • 19. Duplicate detection (2)  Robust outlier detection 1 a=12..15 0.9 a=106,107 0.8 0.7 0.6 max(Da) 0.5 a=22..25 0.4 a=108,109 0.3 0.2 0.1 0 0 50 100 150 200 250 300 350 400 450 500 image index a a=188..197 a=198..207 22.11.2012 19
  • 20. Comparison of duplicate detection schemes  a) Visual histogram comparison - tf  b) tf and inv. document frequency - tf/idf  c) tf and spatial verification – tf/sv 22.11.2012 20
  • 21. Results  Manual vs. automatic detection  59 books, 34805 pages  53 books correctly processed 53/59 ≈ 90% correct  69 of 75 duplicate runs detected 69/75 ≈ 92% correct  Missing detections due to heavily mixed content 22.11.2012 21
  • 22. Conclusion and outlook  Workflows for duplicate detection for complex documents  Keypoint detection and description = purely image based  Bag of visual words provides fast matching  Spatial verification applied to shortlist  Robust thresholding scheme for duplicate identification  Evaluation at Austrian National Library  Integration on SCAPE platform for scalable preservation 22.11.2012 22
  • 23. AIT Austrian Institute of Technology your ingenious partner reinhold.huber-moerk@ait.ac.at