SlideShare una empresa de Scribd logo
1 de 23
Descargar para leer sin conexión
Quality assurance for document
image collections in digital
preservation
Reinhold Huber-Mörk1 & Alexander Schindler1,2
1 Research Area Intelligent Vision Systems
Department Safety & Security
AIT Austrian Institute of Technology
2 Department of Software Technology and Interactive Systems
Vienna University of Technology
Overview
 Digital preservation
 Quality assurance in digital image preservation workflows
 Keypoint based approach for image content comparison
 Spatially distinctive keypoints
 Document image preprocessing
 Structural similarity assessment
 Results on real-world data sets
212.09.2013
Digital preservation
 „Set of processes, activities and management of digital information over time
to ensure its long term accessibility“ (Source: Wikipedia)
 Physical damage of digital/digitized content, e.g. „bit rot“ related to some
storage media
 Digital obsolescence of hardware/software, e.g. vanishing file formats
 Content modification in preservation, e.g. error injection during file format
conversion, digital manipulation, reacquisition,…
312.09.2013
Images provided by historical
newspaper collection / The
British Library
Quality assurance in digital image preservation workflows
 Automated preservation workflows are common in large digitization projects
(e.g. museum collections, Google books PPPs,…).
 Automated quality assurance to ensure file format consistency, detection of
duplicates and quality and content preservation.
 SCAPE FP7
412.09.2013
Keypoint based approach for document comparison
 Local features are detected & described by standard LoG/SIFT approach
 Scaling, rotation, cropping and additional/missing content is handled
 Affine transformation is sufficient (usually no perspective, bending etc.)
512.09.2013
200 400 600 800 1000 1200
100
200
300
400
500
600
700
800
Images
provided by
historical
newspaper
collection /
The British
Library
Spatially distinctive keypoints (SDKs) (1)
 High-resolution document scans contain large number of keypoints (e.g.
~50.000 keypoints on ~5000x3000 pixel images)
 Matching of descriptors results in high computational complexity
 Changing of detector edge/peak thresholds often results in spatially uneven
distribution of keypoints
 One solution is dense/regular spatial sampling of keypoints
 Another solution is adaptive non-maximal suppression (Brown et. al, 2005)
 Our solution is to enforce keypoint selection at positions locally adjacent to
spatially uniformly distributed interest regions
612.09.2013
Spatially distinctive keypoints (SDKs) (2)
 Interest regions are distributed over the image using a regular grid
 Keypoints with highest saliency are selected from each interest region
(Harris & Stevens corner strength is used as saliency measure)
712.09.2013
Images provided by International Dunhuang Project / The British Library
Evaluation SDK (1)
812.09.2013
 Mean SSIM vs. number of SDKs (evaluated on 1560 Dunhuang image pairs)
#SDKs=64 #SDKs=256
#SDKs=512
#SDKs=1024
#SDKs=2048 all keypoints
Evaluation SDK (2)
912.09.2013
 Dependency of mean SSIM on the number of SDKs
(evaluated on 1560 Dunhuang image pairs)
Robust symmetric matching
 RANSAC constrained by affine transformation
 Only accept significant matches - distance ratio of best and second best match
 Enforcing one-to-one matching of descriptors - ignoring ambiguous matches
1012.09.2013
Images provided by
International
Dunhuang Project /
The British Library
Image preprocessing (1)
 Content in (historical) book collections is characterized by a mixture of text,
graphical art, empty pages & other artefacts
 E.g. onsider a sample from the Dunhuang manuscripts
1112.09.2013
Images
provided by
International
Dunhuang
Project / The
British
Library
Image preprocessing (2)
 Locally adaptive histogram equalization to enhance paper structure while
preserving text structure
 Contrast limited adaptive histogram equalization (CLAHE, Pizer et.al. 1987),
where grid/tile spacing ~ character size (e.g. 40x50 pixels)
1212.09.2013
Images provided by International Dunhuang Project / The British Library
Image preprocessing (3)
Tile centers Original Global hist. eq. CLAHE
1312.09.2013
Images provided by International Dunhuang Project / The British Library
Structural similarity (1)
 MSE, PSNR, etc. not well suited for content comparison –> perceptual
image quality assessment
 Non-blind/full-reference image quality assessment
 The mean structural similarity index (SSIM, Wang et. al 2004) compares two
images based on luminance, contrast and structure terms.
 Mean SSIM is evaluated for overlapping region of image pairs -> registration
 To lower the influence of misregistration the local minimum of the mean
SSIM between the images in the pair is evaluated
1412.09.2013
Structural similarity (2)
 Registered and overlaid images (SSIM low … black, SSIM high …white)
1512.09.2013
Images provided by International Dunhuang Project / The British Library
16
Pairs not
matching
Pairs with
low
structural
similarity
Pairs with
high structural
similarity
Mean SSIM = 0
8 pairs
Mean SSIM <0.67
78 pairs
Mean SSIM >0.67
(p=5 quantile)
1482 pairs
1560 pairsTotal number
Results - International Dunhuang Project data (1)
Results - International Dunhuang Project data (2)
1712.09.2013
 Pairs of high mean SSIM are not subject to a human verification
Images provided by
International Dunhuang
Project / The British Library
Results - International Dunhuang Project data (3)
1812.09.2013
 Pairs of medium mean SSIM are possibly subject to human verification
Images provided by
International
Dunhuang Project /
The British Library
Results - International Dunhuang Project data (4)
1912.09.2013
 Pairs of low mean SSIM are subject to human verification
Images provided by International Dunhuang Project /
The British Library
20
rate=1…
content
is
identical
Book/barcode nr. Book/Barcode name #Pairs Rate of matches
1 +Z13641740X_31525197396364410 546 0.9982
2 +Z13722110X_31525197396362478 18 1.0000
3 +Z136400800_31525197396361993 269 0.9888
4 +Z136408008_31525197396361942 291 0.9897
5 +Z136408409_31525197396362038 1 1.0000
6 +Z136409104_31525197396361681 182 0.9670
7 +Z136411408_31525197396362266 219 0.9954
8 +Z136415001_31525197396363522 360 0.9861
9 +Z136419900_31525197396360634 219 0.9954
10 +Z136428500_31525197396360351 249 0.9799
11 +Z136436004_31525197396361129 273 0.9853
12 +Z137116108_31525197396265632 589 0.9949
13 +Z137117708_31525197396287838 651 0.9969
14 +Z137118403_31525197396265776 505 0.9822
15 +Z137120100_31525197396265914 1231 0.9992
16 +Z137150402_31525197396389590 2 1.0000
17 +Z137219001_31525197396361518 664 0.9774
18 +Z150800609_31525197396361025 212 0.9858
19 +Z152471307_31525197396311214 443 0.9910
20 +Z152472403_31525197396313828 460 0.9913
21 +Z152472701_31525197396315698 859 0.9953
Results - Google books redownload workflow (1)
21
Pairs not
matching
Pairs with
low similarity
(or low
overlap)
Pairs with high
similarity
Results - Google books redownload workflow (2)
Images provided by Google books collection / Austrian National Library
Conclusion and outlook
 Keypoint based approach for quality assurance in digital book preservation
 Combination of keypoints approach with perceptual similarity evaluation
 Recently: combination with bag of keypoints approach for duplicate
detection and collection comparison
 Currently: Evaluation at Austrian National Library (Google books collection)
and British Library (historical newspaper collection)
 Future: Integration on SCAPE platform for scalable distributed computing
2212.09.2013
AIT Austrian Institute of Technology
your ingenious partner
reinhold.huber-moerk@ait.ac.at

Más contenido relacionado

La actualidad más candente

New Research Articles - 2018 November Issue- Advanced Computing: An Internat...
 New Research Articles - 2018 November Issue- Advanced Computing: An Internat... New Research Articles - 2018 November Issue- Advanced Computing: An Internat...
New Research Articles - 2018 November Issue- Advanced Computing: An Internat...acijjournal
 
Dynamic Two-Stage Image Retrieval from Large Multimodal Databases
Dynamic Two-Stage Image Retrieval from Large Multimodal DatabasesDynamic Two-Stage Image Retrieval from Large Multimodal Databases
Dynamic Two-Stage Image Retrieval from Large Multimodal DatabasesKonstantinos Zagoris
 
Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...
Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...
Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...miyurud
 
Certain Investigation on Dynamic Clustering in Dynamic Datamining
Certain Investigation on Dynamic Clustering in Dynamic DataminingCertain Investigation on Dynamic Clustering in Dynamic Datamining
Certain Investigation on Dynamic Clustering in Dynamic Dataminingijdmtaiir
 
Feature Subset Selection for High Dimensional Data Using Clustering Techniques
Feature Subset Selection for High Dimensional Data Using Clustering TechniquesFeature Subset Selection for High Dimensional Data Using Clustering Techniques
Feature Subset Selection for High Dimensional Data Using Clustering TechniquesIRJET Journal
 
A hybrid approach for analysis of dynamic changes in spatial data
A hybrid approach for analysis of dynamic changes in spatial dataA hybrid approach for analysis of dynamic changes in spatial data
A hybrid approach for analysis of dynamic changes in spatial dataijdms
 
Image Restoration Using Joint Statistical Modeling in a Space-Transform Domain
Image Restoration Using Joint Statistical Modeling in a Space-Transform DomainImage Restoration Using Joint Statistical Modeling in a Space-Transform Domain
Image Restoration Using Joint Statistical Modeling in a Space-Transform Domainjohn236zaq
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Editor IJARCET
 
A Semantic Approach to Retrieving, Linking, and Integrating Heterogeneous Ge...
A Semantic Approach to Retrieving, Linking, and  Integrating Heterogeneous Ge...A Semantic Approach to Retrieving, Linking, and  Integrating Heterogeneous Ge...
A Semantic Approach to Retrieving, Linking, and Integrating Heterogeneous Ge...Craig Knoblock
 
Recent Trends in Incremental Clustering: A Review
Recent Trends in Incremental Clustering: A ReviewRecent Trends in Incremental Clustering: A Review
Recent Trends in Incremental Clustering: A ReviewIOSRjournaljce
 
K Means Clustering and Meanshift Analysis for Grouping the Data of Coal Term ...
K Means Clustering and Meanshift Analysis for Grouping the Data of Coal Term ...K Means Clustering and Meanshift Analysis for Grouping the Data of Coal Term ...
K Means Clustering and Meanshift Analysis for Grouping the Data of Coal Term ...TELKOMNIKA JOURNAL
 
Survey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data AnalysisSurvey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data Analysisijdmtaiir
 

La actualidad más candente (14)

New Research Articles - 2018 November Issue- Advanced Computing: An Internat...
 New Research Articles - 2018 November Issue- Advanced Computing: An Internat... New Research Articles - 2018 November Issue- Advanced Computing: An Internat...
New Research Articles - 2018 November Issue- Advanced Computing: An Internat...
 
Dynamic Two-Stage Image Retrieval from Large Multimodal Databases
Dynamic Two-Stage Image Retrieval from Large Multimodal DatabasesDynamic Two-Stage Image Retrieval from Large Multimodal Databases
Dynamic Two-Stage Image Retrieval from Large Multimodal Databases
 
Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...
Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...
Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...
 
Certain Investigation on Dynamic Clustering in Dynamic Datamining
Certain Investigation on Dynamic Clustering in Dynamic DataminingCertain Investigation on Dynamic Clustering in Dynamic Datamining
Certain Investigation on Dynamic Clustering in Dynamic Datamining
 
Feature Subset Selection for High Dimensional Data Using Clustering Techniques
Feature Subset Selection for High Dimensional Data Using Clustering TechniquesFeature Subset Selection for High Dimensional Data Using Clustering Techniques
Feature Subset Selection for High Dimensional Data Using Clustering Techniques
 
A hybrid approach for analysis of dynamic changes in spatial data
A hybrid approach for analysis of dynamic changes in spatial dataA hybrid approach for analysis of dynamic changes in spatial data
A hybrid approach for analysis of dynamic changes in spatial data
 
Cg33504508
Cg33504508Cg33504508
Cg33504508
 
Image Restoration Using Joint Statistical Modeling in a Space-Transform Domain
Image Restoration Using Joint Statistical Modeling in a Space-Transform DomainImage Restoration Using Joint Statistical Modeling in a Space-Transform Domain
Image Restoration Using Joint Statistical Modeling in a Space-Transform Domain
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932
 
A Semantic Approach to Retrieving, Linking, and Integrating Heterogeneous Ge...
A Semantic Approach to Retrieving, Linking, and  Integrating Heterogeneous Ge...A Semantic Approach to Retrieving, Linking, and  Integrating Heterogeneous Ge...
A Semantic Approach to Retrieving, Linking, and Integrating Heterogeneous Ge...
 
Recent Trends in Incremental Clustering: A Review
Recent Trends in Incremental Clustering: A ReviewRecent Trends in Incremental Clustering: A Review
Recent Trends in Incremental Clustering: A Review
 
K Means Clustering and Meanshift Analysis for Grouping the Data of Coal Term ...
K Means Clustering and Meanshift Analysis for Grouping the Data of Coal Term ...K Means Clustering and Meanshift Analysis for Grouping the Data of Coal Term ...
K Means Clustering and Meanshift Analysis for Grouping the Data of Coal Term ...
 
A0360109
A0360109A0360109
A0360109
 
Survey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data AnalysisSurvey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data Analysis
 

Similar a Quality assurance for document image collections in digital preservation

Binarization of Degraded Text documents and Palm Leaf Manuscripts
Binarization of Degraded Text documents and Palm Leaf ManuscriptsBinarization of Degraded Text documents and Palm Leaf Manuscripts
Binarization of Degraded Text documents and Palm Leaf ManuscriptsIRJET Journal
 
Improved wolf algorithm on document images detection using optimum mean techn...
Improved wolf algorithm on document images detection using optimum mean techn...Improved wolf algorithm on document images detection using optimum mean techn...
Improved wolf algorithm on document images detection using optimum mean techn...journalBEEI
 
DETECTION OF DENSE, OVERLAPPING, GEOMETRIC OBJECTS
DETECTION OF DENSE, OVERLAPPING, GEOMETRIC OBJECTSDETECTION OF DENSE, OVERLAPPING, GEOMETRIC OBJECTS
DETECTION OF DENSE, OVERLAPPING, GEOMETRIC OBJECTSgerogepatton
 
DETECTION OF DENSE, OVERLAPPING, GEOMETRIC OBJECTS
DETECTION OF DENSE, OVERLAPPING, GEOMETRIC OBJECTSDETECTION OF DENSE, OVERLAPPING, GEOMETRIC OBJECTS
DETECTION OF DENSE, OVERLAPPING, GEOMETRIC OBJECTSijaia
 
Detection of Dense, Overlapping, Geometric Objects
Detection of Dense, Overlapping, Geometric Objects Detection of Dense, Overlapping, Geometric Objects
Detection of Dense, Overlapping, Geometric Objects gerogepatton
 
Retrieval of textual and non textual information in
Retrieval of textual and non textual information inRetrieval of textual and non textual information in
Retrieval of textual and non textual information ineSAT Publishing House
 
Feature based similarity search in 3 d object databases
Feature based similarity search in 3 d object databasesFeature based similarity search in 3 d object databases
Feature based similarity search in 3 d object databasesunyil96
 
Maximizing Strength of Digital Watermarks Using Fuzzy Logic
Maximizing Strength of Digital Watermarks Using Fuzzy LogicMaximizing Strength of Digital Watermarks Using Fuzzy Logic
Maximizing Strength of Digital Watermarks Using Fuzzy Logicsipij
 
Proposed technique-for-edge-matching-of-torn-paper
Proposed technique-for-edge-matching-of-torn-paperProposed technique-for-edge-matching-of-torn-paper
Proposed technique-for-edge-matching-of-torn-paperEditor IJMTER
 
Adaptive Image Contrast with Binarization Technique for Degraded Document Image
Adaptive Image Contrast with Binarization Technique for Degraded Document ImageAdaptive Image Contrast with Binarization Technique for Degraded Document Image
Adaptive Image Contrast with Binarization Technique for Degraded Document Imagetheijes
 
PREVENTING COPYRIGHTS INFRINGEMENT OF IMAGES BY WATERMARKING IN TRANSFORM DOM...
PREVENTING COPYRIGHTS INFRINGEMENT OF IMAGES BY WATERMARKING IN TRANSFORM DOM...PREVENTING COPYRIGHTS INFRINGEMENT OF IMAGES BY WATERMARKING IN TRANSFORM DOM...
PREVENTING COPYRIGHTS INFRINGEMENT OF IMAGES BY WATERMARKING IN TRANSFORM DOM...ijistjournal
 
Developing and comparing an encoding system using vector quantization &
Developing and comparing an encoding system using vector quantization &Developing and comparing an encoding system using vector quantization &
Developing and comparing an encoding system using vector quantization &IAEME Publication
 
Developing and comparing an encoding system using vector quantization &
Developing and comparing an encoding system using vector quantization &Developing and comparing an encoding system using vector quantization &
Developing and comparing an encoding system using vector quantization &IAEME Publication
 
Finding similarities between structured documents as a crucial stage for gene...
Finding similarities between structured documents as a crucial stage for gene...Finding similarities between structured documents as a crucial stage for gene...
Finding similarities between structured documents as a crucial stage for gene...Alexander Decker
 
An Efficient Analysis of Wavelet Techniques on Image Compression in MRI Images
An Efficient Analysis of Wavelet Techniques on Image Compression in MRI Images An Efficient Analysis of Wavelet Techniques on Image Compression in MRI Images
An Efficient Analysis of Wavelet Techniques on Image Compression in MRI Images Associate Professor in VSB Coimbatore
 
Ijarcet vol-2-issue-3-1078-1080
Ijarcet vol-2-issue-3-1078-1080Ijarcet vol-2-issue-3-1078-1080
Ijarcet vol-2-issue-3-1078-1080Editor IJARCET
 
Drsp dimension reduction for similarity matching and pruning of time series ...
Drsp  dimension reduction for similarity matching and pruning of time series ...Drsp  dimension reduction for similarity matching and pruning of time series ...
Drsp dimension reduction for similarity matching and pruning of time series ...IJDKP
 

Similar a Quality assurance for document image collections in digital preservation (20)

Binarization of Degraded Text documents and Palm Leaf Manuscripts
Binarization of Degraded Text documents and Palm Leaf ManuscriptsBinarization of Degraded Text documents and Palm Leaf Manuscripts
Binarization of Degraded Text documents and Palm Leaf Manuscripts
 
Sub1586
Sub1586Sub1586
Sub1586
 
Improved wolf algorithm on document images detection using optimum mean techn...
Improved wolf algorithm on document images detection using optimum mean techn...Improved wolf algorithm on document images detection using optimum mean techn...
Improved wolf algorithm on document images detection using optimum mean techn...
 
DETECTION OF DENSE, OVERLAPPING, GEOMETRIC OBJECTS
DETECTION OF DENSE, OVERLAPPING, GEOMETRIC OBJECTSDETECTION OF DENSE, OVERLAPPING, GEOMETRIC OBJECTS
DETECTION OF DENSE, OVERLAPPING, GEOMETRIC OBJECTS
 
DETECTION OF DENSE, OVERLAPPING, GEOMETRIC OBJECTS
DETECTION OF DENSE, OVERLAPPING, GEOMETRIC OBJECTSDETECTION OF DENSE, OVERLAPPING, GEOMETRIC OBJECTS
DETECTION OF DENSE, OVERLAPPING, GEOMETRIC OBJECTS
 
Detection of Dense, Overlapping, Geometric Objects
Detection of Dense, Overlapping, Geometric Objects Detection of Dense, Overlapping, Geometric Objects
Detection of Dense, Overlapping, Geometric Objects
 
Retrieval of textual and non textual information in
Retrieval of textual and non textual information inRetrieval of textual and non textual information in
Retrieval of textual and non textual information in
 
Feature based similarity search in 3 d object databases
Feature based similarity search in 3 d object databasesFeature based similarity search in 3 d object databases
Feature based similarity search in 3 d object databases
 
Maximizing Strength of Digital Watermarks Using Fuzzy Logic
Maximizing Strength of Digital Watermarks Using Fuzzy LogicMaximizing Strength of Digital Watermarks Using Fuzzy Logic
Maximizing Strength of Digital Watermarks Using Fuzzy Logic
 
Proposed technique-for-edge-matching-of-torn-paper
Proposed technique-for-edge-matching-of-torn-paperProposed technique-for-edge-matching-of-torn-paper
Proposed technique-for-edge-matching-of-torn-paper
 
Adaptive Image Contrast with Binarization Technique for Degraded Document Image
Adaptive Image Contrast with Binarization Technique for Degraded Document ImageAdaptive Image Contrast with Binarization Technique for Degraded Document Image
Adaptive Image Contrast with Binarization Technique for Degraded Document Image
 
3 video segmentation
3 video segmentation3 video segmentation
3 video segmentation
 
PREVENTING COPYRIGHTS INFRINGEMENT OF IMAGES BY WATERMARKING IN TRANSFORM DOM...
PREVENTING COPYRIGHTS INFRINGEMENT OF IMAGES BY WATERMARKING IN TRANSFORM DOM...PREVENTING COPYRIGHTS INFRINGEMENT OF IMAGES BY WATERMARKING IN TRANSFORM DOM...
PREVENTING COPYRIGHTS INFRINGEMENT OF IMAGES BY WATERMARKING IN TRANSFORM DOM...
 
Developing and comparing an encoding system using vector quantization &
Developing and comparing an encoding system using vector quantization &Developing and comparing an encoding system using vector quantization &
Developing and comparing an encoding system using vector quantization &
 
Developing and comparing an encoding system using vector quantization &
Developing and comparing an encoding system using vector quantization &Developing and comparing an encoding system using vector quantization &
Developing and comparing an encoding system using vector quantization &
 
Finding similarities between structured documents as a crucial stage for gene...
Finding similarities between structured documents as a crucial stage for gene...Finding similarities between structured documents as a crucial stage for gene...
Finding similarities between structured documents as a crucial stage for gene...
 
An Efficient Analysis of Wavelet Techniques on Image Compression in MRI Images
An Efficient Analysis of Wavelet Techniques on Image Compression in MRI Images An Efficient Analysis of Wavelet Techniques on Image Compression in MRI Images
An Efficient Analysis of Wavelet Techniques on Image Compression in MRI Images
 
L0816166
L0816166L0816166
L0816166
 
Ijarcet vol-2-issue-3-1078-1080
Ijarcet vol-2-issue-3-1078-1080Ijarcet vol-2-issue-3-1078-1080
Ijarcet vol-2-issue-3-1078-1080
 
Drsp dimension reduction for similarity matching and pruning of time series ...
Drsp  dimension reduction for similarity matching and pruning of time series ...Drsp  dimension reduction for similarity matching and pruning of time series ...
Drsp dimension reduction for similarity matching and pruning of time series ...
 

Más de SCAPE Project

SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Project
 
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...SCAPE Project
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Project
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Project
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Project
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Project
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE Project
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...SCAPE Project
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014SCAPE Project
 
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...SCAPE Project
 
Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...SCAPE Project
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsSCAPE Project
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbSCAPE Project
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3POSCAPE Project
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulationSCAPE Project
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusSCAPE Project
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsSCAPE Project
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE Project
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalitySCAPE Project
 

Más de SCAPE Project (20)

C sz z6
C sz z6C sz z6
C sz z6
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with Nanite
 
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with Hadoop
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation Tool
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
 
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
 
Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation Environments
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven Schlarb
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3PO
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulation
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, Aarhus
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collections
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionality
 

Último

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 

Último (20)

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 

Quality assurance for document image collections in digital preservation

  • 1. Quality assurance for document image collections in digital preservation Reinhold Huber-Mörk1 & Alexander Schindler1,2 1 Research Area Intelligent Vision Systems Department Safety & Security AIT Austrian Institute of Technology 2 Department of Software Technology and Interactive Systems Vienna University of Technology
  • 2. Overview  Digital preservation  Quality assurance in digital image preservation workflows  Keypoint based approach for image content comparison  Spatially distinctive keypoints  Document image preprocessing  Structural similarity assessment  Results on real-world data sets 212.09.2013
  • 3. Digital preservation  „Set of processes, activities and management of digital information over time to ensure its long term accessibility“ (Source: Wikipedia)  Physical damage of digital/digitized content, e.g. „bit rot“ related to some storage media  Digital obsolescence of hardware/software, e.g. vanishing file formats  Content modification in preservation, e.g. error injection during file format conversion, digital manipulation, reacquisition,… 312.09.2013 Images provided by historical newspaper collection / The British Library
  • 4. Quality assurance in digital image preservation workflows  Automated preservation workflows are common in large digitization projects (e.g. museum collections, Google books PPPs,…).  Automated quality assurance to ensure file format consistency, detection of duplicates and quality and content preservation.  SCAPE FP7 412.09.2013
  • 5. Keypoint based approach for document comparison  Local features are detected & described by standard LoG/SIFT approach  Scaling, rotation, cropping and additional/missing content is handled  Affine transformation is sufficient (usually no perspective, bending etc.) 512.09.2013 200 400 600 800 1000 1200 100 200 300 400 500 600 700 800 Images provided by historical newspaper collection / The British Library
  • 6. Spatially distinctive keypoints (SDKs) (1)  High-resolution document scans contain large number of keypoints (e.g. ~50.000 keypoints on ~5000x3000 pixel images)  Matching of descriptors results in high computational complexity  Changing of detector edge/peak thresholds often results in spatially uneven distribution of keypoints  One solution is dense/regular spatial sampling of keypoints  Another solution is adaptive non-maximal suppression (Brown et. al, 2005)  Our solution is to enforce keypoint selection at positions locally adjacent to spatially uniformly distributed interest regions 612.09.2013
  • 7. Spatially distinctive keypoints (SDKs) (2)  Interest regions are distributed over the image using a regular grid  Keypoints with highest saliency are selected from each interest region (Harris & Stevens corner strength is used as saliency measure) 712.09.2013 Images provided by International Dunhuang Project / The British Library
  • 8. Evaluation SDK (1) 812.09.2013  Mean SSIM vs. number of SDKs (evaluated on 1560 Dunhuang image pairs) #SDKs=64 #SDKs=256 #SDKs=512 #SDKs=1024 #SDKs=2048 all keypoints
  • 9. Evaluation SDK (2) 912.09.2013  Dependency of mean SSIM on the number of SDKs (evaluated on 1560 Dunhuang image pairs)
  • 10. Robust symmetric matching  RANSAC constrained by affine transformation  Only accept significant matches - distance ratio of best and second best match  Enforcing one-to-one matching of descriptors - ignoring ambiguous matches 1012.09.2013 Images provided by International Dunhuang Project / The British Library
  • 11. Image preprocessing (1)  Content in (historical) book collections is characterized by a mixture of text, graphical art, empty pages & other artefacts  E.g. onsider a sample from the Dunhuang manuscripts 1112.09.2013 Images provided by International Dunhuang Project / The British Library
  • 12. Image preprocessing (2)  Locally adaptive histogram equalization to enhance paper structure while preserving text structure  Contrast limited adaptive histogram equalization (CLAHE, Pizer et.al. 1987), where grid/tile spacing ~ character size (e.g. 40x50 pixels) 1212.09.2013 Images provided by International Dunhuang Project / The British Library
  • 13. Image preprocessing (3) Tile centers Original Global hist. eq. CLAHE 1312.09.2013 Images provided by International Dunhuang Project / The British Library
  • 14. Structural similarity (1)  MSE, PSNR, etc. not well suited for content comparison –> perceptual image quality assessment  Non-blind/full-reference image quality assessment  The mean structural similarity index (SSIM, Wang et. al 2004) compares two images based on luminance, contrast and structure terms.  Mean SSIM is evaluated for overlapping region of image pairs -> registration  To lower the influence of misregistration the local minimum of the mean SSIM between the images in the pair is evaluated 1412.09.2013
  • 15. Structural similarity (2)  Registered and overlaid images (SSIM low … black, SSIM high …white) 1512.09.2013 Images provided by International Dunhuang Project / The British Library
  • 16. 16 Pairs not matching Pairs with low structural similarity Pairs with high structural similarity Mean SSIM = 0 8 pairs Mean SSIM <0.67 78 pairs Mean SSIM >0.67 (p=5 quantile) 1482 pairs 1560 pairsTotal number Results - International Dunhuang Project data (1)
  • 17. Results - International Dunhuang Project data (2) 1712.09.2013  Pairs of high mean SSIM are not subject to a human verification Images provided by International Dunhuang Project / The British Library
  • 18. Results - International Dunhuang Project data (3) 1812.09.2013  Pairs of medium mean SSIM are possibly subject to human verification Images provided by International Dunhuang Project / The British Library
  • 19. Results - International Dunhuang Project data (4) 1912.09.2013  Pairs of low mean SSIM are subject to human verification Images provided by International Dunhuang Project / The British Library
  • 20. 20 rate=1… content is identical Book/barcode nr. Book/Barcode name #Pairs Rate of matches 1 +Z13641740X_31525197396364410 546 0.9982 2 +Z13722110X_31525197396362478 18 1.0000 3 +Z136400800_31525197396361993 269 0.9888 4 +Z136408008_31525197396361942 291 0.9897 5 +Z136408409_31525197396362038 1 1.0000 6 +Z136409104_31525197396361681 182 0.9670 7 +Z136411408_31525197396362266 219 0.9954 8 +Z136415001_31525197396363522 360 0.9861 9 +Z136419900_31525197396360634 219 0.9954 10 +Z136428500_31525197396360351 249 0.9799 11 +Z136436004_31525197396361129 273 0.9853 12 +Z137116108_31525197396265632 589 0.9949 13 +Z137117708_31525197396287838 651 0.9969 14 +Z137118403_31525197396265776 505 0.9822 15 +Z137120100_31525197396265914 1231 0.9992 16 +Z137150402_31525197396389590 2 1.0000 17 +Z137219001_31525197396361518 664 0.9774 18 +Z150800609_31525197396361025 212 0.9858 19 +Z152471307_31525197396311214 443 0.9910 20 +Z152472403_31525197396313828 460 0.9913 21 +Z152472701_31525197396315698 859 0.9953 Results - Google books redownload workflow (1)
  • 21. 21 Pairs not matching Pairs with low similarity (or low overlap) Pairs with high similarity Results - Google books redownload workflow (2) Images provided by Google books collection / Austrian National Library
  • 22. Conclusion and outlook  Keypoint based approach for quality assurance in digital book preservation  Combination of keypoints approach with perceptual similarity evaluation  Recently: combination with bag of keypoints approach for duplicate detection and collection comparison  Currently: Evaluation at Austrian National Library (Google books collection) and British Library (historical newspaper collection)  Future: Integration on SCAPE platform for scalable distributed computing 2212.09.2013
  • 23. AIT Austrian Institute of Technology your ingenious partner reinhold.huber-moerk@ait.ac.at