1. June 14, 2013
Page 1
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
CCS
Content Conversion Specialists
europeana newspapers
Workshop Refinement and Quality Assessment, Belgrade 14.6.2013
OLR at CCS
From unstructured to structured newspaper data and the role
of content providers in the overall process
2. June 14, 2013
Page 2
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
Agenda
About CCS
General workflow for mass digitization of newspapers
OLR – Layout and structure analysis
ENP OLR workflow (involvement of CP‘s)
Quality assurance
Output - METS/ALTO package
Demo of first results
3. June 14, 2013
Page 3
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
About CCS
CCS Content Conversion Specialists GmbH (Hamburg), as technical project
partner, will provide its expertise and docWorks technology to set up and
operate a mass digitisation workflow to create high quality structured content
from 2 million scanned newspaper pages provided by 5 library partners
Page volume:
BNF=1.000 k, NLE=500 k , SUB HH=480 k, NLF=90 k, SBB=10 k
The distributed OLR workflow enables the contribution of project partners
(content providers) to the integrated quality assurance process
CCS will also contribute to the specification of the metadata model
4. June 14, 2013
Page 4
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
General workflow for mass digitization
Re-Scan
Conversion
Imaging
Layout
Analysis
OCR
ISR
Reject
Condition
Delivery
QA
random
Final
Output
Scanning
Image
Metadata
Database
----------------
Repository
Automated QA
Document
UID
Barcode
Item Tracking
Manual QA
•in-house
•near-shore
•off-shore
•multiple locations
Manual QA
•in-house
•near-shore
Check in
Check out
Scanner
•Robot-
•Book-
•Document-
•Microfilm-
QA+Correcti
onQA+Correcti
on
QA +
Correction
Z 39.50
Metadata
5. June 14, 2013
Page 5
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
Layout and structure analysis
Layout analysis based on „bottom up“ approach
General rule system enables recognition of words, text
lines, text blocks, columns and classification of text
blocks, illustrations, advertisements, tables and the
following page types:
- title page (the title page of an issue)
- content page (a page that consists of content/text only)
- illustration page (a page that has at least one illustration)
- advertisement page (a page that contains adverts only)
Structure analysis through classification of headlines
and grouping of zones into articles
(incl. article continuation)
6. June 14, 2013
Page 6
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
ENP OLR workflow | Conversion without scanning
Digital Image
Metadata
Delivery
Digital Image
Metadata
Delivery
Digital Object
Return
Digital Object
Return
Inspection /
Automatic QA
Inspection /
Automatic QA
Doc DeliveryDoc Delivery
RejectReject
Conversion facility
Material location
Conversion
MD Recording
7. June 14, 2013
Page 7
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
Possible conversion scenarios
A) Conversion at library (on-site)
B) Conversion off-shore at CCS data center,
final QA at the library via internet transfer (remote QA solution)
C) Conversion off-shore at CCS,
final QA at the library by backup shipment
8. June 14, 2013
Page 8
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
Scenario B | Remote QA at library
Internet
StorageStorage
IN
OUTPOOL
dW Share
Master
Offshore
Processing
@ CCS
OUTPUT
METS ALTO
StorageStorage
POOL
dW Share
RQA
QA on-site
@ Library
INPUT
9. June 14, 2013
Page 9
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
Quality assurance
@ CCS | Automated markup and basic manual correction:
- headlines, illustrations, tables, captions, advertisements, etc.
- article segmentation and grouping of zones into articles (incl. continuation)
@ Content Provider (Library)
Recommended:
- Zoning: correct classification of blocks as „text“ or „illustration“
- Article segmentation: correct identification of headlines/text blocks/captions
- Grouping: correct gouping of blocks (text, illustration) to articles
- Metadata: correct title, issue date and issue number
Optional:
- Page types: correct page types
- Page numbers: correct page sequence
- OCR: perform text correction of specific zones (e.g. headlines, captions)
10. June 14, 2013
Page 10
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
Output | METS/ALTO package
METS/ALTO metadata schemas to describe the structured digital ouput object
A newspaper issue processed in docWorks is converted into one METS XML
file. It reflects the whole physical and logical structure, manages all links to the
image files and the related ALTO XML files. ALTO is based on a standardized
page description schema and contains all information of a page (print space,
margins, coordinates, OCR results).
Benefits of structural markup:
- better browsing and more precise text search
- better access and display on tablet and mobile devices
- automated article classification and clustering through data/text mining and
linguistic technologies
- user engagement for manual online text correction, article classification,
annotation, building personal collections, etc.
- sharing articles via social media platforms like Facebook, Twitter, etc.
_______________
METS = Metadada Encoding and Transmission Standard
ALTO = Analyzed Layout and Text Object
11. June 14, 2013
Page 11
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
Access and Presentation
Access through Europeana as well as content provider portals
Existing newspaper presentation systems at National Library of Australia
(Trove), Library of Congress/NDNP (Chronicling America), Dutch National
Library (DDD), National Library of Luxembourg (eLuxemburgensia), ...
Veridian demo:
Example of a newspaper presentation system to demonstrate access to
already processed ENP newspaper issues
12. June 14, 2013
Page 12
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
Questions + answers
13. June 14, 2013
Page 13
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
Contact
Claus Gravenhorst
Director Strategic Initiatives
CCS Content Conversion Specialists GmbH
Weidestr. 134
22083 Hamburg
Germany
c.gravenhorst@content-conversion.com
www.content-conversion.com