From the printed page to discoverable content library camp perth 2010

From the
Printed Page
to
Discoverable Content
the open source way
Steven Miles
@stevermiles stevenmiles.com.au
Tuesday, 18 January 2011

About Me

About Me
Web Application Developer
State Library of Western Australia
@

About Me
Web Application Developer
State Library of Western Australia
@
S.L.U.R.P.
Digital Content Ingestion &
Integration with LMS
PC Reservation
PC Reservations and Booking
System
PLO
Public Libraries Online
Venues Bookings
Venues Booking & Reservation
System
P.URL
Permanent URL

WARNING !!!!
Lots of technical stuff!

How can I make scanned content more discoverable?
presentation
Digitisation
Indexing
Capture DIY Scanner
Existing Documents
Dual Camera Setup
Single Camera Setup
Commercial Scanners
Image Processing
OCR
Document Scanners
MFD’s
Rotation
Cropping
Normalisation Levels Correction
Multi page
Tagging
Open source
Commercial
Cuneiform
Tesseract
Ocropus
GOCR
Page
Layout Analysis
Abby Fine Reader
Acrobat
leptonica
Metadata
ManualAutomatic
PersonsLocations
Dates
Organisations
Locations
Formats
hOCR
Text
XML
Manual
Import
Z39.50
SRU/SRW
Engine
Zebra
XML
Z39.50
RBMS
Postgres
MySQL
Search
Pull from
LMS
Search
Multiple Databases Results
Expose Web API’s
Other Library Systems
Z39.50
SRU/SRW
Facets Page
Previews
Ranked
Sortable
Filters
Web Accessible
Simple
Keyword
Searching
Encourage
Exploration
Tagging
Advanced
Search
Saved
Searches
Social Sharing,
Intergration
Web Browser
Accessible
Auto Updating
Downloadable PDF’s
User Correctable
Text
In Document
Searching
Highlight Search Results
Potential Conversion to Other Formats

Most common process of digitisation for
public consumption
Scan /
Capture
Generate PDF OCR
Indexed by Content
Management
System
Link to
Downloadable
PDF(Uncorrected OCR)
(Links only to Document)
How can we do this better?

Inspirational Resources
National Libraries Australia - Australian Newspapers
http://newspapers.nla.gov.au/
Google Docs
http://docs.google.com
Informit -Text Searchable Content

Scan /
Capture
Semi Auto
Cropping
and Rotation
Correction
Optimise
Each Page
for OCR
OCR Pages
Retain Positional
Information (hocr)
Post OCR
Processing
Spell checking &
correction of common
OCR errors
Natural
Language
Processing
Auto Extract Names,
Organisations,
Locations & Dates
from Text and Use for
tagging
Store as
XML
Generate
Page Level
XML Index
Files
Add/Update
XML
Indexing
Server
Fully Automated Process
Generate
Searchable PDF
Generate Web
FriendlyVersions
of each page
Full Text
Search
Web Services & Z39.50
Downloadable
PDF
Google Docs
Style Interface
Individual Line
Highlighting to Show
search results
Proposed Digitisation Process

Available Open Source Projects
Ocropus - Page Layout Analysis
http://code.google.com/p/ocropus/
Tesseract OCR - OCR
Image Magick - Image Processing
http://www.imagemagick.org/
Index Data Zebra -XML Indexing
http://www.indexdata.com/zebra
Index Data Pazpar2 -Federated Search
http://www.indexdata.com/pazpar2
Existing Web Technologies - PHP, HTML, CSS etc

DIY Book Scanner
Project
www.diybookscanner.org

Discovery Layer
(PHP, HTML,CSS)
Federated Search
Using PazPar2 - Z39.50, SRU, SRW
Full Text Search
Zebra - XML Indexer
via Z39.50
LMS & External
Databases
Existing via Z39.50
XML Data Files
MARC, Dublin Core, OAI-PM
DocumentViewer / Editor
(PHP, HTML,CSS)
Ingest / Digitisation
(PHP,HTML,CSS)
OCR & NLP
(Document Processing, OCR & Natural Language Processing)
DownloadableVersion
Automatic Generation of Searchable
PDF,Text Files etc
(Updated from User Alterations)
External Resources
Basic Architecture
Crowdsourcing OCR
Corrections & Possible
translation on handwritten
documents

Converting Images for OCR
Convert to Grayscale Generate Text Image Mask Clean Up Background Noise OCRVersion
OCRopus Page Layout Analysis
Image Magick Image Manipulation
Combined

Images to Text
Image for OCR Processing Tesseract OCR to HOCR File
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd"><html><head><title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" ><meta
name='ocr-system' content='tesseract'></head>
<body><div class='ocr_page' id='page_1' title='image "/var/digindex/repository/
eastern_reporter/2010/10/5/ocr/1-masked.png"; bbox 0 0 2161 3247'>
<div class='ocr_carea' id='block_1_1' title="bbox 200 46 1858 233">
R r 
</div><div class='ocr_carea' id='block_1_2' title="bbox 47 1855 241 1883">
By LIAM CROY</div><div class='ocr_carea' id='block_1_3'
title="bbox 43 1909 533 2404">
IN a <metadata><title>Eastern Reporter Tuesday, October 5,
2010</title><id>eastern_reporter/2010/10/5</id></metadata>
<pages><page id="0" origWidth="3648" origHeight="2736"
rotate="-90.5" crop="2199x3321+147+147"/><page id="1"
origWidth="3648" origHeight="2736" rotate="91" path="odd/
IMG_0946.JPG" crop="2161x3247+374+274" width="2161"
height="3247"><paragraph><line id="line_1_1" top="50" left="201"
width="1657" height="180">R r</line></
paragraph><paragraph><line id="line_1_2" top="1855" left="47"
width="194" height="27">By LIAM CROY</line></
width="485" height="24">IN a display of unity, Muslims and Chris-</
line><line id="line_1_4" top="1937" left="45" width="486"
height="26">tians gathered at Dianella Uniting Church</line><line
id="line_1_5" top="1965" left="45" width="485" height="26">last
Thursday to share thei.r experiences</line><line id="line_1_6"
top="1993" left="45" width="212" height="24">and pray for peace.</
line></paragraph><paragraph><line id="line_1_7" top="2020"
left="79" width="451" height="25">Sheikh Muhammad Agherdien of
the</line></paragraph><paragraph><line id="line_1_8" top="2048"
left="46" width="484" height="25">Mirrabooka mosque opened the
service</line><line id="line_1_9" top="2076" left="46" width="484"
height="26">with a verse of the Islamic religious text,</line><line
id="line_1_10" top="2103" left="45" width="117" height="20">the
Koran:</line></paragraph><paragraph><line id="line_1_11"
top="2131" left="79" width="451" height="27">“Oh People!
Behold, we have created you</line></paragraph><paragraph><line
id="line_1_12" top="2158" left="46" width="331" height="22">all out
ofa male and a female.</line></paragraph><paragraph><line
id="line_1_13" top="2187" left="79" width="451"
height="25">“And we have made you into nations</line></
Convert HOCR to XML for Storage Sample Auto Generate Tags
IN a display of unity , [MISC Muslims ] and [MISC Chris- ] , tians
gathered at [ORG Dianella Uniting Church ] , last Thursday to share
thei.r experiences , and pray for peace.

Prototype Interface for Ingesting Pages
from Book Scanner

Perform Basic Image Rotation and
Cropping
Rotation and Cropping can replicated to other pages

Prototype Search Pages
Results on the left are the Auto Generated facets based on the natural language processing tags

Viewing Document Pages

Viewing Document Pages with
Highlighted Results

Editing Document with Auto Updating of
Indexer

Pazar2 can be used to alternative interfaces for
search multiple existing catalogs

Questions?

More Info & Credits
Tesseract-OCR
http://code.google.com/p/tesseract-ocr/
OCRopus
Do-It-Yourself Book Scanning
http://www.diybookscanner.org/
CHDK - Canon Hack Development Kit
http://chdk.wikia.com/wiki/CHDK
Zebra - XML Indexing
http://www.indexdata.com/zebra
PazPar2 -Federated Search
Cuneiform
http://en.wikipedia.org/wiki/HOCR
EyeFi Python Server
http://returnbooleantrue.blogspot.com/2009/01/eye-ﬁ-
standalone-server.html/
hOCR - HTML OCR
http://en.wikipedia.org/wiki/HOCR
OpenNLP
Illinois Named Entity Tagger
http://cogcomp.cs.illinois.edu/page/software_view/4

From the printed page to discoverable content library camp perth 2010

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (7)

Similar a From the printed page to discoverable content library camp perth 2010

Similar a From the printed page to discoverable content library camp perth 2010 (20)

Último

Último (20)

From the printed page to discoverable content library camp perth 2010