4. About Me
Web Application Developer
State Library of Western Australia
@
S.L.U.R.P.
Digital Content Ingestion &
Integration with LMS
PC Reservation
PC Reservations and Booking
System
PLO
Public Libraries Online
Venues Bookings
Venues Booking & Reservation
System
P.URL
Permanent URL
Tuesday, 18 January 2011
6. How can I make scanned content more discoverable?
presentation
Digitisation
Indexing
Capture DIY Scanner
Existing Documents
Dual Camera Setup
Single Camera Setup
Commercial Scanners
Image Processing
OCR
Document Scanners
MFD’s
Rotation
Cropping
Normalisation Levels Correction
Multi page
Tagging
Open source
Commercial
Cuneiform
Tesseract
Ocropus
GOCR
Page
Layout Analysis
Abby Fine Reader
Acrobat
leptonica
Metadata
ManualAutomatic
PersonsLocations
Dates
Organisations
Locations
Formats
hOCR
Text
XML
Manual
Import
Z39.50
SRU/SRW
Engine
Zebra
XML
Z39.50
RBMS
Postgres
MySQL
Search
Pull from
LMS
Search
Multiple Databases Results
Expose Web API’s
Other Library Systems
Z39.50
SRU/SRW
Facets Page
Previews
Ranked
Sortable
Filters
Web Accessible
Simple
Keyword
Searching
Encourage
Exploration
Tagging
Advanced
Search
Saved
Searches
Social Sharing,
Intergration
Web Browser
Accessible
Auto Updating
Downloadable PDF’s
User Correctable
Text
In Document
Searching
Highlight Search Results
Potential Conversion to Other Formats
Tuesday, 18 January 2011
7. Most common process of digitisation for
public consumption
Scan /
Capture
Generate PDF OCR
Indexed by Content
Management
System
Link to
Downloadable
PDF(Uncorrected OCR)
(Links only to Document)
How can we do this better?
Tuesday, 18 January 2011
8. Inspirational Resources
National Libraries Australia - Australian Newspapers
http://newspapers.nla.gov.au/
Google Docs
http://docs.google.com
Informit -Text Searchable Content
Tuesday, 18 January 2011
9. Scan /
Capture
Semi Auto
Cropping
and Rotation
Correction
Optimise
Each Page
for OCR
OCR Pages
Retain Positional
Information (hocr)
Post OCR
Processing
Spell checking &
correction of common
OCR errors
Natural
Language
Processing
Auto Extract Names,
Organisations,
Locations & Dates
from Text and Use for
tagging
Store as
XML
Generate
Page Level
XML Index
Files
Add/Update
XML
Indexing
Server
Fully Automated Process
Generate
Searchable PDF
Generate Web
FriendlyVersions
of each page
Full Text
Search
Web Services & Z39.50
Downloadable
PDF
Google Docs
Style Interface
Individual Line
Highlighting to Show
search results
Proposed Digitisation Process
Tuesday, 18 January 2011
10. Available Open Source Projects
Ocropus - Page Layout Analysis
http://code.google.com/p/ocropus/
Tesseract OCR - OCR
http://code.google.com/p/ocropus/
Image Magick - Image Processing
http://www.imagemagick.org/
Index Data Zebra -XML Indexing
http://www.indexdata.com/zebra
Index Data Pazpar2 -Federated Search
http://www.indexdata.com/pazpar2
Existing Web Technologies - PHP, HTML, CSS etc
Tuesday, 18 January 2011