SlideShare una empresa de Scribd logo
1 de 24
Descargar para leer sin conexión
From the
Printed Page
to
Discoverable Content
the open source way
Steven Miles
@stevermiles stevenmiles.com.au
Tuesday, 18 January 2011
About Me
Tuesday, 18 January 2011
About Me
Web Application Developer
State Library of Western Australia
@
Tuesday, 18 January 2011
About Me
Web Application Developer
State Library of Western Australia
@
S.L.U.R.P.
Digital Content Ingestion &
Integration with LMS
PC Reservation
PC Reservations and Booking
System
PLO
Public Libraries Online
Venues Bookings
Venues Booking & Reservation
System
P.URL
Permanent URL
Tuesday, 18 January 2011
WARNING !!!!
Lots of technical stuff!
Tuesday, 18 January 2011
How can I make scanned content more discoverable?
presentation
Digitisation
Indexing
Capture DIY Scanner
Existing Documents
Dual Camera Setup
Single Camera Setup
Commercial Scanners
Image Processing
OCR
Document Scanners
MFD’s
Rotation
Cropping
Normalisation Levels Correction
Multi page
Tagging
Open source
Commercial
Cuneiform
Tesseract
Ocropus
GOCR
Page
Layout Analysis
Abby Fine Reader
Acrobat
leptonica
Metadata
ManualAutomatic
PersonsLocations
Dates
Organisations
Locations
Formats
hOCR
Text
XML
Manual
Import
Z39.50
SRU/SRW
Engine
Zebra
XML
Z39.50
RBMS
Postgres
MySQL
Search
Pull from
LMS
Search
Multiple Databases Results
Expose Web API’s
Other Library Systems
Z39.50
SRU/SRW
Facets Page
Previews
Ranked
Sortable
Filters
Web Accessible
Simple
Keyword
Searching
Encourage
Exploration
Tagging
Advanced
Search
Saved
Searches
Social Sharing,
Intergration
Web Browser
Accessible
Auto Updating
Downloadable PDF’s
User Correctable
Text
In Document
Searching
Highlight Search Results
Potential Conversion to Other Formats
Tuesday, 18 January 2011
Most common process of digitisation for
public consumption
Scan /
Capture
Generate PDF OCR
Indexed by Content
Management
System
Link to
Downloadable
PDF(Uncorrected OCR)
(Links only to Document)
How can we do this better?
Tuesday, 18 January 2011
Inspirational Resources
National Libraries Australia - Australian Newspapers
http://newspapers.nla.gov.au/
Google Docs
http://docs.google.com
Informit -Text Searchable Content
Tuesday, 18 January 2011
Scan /
Capture
Semi Auto
Cropping
and Rotation
Correction
Optimise
Each Page
for OCR
OCR Pages
Retain Positional
Information (hocr)
Post OCR
Processing
Spell checking &
correction of common
OCR errors
Natural
Language
Processing
Auto Extract Names,
Organisations,
Locations & Dates
from Text and Use for
tagging
Store as
XML
Generate
Page Level
XML Index
Files
Add/Update
XML
Indexing
Server
Fully Automated Process
Generate
Searchable PDF
Generate Web
FriendlyVersions
of each page
Full Text
Search
Web Services & Z39.50
Downloadable
PDF
Google Docs
Style Interface
Individual Line
Highlighting to Show
search results
Proposed Digitisation Process
Tuesday, 18 January 2011
Available Open Source Projects
Ocropus - Page Layout Analysis
http://code.google.com/p/ocropus/
Tesseract OCR - OCR
http://code.google.com/p/ocropus/
Image Magick - Image Processing
http://www.imagemagick.org/
Index Data Zebra -XML Indexing
http://www.indexdata.com/zebra
Index Data Pazpar2 -Federated Search
http://www.indexdata.com/pazpar2
Existing Web Technologies - PHP, HTML, CSS etc
Tuesday, 18 January 2011
DIY Book Scanner
Project
www.diybookscanner.org
Tuesday, 18 January 2011
Discovery Layer
(PHP, HTML,CSS)
Federated Search
Using PazPar2 - Z39.50, SRU, SRW
Full Text Search
Zebra - XML Indexer
via Z39.50
LMS & External
Databases
Existing via Z39.50
XML Data Files
MARC, Dublin Core, OAI-PM
DocumentViewer / Editor
(PHP, HTML,CSS)
Ingest / Digitisation
(PHP,HTML,CSS)
OCR & NLP
(Document Processing, OCR & Natural Language Processing)
DownloadableVersion
Automatic Generation of Searchable
PDF,Text Files etc
(Updated from User Alterations)
External Resources
Basic Architecture
Crowdsourcing OCR
Corrections & Possible
translation on handwritten
documents
Tuesday, 18 January 2011
Converting Images for OCR
Convert to Grayscale Generate Text Image Mask Clean Up Background Noise OCRVersion
OCRopus Page Layout Analysis
Image Magick Image Manipulation
Combined
Tuesday, 18 January 2011
Images to Text
Image for OCR Processing Tesseract OCR to HOCR File
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd"><html><head><title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" ><meta
name='ocr-system' content='tesseract'></head>
<body><div class='ocr_page' id='page_1' title='image "/var/digindex/repository/
eastern_reporter/2010/10/5/ocr/1-masked.png"; bbox 0 0 2161 3247'>
<div class='ocr_carea' id='block_1_1' title="bbox 200 46 1858 233">
<p class='ocr_par'><span class='ocr_line' id='line_1_1' title="bbox 201 50 1858
230"><span class='ocr_word' id='word_1_1' title="bbox 1058 50 1211
196"><span class='xocr_word' id='xword_1_1' title="x_wconf -6">R</span></
span> <span class='ocr_word' id='word_1_2' title="bbox 1319 88 1858
230"><span class='xocr_word' id='xword_1_2' title="x_wconf -4"> r </span></
span></span></p>
</div><div class='ocr_carea' id='block_1_2' title="bbox 47 1855 241 1883">
<p class='ocr_par'><span class='ocr_line' id='line_1_2' title="bbox 47 1855 241
1882"><span class='ocr_word' id='word_1_3' title="bbox 47 1855 77
1882"><span class='xocr_word' id='xword_1_3' title="x_wconf -2">By</span></
span> <span class='ocr_word' id='word_1_4' title="bbox 87 1855 153
1877"><span class='xocr_word' id='xword_1_4' title="x_wconf -3">LIAM</
span></span> <span class='ocr_word' id='word_1_5' title="bbox 163 1856 241
1878"><span class='xocr_word' id='xword_1_5' title="x_wconf -2">CROY</
span></span></span></p></div><div class='ocr_carea' id='block_1_3'
title="bbox 43 1909 533 2404"><p class='ocr_par'>
<span class='ocr_line' id='line_1_3' title="bbox 46 1910 531 1934"><span
class='ocr_word' id='word_1_6' title="bbox 46 1910 72 1928"><span
class='xocr_word' id='xword_1_6' title="x_wconf -3">IN</span></span> <span
class='ocr_word' id='word_1_7' title="bbox 83 1914 94 1928"><span
class='xocr_word' id='xword_1_7' title="x_wconf -2">a</span></span> <span
class='ocr_word' id='word_1_8' title="bbox 105 1910 185 1933"><span
<document><metadata><title>Eastern Reporter Tuesday, October 5,
2010</title><id>eastern_reporter/2010/10/5</id></metadata>
<pages><page id="0" origWidth="3648" origHeight="2736"
rotate="-90.5" crop="2199x3321+147+147"/><page id="1"
origWidth="3648" origHeight="2736" rotate="91" path="odd/
IMG_0946.JPG" crop="2161x3247+374+274" width="2161"
height="3247"><paragraph><line id="line_1_1" top="50" left="201"
width="1657" height="180">R r</line></
paragraph><paragraph><line id="line_1_2" top="1855" left="47"
width="194" height="27">By LIAM CROY</line></
paragraph><paragraph><line id="line_1_3" top="1910" left="46"
width="485" height="24">IN a display of unity, Muslims and Chris-</
line><line id="line_1_4" top="1937" left="45" width="486"
height="26">tians gathered at Dianella Uniting Church</line><line
id="line_1_5" top="1965" left="45" width="485" height="26">last
Thursday to share thei.r experiences</line><line id="line_1_6"
top="1993" left="45" width="212" height="24">and pray for peace.</
line></paragraph><paragraph><line id="line_1_7" top="2020"
left="79" width="451" height="25">Sheikh Muhammad Agherdien of
the</line></paragraph><paragraph><line id="line_1_8" top="2048"
left="46" width="484" height="25">Mirrabooka mosque opened the
service</line><line id="line_1_9" top="2076" left="46" width="484"
height="26">with a verse of the Islamic religious text,</line><line
id="line_1_10" top="2103" left="45" width="117" height="20">the
Koran:</line></paragraph><paragraph><line id="line_1_11"
top="2131" left="79" width="451" height="27">&#x201C;Oh People!
Behold, we have created you</line></paragraph><paragraph><line
id="line_1_12" top="2158" left="46" width="331" height="22">all out
ofa male and a female.</line></paragraph><paragraph><line
id="line_1_13" top="2187" left="79" width="451"
height="25">&#x201C;And we have made you into nations</line></
paragraph><paragraph><line id="line_1_14" top="2214" left="46"
Convert HOCR to XML for Storage Sample Auto Generate Tags
IN a display of unity , [MISC Muslims ] and [MISC Chris- ] , tians
gathered at [ORG Dianella Uniting Church ] , last Thursday to share
thei.r experiences , and pray for peace.
Tuesday, 18 January 2011
Demo
Tuesday, 18 January 2011
Prototype Interface for Ingesting Pages
from Book Scanner
Tuesday, 18 January 2011
Perform Basic Image Rotation and
Cropping
Rotation and Cropping can replicated to other pages
Tuesday, 18 January 2011
Prototype Search Pages
Results on the left are the Auto Generated facets based on the natural language processing tags
Tuesday, 18 January 2011
Viewing Document Pages
Tuesday, 18 January 2011
Viewing Document Pages with
Highlighted Results
Tuesday, 18 January 2011
Editing Document with Auto Updating of
Indexer
Tuesday, 18 January 2011
Pazar2 can be used to alternative interfaces for
search multiple existing catalogs
Tuesday, 18 January 2011
Questions?
Tuesday, 18 January 2011
More Info & Credits
Tesseract-OCR
http://code.google.com/p/tesseract-ocr/
OCRopus
http://code.google.com/p/ocropus/
Do-It-Yourself Book Scanning
http://www.diybookscanner.org/
CHDK - Canon Hack Development Kit
http://chdk.wikia.com/wiki/CHDK
Zebra - XML Indexing
http://www.indexdata.com/zebra
PazPar2 -Federated Search
http://www.indexdata.com/pazpar2
Cuneiform
http://en.wikipedia.org/wiki/HOCR
EyeFi Python Server
http://returnbooleantrue.blogspot.com/2009/01/eye-fi-
standalone-server.html/
hOCR - HTML OCR
http://en.wikipedia.org/wiki/HOCR
OpenNLP
http://www.indexdata.com/pazpar2
Illinois Named Entity Tagger
http://cogcomp.cs.illinois.edu/page/software_view/4
Tuesday, 18 January 2011

Más contenido relacionado

Destacado (7)

Design Out Loud: Brainstorming
Design Out Loud: BrainstormingDesign Out Loud: Brainstorming
Design Out Loud: Brainstorming
 
Generic Handbook
Generic HandbookGeneric Handbook
Generic Handbook
 
Union migrant
Union migrantUnion migrant
Union migrant
 
Project Management Templates
Project Management TemplatesProject Management Templates
Project Management Templates
 
Design Out Loud: Social Media
Design Out Loud: Social MediaDesign Out Loud: Social Media
Design Out Loud: Social Media
 
Design Out Loud: Making A Web Video
Design Out Loud: Making A Web VideoDesign Out Loud: Making A Web Video
Design Out Loud: Making A Web Video
 
Structureof Prokaryotic Eukary
Structureof Prokaryotic EukaryStructureof Prokaryotic Eukary
Structureof Prokaryotic Eukary
 

Similar a From the printed page to discoverable content library camp perth 2010

CrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef WorkshopsCrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef WorkshopsCrossref
 
Rank | Analyse | Lead | Search
Rank | Analyse | Lead | SearchRank | Analyse | Lead | Search
Rank | Analyse | Lead | Searchsopekmir
 
Introduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS PractitionersIntroduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS PractitionersEmanuele Della Valle
 
(Updated) SharePoint & jQuery Guide
(Updated) SharePoint & jQuery Guide(Updated) SharePoint & jQuery Guide
(Updated) SharePoint & jQuery GuideMark Rackley
 
How to Find a Needle in the Haystack
How to Find a Needle in the HaystackHow to Find a Needle in the Haystack
How to Find a Needle in the HaystackAdrian Stevenson
 
Intro on Oracle Application express - APEX
Intro on Oracle Application express - APEXIntro on Oracle Application express - APEX
Intro on Oracle Application express - APEXLino Schildenfeld
 
IWMW 2003: Content Management - Buy or Build?
IWMW 2003: Content Management - Buy or Build?IWMW 2003: Content Management - Buy or Build?
IWMW 2003: Content Management - Buy or Build?IWMW
 
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreAndy Powell
 
Learning Regular Expressions for the Extraction of Product Attributes from E-...
Learning Regular Expressions for the Extraction of Product Attributes from E-...Learning Regular Expressions for the Extraction of Product Attributes from E-...
Learning Regular Expressions for the Extraction of Product Attributes from E-...Volha Bryl
 
Linked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationLinked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationSören Auer
 
SharePoint & jQuery Guide - SPSTC 5/18/2013
SharePoint & jQuery Guide - SPSTC 5/18/2013 SharePoint & jQuery Guide - SPSTC 5/18/2013
SharePoint & jQuery Guide - SPSTC 5/18/2013 Mark Rackley
 
Prototyping interactions
Prototyping interactionsPrototyping interactions
Prototyping interactionsselwynjacob90
 
PoolParty Semantic Platform - Overview
PoolParty Semantic Platform - OverviewPoolParty Semantic Platform - Overview
PoolParty Semantic Platform - OverviewSemantic Web Company
 
Creating Interactive Olap Applications With My Sql Enterprise And Mondrian Pr...
Creating Interactive Olap Applications With My Sql Enterprise And Mondrian Pr...Creating Interactive Olap Applications With My Sql Enterprise And Mondrian Pr...
Creating Interactive Olap Applications With My Sql Enterprise And Mondrian Pr...Indus Khaitan
 
Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Aad Versteden
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us? Andrea Volpini
 

Similar a From the printed page to discoverable content library camp perth 2010 (20)

Obiee
ObieeObiee
Obiee
 
CrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef WorkshopsCrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef Workshops
 
Rank | Analyse | Lead | Search
Rank | Analyse | Lead | SearchRank | Analyse | Lead | Search
Rank | Analyse | Lead | Search
 
Introduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS PractitionersIntroduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS Practitioners
 
(Updated) SharePoint & jQuery Guide
(Updated) SharePoint & jQuery Guide(Updated) SharePoint & jQuery Guide
(Updated) SharePoint & jQuery Guide
 
How to Find a Needle in the Haystack
How to Find a Needle in the HaystackHow to Find a Needle in the Haystack
How to Find a Needle in the Haystack
 
PoolParty Overview
PoolParty OverviewPoolParty Overview
PoolParty Overview
 
Intro on Oracle Application express - APEX
Intro on Oracle Application express - APEXIntro on Oracle Application express - APEX
Intro on Oracle Application express - APEX
 
IWMW 2003: Content Management - Buy or Build?
IWMW 2003: Content Management - Buy or Build?IWMW 2003: Content Management - Buy or Build?
IWMW 2003: Content Management - Buy or Build?
 
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
 
Learning Regular Expressions for the Extraction of Product Attributes from E-...
Learning Regular Expressions for the Extraction of Product Attributes from E-...Learning Regular Expressions for the Extraction of Product Attributes from E-...
Learning Regular Expressions for the Extraction of Product Attributes from E-...
 
Linked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationLinked data for Enterprise Data Integration
Linked data for Enterprise Data Integration
 
SharePoint & jQuery Guide - SPSTC 5/18/2013
SharePoint & jQuery Guide - SPSTC 5/18/2013 SharePoint & jQuery Guide - SPSTC 5/18/2013
SharePoint & jQuery Guide - SPSTC 5/18/2013
 
Prototyping interactions
Prototyping interactionsPrototyping interactions
Prototyping interactions
 
PoolParty Semantic Platform - Overview
PoolParty Semantic Platform - OverviewPoolParty Semantic Platform - Overview
PoolParty Semantic Platform - Overview
 
Markup As An Api
Markup As An ApiMarkup As An Api
Markup As An Api
 
Creating Interactive Olap Applications With My Sql Enterprise And Mondrian Pr...
Creating Interactive Olap Applications With My Sql Enterprise And Mondrian Pr...Creating Interactive Olap Applications With My Sql Enterprise And Mondrian Pr...
Creating Interactive Olap Applications With My Sql Enterprise And Mondrian Pr...
 
Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
 
Nuxeo JavaOne 2007
Nuxeo JavaOne 2007Nuxeo JavaOne 2007
Nuxeo JavaOne 2007
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

From the printed page to discoverable content library camp perth 2010

  • 1. From the Printed Page to Discoverable Content the open source way Steven Miles @stevermiles stevenmiles.com.au Tuesday, 18 January 2011
  • 2. About Me Tuesday, 18 January 2011
  • 3. About Me Web Application Developer State Library of Western Australia @ Tuesday, 18 January 2011
  • 4. About Me Web Application Developer State Library of Western Australia @ S.L.U.R.P. Digital Content Ingestion & Integration with LMS PC Reservation PC Reservations and Booking System PLO Public Libraries Online Venues Bookings Venues Booking & Reservation System P.URL Permanent URL Tuesday, 18 January 2011
  • 5. WARNING !!!! Lots of technical stuff! Tuesday, 18 January 2011
  • 6. How can I make scanned content more discoverable? presentation Digitisation Indexing Capture DIY Scanner Existing Documents Dual Camera Setup Single Camera Setup Commercial Scanners Image Processing OCR Document Scanners MFD’s Rotation Cropping Normalisation Levels Correction Multi page Tagging Open source Commercial Cuneiform Tesseract Ocropus GOCR Page Layout Analysis Abby Fine Reader Acrobat leptonica Metadata ManualAutomatic PersonsLocations Dates Organisations Locations Formats hOCR Text XML Manual Import Z39.50 SRU/SRW Engine Zebra XML Z39.50 RBMS Postgres MySQL Search Pull from LMS Search Multiple Databases Results Expose Web API’s Other Library Systems Z39.50 SRU/SRW Facets Page Previews Ranked Sortable Filters Web Accessible Simple Keyword Searching Encourage Exploration Tagging Advanced Search Saved Searches Social Sharing, Intergration Web Browser Accessible Auto Updating Downloadable PDF’s User Correctable Text In Document Searching Highlight Search Results Potential Conversion to Other Formats Tuesday, 18 January 2011
  • 7. Most common process of digitisation for public consumption Scan / Capture Generate PDF OCR Indexed by Content Management System Link to Downloadable PDF(Uncorrected OCR) (Links only to Document) How can we do this better? Tuesday, 18 January 2011
  • 8. Inspirational Resources National Libraries Australia - Australian Newspapers http://newspapers.nla.gov.au/ Google Docs http://docs.google.com Informit -Text Searchable Content Tuesday, 18 January 2011
  • 9. Scan / Capture Semi Auto Cropping and Rotation Correction Optimise Each Page for OCR OCR Pages Retain Positional Information (hocr) Post OCR Processing Spell checking & correction of common OCR errors Natural Language Processing Auto Extract Names, Organisations, Locations & Dates from Text and Use for tagging Store as XML Generate Page Level XML Index Files Add/Update XML Indexing Server Fully Automated Process Generate Searchable PDF Generate Web FriendlyVersions of each page Full Text Search Web Services & Z39.50 Downloadable PDF Google Docs Style Interface Individual Line Highlighting to Show search results Proposed Digitisation Process Tuesday, 18 January 2011
  • 10. Available Open Source Projects Ocropus - Page Layout Analysis http://code.google.com/p/ocropus/ Tesseract OCR - OCR http://code.google.com/p/ocropus/ Image Magick - Image Processing http://www.imagemagick.org/ Index Data Zebra -XML Indexing http://www.indexdata.com/zebra Index Data Pazpar2 -Federated Search http://www.indexdata.com/pazpar2 Existing Web Technologies - PHP, HTML, CSS etc Tuesday, 18 January 2011
  • 12. Discovery Layer (PHP, HTML,CSS) Federated Search Using PazPar2 - Z39.50, SRU, SRW Full Text Search Zebra - XML Indexer via Z39.50 LMS & External Databases Existing via Z39.50 XML Data Files MARC, Dublin Core, OAI-PM DocumentViewer / Editor (PHP, HTML,CSS) Ingest / Digitisation (PHP,HTML,CSS) OCR & NLP (Document Processing, OCR & Natural Language Processing) DownloadableVersion Automatic Generation of Searchable PDF,Text Files etc (Updated from User Alterations) External Resources Basic Architecture Crowdsourcing OCR Corrections & Possible translation on handwritten documents Tuesday, 18 January 2011
  • 13. Converting Images for OCR Convert to Grayscale Generate Text Image Mask Clean Up Background Noise OCRVersion OCRopus Page Layout Analysis Image Magick Image Manipulation Combined Tuesday, 18 January 2011
  • 14. Images to Text Image for OCR Processing Tesseract OCR to HOCR File <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title></title> <meta http-equiv="Content-Type" content="text/html;charset=utf-8" ><meta name='ocr-system' content='tesseract'></head> <body><div class='ocr_page' id='page_1' title='image "/var/digindex/repository/ eastern_reporter/2010/10/5/ocr/1-masked.png"; bbox 0 0 2161 3247'> <div class='ocr_carea' id='block_1_1' title="bbox 200 46 1858 233"> <p class='ocr_par'><span class='ocr_line' id='line_1_1' title="bbox 201 50 1858 230"><span class='ocr_word' id='word_1_1' title="bbox 1058 50 1211 196"><span class='xocr_word' id='xword_1_1' title="x_wconf -6">R</span></ span> <span class='ocr_word' id='word_1_2' title="bbox 1319 88 1858 230"><span class='xocr_word' id='xword_1_2' title="x_wconf -4"> r </span></ span></span></p> </div><div class='ocr_carea' id='block_1_2' title="bbox 47 1855 241 1883"> <p class='ocr_par'><span class='ocr_line' id='line_1_2' title="bbox 47 1855 241 1882"><span class='ocr_word' id='word_1_3' title="bbox 47 1855 77 1882"><span class='xocr_word' id='xword_1_3' title="x_wconf -2">By</span></ span> <span class='ocr_word' id='word_1_4' title="bbox 87 1855 153 1877"><span class='xocr_word' id='xword_1_4' title="x_wconf -3">LIAM</ span></span> <span class='ocr_word' id='word_1_5' title="bbox 163 1856 241 1878"><span class='xocr_word' id='xword_1_5' title="x_wconf -2">CROY</ span></span></span></p></div><div class='ocr_carea' id='block_1_3' title="bbox 43 1909 533 2404"><p class='ocr_par'> <span class='ocr_line' id='line_1_3' title="bbox 46 1910 531 1934"><span class='ocr_word' id='word_1_6' title="bbox 46 1910 72 1928"><span class='xocr_word' id='xword_1_6' title="x_wconf -3">IN</span></span> <span class='ocr_word' id='word_1_7' title="bbox 83 1914 94 1928"><span class='xocr_word' id='xword_1_7' title="x_wconf -2">a</span></span> <span class='ocr_word' id='word_1_8' title="bbox 105 1910 185 1933"><span <document><metadata><title>Eastern Reporter Tuesday, October 5, 2010</title><id>eastern_reporter/2010/10/5</id></metadata> <pages><page id="0" origWidth="3648" origHeight="2736" rotate="-90.5" crop="2199x3321+147+147"/><page id="1" origWidth="3648" origHeight="2736" rotate="91" path="odd/ IMG_0946.JPG" crop="2161x3247+374+274" width="2161" height="3247"><paragraph><line id="line_1_1" top="50" left="201" width="1657" height="180">R r</line></ paragraph><paragraph><line id="line_1_2" top="1855" left="47" width="194" height="27">By LIAM CROY</line></ paragraph><paragraph><line id="line_1_3" top="1910" left="46" width="485" height="24">IN a display of unity, Muslims and Chris-</ line><line id="line_1_4" top="1937" left="45" width="486" height="26">tians gathered at Dianella Uniting Church</line><line id="line_1_5" top="1965" left="45" width="485" height="26">last Thursday to share thei.r experiences</line><line id="line_1_6" top="1993" left="45" width="212" height="24">and pray for peace.</ line></paragraph><paragraph><line id="line_1_7" top="2020" left="79" width="451" height="25">Sheikh Muhammad Agherdien of the</line></paragraph><paragraph><line id="line_1_8" top="2048" left="46" width="484" height="25">Mirrabooka mosque opened the service</line><line id="line_1_9" top="2076" left="46" width="484" height="26">with a verse of the Islamic religious text,</line><line id="line_1_10" top="2103" left="45" width="117" height="20">the Koran:</line></paragraph><paragraph><line id="line_1_11" top="2131" left="79" width="451" height="27">&#x201C;Oh People! Behold, we have created you</line></paragraph><paragraph><line id="line_1_12" top="2158" left="46" width="331" height="22">all out ofa male and a female.</line></paragraph><paragraph><line id="line_1_13" top="2187" left="79" width="451" height="25">&#x201C;And we have made you into nations</line></ paragraph><paragraph><line id="line_1_14" top="2214" left="46" Convert HOCR to XML for Storage Sample Auto Generate Tags IN a display of unity , [MISC Muslims ] and [MISC Chris- ] , tians gathered at [ORG Dianella Uniting Church ] , last Thursday to share thei.r experiences , and pray for peace. Tuesday, 18 January 2011
  • 16. Prototype Interface for Ingesting Pages from Book Scanner Tuesday, 18 January 2011
  • 17. Perform Basic Image Rotation and Cropping Rotation and Cropping can replicated to other pages Tuesday, 18 January 2011
  • 18. Prototype Search Pages Results on the left are the Auto Generated facets based on the natural language processing tags Tuesday, 18 January 2011
  • 20. Viewing Document Pages with Highlighted Results Tuesday, 18 January 2011
  • 21. Editing Document with Auto Updating of Indexer Tuesday, 18 January 2011
  • 22. Pazar2 can be used to alternative interfaces for search multiple existing catalogs Tuesday, 18 January 2011
  • 24. More Info & Credits Tesseract-OCR http://code.google.com/p/tesseract-ocr/ OCRopus http://code.google.com/p/ocropus/ Do-It-Yourself Book Scanning http://www.diybookscanner.org/ CHDK - Canon Hack Development Kit http://chdk.wikia.com/wiki/CHDK Zebra - XML Indexing http://www.indexdata.com/zebra PazPar2 -Federated Search http://www.indexdata.com/pazpar2 Cuneiform http://en.wikipedia.org/wiki/HOCR EyeFi Python Server http://returnbooleantrue.blogspot.com/2009/01/eye-fi- standalone-server.html/ hOCR - HTML OCR http://en.wikipedia.org/wiki/HOCR OpenNLP http://www.indexdata.com/pazpar2 Illinois Named Entity Tagger http://cogcomp.cs.illinois.edu/page/software_view/4 Tuesday, 18 January 2011