Entering the Fourth Dimension of OCR with Tesseract

Entering the Fourth DimensionEntering the Fourth Dimension
of OCR withof OCR with
TesseractTesseract
Hanno Embregts @hannotify

https://commons.wikimedia.org/wiki/File:Animation_of_a_tesseract.webm

What is OCR?What is OCR?What is OCR?What is OCR?What is OCR?
/ rst dimension/ rst dimension/ rst dimension/ rst dimension/ rst dimension
https://gifer.com/en/7U5F

Atechnology that can take writtenAtechnology that can take written
words and convert then aes mowords and convert then aes mo
aulerseadans Yom, proved meyren neaulerseadans Yom, proved meyren ne
fgntrom sng tne cortec cars ometimes,fgntrom sng tne cortec cars ometimes,
ate fant gait sae and ptch, dar engugnate fant gait sae and ptch, dar engugn
onthe pase, ana Your Bropared toonthe pase, ana Your Bropared to
spend Several cantinas covtecing al mespend Several cantinas covtecing al me
one PrP areee e SP es ts anine Oe tatone PrP areee e SP es ts anine Oe tat
came autas zerseecame autas zersee
@hannotify

A technology that can take writtenA technology that can take written
words and convert them back intowords and convert them back into
computer-readable form, providedcomputer-readable form, provided
they're in the right font, using thethey're in the right font, using the
correct colors sometimes, at the rightcorrect colors sometimes, at the right
point size and pitch, dark enough onpoint size and pitch, dark enough on
the paper, and you're prepared to spendthe paper, and you're prepared to spend
several centuries correcting all the onesseveral centuries correcting all the ones
that came out as l's, all the O's thatthat came out as l's, all the O's that
came out as zeroes, and all the colonscame out as zeroes, and all the colons
that come out like semicolons.that come out like semicolons.
@hannotify

A Proper De nitionA Proper De nition
Optical character recognitionOptical character recognition (...) is(...) is
the mechanical or electronic conversionthe mechanical or electronic conversion
of images of typed, handwritten orof images of typed, handwritten or
printed text into machine-encoded text,printed text into machine-encoded text,
whether from a scanned document, awhether from a scanned document, a
photo of a document, a scene-photophoto of a document, a scene-photo
(...) or from subtitle text superimposed(...) or from subtitle text superimposed
on an image.on an image.
https://en.wikipedia.org/wiki/Optical_character_recognition

Pattern recognitionPattern recognition
https://www.explainthatstu .com/how-ocr-works.html

Feature detectionFeature detection
https://www.explainthatstu .com/how-ocr-works.html

19291929192919291929
Gustav Tauschek patents a basic OCR
'reading machine'.
https://www.pexels.com/photo/abstract-art-circle-clockwork-414579/

1960s1960s1960s1960s1960s
Postal services start using OCR for
mail sorting.
https://pxhere.com/en/photo/139100

19931993199319931993
The Apple Newton becomes the rst
handheld computer to feature
handwriting recognition.
https://commons.wikimedia.org/wiki/File:Newton_MessagePad_120_stylus.JPG

ApplicationsApplications
@hannotify

Financial transfersFinancial transfersFinancial transfersFinancial transfersFinancial transfers
Catch me if you can!
http://www.spielberg-ocr.com/stationery/check-OCR-A-font.jpg

Book digitizationBook digitizationBook digitizationBook digitizationBook digitization
Also supports Ctrl+F.
https://upload.wikimedia.org/wikipedia/commons/6/65/Internet_Archive_book_scanner_1.jpg

Passport scanningPassport scanningPassport scanningPassport scanningPassport scanning
Gets you to your gate in time.
https://supertravelme.com/wp-content/uploads/2017/08/Changi-Airport-Terminal-4-Automated-Departures-Scan-Passport-and-Take-Photo-1G7A1491.jpg

Number plateNumber plateNumber plateNumber plateNumber plate
recognitionrecognitionrecognitionrecognitionrecognition
Get your speeding ticket even faster!
https://commons.wikimedia.org/wiki/File:Nopeuskamera_20170619.jpg

GettingGettingGettingGettingGetting
StartedStartedStartedStartedStarted
/ second dimension/ second dimension/ second dimension/ second dimension/ second dimension

TesseractTesseract
Development started at Hewlett-Packard in 1985
Ported to Windows in 1996
Released as open-source in 2005
Google sponsors development of Tesseract since
2006
()
@hannotify

—— Anthony KayAnthony Kay
in "Linux Journal", July 2007in "Linux Journal", July 2007
"The core feature, text recognition, is"The core feature, text recognition, is
drastically better than anything elsedrastically better than anything else
I've tried from the Open SourceI've tried from the Open Source
community."community."
https://www.linuxjournal.com/article/9676

FeaturesFeatures
character recognition
support for Unicode
input: JPEG, GIF, PNG, TIFF or BMP
output: searchable PDF, TSV, plain text or HOCR
@hannotify

HOCR exampleHOCR example
<p class="ocr_par" lang="deu" title="bbox930">
<span class="ocr_line" title="bbox 348 797 1482 838; baseline -
<span class="ocrx_word" title="bbox 348 805 402 832; x_wconf
<span class="ocrx_word" title="bbox 1199 797 1343 832; x_wcon
<span class="ocrx_word" title="bbox 1362 805 1399 823; x_wcon
<span class="ocrx_word" title="bbox 1417 x_wconf 96">ver-</sp
</span>
</p>
https://en.wikipedia.org/wiki/HOCR

Used by GoogleUsed by Google
For text detection on mobile devices
In video
In Gmail image spam detection
https://opensource.google.com/projects/tesseract

New featuresNew features
in v3.0in v3.0
support for over 100 languages
page layout analysis
in v4.0in v4.0
LSTM recognition engine
@hannotify

Tess4JTess4JTess4JTess4JTess4J
A Java JNA wrapper for Tesseract
https://www.pexels.com/photo/beans-beverage-ca eine-cappuccino-585754/

Tess4J featuresTess4J features
PDF input
Multi-page TIFF input
Image optimization
(( ))https://github.com/nguyenq/tess4jhttps://github.com/nguyenq/tess4j
@hannotify

DemoDemo
Install Tesseract (for multilanguage support)
Add Tess4J dependency
Convert image to plain text (English)
Convert image to plain text (Greek)
@hannotify

Choosing theChoosing theChoosing theChoosing theChoosing the
Right LibraryRight LibraryRight LibraryRight LibraryRight Library
/ third dimension/ third dimension/ third dimension/ third dimension/ third dimension

CompetitorsCompetitors
@hannotify

ABBYY FineReaderABBYY FineReader
Development started at ABBYY in 1993
Supports 192 languages
20 million users worldwide
Outputs to MS O ce, RTF, HTML, (searchable)
PDF and plain text
(( ))https://www.abbyy.com/en-eu/ nereaderhttps://www.abbyy.com/en-eu/ nereader
@hannotify

Google Cloud VisionGoogle Cloud Vision
APIAPI
Launched in 2016 by Google
Supports 56 languages
Outputs to JSON
Integrates nicely with Google Images and Google
SafeSearch
(( ))https://cloud.google.com/visionhttps://cloud.google.com/vision
@hannotify

ABBYY GCV Tesseract
https://www.digitisation.eu/download/website- les/IMPACT_D-EXT2_Pilot_report_PSNC.pdf

ABBYY GCV Tesseract
costs
$200
per computer
$1.50
per 1000 images, per
month
$0

ABBYY GCV Tesseract
costs
$200
per computer
$1.50
month
$0
languages 192 56 102

ABBYY GCV Tesseract
costs
$200
per computer
$1.50
month
$0
Java integration through SDK
through REST
API
through JNA
wrapper

ABBYY GCV Tesseract
costs
$200
per computer
$1.50
month
$0
through REST
API
through JNA
wrapper
handwriting
recognition
'handprinted'
text
supported not supported

ABBYY GCV Tesseract
costs
$200
per computer
$1.50
month
$0
through REST
API
through JNA
wrapper
handwriting
recognition
'handprinted'
text
custom training supported not supported supported

ABBYY GCV Tesseract
costs
$200
per computer
$1.50
month
$0
through REST
API
through JNA
wrapper
handwriting
recognition
'handprinted'
text
custom training supported not supported supported
accuracy 9/10 8/10 7.5/10

Case studyCase studyCase studyCase studyCase study
Paper archives going digital.

https://img.bhs4.com/db/6/db618954fa2c95aa8510cf415ec3acd89702eb81_large.jpg

AdvancedAdvancedAdvancedAdvancedAdvanced
FeaturesFeaturesFeaturesFeaturesFeatures
/ fourth dimension/ fourth dimension/ fourth dimension/ fourth dimension/ fourth dimension

What AdvancedWhat Advanced
Features?Features?
Reporting con dence
Multiple languages in a single document
Image optimization
Speed/accuracy tradeo s
Training
@hannotify

Improving accuracyImproving accuracyImproving accuracyImproving accuracyImproving accuracy
To better recognize the expected
input documents.

What is con dence?What is con dence?
@hannotify

Reporting con denceReporting con dence
Tess4J supports two return types:Tess4J supports two return types:
String (containing the OCR'ed text)
List<OCRResult> (text is written to a le)
int confidence
List<Word> words
@hannotify

Multiple languages in aMultiple languages in a
single documentsingle document
Concatenate the language codes and separate themConcatenate the language codes and separate them
by a plus sign:by a plus sign:
tesseract.setLanguage("eng+nld");
@hannotify

DemoDemo
Reporting con dence
Multiple languages in a single document
@hannotify

Image optimizationImage optimization
Tess4J is bundled with theTess4J is bundled with the ImageHelperImageHelper class, whichclass, which
contains a few image optimization tricks.contains a few image optimization tricks.
@hannotify

convertImageToBinary(BufferedImage
image)
@hannotify

image)
convertImageToGrayscale(BufferedImage
image)
@hannotify

image)
image)
invertImageColor(BufferedImage image)
@hannotify

image)
image)
invertImageColor(BufferedImage image)
rotateImage(BufferedImage image,
double angle)
@hannotify

Still having problems?Still having problems?
https://tesseract-https://tesseract-
ocr.github.io/tessdoc/ImproveQualityocr.github.io/tessdoc/ImproveQuality
@hannotify

Speed/accuracySpeed/accuracy
tradeo stradeo s
Two types of training data:Two types of training data:
https://github.com/tesseract-ocr/tessdata_fast
https://github.com/tesseract-ocr/tessdata_best
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#updated-data- les-for-version-400-september-15-2017

DemoDemo
Image optimization
Speed/accuracy tradeo s
@hannotify

Training dataTraining data
400,000 textlines
4500 fonts
(for Latin-based languages)(for Latin-based languages)
@hannotify

Custom trainingCustom training
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

Fine tune (e.g. for an unusual font)

Cut o the top layer (e.g. for a new language)

Cut o the top layer (e.g. for a new language)
Retrain from scratch (e.g. don't do this!)

http://irevolution.net/2013/06/17/recaptcha-for-disaster-response/

https://media.wired.com/photos/59323151edfced5820d0edfe/master/w_616,c_limit/Recaptcha_anchor@2x.gif

FurtherFurtherFurtherFurtherFurther
readingreadingreadingreadingreading
https://www.pexels.com/photo/white-teddy-bear-reading-book-33196/

Further readingFurther reading
"An Overview of the Tesseract OCR Engine" by
Ray Smith
( )
Useful resourcesUseful resources
Tesseract on Github
( )
Try Tesseract online
( )
https://research.google.com/pubs/archive/33418.pdf
https://github.com/tesseract-ocr/tesseract
newocr.com
@hannotify

AnyAnyAnyAnyAny
questions?questions?questions?questions?questions?
https://www.pexels.com/photo/monopoly-car-piece-1634213/

Thank you! ☺Thank you! ☺
bit.do/tesseract-luxoft-slides
https://hannotify.github.io
@hannotify
hanno.embregts@infosupport.com

Entering the Fourth Dimension of OCR with Tesseract

Recomendados

Recomendados

Más contenido relacionado

Similar a Entering the Fourth Dimension of OCR with Tesseract

Similar a Entering the Fourth Dimension of OCR with Tesseract (20)

Más de 🎤 Hanno Embregts 🎸

Más de 🎤 Hanno Embregts 🎸 (19)

Último

Último (20)

Entering the Fourth Dimension of OCR with Tesseract