Optical Character Recognition has evolved into a mature computer science field, with applications in financial transfers, book digitisation and passport scanning. Now if you would want to add OCR to your Java application, you have lots of options; one of which is Tesseract.
Tesseract has become quite popular amongst software developers because of its accuracy, its open-source status and its active development by Google. By using the Tess4J JNA wrapper it is easily integrated into your Java project.
During this session, I will introduce Tesseract, its pros and cons and how & when to use it. I will compare it to its competitors and I will explain why and how we used it in a case study. To top it off I will live-code a Java application that uses Tesseract and Tess4J to process some example images, so you’ll be able to assess its accuracy for yourself.
Now in geometry, a ‘tesseract’ is the four-dimensional analog of a cube. So will Tesseract live up to its name and help your project to ‘enter the fourth dimension’? Attend this session and find out!
3. What is OCR?What is OCR?What is OCR?What is OCR?What is OCR?
/ rst dimension/ rst dimension/ rst dimension/ rst dimension/ rst dimension
https://gifer.com/en/7U5F
4. Atechnology that can take writtenAtechnology that can take written
words and convert then aes mowords and convert then aes mo
aulerseadans Yom, proved meyren neaulerseadans Yom, proved meyren ne
fgntrom sng tne cortec cars ometimes,fgntrom sng tne cortec cars ometimes,
ate fant gait sae and ptch, dar engugnate fant gait sae and ptch, dar engugn
onthe pase, ana Your Bropared toonthe pase, ana Your Bropared to
spend Several cantinas covtecing al mespend Several cantinas covtecing al me
one PrP areee e SP es ts anine Oe tatone PrP areee e SP es ts anine Oe tat
came autas zerseecame autas zersee
@hannotify
7. A technology that can take writtenA technology that can take written
words and convert them back intowords and convert them back into
computer-readable form, providedcomputer-readable form, provided
they're in the right font, using thethey're in the right font, using the
correct colors sometimes, at the rightcorrect colors sometimes, at the right
point size and pitch, dark enough onpoint size and pitch, dark enough on
the paper, and you're prepared to spendthe paper, and you're prepared to spend
several centuries correcting all the onesseveral centuries correcting all the ones
that came out as l's, all the O's thatthat came out as l's, all the O's that
came out as zeroes, and all the colonscame out as zeroes, and all the colons
that come out like semicolons.that come out like semicolons.
@hannotify
8. A Proper De nitionA Proper De nition
Optical character recognitionOptical character recognition (...) is(...) is
the mechanical or electronic conversionthe mechanical or electronic conversion
of images of typed, handwritten orof images of typed, handwritten or
printed text into machine-encoded text,printed text into machine-encoded text,
whether from a scanned document, awhether from a scanned document, a
photo of a document, a scene-photophoto of a document, a scene-photo
(...) or from subtitle text superimposed(...) or from subtitle text superimposed
on an image.on an image.
https://en.wikipedia.org/wiki/Optical_character_recognition
14. 19931993199319931993
The Apple Newton becomes the rst
handheld computer to feature
handwriting recognition.
https://commons.wikimedia.org/wiki/File:Newton_MessagePad_120_stylus.JPG
17. Book digitizationBook digitizationBook digitizationBook digitizationBook digitization
Also supports Ctrl+F.
https://upload.wikimedia.org/wikipedia/commons/6/65/Internet_Archive_book_scanner_1.jpg
18. Passport scanningPassport scanningPassport scanningPassport scanningPassport scanning
Gets you to your gate in time.
https://supertravelme.com/wp-content/uploads/2017/08/Changi-Airport-Terminal-4-Automated-Departures-Scan-Passport-and-Take-Photo-1G7A1491.jpg
19. Number plateNumber plateNumber plateNumber plateNumber plate
recognitionrecognitionrecognitionrecognitionrecognition
Get your speeding ticket even faster!
https://commons.wikimedia.org/wiki/File:Nopeuskamera_20170619.jpg
22. TesseractTesseract
Development started at Hewlett-Packard in 1985
Ported to Windows in 1996
Released as open-source in 2005
Google sponsors development of Tesseract since
2006
()
@hannotify
23. —— Anthony KayAnthony Kay
in "Linux Journal", July 2007in "Linux Journal", July 2007
"The core feature, text recognition, is"The core feature, text recognition, is
drastically better than anything elsedrastically better than anything else
I've tried from the Open SourceI've tried from the Open Source
community."community."
https://www.linuxjournal.com/article/9676
26. Used by GoogleUsed by Google
For text detection on mobile devices
In video
In Gmail image spam detection
https://opensource.google.com/projects/tesseract
27. New featuresNew features
in v3.0in v3.0
support for over 100 languages
page layout analysis
in v4.0in v4.0
LSTM recognition engine
@hannotify
29. Tess4J featuresTess4J features
PDF input
Multi-page TIFF input
Image optimization
(( ))https://github.com/nguyenq/tess4jhttps://github.com/nguyenq/tess4j
@hannotify
30. DemoDemo
Install Tesseract (for multilanguage support)
Add Tess4J dependency
Convert image to plain text (English)
Convert image to plain text (Greek)
@hannotify
31. Choosing theChoosing theChoosing theChoosing theChoosing the
Right LibraryRight LibraryRight LibraryRight LibraryRight Library
/ third dimension/ third dimension/ third dimension/ third dimension/ third dimension
https://gifer.com/en/7U5F
33. ABBYY FineReaderABBYY FineReader
Development started at ABBYY in 1993
Supports 192 languages
20 million users worldwide
Outputs to MS O ce, RTF, HTML, (searchable)
PDF and plain text
(( ))https://www.abbyy.com/en-eu/ nereaderhttps://www.abbyy.com/en-eu/ nereader
@hannotify
34. Google Cloud VisionGoogle Cloud Vision
APIAPI
Launched in 2016 by Google
Supports 56 languages
Outputs to JSON
Integrates nicely with Google Images and Google
SafeSearch
(( ))https://cloud.google.com/visionhttps://cloud.google.com/vision
@hannotify
36. ABBYY GCV Tesseract
costs
$200
per computer
$1.50
per 1000 images, per
month
$0
https://www.digitisation.eu/download/website- les/IMPACT_D-EXT2_Pilot_report_PSNC.pdf
37. ABBYY GCV Tesseract
costs
$200
per computer
$1.50
per 1000 images, per
month
$0
languages 192 56 102
https://www.digitisation.eu/download/website- les/IMPACT_D-EXT2_Pilot_report_PSNC.pdf
38. ABBYY GCV Tesseract
costs
$200
per computer
$1.50
per 1000 images, per
month
$0
languages 192 56 102
Java integration through SDK
through REST
API
through JNA
wrapper
https://www.digitisation.eu/download/website- les/IMPACT_D-EXT2_Pilot_report_PSNC.pdf
39. ABBYY GCV Tesseract
costs
$200
per computer
$1.50
per 1000 images, per
month
$0
languages 192 56 102
Java integration through SDK
through REST
API
through JNA
wrapper
handwriting
recognition
'handprinted'
text
supported not supported
https://www.digitisation.eu/download/website- les/IMPACT_D-EXT2_Pilot_report_PSNC.pdf
40. ABBYY GCV Tesseract
costs
$200
per computer
$1.50
per 1000 images, per
month
$0
languages 192 56 102
Java integration through SDK
through REST
API
through JNA
wrapper
handwriting
recognition
'handprinted'
text
supported not supported
custom training supported not supported supported
https://www.digitisation.eu/download/website- les/IMPACT_D-EXT2_Pilot_report_PSNC.pdf
41. ABBYY GCV Tesseract
costs
$200
per computer
$1.50
per 1000 images, per
month
$0
languages 192 56 102
Java integration through SDK
through REST
API
through JNA
wrapper
handwriting
recognition
'handprinted'
text
supported not supported
custom training supported not supported supported
accuracy 9/10 8/10 7.5/10
https://www.digitisation.eu/download/website- les/IMPACT_D-EXT2_Pilot_report_PSNC.pdf
42. Case studyCase studyCase studyCase studyCase study
Paper archives going digital.
https://pxhere.com/en/photo/906364
53. What is con dence?What is con dence?
@hannotify
54. What is con dence?What is con dence?
@hannotify
55. Reporting con denceReporting con dence
Tess4J supports two return types:Tess4J supports two return types:
String (containing the OCR'ed text)
List<OCRResult> (text is written to a le)
int confidence
List<Word> words
@hannotify
56. Multiple languages in aMultiple languages in a
single documentsingle document
Concatenate the language codes and separate themConcatenate the language codes and separate them
by a plus sign:by a plus sign:
tesseract.setLanguage("eng+nld");
@hannotify
58. Image optimizationImage optimization
Tess4J is bundled with theTess4J is bundled with the ImageHelperImageHelper class, whichclass, which
contains a few image optimization tricks.contains a few image optimization tricks.
@hannotify
59. Image optimizationImage optimization
Tess4J is bundled with theTess4J is bundled with the ImageHelperImageHelper class, whichclass, which
contains a few image optimization tricks.contains a few image optimization tricks.
convertImageToBinary(BufferedImage
image)
@hannotify
60. Image optimizationImage optimization
Tess4J is bundled with theTess4J is bundled with the ImageHelperImageHelper class, whichclass, which
contains a few image optimization tricks.contains a few image optimization tricks.
convertImageToBinary(BufferedImage
image)
convertImageToGrayscale(BufferedImage
image)
@hannotify
61. Image optimizationImage optimization
Tess4J is bundled with theTess4J is bundled with the ImageHelperImageHelper class, whichclass, which
contains a few image optimization tricks.contains a few image optimization tricks.
convertImageToBinary(BufferedImage
image)
convertImageToGrayscale(BufferedImage
image)
invertImageColor(BufferedImage image)
@hannotify
62. Image optimizationImage optimization
Tess4J is bundled with theTess4J is bundled with the ImageHelperImageHelper class, whichclass, which
contains a few image optimization tricks.contains a few image optimization tricks.
convertImageToBinary(BufferedImage
image)
convertImageToGrayscale(BufferedImage
image)
invertImageColor(BufferedImage image)
rotateImage(BufferedImage image,
double angle)
@hannotify
63. Still having problems?Still having problems?
https://tesseract-https://tesseract-
ocr.github.io/tessdoc/ImproveQualityocr.github.io/tessdoc/ImproveQuality
@hannotify
64. Speed/accuracySpeed/accuracy
tradeo stradeo s
Two types of training data:Two types of training data:
https://github.com/tesseract-ocr/tessdata_fast
https://github.com/tesseract-ocr/tessdata_best
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#updated-data- les-for-version-400-september-15-2017
69. Custom trainingCustom training
Fine tune (e.g. for an unusual font)
Cut o the top layer (e.g. for a new language)
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
70. Custom trainingCustom training
Fine tune (e.g. for an unusual font)
Cut o the top layer (e.g. for a new language)
Retrain from scratch (e.g. don't do this!)
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
75. Further readingFurther reading
"An Overview of the Tesseract OCR Engine" by
Ray Smith
( )
Useful resourcesUseful resources
Tesseract on Github
( )
Try Tesseract online
( )
https://research.google.com/pubs/archive/33418.pdf
https://github.com/tesseract-ocr/tesseract
newocr.com
@hannotify