SlideShare a Scribd company logo
1 of 22
Tesseract OCR Engine
What it is, where it came from,
      where it is going.
  Ray Smith, Google Inc
         OSCON 2007
Contents
ā€¢   Introduction & history of OCR
ā€¢   Tesseract architecture & methods
ā€¢   Announcing Tesseract 2.00
ā€¢   Training Tesseract
ā€¢   Future enhancements
A Brief History of OCR
ā€¢ What is Optical Character Recognition?




                         OCR


  My invention relates to statistical machines
 of the type in which successive comparisons
 are made between a character and a charac-
A Brief History of OCR
ā€¢ OCR predates electronic computers!




                US Patent 1915993, Filed Apr 27, 1931
A Brief History of OCR
ā€¢   1929 ā€“ Digit recognition machine
ā€¢   1953 ā€“ Alphanumeric recognition machine
ā€¢   1965 ā€“ US Mail sorting
ā€¢   1965 ā€“ British banking system
ā€¢   1976 ā€“ Kurzweil reading machine
ā€¢   1985 ā€“ Hardware-assisted PC software
ā€¢   1988 ā€“ Software-only PC software
ā€¢   1994-2000 ā€“ Industry consolidation
Tesseract Background
ā€¢ Developed on HP-UX at HP between 1985
  and 1994 to run in a desktop scanner.
ā€¢ Came neck and neck with Caere and XIS
  in the 1995 UNLV test.
 (See http://www.isri.unlv.edu/downloads/AT-1995.pdf )
ā€¢ Never used in an HP product.
ā€¢ Open sourced in 2005. Now on:
  http://code.google.com/p/tesseract-ocr
ā€¢ Highly portable.
Tesseract OCR Architecture
Input: Gray or Color Image
[+ Region Polygons]                        Adaptive
                                         Thresholding
                                                           Binary Image


                                             Character
                             Find Text                     Connected
                                             Outlines
                             Lines and                     Component
                               Words                        Analysis
         Character
         Outlines
         Organized
         Into Words
                                 Recognize               Recognize
                                   Word                    Word
                                  Pass 1                  Pass 2
Adaptive Thresholding is Essential
        Some examples of how difficult it can be to make a binary image
                   Taken from the UNLV Magazine set.
                  (http://www.isri.unlv.edu/ISRI/OCRtk )
Baselines are rarely perfectly straight

ā€¢ Text Line Finding ā€“ skew independent ā€“
  published at ICDARā€™95 Montreal.
  (http://scholar.google.com/scholar?q=skew+detection+smith)
ā€¢ Baselines are approximated by quadratic splines
  to account for skew and curl.
ā€¢ Meanline, ascender and descender lines are a
  constant displacement from baseline.
ā€¢ Critical value is the x-height.
Spaces between words are tricky too

 ā€¢ Italics, digits, punctuation all create
   special-case font-dependent spacing.
 ā€¢ Fully justified text in narrow columns can
   have vastly varying spacing on different
   lines.
Tesseract: Recognize Word


                                             No
  Character      Character
                                     Done?
  Chopper        Associator

                                   Yes

                                              Adapt to
                                               Word




  Static                      Adaptive
                                              Number
              Dictionary
 Character                    Character
                                              Parser
 Classifier                   Classifier
Outline Approximation



 Original Image      Outlines of components     Polygonal Approximation



Polygonal approximation is a double-edged sword.
Noise and some pertinent information are both lost.
Tesseract: Features and Matching




Prototype   Character     Extracted   Match of      Match of
            to classify   Features    Prototype     Features To
                                      To Features   Prototype

ā€¢ Static classifier uses outline fragments as
  features. Broken characters are easily
  recognizable by a small->large matching
  process in classifier. (This is slow.)
ā€¢ Adaptive classifier uses the same technique!
  (Apart from normalization method.)
Announcing tesseract-2.00
ā€¢ Fully Unicode (UTF-8) capable
ā€¢ Already trained for 6 Latin-based
  languages (Eng, Fra, Ita, Deu, Spa, Nld)
ā€¢ Code and documented process to train at
  http://code.google.com/p/tesseract-ocr
ā€¢ UNLV regression test framework
ā€¢ Other minor fixes
Training Tesseract
                                                                       Tesseract Data Files


                                                                           User-words
                                   Wordlist2dawg
      Word List
                                                                           Word-dawg,
                                                                           Freq-dawg
      Training
    page images                                   mfTraining                inttemp,
                                  Character                                 pffmtable
              Tesseract
                                  Features
Tesseract
                                  (*.tr files)    cnTraining
 +manual
                                                                            normproto
correction


                  Unicharset_extractor                   Addition of
      Box files                          unicharset                         unicharset
                                                         character
                                                         properties

                                                       Manual
                                                                          DangAmbigs
                                                      Data Entry
Tesseract Dictionaries
                                         Tesseract Data Files
                Usually Empty
                                             User-words
            Infrequent
Word List
            Word List                        Word-dawg,
                         Wordlist2dawg
                                             Freq-dawg
            Frequent
            Word List
Tesseract Shape Data
                                                         Tesseract Data Files



                                Prototype Shape Features

                           Expected Feature Counts

      Training
    page images                             mfTraining        inttemp,
                             Character                        pffmtable
              Tesseract
                             Features
Tesseract
                             (*.tr files)   cnTraining
 +manual
                                                              normproto
correction


                      Character Normalization Features
      Box files
Tesseract Character Data
                                                                       Tesseract Data Files




      Training
    page images

                          List of Characters + ctype information
Tesseract
 +manual
correction


                  Unicharset_extractor                   Addition of
      Box files                          unicharset                         unicharset
                                                         character
                                                         properties

                                                       Manual
                                                                          DangAmbigs
                                                      Data Entry
     Typical OCR errors eg e<->c, rn<->m etc
Accuracy Results
          Comparison of current results against 1995 UNLV results

Testid   Testset    Character                        Non-stopword

                    Errors      Accuracy   Change    Errors     Accuracy   Change

1995     Bus.3B     5959        98.14%               1293       95.73%

1995     Doe3.3B    36349       97.52%               7042       94.87%

1995     Mag.3B     15043       97.74%               3379       94.99%

1995     News.3B    6432        98.69%               1502       96.94%

Gcc4.1   Bus.3B     6258        98.04%     5.02%     1312       95.67%     1.47%

Gcc4.1   Doe3.3B    28589       98.05%     -21.35%   6692       95.12%     -4.97%

Gcc4.1   Mag.3B     14800       97.78%     -1.62%    3123       95.37%     -7.58%

Gcc4.1   News.3B    7524        98.47%     16.98%    1220       97.51%     -18.77%

Gcc4.1   Total      57171                  -10.37%   12347                 -6.58%
Commercial OCR v Tesseract
ā€¢ 100+ languages.     ā€¢ 6 languages + growing.
ā€¢ Accuracy is good    ā€¢ Accuracy was good in
  now.                  1995.
ā€¢ Sophisticated app   ā€¢ No UI yet.
  with complex UI.
ā€¢ Works on complex    ā€¢ Page layout analysis
  magazine pages.       coming soon.
ā€¢ Windows Mostly.     ā€¢ Runs on Linux, Mac,
                        Windows, more...
ā€¢ Costs $130-$500     ā€¢ Open source ā€“ Free!
Tesseract Future

ā€¢   Page layout analysis.
ā€¢   More languages.
ā€¢   Improve accuracy.
ā€¢   Add a UI.
The End
ā€¢ For more information see:
  http://code.google.com/p/tesseract-ocr

More Related Content

What's hot

A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUESA STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
ijcsitcejournal
Ā 
optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition system
Vijay Apurva
Ā 
Biometric Signature Recognization
 Biometric Signature Recognization Biometric Signature Recognization
Biometric Signature Recognization
Faimin Khan
Ā 

What's hot (20)

OCR Presentation (Optical Character Recognition)
OCR Presentation (Optical Character Recognition)OCR Presentation (Optical Character Recognition)
OCR Presentation (Optical Character Recognition)
Ā 
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUESA STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
Ā 
optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition system
Ā 
Optical Character Recognition
Optical Character RecognitionOptical Character Recognition
Optical Character Recognition
Ā 
ocr
ocrocr
ocr
Ā 
Text extraction from images
Text extraction from imagesText extraction from images
Text extraction from images
Ā 
Image authentication techniques based on Image watermarking
Image authentication techniques based on Image watermarkingImage authentication techniques based on Image watermarking
Image authentication techniques based on Image watermarking
Ā 
Lung Cancer Detection using Image Processing Techniques
Lung Cancer Detection using Image Processing TechniquesLung Cancer Detection using Image Processing Techniques
Lung Cancer Detection using Image Processing Techniques
Ā 
Signature verification in biometrics
Signature verification in biometricsSignature verification in biometrics
Signature verification in biometrics
Ā 
Ocr abstract
Ocr abstractOcr abstract
Ocr abstract
Ā 
Digital Image Processing: Digital Image Fundamentals
Digital Image Processing: Digital Image FundamentalsDigital Image Processing: Digital Image Fundamentals
Digital Image Processing: Digital Image Fundamentals
Ā 
Image segmentation
Image segmentation Image segmentation
Image segmentation
Ā 
Lecture 3 image sampling and quantization
Lecture 3 image sampling and quantizationLecture 3 image sampling and quantization
Lecture 3 image sampling and quantization
Ā 
Biometric Signature Recognization
 Biometric Signature Recognization Biometric Signature Recognization
Biometric Signature Recognization
Ā 
Optical character recognition (ocr) ppt
Optical character recognition (ocr) pptOptical character recognition (ocr) ppt
Optical character recognition (ocr) ppt
Ā 
3D Password
3D Password3D Password
3D Password
Ā 
Pattern recognition voice biometrics
Pattern recognition voice biometricsPattern recognition voice biometrics
Pattern recognition voice biometrics
Ā 
Text extraction From Digital image
Text extraction From Digital imageText extraction From Digital image
Text extraction From Digital image
Ā 
Optical character recognition IEEE Paper Study
Optical character recognition IEEE Paper StudyOptical character recognition IEEE Paper Study
Optical character recognition IEEE Paper Study
Ā 
OCR (Optical Character Recognition)
OCR (Optical Character Recognition) OCR (Optical Character Recognition)
OCR (Optical Character Recognition)
Ā 

Viewers also liked

CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1
Solin TEM
Ā 
Os Keysholistic
Os KeysholisticOs Keysholistic
Os Keysholistic
oscon2007
Ā 
Os Napier
Os NapierOs Napier
Os Napier
oscon2007
Ā 
Os Keyshacks
Os KeyshacksOs Keyshacks
Os Keyshacks
oscon2007
Ā 
J Ruby Whirlwind Tour
J Ruby Whirlwind TourJ Ruby Whirlwind Tour
J Ruby Whirlwind Tour
oscon2007
Ā 
Os Ellistutorial
Os EllistutorialOs Ellistutorial
Os Ellistutorial
oscon2007
Ā 
Solr Presentation5
Solr Presentation5Solr Presentation5
Solr Presentation5
oscon2007
Ā 
Os Harkins
Os HarkinsOs Harkins
Os Harkins
oscon2007
Ā 
OCaml Labs introduction at OCaml Consortium 2012
OCaml Labs introduction at OCaml Consortium 2012OCaml Labs introduction at OCaml Consortium 2012
OCaml Labs introduction at OCaml Consortium 2012
Anil Madhavapeddy
Ā 
Os Peytonjones
Os PeytonjonesOs Peytonjones
Os Peytonjones
oscon2007
Ā 

Viewers also liked (20)

CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1
Ā 
OCR using Tesseract
OCR using TesseractOCR using Tesseract
OCR using Tesseract
Ā 
Os Keysholistic
Os KeysholisticOs Keysholistic
Os Keysholistic
Ā 
Os Napier
Os NapierOs Napier
Os Napier
Ā 
FregeDay: Roadmap for resolving differences between Haskell and Frege (Ingo W...
FregeDay: Roadmap for resolving differences between Haskell and Frege (Ingo W...FregeDay: Roadmap for resolving differences between Haskell and Frege (Ingo W...
FregeDay: Roadmap for resolving differences between Haskell and Frege (Ingo W...
Ā 
Os Keyshacks
Os KeyshacksOs Keyshacks
Os Keyshacks
Ā 
J Ruby Whirlwind Tour
J Ruby Whirlwind TourJ Ruby Whirlwind Tour
J Ruby Whirlwind Tour
Ā 
Os Ellistutorial
Os EllistutorialOs Ellistutorial
Os Ellistutorial
Ā 
FregeDay: Parallelism in Frege compared to GHC Haskell (Volker Steiss)
FregeDay: Parallelism in Frege compared to GHC Haskell (Volker Steiss) FregeDay: Parallelism in Frege compared to GHC Haskell (Volker Steiss)
FregeDay: Parallelism in Frege compared to GHC Haskell (Volker Steiss)
Ā 
Solr Presentation5
Solr Presentation5Solr Presentation5
Solr Presentation5
Ā 
Os Harkins
Os HarkinsOs Harkins
Os Harkins
Ā 
FregeFX - JavaFX with Frege, a Haskell for the JVM
FregeFX - JavaFX with Frege, a Haskell for the JVMFregeFX - JavaFX with Frege, a Haskell for the JVM
FregeFX - JavaFX with Frege, a Haskell for the JVM
Ā 
Frege - consequently functional programming for the JVM
Frege - consequently functional programming for the JVMFrege - consequently functional programming for the JVM
Frege - consequently functional programming for the JVM
Ā 
FregeDay: Design and Implementation of the language (Ingo Wechsung)
FregeDay: Design and Implementation of the language (Ingo Wechsung)FregeDay: Design and Implementation of the language (Ingo Wechsung)
FregeDay: Design and Implementation of the language (Ingo Wechsung)
Ā 
Software Transactional Memory (STM) in Frege
Software Transactional Memory (STM) in Frege Software Transactional Memory (STM) in Frege
Software Transactional Memory (STM) in Frege
Ā 
Frege Tutorial at JavaOne 2015
Frege Tutorial at JavaOne 2015Frege Tutorial at JavaOne 2015
Frege Tutorial at JavaOne 2015
Ā 
2 architecture anddatastructures
2 architecture anddatastructures2 architecture anddatastructures
2 architecture anddatastructures
Ā 
OCaml Labs introduction at OCaml Consortium 2012
OCaml Labs introduction at OCaml Consortium 2012OCaml Labs introduction at OCaml Consortium 2012
OCaml Labs introduction at OCaml Consortium 2012
Ā 
Os Peytonjones
Os PeytonjonesOs Peytonjones
Os Peytonjones
Ā 
OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents
Ā 

Similar to Os Raysmith

Simplifying Database Development (OSCON 2009)
Simplifying Database Development (OSCON 2009)Simplifying Database Development (OSCON 2009)
Simplifying Database Development (OSCON 2009)
PostgreSQL Experts, Inc.
Ā 
[DCTPE2010] Biodiversity & Drupal
[DCTPE2010] Biodiversity & Drupal[DCTPE2010] Biodiversity & Drupal
[DCTPE2010] Biodiversity & Drupal
Drupal Taiwan
Ā 
Turmeric SOA Cloud Mashups
Turmeric SOA Cloud MashupsTurmeric SOA Cloud Mashups
Turmeric SOA Cloud Mashups
kingargyle
Ā 

Similar to Os Raysmith (20)

Simplifying Database Development (OSCON 2009)
Simplifying Database Development (OSCON 2009)Simplifying Database Development (OSCON 2009)
Simplifying Database Development (OSCON 2009)
Ā 
GTC 2012: NVIDIA OpenGL in 2012
GTC 2012: NVIDIA OpenGL in 2012GTC 2012: NVIDIA OpenGL in 2012
GTC 2012: NVIDIA OpenGL in 2012
Ā 
Model-Driven Software Development - Language Workbenches & Syntax Definition
Model-Driven Software Development - Language Workbenches & Syntax DefinitionModel-Driven Software Development - Language Workbenches & Syntax Definition
Model-Driven Software Development - Language Workbenches & Syntax Definition
Ā 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Ā 
[DCTPE2010] Biodiversity & Drupal
[DCTPE2010] Biodiversity & Drupal[DCTPE2010] Biodiversity & Drupal
[DCTPE2010] Biodiversity & Drupal
Ā 
stream processing engine
stream processing enginestream processing engine
stream processing engine
Ā 
Using S1000D and SCORM to Integrate Documentation and Training
Using S1000D and SCORM to Integrate Documentation and TrainingUsing S1000D and SCORM to Integrate Documentation and Training
Using S1000D and SCORM to Integrate Documentation and Training
Ā 
Maximize the Value of Raster Data Using FME
Maximize the Value of Raster Data Using FMEMaximize the Value of Raster Data Using FME
Maximize the Value of Raster Data Using FME
Ā 
Building a SQL Database that Works
Building a SQL Database that WorksBuilding a SQL Database that Works
Building a SQL Database that Works
Ā 
The Phenoscape Knowledgebase
The Phenoscape KnowledgebaseThe Phenoscape Knowledgebase
The Phenoscape Knowledgebase
Ā 
Evolution: It's a process
Evolution: It's a processEvolution: It's a process
Evolution: It's a process
Ā 
Erlang/OTP for Rubyists
Erlang/OTP for RubyistsErlang/OTP for Rubyists
Erlang/OTP for Rubyists
Ā 
XML Schema Patterns for Databinding
XML Schema Patterns for DatabindingXML Schema Patterns for Databinding
XML Schema Patterns for Databinding
Ā 
Corel draw vba object model
Corel draw vba object modelCorel draw vba object model
Corel draw vba object model
Ā 
Cascon2011_5_rules+owl
Cascon2011_5_rules+owlCascon2011_5_rules+owl
Cascon2011_5_rules+owl
Ā 
Extraction of topic evolutions from references in scientific articles and its...
Extraction of topic evolutions from references in scientific articles and its...Extraction of topic evolutions from references in scientific articles and its...
Extraction of topic evolutions from references in scientific articles and its...
Ā 
3 training
3 training3 training
3 training
Ā 
Turmeric SOA Cloud Mashups
Turmeric SOA Cloud MashupsTurmeric SOA Cloud Mashups
Turmeric SOA Cloud Mashups
Ā 
CSHALS 2013
CSHALS 2013CSHALS 2013
CSHALS 2013
Ā 
Multi-layer Annotation in Dependency Structure
Multi-layer Annotation in Dependency StructureMulti-layer Annotation in Dependency Structure
Multi-layer Annotation in Dependency Structure
Ā 

More from oscon2007

Os Borger
Os BorgerOs Borger
Os Borger
oscon2007
Ā 
Os Fitzpatrick Sussman Wiifm
Os Fitzpatrick Sussman WiifmOs Fitzpatrick Sussman Wiifm
Os Fitzpatrick Sussman Wiifm
oscon2007
Ā 
Os Bunce
Os BunceOs Bunce
Os Bunce
oscon2007
Ā 
Yuicss R7
Yuicss R7Yuicss R7
Yuicss R7
oscon2007
Ā 
Performance Whack A Mole
Performance Whack A MolePerformance Whack A Mole
Performance Whack A Mole
oscon2007
Ā 
Os Fogel
Os FogelOs Fogel
Os Fogel
oscon2007
Ā 
Os Lanphier Brashears
Os Lanphier BrashearsOs Lanphier Brashears
Os Lanphier Brashears
oscon2007
Ā 
Os Tucker
Os TuckerOs Tucker
Os Tucker
oscon2007
Ā 
Os Fitzpatrick Sussman Swp
Os Fitzpatrick Sussman SwpOs Fitzpatrick Sussman Swp
Os Fitzpatrick Sussman Swp
oscon2007
Ā 
Os Furlong
Os FurlongOs Furlong
Os Furlong
oscon2007
Ā 
Os Berlin Dispelling Myths
Os Berlin Dispelling MythsOs Berlin Dispelling Myths
Os Berlin Dispelling Myths
oscon2007
Ā 
Os Kimsal
Os KimsalOs Kimsal
Os Kimsal
oscon2007
Ā 
Os Pruett
Os PruettOs Pruett
Os Pruett
oscon2007
Ā 
Os Alrubaie
Os AlrubaieOs Alrubaie
Os Alrubaie
oscon2007
Ā 
Os Jonphillips
Os JonphillipsOs Jonphillips
Os Jonphillips
oscon2007
Ā 
Os Urnerupdated
Os UrnerupdatedOs Urnerupdated
Os Urnerupdated
oscon2007
Ā 
Adventures In Copyright Reform
Adventures In Copyright ReformAdventures In Copyright Reform
Adventures In Copyright Reform
oscon2007
Ā 
Railsconf2007
Railsconf2007Railsconf2007
Railsconf2007
oscon2007
Ā 
Oscon Mitchellbaker
Oscon MitchellbakerOscon Mitchellbaker
Oscon Mitchellbaker
oscon2007
Ā 
Os Sharp
Os SharpOs Sharp
Os Sharp
oscon2007
Ā 

More from oscon2007 (20)

Os Borger
Os BorgerOs Borger
Os Borger
Ā 
Os Fitzpatrick Sussman Wiifm
Os Fitzpatrick Sussman WiifmOs Fitzpatrick Sussman Wiifm
Os Fitzpatrick Sussman Wiifm
Ā 
Os Bunce
Os BunceOs Bunce
Os Bunce
Ā 
Yuicss R7
Yuicss R7Yuicss R7
Yuicss R7
Ā 
Performance Whack A Mole
Performance Whack A MolePerformance Whack A Mole
Performance Whack A Mole
Ā 
Os Fogel
Os FogelOs Fogel
Os Fogel
Ā 
Os Lanphier Brashears
Os Lanphier BrashearsOs Lanphier Brashears
Os Lanphier Brashears
Ā 
Os Tucker
Os TuckerOs Tucker
Os Tucker
Ā 
Os Fitzpatrick Sussman Swp
Os Fitzpatrick Sussman SwpOs Fitzpatrick Sussman Swp
Os Fitzpatrick Sussman Swp
Ā 
Os Furlong
Os FurlongOs Furlong
Os Furlong
Ā 
Os Berlin Dispelling Myths
Os Berlin Dispelling MythsOs Berlin Dispelling Myths
Os Berlin Dispelling Myths
Ā 
Os Kimsal
Os KimsalOs Kimsal
Os Kimsal
Ā 
Os Pruett
Os PruettOs Pruett
Os Pruett
Ā 
Os Alrubaie
Os AlrubaieOs Alrubaie
Os Alrubaie
Ā 
Os Jonphillips
Os JonphillipsOs Jonphillips
Os Jonphillips
Ā 
Os Urnerupdated
Os UrnerupdatedOs Urnerupdated
Os Urnerupdated
Ā 
Adventures In Copyright Reform
Adventures In Copyright ReformAdventures In Copyright Reform
Adventures In Copyright Reform
Ā 
Railsconf2007
Railsconf2007Railsconf2007
Railsconf2007
Ā 
Oscon Mitchellbaker
Oscon MitchellbakerOscon Mitchellbaker
Oscon Mitchellbaker
Ā 
Os Sharp
Os SharpOs Sharp
Os Sharp
Ā 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(ā˜Žļø+971_581248768%)**%*]'#abortion pills for sale in dubai@
Ā 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
Ā 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
Ā 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
Ā 

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
Ā 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
Ā 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Ā 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
Ā 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
Ā 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
Ā 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Ā 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Ā 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
Ā 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Ā 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Ā 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
Ā 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
Ā 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Ā 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Ā 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Ā 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Ā 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
Ā 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
Ā 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
Ā 

Os Raysmith

  • 1. Tesseract OCR Engine What it is, where it came from, where it is going. Ray Smith, Google Inc OSCON 2007
  • 2. Contents ā€¢ Introduction & history of OCR ā€¢ Tesseract architecture & methods ā€¢ Announcing Tesseract 2.00 ā€¢ Training Tesseract ā€¢ Future enhancements
  • 3. A Brief History of OCR ā€¢ What is Optical Character Recognition? OCR My invention relates to statistical machines of the type in which successive comparisons are made between a character and a charac-
  • 4. A Brief History of OCR ā€¢ OCR predates electronic computers! US Patent 1915993, Filed Apr 27, 1931
  • 5. A Brief History of OCR ā€¢ 1929 ā€“ Digit recognition machine ā€¢ 1953 ā€“ Alphanumeric recognition machine ā€¢ 1965 ā€“ US Mail sorting ā€¢ 1965 ā€“ British banking system ā€¢ 1976 ā€“ Kurzweil reading machine ā€¢ 1985 ā€“ Hardware-assisted PC software ā€¢ 1988 ā€“ Software-only PC software ā€¢ 1994-2000 ā€“ Industry consolidation
  • 6. Tesseract Background ā€¢ Developed on HP-UX at HP between 1985 and 1994 to run in a desktop scanner. ā€¢ Came neck and neck with Caere and XIS in the 1995 UNLV test. (See http://www.isri.unlv.edu/downloads/AT-1995.pdf ) ā€¢ Never used in an HP product. ā€¢ Open sourced in 2005. Now on: http://code.google.com/p/tesseract-ocr ā€¢ Highly portable.
  • 7. Tesseract OCR Architecture Input: Gray or Color Image [+ Region Polygons] Adaptive Thresholding Binary Image Character Find Text Connected Outlines Lines and Component Words Analysis Character Outlines Organized Into Words Recognize Recognize Word Word Pass 1 Pass 2
  • 8. Adaptive Thresholding is Essential Some examples of how difficult it can be to make a binary image Taken from the UNLV Magazine set. (http://www.isri.unlv.edu/ISRI/OCRtk )
  • 9. Baselines are rarely perfectly straight ā€¢ Text Line Finding ā€“ skew independent ā€“ published at ICDARā€™95 Montreal. (http://scholar.google.com/scholar?q=skew+detection+smith) ā€¢ Baselines are approximated by quadratic splines to account for skew and curl. ā€¢ Meanline, ascender and descender lines are a constant displacement from baseline. ā€¢ Critical value is the x-height.
  • 10. Spaces between words are tricky too ā€¢ Italics, digits, punctuation all create special-case font-dependent spacing. ā€¢ Fully justified text in narrow columns can have vastly varying spacing on different lines.
  • 11. Tesseract: Recognize Word No Character Character Done? Chopper Associator Yes Adapt to Word Static Adaptive Number Dictionary Character Character Parser Classifier Classifier
  • 12. Outline Approximation Original Image Outlines of components Polygonal Approximation Polygonal approximation is a double-edged sword. Noise and some pertinent information are both lost.
  • 13. Tesseract: Features and Matching Prototype Character Extracted Match of Match of to classify Features Prototype Features To To Features Prototype ā€¢ Static classifier uses outline fragments as features. Broken characters are easily recognizable by a small->large matching process in classifier. (This is slow.) ā€¢ Adaptive classifier uses the same technique! (Apart from normalization method.)
  • 14. Announcing tesseract-2.00 ā€¢ Fully Unicode (UTF-8) capable ā€¢ Already trained for 6 Latin-based languages (Eng, Fra, Ita, Deu, Spa, Nld) ā€¢ Code and documented process to train at http://code.google.com/p/tesseract-ocr ā€¢ UNLV regression test framework ā€¢ Other minor fixes
  • 15. Training Tesseract Tesseract Data Files User-words Wordlist2dawg Word List Word-dawg, Freq-dawg Training page images mfTraining inttemp, Character pffmtable Tesseract Features Tesseract (*.tr files) cnTraining +manual normproto correction Unicharset_extractor Addition of Box files unicharset unicharset character properties Manual DangAmbigs Data Entry
  • 16. Tesseract Dictionaries Tesseract Data Files Usually Empty User-words Infrequent Word List Word List Word-dawg, Wordlist2dawg Freq-dawg Frequent Word List
  • 17. Tesseract Shape Data Tesseract Data Files Prototype Shape Features Expected Feature Counts Training page images mfTraining inttemp, Character pffmtable Tesseract Features Tesseract (*.tr files) cnTraining +manual normproto correction Character Normalization Features Box files
  • 18. Tesseract Character Data Tesseract Data Files Training page images List of Characters + ctype information Tesseract +manual correction Unicharset_extractor Addition of Box files unicharset unicharset character properties Manual DangAmbigs Data Entry Typical OCR errors eg e<->c, rn<->m etc
  • 19. Accuracy Results Comparison of current results against 1995 UNLV results Testid Testset Character Non-stopword Errors Accuracy Change Errors Accuracy Change 1995 Bus.3B 5959 98.14% 1293 95.73% 1995 Doe3.3B 36349 97.52% 7042 94.87% 1995 Mag.3B 15043 97.74% 3379 94.99% 1995 News.3B 6432 98.69% 1502 96.94% Gcc4.1 Bus.3B 6258 98.04% 5.02% 1312 95.67% 1.47% Gcc4.1 Doe3.3B 28589 98.05% -21.35% 6692 95.12% -4.97% Gcc4.1 Mag.3B 14800 97.78% -1.62% 3123 95.37% -7.58% Gcc4.1 News.3B 7524 98.47% 16.98% 1220 97.51% -18.77% Gcc4.1 Total 57171 -10.37% 12347 -6.58%
  • 20. Commercial OCR v Tesseract ā€¢ 100+ languages. ā€¢ 6 languages + growing. ā€¢ Accuracy is good ā€¢ Accuracy was good in now. 1995. ā€¢ Sophisticated app ā€¢ No UI yet. with complex UI. ā€¢ Works on complex ā€¢ Page layout analysis magazine pages. coming soon. ā€¢ Windows Mostly. ā€¢ Runs on Linux, Mac, Windows, more... ā€¢ Costs $130-$500 ā€¢ Open source ā€“ Free!
  • 21. Tesseract Future ā€¢ Page layout analysis. ā€¢ More languages. ā€¢ Improve accuracy. ā€¢ Add a UI.
  • 22. The End ā€¢ For more information see: http://code.google.com/p/tesseract-ocr