SlideShare una empresa de Scribd logo
1 de 85
 NAME : K.Balamurugan
 REG NO : 12370002
 COURSE : M.SC (Comp. Science)
 SEMESTER : IV Semester
Under Guidance
Dr.K.S. Kuppusamy
TAMIL
(TAMIL OCR USING MULTIDIMENSIONAL INTERACTIVE LEARNING MODEL)
Introduction to OCR
Abstract
Software Requirement
Analysis
Design
GUI Design
Conclusion
Outline:
What is OCR?
 Optical Character
Recognition, usually
abbreviated to OCR, is
the conversion of scanned
photo/ Images of typewritten
or printed text into machine-
encoded/computer-readable
text.
Introduction to OCR
Types of OCR
Basically, there are three types of OCR. They are briefly discussed
below:
 Handwritten Text OCR
The text produced by a person by writing with a pen/ pencil on a paper
medium and which is then scanned into digital format using scanner is
called Handwritten Text.
 Machine Printed Text OCR
Machine printed text can be found commonly in daily use. It is
produced by offset processes, such as laser, inkjet and many more. The
project comes under this category.
Introduction - Type of OCR
 TAMIL (Tamil OCR using Multidimensional Interactive Learning
model) is Optical Character Recognition Software that convert machine
printed text into editable text. It is mainly targeting to Tamil language. It
uses tesseract OCR Engine.
 Tesseract OCR engine has trained in Tamil language so this software can
convert an image in to text in Tamil language ; it involves the Tamil
Tessdata files for mapping of each character in image into text. While trying
to recognize a limited range of fonts (like a single font for instance), then a
single training page might be enough. Tesseract was originally designed to
recognize English text only. Efforts have been made to modify the engine
and its training system to make them able to deal with other languages and
UTF-8 characters.
TAMIL OCR USING MULTIDIMENSIONAL INTERACTIVE
LEARNING MODEL
Abstract
Tesseract-OCR is the most widely used open source OCR
across the world. Currently this OCR supports English language as
default and few more language and it is a command line tool. Tesseract
OCR Engine has flexibility that it can be trained to any language. In this
project an application is developed to train OCR in Tamil languages.
More Training Images and effort are needed to produce the training data.
Since Tesseract-OCR is a command line tool, it is not mostly used by the
beginners which limit the usage of tesseract. Here an excellent GUI was
developed for tesseract-OCR, which makes peoples to use this OCR
easily.
Tamil OCR GUI is multidimensional and interactive. It uses espeak TTS
engine to convert Tamil Text into voice. So it much beneficial for blind
people.
Abstract
TAMIL OCR USING MULTIDIMENSIONAL INTERACTIVE
LEARNING MODEL
 Software Requirements
 Tesseract 3.02 OCR Engine
 JTessBoxEditor
 SerakTesseractTrainer
 Visual studio 2010
 .NET Framework 4.0
 VB.NET
 Notepad ++ (Text Creator)
 Text2Image Command line
 Ghost Script 9.0
 WebsitesScreenshot.dll
 Windows Image Acquisition(Wiaaut.dll)
 Hunspell Spell Checker
 NHunspellExtender
 eSpeak
 espeak Data Files
Requirements
Requirements
 MILE Lab Tamil language Text-To-Speech (TTS) Engine
 Google Transliterations API
 Google Speech Service
 Tamil and English Spell Check Dictionaries
 Microsoft Speech API (SAPI) 5.4 (SAPI)
 Tessdata with Tamil Trained Data file
 Web browser Component
 VC++ Runtime
 Microsoft Office Interoperability Word Service
 Tamil Fonts and Tamil Keyboard software
 Windows Operating system(From windows XP to Windows 8)
Requirements
User Software Requirements
 VC++ Runtime 2010
 .NET Framework 4.0
Hardware Requirements:
 Minimum 260MB RAM memory, preferably 600MB RAM
 700MB free hard disk space
 Modem for internet connection
Software Requirements - Tesseract
OCR Engine
 The Tesseract engine was originally developed as proprietary
software at Hewlett Packard labs in Bristol, England and Greeley,
Colorado between 1985 and 1994, with some more changes made in
1996 to port to Windows, and some migration from C to C++ in
1998. A lot of the code was written in C, and then some more was
written in C++. Since then all the code has been converted to at
least compile with a C++ compiler. Very little work was done in the
following decade. It was then released as open source in 2005 by
Hewlett Packard and the University of Nevada, Las Vegas (UNLV).
Tesseract development has been sponsored by Google since 2006.
 Tesseract does not come with a GUI and is instead run from
the command-line interface.
 Tesseract was in the top three OCR engines in terms of character accuracy in
1995. It is available for Linux, Windows and Mac OS X, however, due to
limited resources only Windows and Ubuntu are rigorously tested by
developers.
 Tesseract up to and including version 2 could only accept TIFF images of
simple one column text as inputs. These early versions did not include layout
analysis and so inputting multi-columned text, images, or equations
produced a garbled output. Since version 3.00 Tesseract has supported
output text formatting, hOCR positional information and page layout
analysis. Support for a number of new image formats was added using the
Leptonica library. Tesseract can detect whether text is monospaced or
proportional.
Software Requirements - Tesseract
OCR Engine
Software Requirements - Tesseract
OCR Engine
 The initial versions of Tesseract could only recognize English language text.
Starting with version 2 Tesseract was able to process English, French, Italian,
German, Spanish, Brazilian Portuguese and Dutch. Starting with version 3 it
can recognize Arabic, English, Bulgarian, Catalan, Czech, Chinese (Simplified
and Traditional), Danish, German (standard and Fraktur script), Greek,
Finnish, French, Hebrew, Croatian, Hungarian, Indonesian, Italian, Japanese,
Korean, Latvian, Lithuanian, Dutch, Norwegian, Polish, Portuguese,
Romanian, Russian, Slovak (standard and Fraktur script), Slovenian, Spanish,
Serbian, Swedish, Tagalog, Thai, Turkish, Ukrainian and Vietnamese.
Tesseract can be trained to work in other languages too
 Each box results in a component which represents a single
character. The box has an x- and y-coordinate, a width and height
and the character which has been recognized in the region.
 we have to manually edit possible mistakes that has been detected
by Tesseract. Maybe all characters where wrong because the font
you are using isn’t very common. To make this job a lot easier,
several tools are available. These tools are called “Tesseract box
editors”, the tool I used is jTessBoxEditor. jTessBoxEditor is written
and Java and thus platform independent, it has options to merge
and split which could be handy. After you corrected the mistakes,
we can creating the training file.
Software Requirements –
Box Editor
 jTessBoxEditor is a box editor and trainer for Tesseract
OCR, providing editing of box data of both Tesseract 2.0x
and 3.0x formats and full automation of Tesseract
training. It can read images of common image formats,
including multi-page TIFF. The program requires Java
Runtime Environment 6.0 or later.
Software Requirements -
jTessBoxEditor
Text-To-Speech Engine
 Introduction to TTS
 A text-to-speech software or speech synthesis program is used
to read a text out loud to a user. Espeak can be used for this
purpose. A TTS engine uses natural human voices to avoid
sounding like a robot during the reading of a text to a user.
There are also few different free TTS engines available online
where the user can choose the voice he prefers the text to be
read with. One I used is MILE Lab Tamil language Text-To-
Speech Engine.
Software Requirements - TTS
 Espeak
 ESpeak is a compact open source software speech
synthesizer for Linux, Windows, and other platforms. It uses
a formant synthesis method, providing many languages in a small
size and it has also been used by Google Translate.
 ESpeak is derived from the "Speak" speech synthesizer for British
English for Acorn RISC OS a computer which was originally written
in 1995 by Jonathan Duddington. A rewritten version for Linux
appeared in February 2006 and a Windows SAPI 5 version in
January 2007. Subsequent development has added and improved
support for additional languages.
 In my project espeak is used as offline Tamil text to speech engine.
It support multi language speech.
Software Requirements - ESpeak
 Microsoft Speech API (SAPI)
 The Microsoft text-to-speech voices are speech
synthesizers provided for use with applications that
use the Microsoft Speech API (SAPI) or the Microsoft
Speech Server Platform. There are client and server
versions of Microsoft text-to-speech voices. SAPI
used to produce interactive speech feature in the
project
Software Requirements -Microsoft
Speech API (SAPI)
 Hunspell
 Hunspell is a spell checker and morphological analyser designed
for languages with rich morphology and complex word
compounding and character encoding, originally designed for
the Hungarian language.
 Hunspell is based on MySpell and is backward-compatible with
MySpell dictionaries. While MySpell uses a single-byte character
encoding, Hunspell can use Unicode UTF-8-encoded dictionaries.
Hunspell The spell checker Hunspell needs two files:
1. Dictionary files with .DIC extension
2. Affix file with .AFF extension
Software Requirements - Hunspell
 Ghostscript is a suite of software based on an interpreter for
Adobe Systems' PostScript and Portable Document Format
(PDF)page description languages. Its main purposes are the
rasterization or rendering of such page description language
files, for the display or printing of document pages, and the
conversion between PostScript and PDF files.
 GhostScript is open source it is used in the project for
converting PDF files to image. PDF OCR uses this GhostScript.
Software Requirements - GhostScript
 Existing System
The existing tesseract-OCR supports English language as
default and also supports languages like Dutch, Spanish, Italian,
French and German. All these languages are trained to
tesseract OCR, but languages like Tamil, Malayalam are not
much trained to OCR. There are no GUI available for tesseract
in tamil and training tesseract is a big task which an
intermediate persons too feel complex for training it. Since
Tesseract OCR Engine is command line Tool, usage of OCR is
much less.
Analysis
There are some OCR GUI are built using Tesseract OCR Engine, but it
does not have much support for Tamil language.
 Some GUI tools are listed below.
 VietOCR
 Tesseract-OCR QT4 gui
 Lime OCR
 Few Online Services:
 CustomOCR
 Free OCR
 i2OCR(support Tamil language, but very less accuracy)
Analysis-Existing System
The proposed system
The proposed system has to be GUI. GUI of Proposed system is named as
“Tamil OCR GUI”. It is trained to Tamil language, using Tesseract Training
Procedures. It will much beneficial for blind people and normal users.
The inputs to the OCR-Engines are:
 Sample Tamil Training Images
 Data Files
 Tamil Dictionary
 Final Tamil trained data
As I will use the programming language VB.NET, it satisfies all the needs
as extensibility, simplicity, interoperability, portability, powerful data
structures and also Unicode support. The GUI was developed in VB.NET,
which is Threaded application.
Analysis- proposed system
GUI is Web browser And Special OCR GUI in this project and the
it going to capture web page images and extract all its Tamil and
English text using the Tamil OCR component.
Proposed System has support of following language:
 1. Tamil
 2. English
User can switch between these language and they can easily
OCR Tamil and English Images using Proposed System.
Analysis- proposed system
Following are process of Tesseract OCR Engine:
 At the first stage, outlines of the text are gathered by nesting, into Blobs. These blobs are
organized into text lines and are broken into words differently according to the kind of
character spacing. After that the lines and regions are analysed for fixed pitch or proportional
text. Fixed pitch text is chopped immediately by character cells, however the proportional
text is broken into words using definite spaces and fuzzy spaces. Then the recognition
proceeds as a two-pass process.
 In the first pass, it attempt recognize each word in turn passed to an adaptive
classifier as training data. The adaptive classifier then gets a chance to more accurately
recognize text. Thereafter the second pass is run over the page for the words those were
not recognized well enough in the First pass.
 At last a final phase resolves fuzzy spaces, and checks alternative hypotheses for the x-
height to locate small cap text. We follow this architecture of Tesseract OCR engine to
recognize the Tamil characters.
 To train Tesseract for Tamil script recognition, I followed Complete Training procedure
Specified.
process of Tesseract OCR Engine:
Tamil OCR Component:
 Tesseract OCR Engine trained for tamil language
 Image Extractor
 Image Reader
 Output Manager
 Output formatter
 Text Speaker
Analysis-DFD
Analysis- Training Procedure:
 DATA FILES REQUIRED
 To train Tamil Language (lang = tam), I have to created 8 data files in the
tessdata subdirectory. The 8 files are:
1. tessdata/tam.freq-dawg
2. tessdata/tam.word-dawg
3. tessdata/tam.user-words
4. tessdata/tam.inttemp
5. tessdata/tam.normproto
6. tessdata/tam.pffmtable
7. tessdata/tam.unicharset
8. tessdata/tam.DangAmbigs
Analysis- Training Procedure:
Analysis- Web OCR Component
Start
Image Extractor
Tesseract Tamil Trained OCR
Engine
Output Formatter Output Manager
Image Reader
Stop
Webpage
(Web Brower)
Web GUI Page
Tamil Trained
Dataset
Data’s
Fig: Architecture of Web OCR Component
1 Web OCR Component Extract all pictures of web page and save all in
single folder called ‘OCR Image’ (image Extractor)
2. Image Reader will take picture by picture and send to Tesseract
Engine.
3. Tesseract Engine will process and extract text from images only
Tamil and English
4. Output of Tesseract Engine is taken by Output Manger that will
maintain text and images.
5. Output Formatter will put Images and Corresponding Extracted
Text in word document
6 .if User move over any image all process are carried out from 1 to 4
and Text Speaker component will speak out to user.
Analysis- Web OCR Component
 Web OCR feature need web browser and it extract all images
from particular page and OCR in the Image. Web OCR has
Following Tools:
 Image Grabber
 Link Grabber
 Web Page to Image Convertor
 Web page to PDF Convertor
 Screen Shotter

 All the above Tools are essential to do Web OCR. Image Grabber
will extract all images and save to Location user Specified. All
extracted images are saved to Folder.
Analysis- Web OCR Component
Analysis- Web OCR Component
Fig: Web browser
 One of the components of project is Web OCR, as I need Better
GUI it is good to use web browser, because it is software in
which most user and image files are there. This browser module
has low priority in this project. All below mention features are I
mentioned may not be implementing. Simple browser with
required features is enough to developing web browser for
computer using following technologies and concepts are
required:
 Vb.net
 Application programmer interface(API’s)
 Multithreading
 Voice library(voice recognition)
Analysis- Web OCR Component
Analysis- Tamil OCR GUI
Start
Input Image
Preprocessing
Tesseract Engine Tamil Trained Dataset
Tamil OCR
Output Text Text
Post Processing
Spell Checker
Tamil Dictionary
Tamil Data files
Accurate Output Text Text
Stop
Analysis- Tamil OCR GUI
Input Image
Pre - process
Tamil OCR GUI Output Text
Tesseract OCR Engine Post Process
Tamil and English
Trained Data Set
Output Formats
1. PDF
2. MS Word
3. RTF
4. XML
5. WAV
6. MP3
8. HTML
9. Text
10. Single Web page
Fig: Block Diagram of Full Image OCR
Analysis- Tamil OCR GUI
Post Process
Tamil OCR GUI
Selected Image Region
Output Text
Tesseract OCR EnginePre - process
Tamil and English Trained Data Set
Fig: Block Diagram of Region Image OCR
Analysis- Tamil OCR GUI
Fig: Block Diagram of Snapshotting OCR
Analysis- Tamil OCR GUI
Fig: Block Diagram of Batch OCR
Analysis- Pre-processing
Increase dpi (max 300)
Start
Input Image
Convert to black and white
Preprocess Manager
Remove Background (max 300)
Remove Inner images (max 300)
Preprocessed Image
Fig: Architecture of Pre-processor
Pre-processing is optional process in Tamil OCR.
This is useful to increase Quality of image so that
process of OCR will be more accurate.
The known system is divided into eight functional
modules, which are Independent among them and
they are followed:
 This Project has following modules:
1) Tamil Training to Tesseract
2) Web OCR Component
3) OCR GUI
Design - Modularity
Introduction
The system is designed in two phase:
 Preliminary System Design
 Detailed System Design
 Coupling is the measure of relative interdependence among modules. In Tamil
OCR project Coupling is very loose in nature because we can easily add any
module without problem with any other module or require little modification in
interface calling. Coupling depends of interface complexity between modules,
the point at which entry of reference is made to a module, and what data pass
across the interface.
 Cohesion is the measure of relative functional strength of individual modules. In
 Tamil OCR project Cohesion is very high in nature because performs a single task
within
 a software procedure, requiring little interaction with procedures being
performed in other parts of a program.
Design
1.Training Tesseract-OCR Task
 Tesseract 3.02 is fully trainable. In Tesseract 2.0 it will only accept
Tiff format images, but in the latest version it include leptonica it will
automatically convert any image into tiff format.
 This page describes the training process, provides some guidelines
on applicability to various languages, and what to expect from the
results. Tesseract was originally designed to recognize English text
only. Efforts have been made to modify the engine and its training
system to make them able to deal with other languages and UTF-8
characters.
Design - Modularity
2.Web Tamil OCR-Component Development
1. Web Tamil OCR Component Extract all pictures of web page and save all in
single folder called ‘OCR Image’ (image Extractor)
2. Image Reader will take picture by picture and send to Tesseract Engine.
3. Tesseract Engine will process and extract text from images only Tamil and
English
4. Output of Tesseract Engine is taken by Output Manger that will maintain
text and images.
5. Output Formatter will put Images and Corresponding Extracted Text in
word document
6 .if User move over any image all process are carried out from 1 to 4 and
Text Speaker component will speak out to user.
Design - Modularity
3 .Web browser
 Main component of project is Tamil OCR, as I need Better GUI it is
good to use web browser, because it is software in which most user are
there. This browser module has low priority in this project. Tesseract-OCR is
a command line tool. The Project GUI is going to developed using .net lets
user to easily work with a graphical user interface. Project targeting the
development of component in web browser that component will extract
Tamil character from images.
 The user interface for OCRgui contains the open button to select the image
(tiff format) file, an recognition button that converts the image to an
editable text, a font button that select the font type for the output and
preferences window that have all preferences. Web browser will use Tamil
OCR Component to extract all images text if Tamil and English.
Design - Modularity
Cond….(Web browser )
 The software interfaces used in the project are built VB.NET. It
leads to easily create programs with a graphical user interface
using the Visual basic.net. The Project does not make use of
any particular protocol and therefore there may not be need
for any specified communication interfaces.
Design - Modularity
4 .OCR GUI
OCR GUI allow user to perform various OCR functions more easily
in Tamil. This module provides following features to user.
1. Whole Image OCR
2. Image Region OCR
3. Snap Shot OCR
4. Snipping Screen OCR
5. Batch OCR
6. Various Image Processing functionality
7. Good Tamil Text Editor
8. Easley Tamil Transliteration Typing (like phonetic)
9. Good Tamil Text to Speech
Design - Modularity
 Cont. …..(OCR GUI)
Design - Modularity
10. Image Convertor
11. PDF OCR
12. Various Output Format
13. Various Input Acceptance
14. Easy Typing in Other languages
15. Email Sender
16. Spell checking
17. Snipping and Saving
18. Combined OCR document
19. OCR report
Detailed Design
Fig: Few Class Diagram of OCR GUI
GUI Design
Fig: Whole Page OCR (windows 7 and 8 version)
GUI Design
Fig: Whole Page OCR (windows XP version)
GUI Design
Fig: Region OCR
GUI Design
Fig: Snipping OCR
GUI Design
Fig: Snipping OCR Output
GUI Design
Fig: Batch OCR
Windows 8 modes
GUI Design
Fig: Batch OCR
Windows XP mode
GUI Design
Fig: Snapshots OCR in Tamil
GUI Design
Fig: Snapshots OCR in English
GUI Design
Fig: Converted to black and White
GUI Design
Fig: Tamil OCR Help
GUI Design
Fig: Tamil OCR Setting Widow
GUI Design
Fig: Tamil OCR Other Setting
Widow
GUI Design
Fig: Tamil OCR Speak Setting
Widow
Fig: Tamil OCR UI Setting Widow
GUI Design
Fig: Tamil Typing
GUI Design
Fig: Image Preview With image Properties
GUI Design
Fig: Tamil OCR Print Preview
Widow
GUI Design
Fig: Tamil OCR
Batch OCR
Theme Widow
GUI Design
Fig: Tamil OCR with
Blackly Mode
GUI Design
Fig: Web OCR
GUI Design
Fig: Web Page Capture
GUI Design
Fig: Web Page Link Grabbing
GUI Design
Fig: Image Grabber
in progress
GUI Design
Fig: PDF OCR
GUI Design
Fig: Find and High Light Feature
GUI Design
Fig: Snipping and Saving Tool
GUI Design
Fig: Batch OCR Context Help
GUI Design
Fig: Auto Cropped Image
GUI Design
Fig: Skewed Image
GUI Design
Fig: De Skewed Image
Tesseract-OCR for Tamil language
Increases the number of users of tesseract
GUI makes more people to migrate towards open source
It makes easy for the people to work tesseract, without
the knowledge about it.
Web browser can make use the OCR Tamil component in
the web browser
For blind people it is very good beneficial
Advantages of Project
The applications of Tesseract-OCR are
 News paper industries
 Books publishers.
 Industries of Digital Expertise
 Web browser
 Web Service Creation
 Web Application
 Common People’s
Advantages of Project
 FUTURE ENHANCEMENT
 The accuracy of recognition can be improved by more
training
 AI Applied to make intelligent snapshotting Reader
 Image Content Search
 More Accurate and powerful Spell Checker Corrector
 Powerful web OCR
 Web Service
 Online Web Application
 Handwritten Recognition
 Browser Extensions
 More Powerful Image Preprocessing
Future Enhancement
 I hope the result of the project is that the Tesseract-OCR is
trained for Tamil language and the more training can be used to
train the OCR for tamil language , so accuracy can be increased.
 This Project can increase the number of users to tesseract OCR
among tamil people and most of the Tamil press peoples to
migrate towards free and open source software’s.

The GUI for tesseract will make easy for all the peoples who
don’t have knowledge about tesseract to use it in a perfect
manner.
Conclusion
 The Tamil OCR Project was a great opportunity for me to experience the workings of
software, know about research’s and bring my 2 years of academics to practice. The 4 months
period of Project was very useful for learning the industry specific practices and standards. I am
very confident that this experience will be a great boost and help for my career ahead.
 The period at Dr.K.S kuppusamy was a great learning experience with very helpful project
supervisors. Their evaluation and guiding helped me a lot to learn and understand new techniques
and methodologies. Overall, it was a great experience and will greatly benefit me in future.
 The biggest advantage of Tesseract OCR is its availability as open code. Thus anybody having
the interest to study the working procedure, and skill to improve it can able to train it for a new
language. In this Document, I presented the step by step procedure to train Tesseract engine for
Tamil printed text document. At first we train the Tesseract for a particular font type of English
language that has not been supported earlier by performing a series of test. We then train
Tesseract to recognize the Tamil character set, and observed the results. As we find editing the
box file manually is a cumber-sum task (this language has a large character set), we try to
generate the box file automatically. Also we could be able to detect the vowels and consonants of
Tamil character set, however still we need to train Tesseract for dependent modifiers and other
characters that exist in the Tamil documents, in future.
Conclusion
1. The tesseract-OCR Google groups is developed by “Ray Smith”
and is accessed http://code.google.com/p/tesseract-OCR/
2. Training the tesseract-OCR for Indic languages can be obtained
from http://code.google.com/p/tesseract-
OCR/wiki/TrainingTesseract
3. The tesseract Indic home page is developed and maintained by
“Debayan Banerjee” and can be found
at http://code.google.com/p/tesseractindic/ and etc…
REFERENCES
Question

Más contenido relacionado

La actualidad más candente

optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition system
Vijay Apurva
 
Text summarization
Text summarizationText summarization
Text summarization
kareemhashem
 
Tesseract OCR Engine
Tesseract OCR EngineTesseract OCR Engine
Tesseract OCR Engine
Raghu nath
 

La actualidad más candente (20)

Optical Character Recognition (OCR)
Optical Character Recognition (OCR)Optical Character Recognition (OCR)
Optical Character Recognition (OCR)
 
OCR (Optical Character Recognition)
OCR (Optical Character Recognition) OCR (Optical Character Recognition)
OCR (Optical Character Recognition)
 
Text reader [OCR]
Text reader [OCR]Text reader [OCR]
Text reader [OCR]
 
Optical character recognition (ocr) ppt
Optical character recognition (ocr) pptOptical character recognition (ocr) ppt
Optical character recognition (ocr) ppt
 
Basics of-optical-character-recognition
Basics of-optical-character-recognitionBasics of-optical-character-recognition
Basics of-optical-character-recognition
 
OCR Presentation (Optical Character Recognition)
OCR Presentation (Optical Character Recognition)OCR Presentation (Optical Character Recognition)
OCR Presentation (Optical Character Recognition)
 
OCR Text Extraction
OCR Text ExtractionOCR Text Extraction
OCR Text Extraction
 
Handwriting Recognition
Handwriting RecognitionHandwriting Recognition
Handwriting Recognition
 
optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition system
 
Optical Character Recognition
Optical Character RecognitionOptical Character Recognition
Optical Character Recognition
 
HANDWRITTEN DIGIT RECOGNITION USING k-NN CLASSIFIER
HANDWRITTEN DIGIT RECOGNITION USING k-NN CLASSIFIERHANDWRITTEN DIGIT RECOGNITION USING k-NN CLASSIFIER
HANDWRITTEN DIGIT RECOGNITION USING k-NN CLASSIFIER
 
SPEECH RECOGNITION USING NEURAL NETWORK
SPEECH RECOGNITION USING NEURAL NETWORK SPEECH RECOGNITION USING NEURAL NETWORK
SPEECH RECOGNITION USING NEURAL NETWORK
 
Text summarization
Text summarizationText summarization
Text summarization
 
Ocr abstract
Ocr abstractOcr abstract
Ocr abstract
 
Text summarization
Text summarization Text summarization
Text summarization
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Decentralized Ridesharing System Using Blockchain
Decentralized Ridesharing System Using BlockchainDecentralized Ridesharing System Using Blockchain
Decentralized Ridesharing System Using Blockchain
 
Final Report on Optical Character Recognition
Final Report on Optical Character Recognition Final Report on Optical Character Recognition
Final Report on Optical Character Recognition
 
Text summarization
Text summarizationText summarization
Text summarization
 
Tesseract OCR Engine
Tesseract OCR EngineTesseract OCR Engine
Tesseract OCR Engine
 

Destacado

Architecture Patterns for Semantic Web Applications
Architecture Patterns for Semantic Web ApplicationsArchitecture Patterns for Semantic Web Applications
Architecture Patterns for Semantic Web Applications
bpanulla
 
mobile camera based text detection
mobile camera based text detectionmobile camera based text detection
mobile camera based text detection
vallabh potadar
 

Destacado (19)

OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents
 
Project report of OCR Recognition
Project report of OCR RecognitionProject report of OCR Recognition
Project report of OCR Recognition
 
OCR using Tesseract
OCR using TesseractOCR using Tesseract
OCR using Tesseract
 
Tesseract OCR Engine - OpenFest 2009
Tesseract OCR Engine - OpenFest 2009Tesseract OCR Engine - OpenFest 2009
Tesseract OCR Engine - OpenFest 2009
 
Text Detection and Recognition
Text Detection and RecognitionText Detection and Recognition
Text Detection and Recognition
 
Architecture's io t d_macp
Architecture's io t d_macp Architecture's io t d_macp
Architecture's io t d_macp
 
Architecture Patterns for Semantic Web Applications
Architecture Patterns for Semantic Web ApplicationsArchitecture Patterns for Semantic Web Applications
Architecture Patterns for Semantic Web Applications
 
project indesh
project indeshproject indesh
project indesh
 
mobile camera based text detection
mobile camera based text detectionmobile camera based text detection
mobile camera based text detection
 
Nuance-ACEDS May 21 OCR Webcast
Nuance-ACEDS May 21 OCR Webcast Nuance-ACEDS May 21 OCR Webcast
Nuance-ACEDS May 21 OCR Webcast
 
Tasract OCR
Tasract OCRTasract OCR
Tasract OCR
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
Neural Networks for OCR
Neural Networks for OCRNeural Networks for OCR
Neural Networks for OCR
 
Moh office trainee midwifes charts
Moh office trainee midwifes chartsMoh office trainee midwifes charts
Moh office trainee midwifes charts
 
Giasan.vn @rstars
Giasan.vn @rstarsGiasan.vn @rstars
Giasan.vn @rstars
 
Imago OCR: Open-source toolkit for chemical structure image recognition
Imago OCR: Open-source toolkit for chemical structure image recognitionImago OCR: Open-source toolkit for chemical structure image recognition
Imago OCR: Open-source toolkit for chemical structure image recognition
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
Can my English course be translated to Tamil language?
Can my English course be translated to Tamil language?Can my English course be translated to Tamil language?
Can my English course be translated to Tamil language?
 
FES Financial Education Services English Presentation
FES Financial Education Services English PresentationFES Financial Education Services English Presentation
FES Financial Education Services English Presentation
 

Similar a Tamil OCR using Tesseract OCR Engine

Paper on Speech Recognition
Paper on Speech RecognitionPaper on Speech Recognition
Paper on Speech Recognition
Thejus Joby
 
Automatic subtitle generation
Automatic subtitle generationAutomatic subtitle generation
Automatic subtitle generation
tanyasaxena1611
 
General Speereo Technology
General Speereo TechnologyGeneral Speereo Technology
General Speereo Technology
Daniel Ischenko
 

Similar a Tamil OCR using Tesseract OCR Engine (20)

Text to speech converter in C#.NET
Text to speech converter in C#.NETText to speech converter in C#.NET
Text to speech converter in C#.NET
 
2 Programming Language.pdf
2 Programming Language.pdf2 Programming Language.pdf
2 Programming Language.pdf
 
Computer languages 11
Computer languages 11Computer languages 11
Computer languages 11
 
Lession 6
Lession 6Lession 6
Lession 6
 
Language translation system p
Language translation system pLanguage translation system p
Language translation system p
 
IRJET- Voice to Code Editor using Speech Recognition
IRJET- Voice to Code Editor using Speech RecognitionIRJET- Voice to Code Editor using Speech Recognition
IRJET- Voice to Code Editor using Speech Recognition
 
Paper on Speech Recognition
Paper on Speech RecognitionPaper on Speech Recognition
Paper on Speech Recognition
 
Computer languages and generation
Computer languages and generationComputer languages and generation
Computer languages and generation
 
IRJET- Voice based Billing System
IRJET-  	  Voice based Billing SystemIRJET-  	  Voice based Billing System
IRJET- Voice based Billing System
 
Insight into progam execution ppt
Insight into progam execution pptInsight into progam execution ppt
Insight into progam execution ppt
 
Efficient Intralingual Text To Speech Web Podcasting And Recording
Efficient Intralingual Text To Speech Web Podcasting And RecordingEfficient Intralingual Text To Speech Web Podcasting And Recording
Efficient Intralingual Text To Speech Web Podcasting And Recording
 
visH (fin).pptx
visH (fin).pptxvisH (fin).pptx
visH (fin).pptx
 
Intermediate Languages
Intermediate LanguagesIntermediate Languages
Intermediate Languages
 
Automatic subtitle generation
Automatic subtitle generationAutomatic subtitle generation
Automatic subtitle generation
 
Notacd07
Notacd07Notacd07
Notacd07
 
Nota programming
Nota programmingNota programming
Nota programming
 
Moses
MosesMoses
Moses
 
Programming Part 01
Programming Part 01Programming Part 01
Programming Part 01
 
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCRA SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
 
General Speereo Technology
General Speereo TechnologyGeneral Speereo Technology
General Speereo Technology
 

Más de balamurugan.k Kalibalamurugan

Más de balamurugan.k Kalibalamurugan (10)

Problem definition
Problem definitionProblem definition
Problem definition
 
Description logic
Description logicDescription logic
Description logic
 
Software testing
Software testingSoftware testing
Software testing
 
Window ce
Window ceWindow ce
Window ce
 
Radius1
Radius1Radius1
Radius1
 
A multi criteria evaluation of environmental databases using hasse
A multi criteria evaluation of environmental databases using hasseA multi criteria evaluation of environmental databases using hasse
A multi criteria evaluation of environmental databases using hasse
 
Simple object access protocol(soap )
Simple object access protocol(soap )Simple object access protocol(soap )
Simple object access protocol(soap )
 
Security monitoring and auditing
Security monitoring and auditingSecurity monitoring and auditing
Security monitoring and auditing
 
Object oriented framework
Object oriented frameworkObject oriented framework
Object oriented framework
 
Distributed datababase Transaction and concurrency control
Distributed datababase Transaction and concurrency controlDistributed datababase Transaction and concurrency control
Distributed datababase Transaction and concurrency control
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Tamil OCR using Tesseract OCR Engine

  • 1.  NAME : K.Balamurugan  REG NO : 12370002  COURSE : M.SC (Comp. Science)  SEMESTER : IV Semester Under Guidance Dr.K.S. Kuppusamy TAMIL (TAMIL OCR USING MULTIDIMENSIONAL INTERACTIVE LEARNING MODEL)
  • 2. Introduction to OCR Abstract Software Requirement Analysis Design GUI Design Conclusion Outline:
  • 3. What is OCR?  Optical Character Recognition, usually abbreviated to OCR, is the conversion of scanned photo/ Images of typewritten or printed text into machine- encoded/computer-readable text. Introduction to OCR
  • 4. Types of OCR Basically, there are three types of OCR. They are briefly discussed below:  Handwritten Text OCR The text produced by a person by writing with a pen/ pencil on a paper medium and which is then scanned into digital format using scanner is called Handwritten Text.  Machine Printed Text OCR Machine printed text can be found commonly in daily use. It is produced by offset processes, such as laser, inkjet and many more. The project comes under this category. Introduction - Type of OCR
  • 5.  TAMIL (Tamil OCR using Multidimensional Interactive Learning model) is Optical Character Recognition Software that convert machine printed text into editable text. It is mainly targeting to Tamil language. It uses tesseract OCR Engine.  Tesseract OCR engine has trained in Tamil language so this software can convert an image in to text in Tamil language ; it involves the Tamil Tessdata files for mapping of each character in image into text. While trying to recognize a limited range of fonts (like a single font for instance), then a single training page might be enough. Tesseract was originally designed to recognize English text only. Efforts have been made to modify the engine and its training system to make them able to deal with other languages and UTF-8 characters. TAMIL OCR USING MULTIDIMENSIONAL INTERACTIVE LEARNING MODEL Abstract
  • 6. Tesseract-OCR is the most widely used open source OCR across the world. Currently this OCR supports English language as default and few more language and it is a command line tool. Tesseract OCR Engine has flexibility that it can be trained to any language. In this project an application is developed to train OCR in Tamil languages. More Training Images and effort are needed to produce the training data. Since Tesseract-OCR is a command line tool, it is not mostly used by the beginners which limit the usage of tesseract. Here an excellent GUI was developed for tesseract-OCR, which makes peoples to use this OCR easily. Tamil OCR GUI is multidimensional and interactive. It uses espeak TTS engine to convert Tamil Text into voice. So it much beneficial for blind people. Abstract TAMIL OCR USING MULTIDIMENSIONAL INTERACTIVE LEARNING MODEL
  • 7.  Software Requirements  Tesseract 3.02 OCR Engine  JTessBoxEditor  SerakTesseractTrainer  Visual studio 2010  .NET Framework 4.0  VB.NET  Notepad ++ (Text Creator)  Text2Image Command line  Ghost Script 9.0  WebsitesScreenshot.dll  Windows Image Acquisition(Wiaaut.dll)  Hunspell Spell Checker  NHunspellExtender  eSpeak  espeak Data Files Requirements
  • 8. Requirements  MILE Lab Tamil language Text-To-Speech (TTS) Engine  Google Transliterations API  Google Speech Service  Tamil and English Spell Check Dictionaries  Microsoft Speech API (SAPI) 5.4 (SAPI)  Tessdata with Tamil Trained Data file  Web browser Component  VC++ Runtime  Microsoft Office Interoperability Word Service  Tamil Fonts and Tamil Keyboard software  Windows Operating system(From windows XP to Windows 8)
  • 9. Requirements User Software Requirements  VC++ Runtime 2010  .NET Framework 4.0 Hardware Requirements:  Minimum 260MB RAM memory, preferably 600MB RAM  700MB free hard disk space  Modem for internet connection
  • 10. Software Requirements - Tesseract OCR Engine  The Tesseract engine was originally developed as proprietary software at Hewlett Packard labs in Bristol, England and Greeley, Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some migration from C to C++ in 1998. A lot of the code was written in C, and then some more was written in C++. Since then all the code has been converted to at least compile with a C++ compiler. Very little work was done in the following decade. It was then released as open source in 2005 by Hewlett Packard and the University of Nevada, Las Vegas (UNLV). Tesseract development has been sponsored by Google since 2006.  Tesseract does not come with a GUI and is instead run from the command-line interface.
  • 11.  Tesseract was in the top three OCR engines in terms of character accuracy in 1995. It is available for Linux, Windows and Mac OS X, however, due to limited resources only Windows and Ubuntu are rigorously tested by developers.  Tesseract up to and including version 2 could only accept TIFF images of simple one column text as inputs. These early versions did not include layout analysis and so inputting multi-columned text, images, or equations produced a garbled output. Since version 3.00 Tesseract has supported output text formatting, hOCR positional information and page layout analysis. Support for a number of new image formats was added using the Leptonica library. Tesseract can detect whether text is monospaced or proportional. Software Requirements - Tesseract OCR Engine
  • 12. Software Requirements - Tesseract OCR Engine  The initial versions of Tesseract could only recognize English language text. Starting with version 2 Tesseract was able to process English, French, Italian, German, Spanish, Brazilian Portuguese and Dutch. Starting with version 3 it can recognize Arabic, English, Bulgarian, Catalan, Czech, Chinese (Simplified and Traditional), Danish, German (standard and Fraktur script), Greek, Finnish, French, Hebrew, Croatian, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak (standard and Fraktur script), Slovenian, Spanish, Serbian, Swedish, Tagalog, Thai, Turkish, Ukrainian and Vietnamese. Tesseract can be trained to work in other languages too
  • 13.  Each box results in a component which represents a single character. The box has an x- and y-coordinate, a width and height and the character which has been recognized in the region.  we have to manually edit possible mistakes that has been detected by Tesseract. Maybe all characters where wrong because the font you are using isn’t very common. To make this job a lot easier, several tools are available. These tools are called “Tesseract box editors”, the tool I used is jTessBoxEditor. jTessBoxEditor is written and Java and thus platform independent, it has options to merge and split which could be handy. After you corrected the mistakes, we can creating the training file. Software Requirements – Box Editor
  • 14.  jTessBoxEditor is a box editor and trainer for Tesseract OCR, providing editing of box data of both Tesseract 2.0x and 3.0x formats and full automation of Tesseract training. It can read images of common image formats, including multi-page TIFF. The program requires Java Runtime Environment 6.0 or later. Software Requirements - jTessBoxEditor
  • 15. Text-To-Speech Engine  Introduction to TTS  A text-to-speech software or speech synthesis program is used to read a text out loud to a user. Espeak can be used for this purpose. A TTS engine uses natural human voices to avoid sounding like a robot during the reading of a text to a user. There are also few different free TTS engines available online where the user can choose the voice he prefers the text to be read with. One I used is MILE Lab Tamil language Text-To- Speech Engine. Software Requirements - TTS
  • 16.  Espeak  ESpeak is a compact open source software speech synthesizer for Linux, Windows, and other platforms. It uses a formant synthesis method, providing many languages in a small size and it has also been used by Google Translate.  ESpeak is derived from the "Speak" speech synthesizer for British English for Acorn RISC OS a computer which was originally written in 1995 by Jonathan Duddington. A rewritten version for Linux appeared in February 2006 and a Windows SAPI 5 version in January 2007. Subsequent development has added and improved support for additional languages.  In my project espeak is used as offline Tamil text to speech engine. It support multi language speech. Software Requirements - ESpeak
  • 17.  Microsoft Speech API (SAPI)  The Microsoft text-to-speech voices are speech synthesizers provided for use with applications that use the Microsoft Speech API (SAPI) or the Microsoft Speech Server Platform. There are client and server versions of Microsoft text-to-speech voices. SAPI used to produce interactive speech feature in the project Software Requirements -Microsoft Speech API (SAPI)
  • 18.  Hunspell  Hunspell is a spell checker and morphological analyser designed for languages with rich morphology and complex word compounding and character encoding, originally designed for the Hungarian language.  Hunspell is based on MySpell and is backward-compatible with MySpell dictionaries. While MySpell uses a single-byte character encoding, Hunspell can use Unicode UTF-8-encoded dictionaries. Hunspell The spell checker Hunspell needs two files: 1. Dictionary files with .DIC extension 2. Affix file with .AFF extension Software Requirements - Hunspell
  • 19.  Ghostscript is a suite of software based on an interpreter for Adobe Systems' PostScript and Portable Document Format (PDF)page description languages. Its main purposes are the rasterization or rendering of such page description language files, for the display or printing of document pages, and the conversion between PostScript and PDF files.  GhostScript is open source it is used in the project for converting PDF files to image. PDF OCR uses this GhostScript. Software Requirements - GhostScript
  • 20.  Existing System The existing tesseract-OCR supports English language as default and also supports languages like Dutch, Spanish, Italian, French and German. All these languages are trained to tesseract OCR, but languages like Tamil, Malayalam are not much trained to OCR. There are no GUI available for tesseract in tamil and training tesseract is a big task which an intermediate persons too feel complex for training it. Since Tesseract OCR Engine is command line Tool, usage of OCR is much less. Analysis
  • 21. There are some OCR GUI are built using Tesseract OCR Engine, but it does not have much support for Tamil language.  Some GUI tools are listed below.  VietOCR  Tesseract-OCR QT4 gui  Lime OCR  Few Online Services:  CustomOCR  Free OCR  i2OCR(support Tamil language, but very less accuracy) Analysis-Existing System
  • 22. The proposed system The proposed system has to be GUI. GUI of Proposed system is named as “Tamil OCR GUI”. It is trained to Tamil language, using Tesseract Training Procedures. It will much beneficial for blind people and normal users. The inputs to the OCR-Engines are:  Sample Tamil Training Images  Data Files  Tamil Dictionary  Final Tamil trained data As I will use the programming language VB.NET, it satisfies all the needs as extensibility, simplicity, interoperability, portability, powerful data structures and also Unicode support. The GUI was developed in VB.NET, which is Threaded application. Analysis- proposed system
  • 23. GUI is Web browser And Special OCR GUI in this project and the it going to capture web page images and extract all its Tamil and English text using the Tamil OCR component. Proposed System has support of following language:  1. Tamil  2. English User can switch between these language and they can easily OCR Tamil and English Images using Proposed System. Analysis- proposed system
  • 24. Following are process of Tesseract OCR Engine:  At the first stage, outlines of the text are gathered by nesting, into Blobs. These blobs are organized into text lines and are broken into words differently according to the kind of character spacing. After that the lines and regions are analysed for fixed pitch or proportional text. Fixed pitch text is chopped immediately by character cells, however the proportional text is broken into words using definite spaces and fuzzy spaces. Then the recognition proceeds as a two-pass process.  In the first pass, it attempt recognize each word in turn passed to an adaptive classifier as training data. The adaptive classifier then gets a chance to more accurately recognize text. Thereafter the second pass is run over the page for the words those were not recognized well enough in the First pass.  At last a final phase resolves fuzzy spaces, and checks alternative hypotheses for the x- height to locate small cap text. We follow this architecture of Tesseract OCR engine to recognize the Tamil characters.  To train Tesseract for Tamil script recognition, I followed Complete Training procedure Specified. process of Tesseract OCR Engine:
  • 25. Tamil OCR Component:  Tesseract OCR Engine trained for tamil language  Image Extractor  Image Reader  Output Manager  Output formatter  Text Speaker Analysis-DFD
  • 27.  DATA FILES REQUIRED  To train Tamil Language (lang = tam), I have to created 8 data files in the tessdata subdirectory. The 8 files are: 1. tessdata/tam.freq-dawg 2. tessdata/tam.word-dawg 3. tessdata/tam.user-words 4. tessdata/tam.inttemp 5. tessdata/tam.normproto 6. tessdata/tam.pffmtable 7. tessdata/tam.unicharset 8. tessdata/tam.DangAmbigs Analysis- Training Procedure:
  • 28. Analysis- Web OCR Component Start Image Extractor Tesseract Tamil Trained OCR Engine Output Formatter Output Manager Image Reader Stop Webpage (Web Brower) Web GUI Page Tamil Trained Dataset Data’s Fig: Architecture of Web OCR Component
  • 29. 1 Web OCR Component Extract all pictures of web page and save all in single folder called ‘OCR Image’ (image Extractor) 2. Image Reader will take picture by picture and send to Tesseract Engine. 3. Tesseract Engine will process and extract text from images only Tamil and English 4. Output of Tesseract Engine is taken by Output Manger that will maintain text and images. 5. Output Formatter will put Images and Corresponding Extracted Text in word document 6 .if User move over any image all process are carried out from 1 to 4 and Text Speaker component will speak out to user. Analysis- Web OCR Component
  • 30.  Web OCR feature need web browser and it extract all images from particular page and OCR in the Image. Web OCR has Following Tools:  Image Grabber  Link Grabber  Web Page to Image Convertor  Web page to PDF Convertor  Screen Shotter   All the above Tools are essential to do Web OCR. Image Grabber will extract all images and save to Location user Specified. All extracted images are saved to Folder. Analysis- Web OCR Component
  • 31. Analysis- Web OCR Component Fig: Web browser
  • 32.  One of the components of project is Web OCR, as I need Better GUI it is good to use web browser, because it is software in which most user and image files are there. This browser module has low priority in this project. All below mention features are I mentioned may not be implementing. Simple browser with required features is enough to developing web browser for computer using following technologies and concepts are required:  Vb.net  Application programmer interface(API’s)  Multithreading  Voice library(voice recognition) Analysis- Web OCR Component
  • 33. Analysis- Tamil OCR GUI Start Input Image Preprocessing Tesseract Engine Tamil Trained Dataset Tamil OCR Output Text Text Post Processing Spell Checker Tamil Dictionary Tamil Data files Accurate Output Text Text Stop
  • 34. Analysis- Tamil OCR GUI Input Image Pre - process Tamil OCR GUI Output Text Tesseract OCR Engine Post Process Tamil and English Trained Data Set Output Formats 1. PDF 2. MS Word 3. RTF 4. XML 5. WAV 6. MP3 8. HTML 9. Text 10. Single Web page Fig: Block Diagram of Full Image OCR
  • 35. Analysis- Tamil OCR GUI Post Process Tamil OCR GUI Selected Image Region Output Text Tesseract OCR EnginePre - process Tamil and English Trained Data Set Fig: Block Diagram of Region Image OCR
  • 36. Analysis- Tamil OCR GUI Fig: Block Diagram of Snapshotting OCR
  • 37. Analysis- Tamil OCR GUI Fig: Block Diagram of Batch OCR
  • 38. Analysis- Pre-processing Increase dpi (max 300) Start Input Image Convert to black and white Preprocess Manager Remove Background (max 300) Remove Inner images (max 300) Preprocessed Image Fig: Architecture of Pre-processor Pre-processing is optional process in Tamil OCR. This is useful to increase Quality of image so that process of OCR will be more accurate.
  • 39. The known system is divided into eight functional modules, which are Independent among them and they are followed:  This Project has following modules: 1) Tamil Training to Tesseract 2) Web OCR Component 3) OCR GUI Design - Modularity
  • 40. Introduction The system is designed in two phase:  Preliminary System Design  Detailed System Design  Coupling is the measure of relative interdependence among modules. In Tamil OCR project Coupling is very loose in nature because we can easily add any module without problem with any other module or require little modification in interface calling. Coupling depends of interface complexity between modules, the point at which entry of reference is made to a module, and what data pass across the interface.  Cohesion is the measure of relative functional strength of individual modules. In  Tamil OCR project Cohesion is very high in nature because performs a single task within  a software procedure, requiring little interaction with procedures being performed in other parts of a program. Design
  • 41. 1.Training Tesseract-OCR Task  Tesseract 3.02 is fully trainable. In Tesseract 2.0 it will only accept Tiff format images, but in the latest version it include leptonica it will automatically convert any image into tiff format.  This page describes the training process, provides some guidelines on applicability to various languages, and what to expect from the results. Tesseract was originally designed to recognize English text only. Efforts have been made to modify the engine and its training system to make them able to deal with other languages and UTF-8 characters. Design - Modularity
  • 42. 2.Web Tamil OCR-Component Development 1. Web Tamil OCR Component Extract all pictures of web page and save all in single folder called ‘OCR Image’ (image Extractor) 2. Image Reader will take picture by picture and send to Tesseract Engine. 3. Tesseract Engine will process and extract text from images only Tamil and English 4. Output of Tesseract Engine is taken by Output Manger that will maintain text and images. 5. Output Formatter will put Images and Corresponding Extracted Text in word document 6 .if User move over any image all process are carried out from 1 to 4 and Text Speaker component will speak out to user. Design - Modularity
  • 43. 3 .Web browser  Main component of project is Tamil OCR, as I need Better GUI it is good to use web browser, because it is software in which most user are there. This browser module has low priority in this project. Tesseract-OCR is a command line tool. The Project GUI is going to developed using .net lets user to easily work with a graphical user interface. Project targeting the development of component in web browser that component will extract Tamil character from images.  The user interface for OCRgui contains the open button to select the image (tiff format) file, an recognition button that converts the image to an editable text, a font button that select the font type for the output and preferences window that have all preferences. Web browser will use Tamil OCR Component to extract all images text if Tamil and English. Design - Modularity
  • 44. Cond….(Web browser )  The software interfaces used in the project are built VB.NET. It leads to easily create programs with a graphical user interface using the Visual basic.net. The Project does not make use of any particular protocol and therefore there may not be need for any specified communication interfaces. Design - Modularity
  • 45. 4 .OCR GUI OCR GUI allow user to perform various OCR functions more easily in Tamil. This module provides following features to user. 1. Whole Image OCR 2. Image Region OCR 3. Snap Shot OCR 4. Snipping Screen OCR 5. Batch OCR 6. Various Image Processing functionality 7. Good Tamil Text Editor 8. Easley Tamil Transliteration Typing (like phonetic) 9. Good Tamil Text to Speech Design - Modularity
  • 46.  Cont. …..(OCR GUI) Design - Modularity 10. Image Convertor 11. PDF OCR 12. Various Output Format 13. Various Input Acceptance 14. Easy Typing in Other languages 15. Email Sender 16. Spell checking 17. Snipping and Saving 18. Combined OCR document 19. OCR report
  • 47. Detailed Design Fig: Few Class Diagram of OCR GUI
  • 48. GUI Design Fig: Whole Page OCR (windows 7 and 8 version)
  • 49. GUI Design Fig: Whole Page OCR (windows XP version)
  • 53. GUI Design Fig: Batch OCR Windows 8 modes
  • 54. GUI Design Fig: Batch OCR Windows XP mode
  • 56. GUI Design Fig: Snapshots OCR in English
  • 57. GUI Design Fig: Converted to black and White
  • 59. GUI Design Fig: Tamil OCR Setting Widow
  • 60. GUI Design Fig: Tamil OCR Other Setting Widow
  • 61. GUI Design Fig: Tamil OCR Speak Setting Widow
  • 62. Fig: Tamil OCR UI Setting Widow
  • 64. GUI Design Fig: Image Preview With image Properties
  • 65. GUI Design Fig: Tamil OCR Print Preview Widow
  • 66. GUI Design Fig: Tamil OCR Batch OCR Theme Widow
  • 67. GUI Design Fig: Tamil OCR with Blackly Mode
  • 69. GUI Design Fig: Web Page Capture
  • 70. GUI Design Fig: Web Page Link Grabbing
  • 71. GUI Design Fig: Image Grabber in progress
  • 73. GUI Design Fig: Find and High Light Feature
  • 74. GUI Design Fig: Snipping and Saving Tool
  • 75. GUI Design Fig: Batch OCR Context Help
  • 76. GUI Design Fig: Auto Cropped Image
  • 78. GUI Design Fig: De Skewed Image
  • 79. Tesseract-OCR for Tamil language Increases the number of users of tesseract GUI makes more people to migrate towards open source It makes easy for the people to work tesseract, without the knowledge about it. Web browser can make use the OCR Tamil component in the web browser For blind people it is very good beneficial Advantages of Project
  • 80. The applications of Tesseract-OCR are  News paper industries  Books publishers.  Industries of Digital Expertise  Web browser  Web Service Creation  Web Application  Common People’s Advantages of Project
  • 81.  FUTURE ENHANCEMENT  The accuracy of recognition can be improved by more training  AI Applied to make intelligent snapshotting Reader  Image Content Search  More Accurate and powerful Spell Checker Corrector  Powerful web OCR  Web Service  Online Web Application  Handwritten Recognition  Browser Extensions  More Powerful Image Preprocessing Future Enhancement
  • 82.  I hope the result of the project is that the Tesseract-OCR is trained for Tamil language and the more training can be used to train the OCR for tamil language , so accuracy can be increased.  This Project can increase the number of users to tesseract OCR among tamil people and most of the Tamil press peoples to migrate towards free and open source software’s.  The GUI for tesseract will make easy for all the peoples who don’t have knowledge about tesseract to use it in a perfect manner. Conclusion
  • 83.  The Tamil OCR Project was a great opportunity for me to experience the workings of software, know about research’s and bring my 2 years of academics to practice. The 4 months period of Project was very useful for learning the industry specific practices and standards. I am very confident that this experience will be a great boost and help for my career ahead.  The period at Dr.K.S kuppusamy was a great learning experience with very helpful project supervisors. Their evaluation and guiding helped me a lot to learn and understand new techniques and methodologies. Overall, it was a great experience and will greatly benefit me in future.  The biggest advantage of Tesseract OCR is its availability as open code. Thus anybody having the interest to study the working procedure, and skill to improve it can able to train it for a new language. In this Document, I presented the step by step procedure to train Tesseract engine for Tamil printed text document. At first we train the Tesseract for a particular font type of English language that has not been supported earlier by performing a series of test. We then train Tesseract to recognize the Tamil character set, and observed the results. As we find editing the box file manually is a cumber-sum task (this language has a large character set), we try to generate the box file automatically. Also we could be able to detect the vowels and consonants of Tamil character set, however still we need to train Tesseract for dependent modifiers and other characters that exist in the Tamil documents, in future. Conclusion
  • 84. 1. The tesseract-OCR Google groups is developed by “Ray Smith” and is accessed http://code.google.com/p/tesseract-OCR/ 2. Training the tesseract-OCR for Indic languages can be obtained from http://code.google.com/p/tesseract- OCR/wiki/TrainingTesseract 3. The tesseract Indic home page is developed and maintained by “Debayan Banerjee” and can be found at http://code.google.com/p/tesseractindic/ and etc… REFERENCES