Optical character recognition for Ge'ez characters

0Introduction
Mekelle Institute of Technology
Project proposal
For
2016
Project Members Dep’t
1. Awet Haileslassie CSE
2. Hadush Hailu CSE
3. Mulu Hailemariam CSE
4. Negash Desalegn CSE
Submission date: March, 28 2016
OCR Algorithm for Ge'ez
Characters

OCR Algorithm for Ge'ezCharacters
P a g e 1 | 10
Abstract
It won't be an exaggeration to claim that Ethiopia's intellectual property is hardly
digitized; and is stored in paper be it in the form of century old parchment paper in
monasteries or in the form of file cabinets in various regional and federal offices.
Digitizing the plethora of documents by hand is not a feasible affair. To this end,
building a system that identifies and digitizes documents made up of Ethiopian
characters is a logical move. This system would play pivotal role in preserving the
intellectual propertyof the country and making it easily accessible for the posterity.

P a g e 2 | 10
Contents
Introduction......................................................................................................................................3
Background......................................................................................................................................4
Problem Statement...........................................................................................................................5
Proposed System...............................................................................................................................6
Objective...........................................................................................................................................7
General Objective .........................................................................................................................7
Specific Object..............................................................................................................................7
Limitation.........................................................................................................................................8
Methodology.....................................................................................................................................8
1. The Imaging Stage.....................................................................................................................8
2. The OCR Process ......................................................................................................................9
3. Distinguishing between text and images - Segmentation............................................................9
4. Character recognition - Feature Extraction...............................................................................9
5. Recognition of Words................................................................................................................9
6. Correction of Unrecognized Characters – Error Correction.....................................................9
7. Output Formatting....................................................................................................................9
Projectschedule plan......................................................................................................................10
References.......................................................................................................................................10

P a g e 3 | 10
Introduction
Optical Character Recognition (OCR) is a technology that is used to translate
scanned images of text into computer editable and searchable text. OCR Software
are analytical artificial intelligence systemthat consideronly sequences ofcharacters
rather than whole words or phrases and do not cross-validate data during the
recognition process based on the analysis of sequential lines and curves, OCR make
‘best guesses’ at characters using database look-up tables to closely associate or
match the strings of characters that form words. For this system to effectively
recognize hand printed or machine printed forms, words must be separated into
individual characters. That is why most typical administrative forms require people
to either hand print into neatly spaced boxes oruse combs(tickmarks) at the bottom
of input lines to force spaces between letters entered on a form. without the use of
combs or boxes, conventional technologies reject fields if people do not follow the
structure when filling out forms, resulting in significant administrative overhead and
costs to forms processing organizations. (OCRopus, 2009)
OCR is a process that allows printed (typewritten, printout as well as handwritten)
text to be recognized optically and converted into machine–readable code that
can be accepted by a computer for further processing (Genovese, 1970). OCR
systems provide a tremendous opportunity in handling repetitive, boring, labor-
intensive, error prone, and time consuming processes for human beings.
Among others, the following are the major advantages of the OCR technology:
• It can be used to scan and preserve historical documents.
• It can be used for scanning data entry forms in a faster and less error prone manner.

P a g e 4 | 10
At present the recognition of Latin-based characters from well-conditioned
documents can be considered as a relatively feasible technology. On the other hand,
the processing of non-Latin scripts is still a subject of active research.
Ethiopic script-based OCR processing is currently among the least developed ICT
disciplines in the country. Developments in this area are mainly limited to
preliminary research activities undertaken at different institutions of higher
educations, such as the former School of Information Science for Africa (SISA).
Such efforts are undertaken in an uncoordinated ways.
Background
Ge'ez (ግዕዝ also referred to by some as "Ethiopic") is an ancient South Semitic
language that originated in the northern region of Ethiopia and Eritrea in the Horn
of Africa. It later became the official language of the Kingdom of Aksum and
Ethiopian imperial court.
Today, Ge'ez remains only as the main language used in the liturgy of the Ethiopian
Orthodox Tewahedo Church, the Eritrean Orthodox Tewahedo Church, the
Ethiopian Catholic Church, and the Beta Israel Jewish community. However, in
Ethiopia Amharic (the main lingua franca of modern Ethiopia) or other local
languages, and in Eritrea and Tigray Region in Ethiopia, Tigrigna may be used for
sermons. Tigrigna and Tigre are closely related to Ge'ez with at least four different
configurations proposed.[1]
Some linguists do not believe that Ge'ez constitutes the
common ancestor of modern Ethiopian languages, but that Ge'ez became a separate
language early on from some hypothetical, completely unattested language,[2]
and
can thus be seen as an extinct sister language of Tigre and Tigrinya.[3]
The foremost

P a g e 5 | 10
Ethiopian experts such as Amsalu Aklilu point to the vast proportion of inherited
nouns that are unchanged, and even spelled identically in both Ge'ez and Amharic
(and to a lesser degree, Tigrinya).[4]
Amharic is a Semitic language spoken in Ethiopia. It is the second-most spoken
Semitic language in the world, after Arabic, and the official working language ofthe
Federal Democratic Republic of Ethiopia. Amharic is also the official or working
language of several of the states within the federal system.
It has been the working language of government, the military, and the Ethiopian
Orthodox Tewahedo Church throughout medieval and modern times. The 2007
census counted nearly 22 million native speakers in Ethiopia.[5]
Outside Ethiopia,
Amharic is the language of some 2.7 million emigrants. It is written (left-to-right)
using Amharic Fidel, ፊደል, which grew out of the Ge'ez abugida—called,
in Ethiopian Semitic languages, ፊደል fidel ("writing system", "letter", or
"character") and አቡጊዳ abugida (from the first four Ethiopic letters, which gave rise
to the modern linguistic term abugida).[6]
Problem Statement
What should you do if you want to convert scanned paper, books and documents
into electronic files like Word document, PDF, or text? You need Optical character
recognition, usually abbreviated to OCR, which can translate scanned images of
handwritten, typewritten or printed text into machine-encoded text.
The motivation that the fact for initiating this project is the absences of locally and
or internationally developed single production for optical character recognition
software for Geez characters. The language is not supported by ASCII standard to

P a g e 6 | 10
use it onthe computer. All these fonts have some similar characters and some others
are difficult to swap from one font to another font. This shows no standardized font
is found in the country. Due to lack of this standard font, the country lacks different
types of services and applications like OCR.
Proposed System
Advanced OCR algorithm that uses both openCV(open Computer vision) and
ANN(Artificial Neural Network) that allows the conversion of scanned images of
handwritten, typewritten, or printed text into machine-encoded text.
Open source OCR software called Tesseract Engine will be used as a basis for
Optical Recognition project into text or information that can be understood oredited
using computers, which is considered as the most accurate free OCR engine in
existence.

P a g e 7 | 10
Objective
In Ethiopia there is no any single quality OCR application for Ge'ez characters. In
general OCR technologies are a well solved problem and matured technology for
Latin scripts, but for non-Latin scripts like some Asian and African scripts still it is
a non-solved problem and active research area. Ge'ez characters are non-Latin
scripts. And relatively with Latin scripts the Ge'ez characters are about 380 and 310
are most popularly and frequently used.
General Objective
Building a system that identifies and digitizes documents made up of Ethiopian
characters is a logical move. This system would play pivotal role in preserving the
intellectual propertyof the country and making it easily accessible for the posterity.
Specific Object
The project would entail many tasks including:
1. Machine learning: to train the system to recognize Ethiopic fonts (Doing
This with high fidelity especially for century old hand written documents would be
an interesting challenge).
2. Parallelizing the developed OCR algorithm to minimize the amount of time it
takes to do the task.

P a g e 8 | 10
3. Once the document is scanned, the words on the document need to be compared
against the list of all known Geez, Amharic and Tigrigna . . . . Words for error
identification/correction.
Limitation
Ethiopic language has its own symbols and syllables for numbers, and mostly for
writing purpose the language shares the Arabic numeric styles such as
0,1,2,3,4,5,6,7,8,9. Due to this reason the Ethiopian numeric styles are not included
in this project.
Methodology
The process of converting Ge'ez character images into electronic form which is
usually referred to as digitization is undertaken in different steps.
The process of scanning a document and representing the scanned image for further
processing is called the preprocessing or imaging phase.
The process of manipulating the scanned image of a document to produce a
searchable text is called the OCR processing stage.
1. The Imaging Stage
The imaging process involves scanning the documentand storing it as an image. The
mostpopular image format used forthis purposeis called Tagged-Image File Format
(TIFF).

P a g e 9 | 10
The resolution (number of dots per inch – dpi) determines the accuracy rate of the
OCR process.
2. The OCR Process
The major steps of the OCR processing stage are shown below.
3. Distinguishing between text and images - Segmentation
In this step, the process ofidentifying the text and image blocks ofthe scanned image
is undertaken. The boundaries of each image are analyzed in order to recognize the
text.
4. Character recognition - Feature Extraction
This step involves recognizing a character using a method known as feature
extraction. OCR tools store rules about the characters of a given script using a
method known as the learning process. A character is then identified by analyzing
its shape and comparing its features against a set of rules stored on the OCR engine
that distinguishes each character.
5. Recognition of Words
Following the character recognition process, word identification process is
performed by comparing string of characters against an existing Ge'ez dictionary
words.
6. Correction of Unrecognized Characters – Error Correction
In this step, the user is allowed to provide corrections to unrecognized characters.
7. Output Formatting
The final step involves storing the output in one of the industry standard formats
such as PDF, WORD and plain UNICODE text.

P a g e 10 | 10
Project schedule plan
ProjectAction Required Time
Collection of data One week
Collection of system requirements One week
Methodology Two weeks
System design and implementation 1
1
2
month
Results and observation One week
References
[1] Bulakh, Maria; Kogan, Leonid (2010). "The Genealogical Position of Tigre and
the Problem of North Ethio-Semitic Unity". Zeitschriften der Deutschen
Morgenländischen Gesellschaft 160 (2): 273–302.
[2] Connell, Dan; Killion, Tom (2010). Historical Dictionary of Eritrea (2nd,
illustrated ed.). Scarecrow Press. p. 508. ISBN 978-0-8108-7505-0.
[3] Haarmann, Harald (2002). Lexikon der untergegangenen Sprachen [Lexicon of
extinct languages] (in German) (2nd ed.). C. H. Beck. p. 76. ISBN 978-3-406-
47596-2.
[4] Amsalu Aklilu, Kuraz Publishing Agency, ጥሩ የአማርኛ ድርሰት እንዴት ያለ ነው!p. 42
[5] http://catalog.ihsn.org/index.php/catalog/3583/download/50086
[6] Amharic Wikipedia url -> https://en.wikipedia.org/wiki/Amharic

Optical character recognition for Ge'ez characters

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (7)

Similar a Optical character recognition for Ge'ez characters

Similar a Optical character recognition for Ge'ez characters (20)

Último

Último (20)

Optical character recognition for Ge'ez characters