SlideShare una empresa de Scribd logo
1 de 11
Descargar para leer sin conexión
0Introduction
Mekelle Institute of Technology
Project proposal
For
2016
Project Members Dep’t
1. Awet Haileslassie CSE
2. Hadush Hailu CSE
3. Mulu Hailemariam CSE
4. Negash Desalegn CSE
Submission date: March, 28 2016
OCR Algorithm for Ge'ez
Characters
OCR Algorithm for Ge'ezCharacters
P a g e 1 | 10
Abstract
It won't be an exaggeration to claim that Ethiopia's intellectual property is hardly
digitized; and is stored in paper be it in the form of century old parchment paper in
monasteries or in the form of file cabinets in various regional and federal offices.
Digitizing the plethora of documents by hand is not a feasible affair. To this end,
building a system that identifies and digitizes documents made up of Ethiopian
characters is a logical move. This system would play pivotal role in preserving the
intellectual propertyof the country and making it easily accessible for the posterity.
OCR Algorithm for Ge'ezCharacters
P a g e 2 | 10
Contents
Introduction......................................................................................................................................3
Background......................................................................................................................................4
Problem Statement...........................................................................................................................5
Proposed System...............................................................................................................................6
Objective...........................................................................................................................................7
General Objective .........................................................................................................................7
Specific Object..............................................................................................................................7
Limitation.........................................................................................................................................8
Methodology.....................................................................................................................................8
1. The Imaging Stage.....................................................................................................................8
2. The OCR Process ......................................................................................................................9
3. Distinguishing between text and images - Segmentation............................................................9
4. Character recognition - Feature Extraction...............................................................................9
5. Recognition of Words................................................................................................................9
6. Correction of Unrecognized Characters – Error Correction.....................................................9
7. Output Formatting....................................................................................................................9
Projectschedule plan......................................................................................................................10
References.......................................................................................................................................10
OCR Algorithm for Ge'ezCharacters
P a g e 3 | 10
Introduction
Optical Character Recognition (OCR) is a technology that is used to translate
scanned images of text into computer editable and searchable text. OCR Software
are analytical artificial intelligence systemthat consideronly sequences ofcharacters
rather than whole words or phrases and do not cross-validate data during the
recognition process based on the analysis of sequential lines and curves, OCR make
‘best guesses’ at characters using database look-up tables to closely associate or
match the strings of characters that form words. For this system to effectively
recognize hand printed or machine printed forms, words must be separated into
individual characters. That is why most typical administrative forms require people
to either hand print into neatly spaced boxes oruse combs(tickmarks) at the bottom
of input lines to force spaces between letters entered on a form. without the use of
combs or boxes, conventional technologies reject fields if people do not follow the
structure when filling out forms, resulting in significant administrative overhead and
costs to forms processing organizations. (OCRopus, 2009)
OCR is a process that allows printed (typewritten, printout as well as handwritten)
text to be recognized optically and converted into machine–readable code that
can be accepted by a computer for further processing (Genovese, 1970). OCR
systems provide a tremendous opportunity in handling repetitive, boring, labor-
intensive, error prone, and time consuming processes for human beings.
Among others, the following are the major advantages of the OCR technology:
• It can be used to scan and preserve historical documents.
• It can be used for scanning data entry forms in a faster and less error prone manner.
OCR Algorithm for Ge'ezCharacters
P a g e 4 | 10
At present the recognition of Latin-based characters from well-conditioned
documents can be considered as a relatively feasible technology. On the other hand,
the processing of non-Latin scripts is still a subject of active research.
Ethiopic script-based OCR processing is currently among the least developed ICT
disciplines in the country. Developments in this area are mainly limited to
preliminary research activities undertaken at different institutions of higher
educations, such as the former School of Information Science for Africa (SISA).
Such efforts are undertaken in an uncoordinated ways.
Background
Ge'ez (ግዕዝ also referred to by some as "Ethiopic") is an ancient South Semitic
language that originated in the northern region of Ethiopia and Eritrea in the Horn
of Africa. It later became the official language of the Kingdom of Aksum and
Ethiopian imperial court.
Today, Ge'ez remains only as the main language used in the liturgy of the Ethiopian
Orthodox Tewahedo Church, the Eritrean Orthodox Tewahedo Church, the
Ethiopian Catholic Church, and the Beta Israel Jewish community. However, in
Ethiopia Amharic (the main lingua franca of modern Ethiopia) or other local
languages, and in Eritrea and Tigray Region in Ethiopia, Tigrigna may be used for
sermons. Tigrigna and Tigre are closely related to Ge'ez with at least four different
configurations proposed.[1]
Some linguists do not believe that Ge'ez constitutes the
common ancestor of modern Ethiopian languages, but that Ge'ez became a separate
language early on from some hypothetical, completely unattested language,[2]
and
can thus be seen as an extinct sister language of Tigre and Tigrinya.[3]
The foremost
OCR Algorithm for Ge'ezCharacters
P a g e 5 | 10
Ethiopian experts such as Amsalu Aklilu point to the vast proportion of inherited
nouns that are unchanged, and even spelled identically in both Ge'ez and Amharic
(and to a lesser degree, Tigrinya).[4]
Amharic is a Semitic language spoken in Ethiopia. It is the second-most spoken
Semitic language in the world, after Arabic, and the official working language ofthe
Federal Democratic Republic of Ethiopia. Amharic is also the official or working
language of several of the states within the federal system.
It has been the working language of government, the military, and the Ethiopian
Orthodox Tewahedo Church throughout medieval and modern times. The 2007
census counted nearly 22 million native speakers in Ethiopia.[5]
Outside Ethiopia,
Amharic is the language of some 2.7 million emigrants. It is written (left-to-right)
using Amharic Fidel, ፊደል, which grew out of the Ge'ez abugida—called,
in Ethiopian Semitic languages, ፊደል fidel ("writing system", "letter", or
"character") and አቡጊዳ abugida (from the first four Ethiopic letters, which gave rise
to the modern linguistic term abugida).[6]
Problem Statement
What should you do if you want to convert scanned paper, books and documents
into electronic files like Word document, PDF, or text? You need Optical character
recognition, usually abbreviated to OCR, which can translate scanned images of
handwritten, typewritten or printed text into machine-encoded text.
The motivation that the fact for initiating this project is the absences of locally and
or internationally developed single production for optical character recognition
software for Geez characters. The language is not supported by ASCII standard to
OCR Algorithm for Ge'ezCharacters
P a g e 6 | 10
use it onthe computer. All these fonts have some similar characters and some others
are difficult to swap from one font to another font. This shows no standardized font
is found in the country. Due to lack of this standard font, the country lacks different
types of services and applications like OCR.
Proposed System
Advanced OCR algorithm that uses both openCV(open Computer vision) and
ANN(Artificial Neural Network) that allows the conversion of scanned images of
handwritten, typewritten, or printed text into machine-encoded text.
Open source OCR software called Tesseract Engine will be used as a basis for
Optical Recognition project into text or information that can be understood oredited
using computers, which is considered as the most accurate free OCR engine in
existence.
OCR Algorithm for Ge'ezCharacters
P a g e 7 | 10
Objective
In Ethiopia there is no any single quality OCR application for Ge'ez characters. In
general OCR technologies are a well solved problem and matured technology for
Latin scripts, but for non-Latin scripts like some Asian and African scripts still it is
a non-solved problem and active research area. Ge'ez characters are non-Latin
scripts. And relatively with Latin scripts the Ge'ez characters are about 380 and 310
are most popularly and frequently used.
General Objective
Building a system that identifies and digitizes documents made up of Ethiopian
characters is a logical move. This system would play pivotal role in preserving the
intellectual propertyof the country and making it easily accessible for the posterity.
Specific Object
The project would entail many tasks including:
1. Machine learning: to train the system to recognize Ethiopic fonts (Doing
This with high fidelity especially for century old hand written documents would be
an interesting challenge).
2. Parallelizing the developed OCR algorithm to minimize the amount of time it
takes to do the task.
OCR Algorithm for Ge'ezCharacters
P a g e 8 | 10
3. Once the document is scanned, the words on the document need to be compared
against the list of all known Geez, Amharic and Tigrigna . . . . Words for error
identification/correction.
Limitation
Ethiopic language has its own symbols and syllables for numbers, and mostly for
writing purpose the language shares the Arabic numeric styles such as
0,1,2,3,4,5,6,7,8,9. Due to this reason the Ethiopian numeric styles are not included
in this project.
Methodology
The process of converting Ge'ez character images into electronic form which is
usually referred to as digitization is undertaken in different steps.
The process of scanning a document and representing the scanned image for further
processing is called the preprocessing or imaging phase.
The process of manipulating the scanned image of a document to produce a
searchable text is called the OCR processing stage.
1. The Imaging Stage
The imaging process involves scanning the documentand storing it as an image. The
mostpopular image format used forthis purposeis called Tagged-Image File Format
(TIFF).
OCR Algorithm for Ge'ezCharacters
P a g e 9 | 10
The resolution (number of dots per inch – dpi) determines the accuracy rate of the
OCR process.
2. The OCR Process
The major steps of the OCR processing stage are shown below.
3. Distinguishing between text and images - Segmentation
In this step, the process ofidentifying the text and image blocks ofthe scanned image
is undertaken. The boundaries of each image are analyzed in order to recognize the
text.
4. Character recognition - Feature Extraction
This step involves recognizing a character using a method known as feature
extraction. OCR tools store rules about the characters of a given script using a
method known as the learning process. A character is then identified by analyzing
its shape and comparing its features against a set of rules stored on the OCR engine
that distinguishes each character.
5. Recognition of Words
Following the character recognition process, word identification process is
performed by comparing string of characters against an existing Ge'ez dictionary
words.
6. Correction of Unrecognized Characters – Error Correction
In this step, the user is allowed to provide corrections to unrecognized characters.
7. Output Formatting
The final step involves storing the output in one of the industry standard formats
such as PDF, WORD and plain UNICODE text.
OCR Algorithm for Ge'ezCharacters
P a g e 10 | 10
Project schedule plan
ProjectAction Required Time
Collection of data One week
Collection of system requirements One week
Methodology Two weeks
System design and implementation 1
1
2
month
Results and observation One week
References
[1] Bulakh, Maria; Kogan, Leonid (2010). "The Genealogical Position of Tigre and
the Problem of North Ethio-Semitic Unity". Zeitschriften der Deutschen
Morgenländischen Gesellschaft 160 (2): 273–302.
[2] Connell, Dan; Killion, Tom (2010). Historical Dictionary of Eritrea (2nd,
illustrated ed.). Scarecrow Press. p. 508. ISBN 978-0-8108-7505-0.
[3] Haarmann, Harald (2002). Lexikon der untergegangenen Sprachen [Lexicon of
extinct languages] (in German) (2nd ed.). C. H. Beck. p. 76. ISBN 978-3-406-
47596-2.
[4] Amsalu Aklilu, Kuraz Publishing Agency, ጥሩ የአማርኛ ድርሰት እንዴት ያለ ነው!p. 42
[5] http://catalog.ihsn.org/index.php/catalog/3583/download/50086
[6] Amharic Wikipedia url -> https://en.wikipedia.org/wiki/Amharic

Más contenido relacionado

La actualidad más candente

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingAkhilPolisetty
 
Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Edureka!
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingVeenaSKumar2
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingRishikese MR
 
Natural language processing(AI UNIT 2)
Natural language processing(AI UNIT 2)Natural language processing(AI UNIT 2)
Natural language processing(AI UNIT 2)SURBHI SAROHA
 
Natural Language Processing: Definition and Application
Natural Language Processing: Definition and ApplicationNatural Language Processing: Definition and Application
Natural Language Processing: Definition and ApplicationStephen Shellman
 
Natural language processing
Natural language processingNatural language processing
Natural language processingprashantdahake
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversionankit_saluja
 
Speech synthesis technology
Speech synthesis technologySpeech synthesis technology
Speech synthesis technologyKalluri Madhuri
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...John Tinsley
 
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGES
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGESSCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGES
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGEScscpconf
 
Natural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionNatural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionAritra Mukherjee
 
Natural language processing (NLP)
Natural language processing (NLP) Natural language processing (NLP)
Natural language processing (NLP) ASWINKP11
 
Big data
Big dataBig data
Big dataIshucs
 
Natural language-processing
Natural language-processingNatural language-processing
Natural language-processingHareem Naz
 

La actualidad más candente (20)

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
 
Nlp
NlpNlp
Nlp
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural language processing(AI UNIT 2)
Natural language processing(AI UNIT 2)Natural language processing(AI UNIT 2)
Natural language processing(AI UNIT 2)
 
Nlp final
Nlp finalNlp final
Nlp final
 
Natural Language Processing: Definition and Application
Natural Language Processing: Definition and ApplicationNatural Language Processing: Definition and Application
Natural Language Processing: Definition and Application
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversion
 
Speech synthesis technology
Speech synthesis technologySpeech synthesis technology
Speech synthesis technology
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
 
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGES
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGESSCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGES
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGES
 
Natural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionNatural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - Introduction
 
NLTK
NLTKNLTK
NLTK
 
Natural language processing (NLP)
Natural language processing (NLP) Natural language processing (NLP)
Natural language processing (NLP)
 
Big data
Big dataBig data
Big data
 
Natural language-processing
Natural language-processingNatural language-processing
Natural language-processing
 
D1803041822
D1803041822D1803041822
D1803041822
 

Destacado

ስርዓተ ቅዳሴ (Kidase english-tigrinya-geez)
ስርዓተ ቅዳሴ (Kidase english-tigrinya-geez)ስርዓተ ቅዳሴ (Kidase english-tigrinya-geez)
ስርዓተ ቅዳሴ (Kidase english-tigrinya-geez)abraham eyale
 
Project report of OCR Recognition
Project report of OCR RecognitionProject report of OCR Recognition
Project report of OCR RecognitionBharat Kalia
 
Projects Completed at the University of Manchester
Projects Completed at the University of ManchesterProjects Completed at the University of Manchester
Projects Completed at the University of ManchesterMike Jones
 
Machine learning
Machine learningMachine learning
Machine learningAmit Gupta
 
Deliver us from evil ከክፉ አድነን እንጂ
Deliver us from evil ከክፉ አድነን እንጂ Deliver us from evil ከክፉ አድነን እንጂ
Deliver us from evil ከክፉ አድነን እንጂ Operation Ezra
 
Deborah ዲቦራ - Amharic sermon,
Deborah ዲቦራ - Amharic sermon, Deborah ዲቦራ - Amharic sermon,
Deborah ዲቦራ - Amharic sermon, Operation Ezra
 

Destacado (7)

ስርዓተ ቅዳሴ (Kidase english-tigrinya-geez)
ስርዓተ ቅዳሴ (Kidase english-tigrinya-geez)ስርዓተ ቅዳሴ (Kidase english-tigrinya-geez)
ስርዓተ ቅዳሴ (Kidase english-tigrinya-geez)
 
Project report of OCR Recognition
Project report of OCR RecognitionProject report of OCR Recognition
Project report of OCR Recognition
 
Projects Completed at the University of Manchester
Projects Completed at the University of ManchesterProjects Completed at the University of Manchester
Projects Completed at the University of Manchester
 
Machine learning
Machine learningMachine learning
Machine learning
 
Geez 1
Geez 1Geez 1
Geez 1
 
Deliver us from evil ከክፉ አድነን እንጂ
Deliver us from evil ከክፉ አድነን እንጂ Deliver us from evil ከክፉ አድነን እንጂ
Deliver us from evil ከክፉ አድነን እንጂ
 
Deborah ዲቦራ - Amharic sermon,
Deborah ዲቦራ - Amharic sermon, Deborah ዲቦራ - Amharic sermon,
Deborah ዲቦራ - Amharic sermon,
 

Similar a Optical character recognition for Ge'ez characters

5.smart multilingual sign boards
5.smart multilingual sign boards5.smart multilingual sign boards
5.smart multilingual sign boardsEditorJST
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguisticsAdnanBaloch15
 
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVALDICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVALcsandit
 
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...IRJET Journal
 
1 computational linguistics an introduction
1 computational linguistics   an introduction1 computational linguistics   an introduction
1 computational linguistics an introductionThennarasuSakkan
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsGuy De Pauw
 
Summer Research Project (Anusaaraka) Report
Summer Research Project (Anusaaraka) ReportSummer Research Project (Anusaaraka) Report
Summer Research Project (Anusaaraka) ReportAnwar Jameel
 
Design and Development of Morphological Analyzer for Tigrigna Verbs using Hyb...
Design and Development of Morphological Analyzer for Tigrigna Verbs using Hyb...Design and Development of Morphological Analyzer for Tigrigna Verbs using Hyb...
Design and Development of Morphological Analyzer for Tigrigna Verbs using Hyb...kevig
 
DESIGN AND DEVELOPMENT OF MORPHOLOGICAL ANALYZER FOR TIGRIGNA VERBS USING HYB...
DESIGN AND DEVELOPMENT OF MORPHOLOGICAL ANALYZER FOR TIGRIGNA VERBS USING HYB...DESIGN AND DEVELOPMENT OF MORPHOLOGICAL ANALYZER FOR TIGRIGNA VERBS USING HYB...
DESIGN AND DEVELOPMENT OF MORPHOLOGICAL ANALYZER FOR TIGRIGNA VERBS USING HYB...kevig
 
Design and Implementation of a Language Assistant for English – Arabic Texts
Design and Implementation of a Language Assistant for English – Arabic TextsDesign and Implementation of a Language Assistant for English – Arabic Texts
Design and Implementation of a Language Assistant for English – Arabic TextsIJCSIS Research Publications
 
A study of feature extraction for Arabic calligraphy characters recognition
A study of feature extraction for Arabic calligraphy characters recognitionA study of feature extraction for Arabic calligraphy characters recognition
A study of feature extraction for Arabic calligraphy characters recognitionIJECEIAES
 
Automatic Extraction of Spatio-Temporal Information from Arabic Text Documents
Automatic Extraction of Spatio-Temporal Information from Arabic Text DocumentsAutomatic Extraction of Spatio-Temporal Information from Arabic Text Documents
Automatic Extraction of Spatio-Temporal Information from Arabic Text Documentsijcsit
 
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...
T URN S EGMENTATION I NTO U TTERANCES F OR  A RABIC  S PONTANEOUS D IALOGUES ...T URN S EGMENTATION I NTO U TTERANCES F OR  A RABIC  S PONTANEOUS D IALOGUES ...
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...ijnlc
 
VOICE BASED E-MAIL
VOICE BASED E-MAILVOICE BASED E-MAIL
VOICE BASED E-MAILStudentRocks
 

Similar a Optical character recognition for Ge'ez characters (20)

5.smart multilingual sign boards
5.smart multilingual sign boards5.smart multilingual sign boards
5.smart multilingual sign boards
 
Speech-Recognition.pptx
Speech-Recognition.pptxSpeech-Recognition.pptx
Speech-Recognition.pptx
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVALDICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL
 
CRC Final Report
CRC Final ReportCRC Final Report
CRC Final Report
 
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
 
Lit mtap
Lit mtapLit mtap
Lit mtap
 
Ijetcas14 371
Ijetcas14 371Ijetcas14 371
Ijetcas14 371
 
1 computational linguistics an introduction
1 computational linguistics   an introduction1 computational linguistics   an introduction
1 computational linguistics an introduction
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
 
Summer Research Project (Anusaaraka) Report
Summer Research Project (Anusaaraka) ReportSummer Research Project (Anusaaraka) Report
Summer Research Project (Anusaaraka) Report
 
Design and Development of Morphological Analyzer for Tigrigna Verbs using Hyb...
Design and Development of Morphological Analyzer for Tigrigna Verbs using Hyb...Design and Development of Morphological Analyzer for Tigrigna Verbs using Hyb...
Design and Development of Morphological Analyzer for Tigrigna Verbs using Hyb...
 
DESIGN AND DEVELOPMENT OF MORPHOLOGICAL ANALYZER FOR TIGRIGNA VERBS USING HYB...
DESIGN AND DEVELOPMENT OF MORPHOLOGICAL ANALYZER FOR TIGRIGNA VERBS USING HYB...DESIGN AND DEVELOPMENT OF MORPHOLOGICAL ANALYZER FOR TIGRIGNA VERBS USING HYB...
DESIGN AND DEVELOPMENT OF MORPHOLOGICAL ANALYZER FOR TIGRIGNA VERBS USING HYB...
 
Design and Implementation of a Language Assistant for English – Arabic Texts
Design and Implementation of a Language Assistant for English – Arabic TextsDesign and Implementation of a Language Assistant for English – Arabic Texts
Design and Implementation of a Language Assistant for English – Arabic Texts
 
A study of feature extraction for Arabic calligraphy characters recognition
A study of feature extraction for Arabic calligraphy characters recognitionA study of feature extraction for Arabic calligraphy characters recognition
A study of feature extraction for Arabic calligraphy characters recognition
 
LSDI.pptx
LSDI.pptxLSDI.pptx
LSDI.pptx
 
Automatic Extraction of Spatio-Temporal Information from Arabic Text Documents
Automatic Extraction of Spatio-Temporal Information from Arabic Text DocumentsAutomatic Extraction of Spatio-Temporal Information from Arabic Text Documents
Automatic Extraction of Spatio-Temporal Information from Arabic Text Documents
 
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...
T URN S EGMENTATION I NTO U TTERANCES F OR  A RABIC  S PONTANEOUS D IALOGUES ...T URN S EGMENTATION I NTO U TTERANCES F OR  A RABIC  S PONTANEOUS D IALOGUES ...
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...
 
VOICE BASED E-MAIL
VOICE BASED E-MAILVOICE BASED E-MAIL
VOICE BASED E-MAIL
 

Último

70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical trainingGladiatorsKasper
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.elesangwon
 
input buffering in lexical analysis in CD
input buffering in lexical analysis in CDinput buffering in lexical analysis in CD
input buffering in lexical analysis in CDHeadOfDepartmentComp1
 
Machine Learning 5G Federated Learning.pdf
Machine Learning 5G Federated Learning.pdfMachine Learning 5G Federated Learning.pdf
Machine Learning 5G Federated Learning.pdfadeyimikaipaye
 
Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...
Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...
Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...IJAEMSJORNAL
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewsandhya757531
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHSneha Padhiar
 
Indian Tradition, Culture & Societies.pdf
Indian Tradition, Culture & Societies.pdfIndian Tradition, Culture & Societies.pdf
Indian Tradition, Culture & Societies.pdfalokitpathak01
 
Introduction to Machine Learning Part1.pptx
Introduction to Machine Learning Part1.pptxIntroduction to Machine Learning Part1.pptx
Introduction to Machine Learning Part1.pptxPavan Mohan Neelamraju
 
Javier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptxJavier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptxJavier Fernández Muñoz
 
Structural Integrity Assessment Standards in Nigeria by Engr Nimot Muili
Structural Integrity Assessment Standards in Nigeria by Engr Nimot MuiliStructural Integrity Assessment Standards in Nigeria by Engr Nimot Muili
Structural Integrity Assessment Standards in Nigeria by Engr Nimot MuiliNimot Muili
 
Submerged Combustion, Explosion Flame Combustion, Pulsating Combustion, and E...
Submerged Combustion, Explosion Flame Combustion, Pulsating Combustion, and E...Submerged Combustion, Explosion Flame Combustion, Pulsating Combustion, and E...
Submerged Combustion, Explosion Flame Combustion, Pulsating Combustion, and E...Ayisha586983
 
The Satellite applications in telecommunication
The Satellite applications in telecommunicationThe Satellite applications in telecommunication
The Satellite applications in telecommunicationnovrain7111
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra
 
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...KrishnaveniKrishnara1
 
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...arifengg7
 
Substation Automation SCADA and Gateway Solutions by BRH
Substation Automation SCADA and Gateway Solutions by BRHSubstation Automation SCADA and Gateway Solutions by BRH
Substation Automation SCADA and Gateway Solutions by BRHbirinder2
 
AntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptxAntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptxLina Kadam
 

Último (20)

70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
 
input buffering in lexical analysis in CD
input buffering in lexical analysis in CDinput buffering in lexical analysis in CD
input buffering in lexical analysis in CD
 
ASME-B31.4-2019-estandar para diseño de ductos
ASME-B31.4-2019-estandar para diseño de ductosASME-B31.4-2019-estandar para diseño de ductos
ASME-B31.4-2019-estandar para diseño de ductos
 
Machine Learning 5G Federated Learning.pdf
Machine Learning 5G Federated Learning.pdfMachine Learning 5G Federated Learning.pdf
Machine Learning 5G Federated Learning.pdf
 
Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...
Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...
Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overview
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
 
Indian Tradition, Culture & Societies.pdf
Indian Tradition, Culture & Societies.pdfIndian Tradition, Culture & Societies.pdf
Indian Tradition, Culture & Societies.pdf
 
Introduction to Machine Learning Part1.pptx
Introduction to Machine Learning Part1.pptxIntroduction to Machine Learning Part1.pptx
Introduction to Machine Learning Part1.pptx
 
Javier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptxJavier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptx
 
Structural Integrity Assessment Standards in Nigeria by Engr Nimot Muili
Structural Integrity Assessment Standards in Nigeria by Engr Nimot MuiliStructural Integrity Assessment Standards in Nigeria by Engr Nimot Muili
Structural Integrity Assessment Standards in Nigeria by Engr Nimot Muili
 
Submerged Combustion, Explosion Flame Combustion, Pulsating Combustion, and E...
Submerged Combustion, Explosion Flame Combustion, Pulsating Combustion, and E...Submerged Combustion, Explosion Flame Combustion, Pulsating Combustion, and E...
Submerged Combustion, Explosion Flame Combustion, Pulsating Combustion, and E...
 
The Satellite applications in telecommunication
The Satellite applications in telecommunicationThe Satellite applications in telecommunication
The Satellite applications in telecommunication
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
 
Versatile Engineering Construction Firms
Versatile Engineering Construction FirmsVersatile Engineering Construction Firms
Versatile Engineering Construction Firms
 
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...
 
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
 
Substation Automation SCADA and Gateway Solutions by BRH
Substation Automation SCADA and Gateway Solutions by BRHSubstation Automation SCADA and Gateway Solutions by BRH
Substation Automation SCADA and Gateway Solutions by BRH
 
AntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptxAntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptx
 

Optical character recognition for Ge'ez characters

  • 1. 0Introduction Mekelle Institute of Technology Project proposal For 2016 Project Members Dep’t 1. Awet Haileslassie CSE 2. Hadush Hailu CSE 3. Mulu Hailemariam CSE 4. Negash Desalegn CSE Submission date: March, 28 2016 OCR Algorithm for Ge'ez Characters
  • 2. OCR Algorithm for Ge'ezCharacters P a g e 1 | 10 Abstract It won't be an exaggeration to claim that Ethiopia's intellectual property is hardly digitized; and is stored in paper be it in the form of century old parchment paper in monasteries or in the form of file cabinets in various regional and federal offices. Digitizing the plethora of documents by hand is not a feasible affair. To this end, building a system that identifies and digitizes documents made up of Ethiopian characters is a logical move. This system would play pivotal role in preserving the intellectual propertyof the country and making it easily accessible for the posterity.
  • 3. OCR Algorithm for Ge'ezCharacters P a g e 2 | 10 Contents Introduction......................................................................................................................................3 Background......................................................................................................................................4 Problem Statement...........................................................................................................................5 Proposed System...............................................................................................................................6 Objective...........................................................................................................................................7 General Objective .........................................................................................................................7 Specific Object..............................................................................................................................7 Limitation.........................................................................................................................................8 Methodology.....................................................................................................................................8 1. The Imaging Stage.....................................................................................................................8 2. The OCR Process ......................................................................................................................9 3. Distinguishing between text and images - Segmentation............................................................9 4. Character recognition - Feature Extraction...............................................................................9 5. Recognition of Words................................................................................................................9 6. Correction of Unrecognized Characters – Error Correction.....................................................9 7. Output Formatting....................................................................................................................9 Projectschedule plan......................................................................................................................10 References.......................................................................................................................................10
  • 4. OCR Algorithm for Ge'ezCharacters P a g e 3 | 10 Introduction Optical Character Recognition (OCR) is a technology that is used to translate scanned images of text into computer editable and searchable text. OCR Software are analytical artificial intelligence systemthat consideronly sequences ofcharacters rather than whole words or phrases and do not cross-validate data during the recognition process based on the analysis of sequential lines and curves, OCR make ‘best guesses’ at characters using database look-up tables to closely associate or match the strings of characters that form words. For this system to effectively recognize hand printed or machine printed forms, words must be separated into individual characters. That is why most typical administrative forms require people to either hand print into neatly spaced boxes oruse combs(tickmarks) at the bottom of input lines to force spaces between letters entered on a form. without the use of combs or boxes, conventional technologies reject fields if people do not follow the structure when filling out forms, resulting in significant administrative overhead and costs to forms processing organizations. (OCRopus, 2009) OCR is a process that allows printed (typewritten, printout as well as handwritten) text to be recognized optically and converted into machine–readable code that can be accepted by a computer for further processing (Genovese, 1970). OCR systems provide a tremendous opportunity in handling repetitive, boring, labor- intensive, error prone, and time consuming processes for human beings. Among others, the following are the major advantages of the OCR technology: • It can be used to scan and preserve historical documents. • It can be used for scanning data entry forms in a faster and less error prone manner.
  • 5. OCR Algorithm for Ge'ezCharacters P a g e 4 | 10 At present the recognition of Latin-based characters from well-conditioned documents can be considered as a relatively feasible technology. On the other hand, the processing of non-Latin scripts is still a subject of active research. Ethiopic script-based OCR processing is currently among the least developed ICT disciplines in the country. Developments in this area are mainly limited to preliminary research activities undertaken at different institutions of higher educations, such as the former School of Information Science for Africa (SISA). Such efforts are undertaken in an uncoordinated ways. Background Ge'ez (ግዕዝ also referred to by some as "Ethiopic") is an ancient South Semitic language that originated in the northern region of Ethiopia and Eritrea in the Horn of Africa. It later became the official language of the Kingdom of Aksum and Ethiopian imperial court. Today, Ge'ez remains only as the main language used in the liturgy of the Ethiopian Orthodox Tewahedo Church, the Eritrean Orthodox Tewahedo Church, the Ethiopian Catholic Church, and the Beta Israel Jewish community. However, in Ethiopia Amharic (the main lingua franca of modern Ethiopia) or other local languages, and in Eritrea and Tigray Region in Ethiopia, Tigrigna may be used for sermons. Tigrigna and Tigre are closely related to Ge'ez with at least four different configurations proposed.[1] Some linguists do not believe that Ge'ez constitutes the common ancestor of modern Ethiopian languages, but that Ge'ez became a separate language early on from some hypothetical, completely unattested language,[2] and can thus be seen as an extinct sister language of Tigre and Tigrinya.[3] The foremost
  • 6. OCR Algorithm for Ge'ezCharacters P a g e 5 | 10 Ethiopian experts such as Amsalu Aklilu point to the vast proportion of inherited nouns that are unchanged, and even spelled identically in both Ge'ez and Amharic (and to a lesser degree, Tigrinya).[4] Amharic is a Semitic language spoken in Ethiopia. It is the second-most spoken Semitic language in the world, after Arabic, and the official working language ofthe Federal Democratic Republic of Ethiopia. Amharic is also the official or working language of several of the states within the federal system. It has been the working language of government, the military, and the Ethiopian Orthodox Tewahedo Church throughout medieval and modern times. The 2007 census counted nearly 22 million native speakers in Ethiopia.[5] Outside Ethiopia, Amharic is the language of some 2.7 million emigrants. It is written (left-to-right) using Amharic Fidel, ፊደል, which grew out of the Ge'ez abugida—called, in Ethiopian Semitic languages, ፊደል fidel ("writing system", "letter", or "character") and አቡጊዳ abugida (from the first four Ethiopic letters, which gave rise to the modern linguistic term abugida).[6] Problem Statement What should you do if you want to convert scanned paper, books and documents into electronic files like Word document, PDF, or text? You need Optical character recognition, usually abbreviated to OCR, which can translate scanned images of handwritten, typewritten or printed text into machine-encoded text. The motivation that the fact for initiating this project is the absences of locally and or internationally developed single production for optical character recognition software for Geez characters. The language is not supported by ASCII standard to
  • 7. OCR Algorithm for Ge'ezCharacters P a g e 6 | 10 use it onthe computer. All these fonts have some similar characters and some others are difficult to swap from one font to another font. This shows no standardized font is found in the country. Due to lack of this standard font, the country lacks different types of services and applications like OCR. Proposed System Advanced OCR algorithm that uses both openCV(open Computer vision) and ANN(Artificial Neural Network) that allows the conversion of scanned images of handwritten, typewritten, or printed text into machine-encoded text. Open source OCR software called Tesseract Engine will be used as a basis for Optical Recognition project into text or information that can be understood oredited using computers, which is considered as the most accurate free OCR engine in existence.
  • 8. OCR Algorithm for Ge'ezCharacters P a g e 7 | 10 Objective In Ethiopia there is no any single quality OCR application for Ge'ez characters. In general OCR technologies are a well solved problem and matured technology for Latin scripts, but for non-Latin scripts like some Asian and African scripts still it is a non-solved problem and active research area. Ge'ez characters are non-Latin scripts. And relatively with Latin scripts the Ge'ez characters are about 380 and 310 are most popularly and frequently used. General Objective Building a system that identifies and digitizes documents made up of Ethiopian characters is a logical move. This system would play pivotal role in preserving the intellectual propertyof the country and making it easily accessible for the posterity. Specific Object The project would entail many tasks including: 1. Machine learning: to train the system to recognize Ethiopic fonts (Doing This with high fidelity especially for century old hand written documents would be an interesting challenge). 2. Parallelizing the developed OCR algorithm to minimize the amount of time it takes to do the task.
  • 9. OCR Algorithm for Ge'ezCharacters P a g e 8 | 10 3. Once the document is scanned, the words on the document need to be compared against the list of all known Geez, Amharic and Tigrigna . . . . Words for error identification/correction. Limitation Ethiopic language has its own symbols and syllables for numbers, and mostly for writing purpose the language shares the Arabic numeric styles such as 0,1,2,3,4,5,6,7,8,9. Due to this reason the Ethiopian numeric styles are not included in this project. Methodology The process of converting Ge'ez character images into electronic form which is usually referred to as digitization is undertaken in different steps. The process of scanning a document and representing the scanned image for further processing is called the preprocessing or imaging phase. The process of manipulating the scanned image of a document to produce a searchable text is called the OCR processing stage. 1. The Imaging Stage The imaging process involves scanning the documentand storing it as an image. The mostpopular image format used forthis purposeis called Tagged-Image File Format (TIFF).
  • 10. OCR Algorithm for Ge'ezCharacters P a g e 9 | 10 The resolution (number of dots per inch – dpi) determines the accuracy rate of the OCR process. 2. The OCR Process The major steps of the OCR processing stage are shown below. 3. Distinguishing between text and images - Segmentation In this step, the process ofidentifying the text and image blocks ofthe scanned image is undertaken. The boundaries of each image are analyzed in order to recognize the text. 4. Character recognition - Feature Extraction This step involves recognizing a character using a method known as feature extraction. OCR tools store rules about the characters of a given script using a method known as the learning process. A character is then identified by analyzing its shape and comparing its features against a set of rules stored on the OCR engine that distinguishes each character. 5. Recognition of Words Following the character recognition process, word identification process is performed by comparing string of characters against an existing Ge'ez dictionary words. 6. Correction of Unrecognized Characters – Error Correction In this step, the user is allowed to provide corrections to unrecognized characters. 7. Output Formatting The final step involves storing the output in one of the industry standard formats such as PDF, WORD and plain UNICODE text.
  • 11. OCR Algorithm for Ge'ezCharacters P a g e 10 | 10 Project schedule plan ProjectAction Required Time Collection of data One week Collection of system requirements One week Methodology Two weeks System design and implementation 1 1 2 month Results and observation One week References [1] Bulakh, Maria; Kogan, Leonid (2010). "The Genealogical Position of Tigre and the Problem of North Ethio-Semitic Unity". Zeitschriften der Deutschen Morgenländischen Gesellschaft 160 (2): 273–302. [2] Connell, Dan; Killion, Tom (2010). Historical Dictionary of Eritrea (2nd, illustrated ed.). Scarecrow Press. p. 508. ISBN 978-0-8108-7505-0. [3] Haarmann, Harald (2002). Lexikon der untergegangenen Sprachen [Lexicon of extinct languages] (in German) (2nd ed.). C. H. Beck. p. 76. ISBN 978-3-406- 47596-2. [4] Amsalu Aklilu, Kuraz Publishing Agency, ጥሩ የአማርኛ ድርሰት እንዴት ያለ ነው!p. 42 [5] http://catalog.ihsn.org/index.php/catalog/3583/download/50086 [6] Amharic Wikipedia url -> https://en.wikipedia.org/wiki/Amharic