08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Optical character recognition of handwritten Arabic using hidden Markov models
1. Optical character recognition of
handwritten Arabic using
hidden Markov models
Mohannad M. Aulama1
Asem M. Natsheh1
Gheith A. Abandah1
Mohammad M. Olama2
1Computer Engineering Department
University of Jordan
2Computational Sciences& Engineering Division
Oak Ridge National Laboratory
2. Outline
• Introduction
• Approach
• Optical Features of Arabic Characters
• Encoding Arabic Language Structure
• Constructing the HMM
• Recognition Algorithm
• Results
2 Managed by UT-Battelle
for the U.S. Department of Energy SPIE 2011
3. Why handwritten Arabic OCR?
• After the Latin alphabet, it is the second-most widely used
alphabet around the world[1].
• The Arabic alphabet is also used to script other languages
such as Farsi, Kurd, Persian, Urdu, etc.
• Little research has been addresses in the Arabic OCR.
• Handwritten OCR has a wide range of applications: invoice
and shipping receipt processing, subscription collection,
usage in bank checks, postal address recognition, and
mail applications.
[1] Encyclopedia Britannica
3 Managed by UT-Battelle
for the U.S. Department of Energy SPIE 2011
4. Characteristics of the Arabic language
– Arabic is cursive
العربية ا ل ع رب ي ة
– Arabic letter shapes are context dependent
ـه ه هـ ـهـ
– Variability of letter shapes (in handwritings)
4 Managed by UT-Battelle
for the U.S. Department of Energy SPIE 2011
5. Overview of Arabic handwritten
Sentence:
Word:
Sub-word:
Letter:
5 Managed by UT-Battelle
for the U.S. Department of Energy SPIE 2011
6. Characters information
Letter “ ” لis
stand-alone
Language structure
No. of dots
Optical property
Concavity
Optical property
Curves
Optical property
Letter “ ” اis Letter “ ” شis
ending the subword followed by letter “.”ع
Language structure Language structure
6 Managed by UT-Battelle
for the U.S. Department of Energy SPIE 2011
7. Approach
• Both characters’ optical properties and language
structure are considered in the recognition.
• An HMM model can efficiently code the sequential
information; and therefore is used in this project to
code characters’ information (optical properties +
language structure).
• The recognition algorithm is based on the “Viterbi
algorithm”. It outputs the most probable characters
representing the “recognized sub-word”.
7 Managed by UT-Battelle
for the U.S. Department of Energy SPIE 2011
8. Optical features of Arabic characters
• A sample of 48 instances of each character
handwritten by different individuals was
collected for all Arabic characters. An
example for letter Arabic-da is shown.
• Features extracted:
– Width to length ratio: length, width or equal.
– Density type: upper, lower, or equal.
– Vertical crosses: one, two, three, or four.
– Horizontal crosses: one, two, three, or four.
– Concavity type: up, down, to left, or to right.
– Number of dots: three, two, one or zero.
– Location of dots: up, down, or middle.
8 Managed by UT-Battelle
for the U.S. Department of Energy SPIE 2011
9. Optical features of Arabic characters
Location of dots: Width to length
Up ratio:
width
No. of dots:
three Density type:
Lower
Concavity type:
Vertical crosses:
Up
Three
Horizontal crosses:
One
9 Managed by UT-Battelle
for the U.S. Department of Energy SPIE 2011
10. Clustering of characters optical features
• 6912 different possible combinations for the features
defined Clustering needed.
• Features are clustered into 26 homogeneous clusters
using K-means algorithm.
10 Managed by UT-Battelle
for the U.S. Department of Energy SPIE 2011
11. Encoding Arabic language structure
• A large corpus was used to extract the structure of the
Arabic language.
• Results:
– Initial state probability: probability of letter x to appear as the first
letter in a sub-word.
– Final state probability: probability of letter x to appear as the last
letter in a sub-word.
– Stand alone state probability: probability of letter x to appear in a
single-lettered sub-word.
– Transition probability: probability of letter x followed by letter y in
a sub-word.
– Frequency of letters: probability of letter x to appear anywhere in
the corpus.
11 Managed by UT-Battelle
for the U.S. Department of Energy SPIE 2011
12. Encoding Arabic language structure
Frequency Initial State
of letter “ ” ل Probability
In Arabic corpus: of letter “ :” ش
12.86% 0.008975
Stand Alone
State Final State Transition
Probability Probability Probability
of letter “ :” ل of letter “ :” ا of letter “ ” ش
0.020877 0.272702 to letter “ :” ع
0.082786
12 Managed by UT-Battelle
for the U.S. Department of Energy SPIE 2011
13. Encoding Arabic language structure
Letter Probability Letter Probability Letter Probability Letter Occurrences Percentage
ل 0.290859 ا 0.272702 ا 0.364433 ا 47790 15.06%
م 0.096403 ر 0.099381 و 0.1559 ل 40796 12.86%
ي 0.076956 و 0.094797 أ 0.067906 ي 20893 6.58%
ف 0.0588 ة 0.077327 ن 0.06148 و 19669 6.20%
ب 0.057266 د 0.064174 ر 0.058728 م 19497 6.14%
ن 0.053009 ن 0.058851 د 0.043502 ن 18213 5.74%
Initial state Final state Stand alone state Frequency of Arabic Letters in
probabilities. probabilities. probabilities. the Corpus.
ء آ أ ؤ إ ئ
ء 0 0 0 0 0 0
آ 0 0 0 0 0 0
أ 0 0 0 0 0 0
Transition probabilities
ؤ 0 0 0 0 0 0
of Arabic Letters in the
إ 0 0 0 0 0 0
Corpus.
ئ 0 0 0 0 0 0
ا 0 0 0 0 0 0
ب 0.000134 0.000402 0.023361 0.001208 0.002953 0.001208
ة 0 0 0 0 0 0
ت 0 0.000223 0.013989 0.003021 0 0.001119
ث 0 0 0 0 0 0
ج 0 0.000327 0.00295 0 0 0.005901
13 Managed by UT-Battelle
for the U.S. Department of Energy SPIE 2011
14. What is HMM
• HMM is defined as a doubly stochastic
process with an underlying Markov
process that is not directly observable,
but can only be observed through
another set of stochastic processes
that produce the sequence of observed
symbols.
• Elements of HMM: S — states
– States S {S1, S2 , S3 ,..., SN } v — possible observations
– Observations V {V1,V2 ,V3 ,...,VM } a — state transition probabilities
b — output probabilities
– State transition probability
A {aij } aij Pr{qt 1 S j | qt Si } 1 i, j N
– Observation probability B {b jk } b jk Pr{Vk at t | qt S j } 1 j N 1 k M
– Initial state probability { i } { i } Pr{q1 Si } 1 i N
14 Managed by UT-Battelle
for the U.S. Department of Energy SPIE 2011
15. Constructing the HMM
• Stand alone character probability SA :
– This vector is a one dimensional matrix where each position
corresponds to one of the states (letters), and each value
corresponds to the probability of this state forming a single-
letter sub-word. It is estimated as:
Number of single-letter subwords composed of this state
SA [state]
Total number of sub-words
• Initial state probability vector i :
– It is a one dimensional vector where each position corresponds
to one of the states (letters), and each value corresponds to the
probability of this state starting the sub-word being recognized.
It is estimated as
Number of sub-words starting with this state
i [ state]
Total number of sub-words
15 Managed by UT-Battelle
for the U.S. Department of Energy SPIE 2011
16. Constructing the HMM
• State transition matrix (A) :
– It is denoted as AN N aij , where 1 i, j N and aij corresponds to
the probability of having a transition from Si state to S j state in
the sub-word being recognized. The entries of the transition
matrix, aij , are estimated as:
Number of transitions from Si S j
aij
Total number of transitions from Si
• Confusion (emission) matrix (B):
– It is denoted as BN M b jk , where 1 j N , 1 k M , and N
corresponds to the number of states (letters) and M corresponds
to the possible feature vectors (extracted feature sets) that can
be emitted from any state. The entries of the confusion matrix, b jk
, are estimated as: b Number of times observation Vk is out
jk
Total number of character S j repetition
16 Managed by UT-Battelle
for the U.S. Department of Energy SPIE 2011
17. Elements of HMM
Stand Alone Character Probability SA State Transition Matrix (A)
Index: ... ع ل ش ق ...
Index: ... ع ل ش ق ... ع [ … 0.03 0.09 0.1 0.03 … ]
[ … 0.03 0.09 0.1 0.03 … ]
ل [ … 0.04 0.07 0.03 0.02 … ]
ش [ … 0.07 0.02 0.02 0.06 … ]
ق [ … 0.09 0.03 0.05 0.04 … ]
Features Extracted:
Cluster 2:
…
No. of dots 3
Concavity up
Density lower Confusion or Emission Matrix (B)
….
Index: ... ع ل ش ق ...
Initial State Probability Vector i 1 [ … 0.03 0.09 0.1 0.03 … ]
2 [ … 0.05 0.05 0.06 0.05 … ]
Index: ... ع ل ش ق ...
[ … 0.02 0.06 0.15 0.05 … ] 3 [ … 0.07 0.06 0.03 0.04 … ]
4 [ … 0.02 0.07 0.03 0.03 … ]
17 Managed by UT-Battelle
for the U.S. Department of Energy SPIE 2011
18. Recognition algorithm
• A modified Viterbi algorithm was implemented for the
recognition, coded in C++.
• Inputs: HMM {A, B, }, observations.
• Outputs: the recognized letters.
• Sample code:
Input
Probability-at-time-t: Probability of a letter to appear at time t in the subword to be recognized.
State-transition-matrix-to-winning-character-at-time-t+1: Probability of a letter at time t to be followed by the
recognized winning letter at time t+1
Output
Winning-state-at-time-t: The recognized letter at time t.
Probability-of-winning-state-at-time-t: Probability of the winning letter at time t.
temp-variable: probability that a letter at time t to be followed by the winning letter at time t+1.
For each letter in the written Arabic language at time: t (total of 48)
temp-variable = Probability-at-time-t * State-transition-matrix-to-winning-character-at-time-t+1
if (temp-variable > Probability-of-winning-state-at-time-t)
Probability-of-winning-state-at-time-t = temp-variable
Winning-state-at-time-t = current letter in the for loop
End
18 Managed by UT-Battelle
for the U.S. Department of Energy SPIE 2011
19. Recognition algorithm
Features Extracted: Features Extracted:
Features Extracted:
Cluster 15: Cluster 9:
Cluster 2:
… …
…
No. of dots 1 No. of dots 0
No. of dots 3
Concavity down Concavity right
Concavity up
Density upper Density equal
Density lower
…. ….
….
0 1 2
Time =
i B(2) P{t(0)} A(i,j) B(15) P{t(2)}
Index: Index: Index: Index: Index: Index:
... ... ... ... ... ...
[ … [ … [ … [ … [ … [ …
ق ق ق ق ق ق
ش
0.02
0.06 X ش
0.03
0.09 ش
0.04
0.06 X ش
0.06
0.03
X ش
0.04
0.03 ش
0.03
0.04
ل 0.15 ل 0.16 ل 0.10 ل 0.14 ل 0.14 ل 0.15
0.05 ع 0.01 ع 0.09 ع 0.02 ع 0.09 ع 0.04
ع … ] … ] … ]
… ] … ] ... ... ... … ]
... ... ...
= =
P{t(1)} Max(P{t(2)})
P{t(0)} =
Index: Index:
... ...
[ …
winner-t(2)
[ … ق
ق 0.04 0.05
ش ش 0.06
0.06 Recursive
ل 0.10 ل 0.12
0.09 ع 0.06
ع … ]
...
… ] winner-t(0) ... winner-t(1)
19 Managed by UT-Battelle
for the U.S. Department of Energy SPIE 2011
20. Recognition flow chart
Text Line Sample
Segment line into sub-words
1. Sub-word segmentation.
2. Character segmentation.
3. Character feature Segment sub-words into separated characters
extraction.
4. Mapping extracted
features into a cluster. Mapping to the predefined feature vector space Feature Extraction
5. Viterbi recognition
Observation 3 Observation 2 Observation 1
(end char.) (middle char.) (initial char.)
algorithm. Apply Viterbi Algorithm in the HMM structure
6. Recognized letters. Pick most probable sequence to generate the observations
The Recognized Sub-word
20 Managed by UT-Battelle The Recognized Characters
for the U.S. Department of Energy SPIE 2011
21. Results
• High OCR recognition rates of Arabic letters (~90%)
were achieved using the developed HMM and Viterbi
algorithm.
• This is a large recognition Measure
Characters in corpus
Count
8456
Percentage
improvement compared to Sub-words in corpus 3384
~70% in [2], in which only Characters correctly
7695 91%
Arabic character features recognized
Sub-words recognized with
are considered without
1388 41%
zero error
performing recognition on
Sub-words recognized with
2436 72%
one error or less
the sub-word level. Sub-words recognized with
two errors or less
3011 89%
[2] Abdel-Hafez, M. H., Abu-Dayeh, H. I., Al-Najjar, M. Sub-words recognized with
S., [Rule-Based Recognition for Arabic Handwritten zero error after dictionary 2741 81%
OCR], Department of Computer Engineering, the correction
University of Jordan, (2005).
21 Managed by UT-Battelle
for the U.S. Department of Energy SPIE 2011
22. 22 Managed by UT-Battelle
for the U.S. Department of Energy SPIE 2011