Optical character recognition of handwritten Arabic using hidden Markov models

Optical character recognition of
handwritten Arabic using
hidden Markov models

Mohannad M. Aulama1
Asem M. Natsheh1
Gheith A. Abandah1

Mohammad M. Olama2

1Computer Engineering Department
University of Jordan
2Computational Sciences& Engineering Division
Oak Ridge National Laboratory

Outline

• Introduction
• Approach
• Optical Features of Arabic Characters
• Encoding Arabic Language Structure
• Constructing the HMM
• Recognition Algorithm
• Results

2 Managed by UT-Battelle
for the U.S. Department of Energy SPIE 2011

Why handwritten Arabic OCR?

• After the Latin alphabet, it is the second-most widely used
alphabet around the world[1].

• The Arabic alphabet is also used to script other languages
such as Farsi, Kurd, Persian, Urdu, etc.

• Little research has been addresses in the Arabic OCR.

• Handwritten OCR has a wide range of applications: invoice
and shipping receipt processing, subscription collection,
usage in bank checks, postal address recognition, and
mail applications.
[1] Encyclopedia Britannica

Characteristics of the Arabic language

– Arabic is cursive

‫العربية‬ ‫ا ل ع رب ي ة‬
– Arabic letter shapes are context dependent

‫ـه‬ ‫ه هـ ـهـ‬
– Variability of letter shapes (in handwritings)


Overview of Arabic handwritten

Sentence:

Word:

Sub-word:

Letter:


Characters information

Letter “ ‫ ” ل‬is
stand-alone
Language structure
No. of dots
Optical property

Concavity
Optical property

Curves
Optical property
Letter “ ‫ ” ا‬is Letter “ ‫ ” ش‬is
ending the subword followed by letter “‫.”ع‬
Language structure Language structure


Approach

• Both characters’ optical properties and language
structure are considered in the recognition.

• An HMM model can efficiently code the sequential
information; and therefore is used in this project to
code characters’ information (optical properties +
language structure).

• The recognition algorithm is based on the “Viterbi
algorithm”. It outputs the most probable characters
representing the “recognized sub-word”.


Optical features of Arabic characters

• A sample of 48 instances of each character
handwritten by different individuals was
collected for all Arabic characters. An
example for letter Arabic-da is shown.
• Features extracted:
– Width to length ratio: length, width or equal.
– Density type: upper, lower, or equal.
– Vertical crosses: one, two, three, or four.
– Horizontal crosses: one, two, three, or four.
– Concavity type: up, down, to left, or to right.
– Number of dots: three, two, one or zero.
– Location of dots: up, down, or middle.

Optical features of Arabic characters

Location of dots: Width to length
Up ratio:
width

No. of dots:
three Density type:
Lower

Concavity type:
Vertical crosses:
Up
Three
Horizontal crosses:
One


Clustering of characters optical features

• 6912 different possible combinations for the features
defined  Clustering needed.
• Features are clustered into 26 homogeneous clusters
using K-means algorithm.


Encoding Arabic language structure

• A large corpus was used to extract the structure of the
Arabic language.
• Results:
– Initial state probability: probability of letter x to appear as the first
letter in a sub-word.
– Final state probability: probability of letter x to appear as the last
letter in a sub-word.
– Stand alone state probability: probability of letter x to appear in a
single-lettered sub-word.
– Transition probability: probability of letter x followed by letter y in
a sub-word.
– Frequency of letters: probability of letter x to appear anywhere in
the corpus.


Frequency Initial State
of letter “ ‫” ل‬ Probability
In Arabic corpus: of letter “ ‫:” ش‬
12.86% 0.008975

Stand Alone
State Final State Transition
Probability Probability Probability
of letter “ ‫:” ل‬ of letter “ ‫:” ا‬ of letter “ ‫” ش‬
0.020877 0.272702 to letter “ ‫:” ع‬
0.082786



Letter Probability Letter Probability Letter Probability Letter Occurrences Percentage

‫ل‬ 0.290859 ‫ا‬ 0.272702 ‫ا‬ 0.364433 ‫ا‬ 47790 15.06%

‫م‬ 0.096403 ‫ر‬ 0.099381 ‫و‬ 0.1559 ‫ل‬ 40796 12.86%

‫ي‬ 0.076956 ‫و‬ 0.094797 ‫أ‬ 0.067906 ‫ي‬ 20893 6.58%

‫ف‬ 0.0588 ‫ة‬ 0.077327 ‫ن‬ 0.06148 ‫و‬ 19669 6.20%

‫ب‬ 0.057266 ‫د‬ 0.064174 ‫ر‬ 0.058728 ‫م‬ 19497 6.14%

‫ن‬ 0.053009 ‫ن‬ 0.058851 ‫د‬ 0.043502 ‫ن‬ 18213 5.74%

Initial state Final state Stand alone state Frequency of Arabic Letters in
probabilities. probabilities. probabilities. the Corpus.

‫ء‬ ‫آ‬ ‫أ‬ ‫ؤ‬ ‫إ‬ ‫ئ‬

‫ء‬ 0 0 0 0 0 0

‫آ‬ 0 0 0 0 0 0

‫أ‬ 0 0 0 0 0 0
Transition probabilities
‫ؤ‬ 0 0 0 0 0 0
of Arabic Letters in the
‫إ‬ 0 0 0 0 0 0
Corpus.
‫ئ‬ 0 0 0 0 0 0

‫ا‬ 0 0 0 0 0 0

‫ب‬ 0.000134 0.000402 0.023361 0.001208 0.002953 0.001208

‫ة‬ 0 0 0 0 0 0

‫ت‬ 0 0.000223 0.013989 0.003021 0 0.001119

‫ث‬ 0 0 0 0 0 0

‫ج‬ 0 0.000327 0.00295 0 0 0.005901

What is HMM

• HMM is defined as a doubly stochastic
process with an underlying Markov
process that is not directly observable,
but can only be observed through
another set of stochastic processes
that produce the sequence of observed
symbols.
• Elements of HMM: S — states
– States S  {S1, S2 , S3 ,..., SN } v — possible observations
– Observations V  {V1,V2 ,V3 ,...,VM } a — state transition probabilities
b — output probabilities
– State transition probability
A  {aij } aij  Pr{qt 1  S j | qt  Si } 1  i, j  N

– Observation probability B  {b jk } b jk  Pr{Vk at t | qt  S j } 1 j  N 1 k  M

– Initial state probability   { i } { i }  Pr{q1  Si } 1  i  N

Constructing the HMM

• Stand alone character probability  SA :
– This vector is a one dimensional matrix where each position
corresponds to one of the states (letters), and each value
corresponds to the probability of this state forming a single-
letter sub-word. It is estimated as:
Number of single-letter subwords composed of this state
 SA [state] 
Total number of sub-words

• Initial state probability vector  i :
– It is a one dimensional vector where each position corresponds
to one of the states (letters), and each value corresponds to the
probability of this state starting the sub-word being recognized.
It is estimated as
Number of sub-words starting with this state
 i [ state] 
Total number of sub-words

Constructing the HMM

• State transition matrix (A) :
– It is denoted as AN N  aij , where 1  i, j  N and aij corresponds to
the probability of having a transition from Si state to S j state in
the sub-word being recognized. The entries of the transition
matrix, aij , are estimated as:
Number of transitions from Si  S j
aij 
Total number of transitions from Si

• Confusion (emission) matrix (B):
– It is denoted as BN M  b jk  , where 1  j  N , 1  k  M , and N
corresponds to the number of states (letters) and M corresponds
to the possible feature vectors (extracted feature sets) that can
be emitted from any state. The entries of the confusion matrix, b jk
, are estimated as: b  Number of times observation Vk is out
jk
Total number of character S j repetition

Elements of HMM
Stand Alone Character Probability  SA State Transition Matrix (A)
Index: ... ‫ع‬ ‫ل‬ ‫ش‬ ‫ق‬ ...
Index: ... ‫ع‬ ‫ل‬ ‫ش‬ ‫ق‬ ... ‫ع‬ [ … 0.03 0.09 0.1 0.03 … ]
[ … 0.03 0.09 0.1 0.03 … ]
‫ل‬ [ … 0.04 0.07 0.03 0.02 … ]

‫ش‬ [ … 0.07 0.02 0.02 0.06 … ]

‫ق‬ [ … 0.09 0.03 0.05 0.04 … ]
Features Extracted:
Cluster 2:
…
No. of dots 3
Concavity up
Density lower Confusion or Emission Matrix (B)
….

Index: ... ‫ع‬ ‫ل‬ ‫ش‬ ‫ق‬ ...
Initial State Probability Vector i 1 [ … 0.03 0.09 0.1 0.03 … ]

2 [ … 0.05 0.05 0.06 0.05 … ]
Index: ... ‫ع‬ ‫ل‬ ‫ش‬ ‫ق‬ ...

[ … 0.02 0.06 0.15 0.05 … ] 3 [ … 0.07 0.06 0.03 0.04 … ]

4 [ … 0.02 0.07 0.03 0.03 … ]

Recognition algorithm

• A modified Viterbi algorithm was implemented for the
recognition, coded in C++.
• Inputs: HMM   {A, B,  }, observations.
• Outputs: the recognized letters.
• Sample code:
Input
Probability-at-time-t: Probability of a letter to appear at time t in the subword to be recognized.
State-transition-matrix-to-winning-character-at-time-t+1: Probability of a letter at time t to be followed by the
recognized winning letter at time t+1
Output
Winning-state-at-time-t: The recognized letter at time t.
Probability-of-winning-state-at-time-t: Probability of the winning letter at time t.
temp-variable: probability that a letter at time t to be followed by the winning letter at time t+1.
For each letter in the written Arabic language at time: t (total of 48)
temp-variable = Probability-at-time-t * State-transition-matrix-to-winning-character-at-time-t+1

if (temp-variable > Probability-of-winning-state-at-time-t)
Probability-of-winning-state-at-time-t = temp-variable
Winning-state-at-time-t = current letter in the for loop
End

Recognition algorithm
Features Extracted: Features Extracted:
Features Extracted:
Cluster 15: Cluster 9:
Cluster 2:
… …
…
No. of dots 1 No. of dots 0
No. of dots 3
Concavity down Concavity right
Concavity up
Density upper Density equal
Density lower
…. ….
….
0 1 2
Time =
i B(2) P{t(0)} A(i,j) B(15) P{t(2)}
Index: Index: Index: Index: Index: Index:
... ... ... ... ... ...
[ … [ … [ … [ … [ … [ …
‫ق‬ ‫ق‬ ‫ق‬ ‫ق‬ ‫ق‬ ‫ق‬
‫ش‬
0.02
0.06 X ‫ش‬
0.03
0.09 ‫ش‬
0.04
0.06 X ‫ش‬
0.06
0.03
X ‫ش‬
0.04
0.03 ‫ش‬
0.03
0.04
‫ل‬ 0.15 ‫ل‬ 0.16 ‫ل‬ 0.10 ‫ل‬ 0.14 ‫ل‬ 0.14 ‫ل‬ 0.15
0.05 ‫ع‬ 0.01 ‫ع‬ 0.09 ‫ع‬ 0.02 ‫ع‬ 0.09 ‫ع‬ 0.04
‫ع‬ … ] … ] … ]
… ] … ] ... ... ... … ]
... ... ...

= =
P{t(1)} Max(P{t(2)})
P{t(0)} =
Index: Index:
... ...
[ …
winner-t(2)
[ … ‫ق‬
‫ق‬ 0.04 0.05
‫ش‬ ‫ش‬ 0.06
0.06 Recursive
‫ل‬ 0.10 ‫ل‬ 0.12
0.09 ‫ع‬ 0.06
‫ع‬ … ]
...
… ] winner-t(0) ... winner-t(1)

Recognition flow chart
Text Line Sample

Segment line into sub-words

1. Sub-word segmentation.
2. Character segmentation.
3. Character feature Segment sub-words into separated characters

extraction.
4. Mapping extracted
features into a cluster. Mapping to the predefined feature vector space Feature Extraction

5. Viterbi recognition
Observation 3 Observation 2 Observation 1
(end char.) (middle char.) (initial char.)

algorithm. Apply Viterbi Algorithm in the HMM structure

6. Recognized letters. Pick most probable sequence to generate the observations

The Recognized Sub-word

20 Managed by UT-Battelle The Recognized Characters

Results

• High OCR recognition rates of Arabic letters (~90%)
were achieved using the developed HMM and Viterbi
algorithm.
• This is a large recognition Measure
Characters in corpus
Count
8456
Percentage

improvement compared to Sub-words in corpus 3384
~70% in [2], in which only Characters correctly
7695 91%
Arabic character features recognized
Sub-words recognized with
are considered without
1388 41%
zero error

performing recognition on
Sub-words recognized with
2436 72%
one error or less

the sub-word level. Sub-words recognized with
two errors or less
3011 89%
[2] Abdel-Hafez, M. H., Abu-Dayeh, H. I., Al-Najjar, M. Sub-words recognized with
S., [Rule-Based Recognition for Arabic Handwritten zero error after dictionary 2741 81%
OCR], Department of Computer Engineering, the correction
University of Jordan, (2005).


Optical character recognition of handwritten Arabic using hidden Markov models

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Más de Muhannad Aulama

Más de Muhannad Aulama (8)

Último

Último (20)

Optical character recognition of handwritten Arabic using hidden Markov models