SlideShare a Scribd company logo
1 of 7
Download to read offline
International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
DOI : 10.5121/ijnlc.2013.2406 87
HANDLING UNKNOWN WORDS IN NAMED ENTITY
RECOGNITION USING TRANSLITERATION
Deepti Chopra1
Sudha Morwal2
and Dr. G.N. Purohit3
Department of Computer Engineering, Banasthali Vidyapith, Rajasthan, INDIA
deeptichopra11@yahoo.co.in
sudha_morwal@yahoo.co.in
ABSTRACT
Transliteration may be defined as the alteration of text from one language to another. In this paper, we
have discussed how transliteration is useful in handling unknown words in Named Entity Recognition
(NER). We have shown some of our results on unknown words handling in NER using transliteration.
KEYWORDS
Transliteration, NER, Unknown words, Performance Metrics
1. INTRODUCTION
Transliteration may be defined as converting word by word or letter by letter from one language
to another. Transliteration is mapping alphabets or phonetic sounds of text written in one
language to alphabets written in another language. Some of the applications of Transliteration
include: Named Entity Recognition, Machine Translation, Information Retrieval, and Information
Extraction. Named Entity Recognition (NER) is considered as one of the subtask of Information
Extraction in which named entities or proper nouns are searched from a huge text and classified
into various categories. Transliteration plays a crucial role in resolving the problem of unknown
words in Named Entity Recognition. If a training of a given word is performed using example
based learning in one language, then this word can be transliterated into other languages also.
e. g. नुसरत/PER सेब खाती है
/PER कानपुर/CITY रहती है
/PER नागपुर/CITY रहती है
Above, Named Entities in Hindi are transliterated into English as: nusrat, deepika, Kanpur, deepa
and nagpur.
2. RELATED WORK
Named Entity Recognition (NER) is the process in which Named Entities or proper nouns are
identified and then categorized into different classes of Named Entity classes e.g. Name of
Person, Sport, River, Country, State, city, Organization etc. The unknown words in Named Entity
Recognition can be handled using many ways.
International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
88
In one of the approaches, unknown word can be allotted a specific tag e.g. unk tag in order to
handle unknown words in NER [1][2].
Capitalization information cannot be used to identify the POS tag, since Capitalization can exists
in unknown words also. Also, Capitalization information is only restricted to identify Named
Entities in English [3][4].
3. OUR APPROACH
In our approach, we have considered ‘OTHER’ tag to be ‘Not a Named Entity’ tag. Figure 1
depicts our approach during training in NER. We process each token of training data and check
its tag. If the token is not tagged with ‘OTHER’ tag, then we transliterate the token into different
languages and save these transliterated Named Entities along with their tags in a file. If the token
is attached with ‘OTHER’ tag, then we simply process the next token in a sentence and continue
this procedure until all the training sentences are not processed.
NO
YES
NO YES
Figure 1 Steps taken while performing training in Named Entity Recognition
If we have the sound mapping of one language into any other language, then we can easily
perform the transliteration task. e. g. If ‘Ram/PER plays/OTHER cricket/OTHER’ is a training
sentence, then the transliteration task will transliterate Ram into other languages, since Ram is a
Named Entity that is attached to ‘PER’ tag and not ‘OTHER’ tag. ‘plays’ and ‘cricket’ are not the
Named Entities, since these are tagged by ‘OTHER’ tag, so transliteration is not performed on it.
So, our approach is a language independent based approach, that is capable of recognizing Named
Entities from a text written in any of the Natural languages, provided that it has been trained on a
TRAINING DATA
(PROCESS EACH
TOKEN)
IF NOT
OTHER
TAG?
TRANSLITERATE
TOKEN AND ATTACH
ITS TAG
IF END OF
SENTENCE
REACHED?
EXIT
PROCESS
NEXT TOKEN
START
International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
89
Named Entity Recognition based system in any of the languages and there lie some sound
mappings so that transliterated Named Entities in other languages can be achieved. Figure 2
displays the steps taken while performing testing in a Named Entity Recognition based system. In
this approach, all the tokens in a testing sentence are processed one at a time. If the current token
is not an unknown entity, then a tag is assigned to it using Viterbi algorithm, if Hidden Markov
Model approach is used to perform Named Entity Recognition. If current token is an unknown
entity, then the Transliteration file which was generated during training process is searched. If the
unknown entity is found in transliteration file, then the corresponding Named Entity tag is
attached to it, else ‘OTHER’ tag is allotted to it. This procedure continues until all the training
sentences are not processed.
NO
YES
NO
YES
NO YES
Figure 2 Steps taken while performing testing in Named Entity Recognition
Consider a testing sentence: राम खेलता है
If Training sentence is: Ram/PER plays/OTHER cricket/SPORT. Then, during training
transliteration of Named Entities i.e. Ram and cricket takes place and the transliterated Named
TESTING SENTENCE
(PROCESS EACH
TOKEN)
IF AN
UNKNOWN
ENTITY? ATTACH TAG TO
TOKEN
SEARCH
TRANSLITERATION
FILE
IF ENTITY
EXISTS?
ATTACH NAMED
ENTITY TAG
IF END OF
SENTENCE
?
EXIT
ATTACH ‘OTHER’
TAG TO TOKEN
International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
90
Entities are stored in a transliterated file along with their tags. So, during testing राम and
are given PER and SPORT tags respectively. खेलता and है tokens are unknown entities, since
these tokens are neither trained during training process nor they exist in transliteration file.
3. RESULT
In Named Entity Recognition, handling of unknown words is a very crucial issue. We have
developed a code that generates a list of Named Entities used in the training file. These Named
Entities are then transliterated into different languages and stored in the respective transliterated
files of different languages. TABLE 1 displays the transliteration of Named Entities in different
languages obtained using our Transliteration code.
TABLE 1 Transliteration of Named Entities in different languages
ENGLISH HINDI FRENCH. URDU TELUGU BENGALI
Ram राम Bélier ‫رام‬ গল
Cricket Cricket ‫ﮐﺮﮐﭧ‬ িে কট খলা
India भारत Inde ‫ﺑﮭﺎرت‬ ভারত
Ganga गंगा Ganga ‫ﮔﻨﮕﺎ‬ গা
For a given testing sentence, we can initially check if all the tokens exist in already created words
list. If for all the words, training has been performed, then by using Viterbi Algorithm, we can
generate optimal state sequence. If one of the word is unknown that does not exists in training
file, then we can check the transliterated file, if the word exists then the correct tag can be allotted
to it and hence the problem of unknown word is resolved. If initially we have performed
training in NER in English and during testing if we give a word in Hindi, then this would
be treated as an unknown word for which we have not performed training yet. So, we can
generate the transliterated list of Named Entities from English to other languages and
handle the problem of unknown words in NER. TABLE 2 displays our results obtained
while performing NER on multilingual corpus. The Training and Testing file include
sentences from the following languages: English, Hindi, Marathi, Punjabi and Urdu. We
have performed training on 142 tokens and testing on 212 tokens. Recall, Precision and
F-Measure obtained are: 95.80%, 96.3% & 96.04%.
TABLE: 2 Unknown words handling in NER having 28 sentences in Training file
Tag set {Person, City, Sport}
Number of Training
sentences
28 (142 tokens)
Number of Testing
Sentences
41(212tokens)
Number of sentence 0
International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
91
tagged wrongly
According to human
frequency of tags
(Correct Answe r)
42 6 1 163
Observed (System
provided) frequency
of tags
(Experimental
Results)
41 3 1 167
TAG SET
PER City Sport Not-a- name
Recall(R) R=Correct/(Correct+Incorrect+Spurious)=209/(209+8+1)=95.80%
Precision(P) P=Correct/(Correct+Incorrect+Missing)=209/(209+8+0)=96.3%
F-measure F-Measure = (2*P*R)/(P+R)=(2*96.3*95.8)/(96.3+95.8)=96.04%
TABLE 3 displays the list of Named Entities on which transliteration is performed in
different languages and then these transliterated Named Entities are stored in a file along
with their tags that can be used further to resolve the problem of unknown entities in a
Named Entity Recognition.
TABLE 3 Named Entities transliterated into different languages
SNO TAGS
1 PER (Name of Person)
2 OZ (Name of Organization)
3 CN (Name of Country)
4 MAG (Name of Magazine)
5 WEEK (Name of Week)
6 LOC (Name of Location)
7 PC (Name of Personal Computer)
8 MONTH (Name of Month)
9 CITY (Name of City)
10 ST (Name of State)
11 SPORT (Name of Sport)
International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
92
4. PERFORMANCE METRICS
Performance Metrics is measure to estimate the performance of a NER based system.
Performance Metrics can be calculated in terms of 3 parameters: Precision, Accuracy and F-
Measure. The output of a NER based system is termed as “response” and the interpretation of
human as the “answer key” [5]. Consider the following terms:
1. Correct-If the response is same as the answer key.
2. Incorrect-If the response is not same as the answer key.
3. Missing-If answer key is found to be tagged but response is not tagged.
4. Spurious-If response is found to be tagged but answer key is not tagged. [6]
Hence, we define Precision, Recall and F-Measure as follows: [7][8][9]
Precision (P): Correct / (Correct + Incorrect + Missing)
Recall (R): Correct / (Correct + Incorrect + Spurious)
F-Measure: (2 * P * R) / (P + R)
5. CONCLUSION
In this paper, we have discussed about Transliteration. We have given our results obtained while
performing Named Entity Recognition on multilingual corpus and handling unknown words using
transliteration approach. We have also discussed about Performance Metrics which is a very
important measure to judge the performance of a Named Entity Recognition based system.
ACKNOWLEDGEMENT
I would like to thank all those who helped me in accomplishing this task.
REFERENCES
[1] http://www.cse.iitb.ac.in/~cs626-460-2012/seminar_ppts/ner.pdf
[2] http://staff.um.edu.mt/cabe2/publications/nfHmms.pdf
[3] http://acl.ldc.upenn.edu/acl2002/MAIN/pdfs/Main036.pdf
[4] http://www.sigkdd.org/explorations/issues/7-1-2005-06/4-Fu.pdf
[5] Shilpi Srivastava, Mukund Sanglikar & D.C Kothari. ”Named Entity Recognition System for Hindi
Language: A Hybrid Approach” International Journal of Computational Linguistics (IJCL), Volume
(2) : Issue (1) : 2011.Available at
http://cscjournals.org/csc/manuscript/Journals/IJCL/volume2/Issue1/IJCL-19.pdf
[6] B. Sasidhar, P. M. Yohan, Dr. A. Vinaya Babu3, Dr. A. Govardhan,.“A Survey on Named Entity
Recognition in Indian Languages with particular reference to Telugu” IJCSI International Journal of
Computer Science Issues, Vol. 8, Issue 2, March 2011.
[7] Asif Ekbal, Rejwanul Haque, Amitava Das, Venkateswarlu Poka and Sivaji Bandyopadhyay
“Language Independent Named Entity Recognition in Indian Languages” .In Proceedings of the
IJCNLP-08 Workshop on NER for South and South East Asian Languages, pages 33–40,Hyderabad,
India, January 2008.Available at: http://www.mt-archive.info/IJCNLP-2008-Ekbal.pdf
[8] Darvinder kaur, Vishal Gupta. “A survey of Named Entity Recognition in English and other Indian
Languages” .IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 6, November
2010.
[9] G.V.S.RAJU, B.SRINIVASU, Dr.S.VISWANADHA RAJU, 4K.S.M.V.KUMAR “Named Entity
Recognition for Telugu Using Maximum Entropy Model”
International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
93
About Authors
Deepti Chopra is working as Assistant Professor in the Department of Computer Science
at Banasthali University (Rajasthan), India. She has received B.Tech degree in Computer
Science and Engineering from Rajasthan College of Engineering for Women, Jaipur,
Rajasthan in 2011.She has done M.Tech in Computer Science and Engineering from
Banasthali University, Rajasthan in 2013. Her research interests include Artificial
Intelligence, Natural Language Processing, and Information Retrieval. She has published many papers in
International journals and conferences.
Sudha Morwal is an active researcher in the field of Natural Language Processing.
Currently working as Associate Professor in the Department of Computer Science at
Banasthali University (Rajast han), India. She has done M.Tech (Computer Science) ,
NET, M.Sc (Computer Science) and her PhD is in progress from Banasthali University
(Rajasthan), India. She has published many papers in International Conferences and
Journals.
Dr. G. N. Purohit is a Professor in Department of Mathematics & Statistics at Banasthali
University (Rajasthan). Before joining Banasthali University, he was Professor and Head of
the Department of Mathematics, University of Rajasthan, Jaipur. He had been Chief-editor
of a research journal and regular reviewer of many journals. His present interest is in O.R.,
Discrete Mathematics and Communication networks. He has published around 40 research
papers in various journals

More Related Content

What's hot

An Improved Approach for Word Ambiguity Removal
An Improved Approach for Word Ambiguity RemovalAn Improved Approach for Word Ambiguity Removal
An Improved Approach for Word Ambiguity RemovalWaqas Tariq
 
Survey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageSurvey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
 
DBMS Campus crack Question Prepared by Randhir Kumar
DBMS Campus crack Question Prepared by Randhir KumarDBMS Campus crack Question Prepared by Randhir Kumar
DBMS Campus crack Question Prepared by Randhir KumarRandhir Chouhan
 
Named Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random FieldNamed Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random FieldWaqas Tariq
 
DISCRIMINATION OF ENGLISH TO OTHER INDIAN LANGUAGES (KANNADA AND HINDI) FOR O...
DISCRIMINATION OF ENGLISH TO OTHER INDIAN LANGUAGES (KANNADA AND HINDI) FOR O...DISCRIMINATION OF ENGLISH TO OTHER INDIAN LANGUAGES (KANNADA AND HINDI) FOR O...
DISCRIMINATION OF ENGLISH TO OTHER INDIAN LANGUAGES (KANNADA AND HINDI) FOR O...IJCSEA Journal
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...ijnlc
 
Isolated word recognition using lpc & vector quantization
Isolated word recognition using lpc & vector quantizationIsolated word recognition using lpc & vector quantization
Isolated word recognition using lpc & vector quantizationeSAT Publishing House
 
Parameters Optimization for Improving ASR Performance in Adverse Real World N...
Parameters Optimization for Improving ASR Performance in Adverse Real World N...Parameters Optimization for Improving ASR Performance in Adverse Real World N...
Parameters Optimization for Improving ASR Performance in Adverse Real World N...Waqas Tariq
 
THE STRUCTURED COMPACT TAG-SET FOR LUGANDA
THE STRUCTURED COMPACT TAG-SET FOR LUGANDATHE STRUCTURED COMPACT TAG-SET FOR LUGANDA
THE STRUCTURED COMPACT TAG-SET FOR LUGANDAijnlc
 
A survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languagesA survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languagesijnlc
 
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMHINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMijnlc
 
Ijartes v1-i1-002
Ijartes v1-i1-002Ijartes v1-i1-002
Ijartes v1-i1-002IJARTES
 
Tamil Morphological Analysis
Tamil Morphological AnalysisTamil Morphological Analysis
Tamil Morphological AnalysisKarthik Sankar
 
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISH
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISHHANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISH
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISHijnlc
 

What's hot (19)

An Improved Approach for Word Ambiguity Removal
An Improved Approach for Word Ambiguity RemovalAn Improved Approach for Word Ambiguity Removal
An Improved Approach for Word Ambiguity Removal
 
Survey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageSurvey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi Language
 
DBMS Campus crack Question Prepared by Randhir Kumar
DBMS Campus crack Question Prepared by Randhir KumarDBMS Campus crack Question Prepared by Randhir Kumar
DBMS Campus crack Question Prepared by Randhir Kumar
 
Named Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random FieldNamed Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random Field
 
DISCRIMINATION OF ENGLISH TO OTHER INDIAN LANGUAGES (KANNADA AND HINDI) FOR O...
DISCRIMINATION OF ENGLISH TO OTHER INDIAN LANGUAGES (KANNADA AND HINDI) FOR O...DISCRIMINATION OF ENGLISH TO OTHER INDIAN LANGUAGES (KANNADA AND HINDI) FOR O...
DISCRIMINATION OF ENGLISH TO OTHER INDIAN LANGUAGES (KANNADA AND HINDI) FOR O...
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
 
Isolated word recognition using lpc & vector quantization
Isolated word recognition using lpc & vector quantizationIsolated word recognition using lpc & vector quantization
Isolated word recognition using lpc & vector quantization
 
Parameters Optimization for Improving ASR Performance in Adverse Real World N...
Parameters Optimization for Improving ASR Performance in Adverse Real World N...Parameters Optimization for Improving ASR Performance in Adverse Real World N...
Parameters Optimization for Improving ASR Performance in Adverse Real World N...
 
D2 anandkumar
D2 anandkumarD2 anandkumar
D2 anandkumar
 
THE STRUCTURED COMPACT TAG-SET FOR LUGANDA
THE STRUCTURED COMPACT TAG-SET FOR LUGANDATHE STRUCTURED COMPACT TAG-SET FOR LUGANDA
THE STRUCTURED COMPACT TAG-SET FOR LUGANDA
 
Anandkumar novel approach
Anandkumar novel approachAnandkumar novel approach
Anandkumar novel approach
 
A survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languagesA survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languages
 
C7 agramakirshnan2
C7 agramakirshnan2C7 agramakirshnan2
C7 agramakirshnan2
 
D017422528
D017422528D017422528
D017422528
 
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMHINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
 
Ijartes v1-i1-002
Ijartes v1-i1-002Ijartes v1-i1-002
Ijartes v1-i1-002
 
Tamil Morphological Analysis
Tamil Morphological AnalysisTamil Morphological Analysis
Tamil Morphological Analysis
 
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISH
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISHHANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISH
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISH
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 

Viewers also liked

50 Essential Content Marketing Hacks (Content Marketing World)
50 Essential Content Marketing Hacks (Content Marketing World)50 Essential Content Marketing Hacks (Content Marketing World)
50 Essential Content Marketing Hacks (Content Marketing World)Heinz Marketing Inc
 
Prototyping is an attitude
Prototyping is an attitudePrototyping is an attitude
Prototyping is an attitudeWith Company
 
10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer Experience10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer ExperienceYuan Wang
 
Learn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionLearn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionIn a Rocket
 
How to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanHow to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanPost Planner
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldabaux singapore
 

Viewers also liked (7)

50 Essential Content Marketing Hacks (Content Marketing World)
50 Essential Content Marketing Hacks (Content Marketing World)50 Essential Content Marketing Hacks (Content Marketing World)
50 Essential Content Marketing Hacks (Content Marketing World)
 
Prototyping is an attitude
Prototyping is an attitudePrototyping is an attitude
Prototyping is an attitude
 
10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer Experience10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer Experience
 
Learn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionLearn BEM: CSS Naming Convention
Learn BEM: CSS Naming Convention
 
How to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanHow to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media Plan
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
 
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job? Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
 

Similar to HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION

Identification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian LanguagesIdentification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian Languageskevig
 
NERHMM: A Tool for Named Entity Recognition Based on Hidden Markov Model
NERHMM: A Tool for Named Entity Recognition Based on Hidden Markov ModelNERHMM: A Tool for Named Entity Recognition Based on Hidden Markov Model
NERHMM: A Tool for Named Entity Recognition Based on Hidden Markov Modelkevig
 
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...IRJET Journal
 
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITION
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITIONISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITION
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITIONijnlc
 
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
AI UNIT 3 - SRCAS JOC.pptx enjoy this pptAI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
AI UNIT 3 - SRCAS JOC.pptx enjoy this pptpavankalyanadroittec
 
Data Analytics using R with Yelp Dataset
Data Analytics using R with Yelp DatasetData Analytics using R with Yelp Dataset
Data Analytics using R with Yelp DatasetCédric Poottaren
 
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGES
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGESSTUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGES
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGESijistjournal
 
INFORMATION RETRIEVAL FROM TEXT
INFORMATION RETRIEVAL FROM TEXTINFORMATION RETRIEVAL FROM TEXT
INFORMATION RETRIEVAL FROM TEXTijcseit
 
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...ijistjournal
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
 
Pos Tagging for Classical Tamil Texts
Pos Tagging for Classical Tamil TextsPos Tagging for Classical Tamil Texts
Pos Tagging for Classical Tamil Textsijcnes
 
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUECOMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
 
Natural Language Processing using Text Mining
Natural Language Processing using Text MiningNatural Language Processing using Text Mining
Natural Language Processing using Text MiningSushanti Acharya
 
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHODPART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHODijait
 
Candidate Link Generation Using Semantic Phermone Swarm
Candidate Link Generation Using Semantic Phermone SwarmCandidate Link Generation Using Semantic Phermone Swarm
Candidate Link Generation Using Semantic Phermone Swarmkevig
 

Similar to HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION (20)

Identification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian LanguagesIdentification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian Languages
 
NERHMM: A Tool for Named Entity Recognition Based on Hidden Markov Model
NERHMM: A Tool for Named Entity Recognition Based on Hidden Markov ModelNERHMM: A Tool for Named Entity Recognition Based on Hidden Markov Model
NERHMM: A Tool for Named Entity Recognition Based on Hidden Markov Model
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-P
 
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
 
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITION
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITIONISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITION
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITION
 
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
AI UNIT 3 - SRCAS JOC.pptx enjoy this pptAI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
 
Data Analytics using R with Yelp Dataset
Data Analytics using R with Yelp DatasetData Analytics using R with Yelp Dataset
Data Analytics using R with Yelp Dataset
 
D3 dhanalakshmi
D3 dhanalakshmiD3 dhanalakshmi
D3 dhanalakshmi
 
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGES
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGESSTUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGES
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGES
 
INFORMATION RETRIEVAL FROM TEXT
INFORMATION RETRIEVAL FROM TEXTINFORMATION RETRIEVAL FROM TEXT
INFORMATION RETRIEVAL FROM TEXT
 
P-6
P-6P-6
P-6
 
P-6
P-6P-6
P-6
 
Top 10 Must-Know NLP Techniques for Data Scientists
Top 10 Must-Know NLP Techniques for Data ScientistsTop 10 Must-Know NLP Techniques for Data Scientists
Top 10 Must-Know NLP Techniques for Data Scientists
 
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
 
Pos Tagging for Classical Tamil Texts
Pos Tagging for Classical Tamil TextsPos Tagging for Classical Tamil Texts
Pos Tagging for Classical Tamil Texts
 
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUECOMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
 
Natural Language Processing using Text Mining
Natural Language Processing using Text MiningNatural Language Processing using Text Mining
Natural Language Processing using Text Mining
 
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHODPART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD
 
Candidate Link Generation Using Semantic Phermone Swarm
Candidate Link Generation Using Semantic Phermone SwarmCandidate Link Generation Using Semantic Phermone Swarm
Candidate Link Generation Using Semantic Phermone Swarm
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION

  • 1. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013 DOI : 10.5121/ijnlc.2013.2406 87 HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION Deepti Chopra1 Sudha Morwal2 and Dr. G.N. Purohit3 Department of Computer Engineering, Banasthali Vidyapith, Rajasthan, INDIA deeptichopra11@yahoo.co.in sudha_morwal@yahoo.co.in ABSTRACT Transliteration may be defined as the alteration of text from one language to another. In this paper, we have discussed how transliteration is useful in handling unknown words in Named Entity Recognition (NER). We have shown some of our results on unknown words handling in NER using transliteration. KEYWORDS Transliteration, NER, Unknown words, Performance Metrics 1. INTRODUCTION Transliteration may be defined as converting word by word or letter by letter from one language to another. Transliteration is mapping alphabets or phonetic sounds of text written in one language to alphabets written in another language. Some of the applications of Transliteration include: Named Entity Recognition, Machine Translation, Information Retrieval, and Information Extraction. Named Entity Recognition (NER) is considered as one of the subtask of Information Extraction in which named entities or proper nouns are searched from a huge text and classified into various categories. Transliteration plays a crucial role in resolving the problem of unknown words in Named Entity Recognition. If a training of a given word is performed using example based learning in one language, then this word can be transliterated into other languages also. e. g. नुसरत/PER सेब खाती है /PER कानपुर/CITY रहती है /PER नागपुर/CITY रहती है Above, Named Entities in Hindi are transliterated into English as: nusrat, deepika, Kanpur, deepa and nagpur. 2. RELATED WORK Named Entity Recognition (NER) is the process in which Named Entities or proper nouns are identified and then categorized into different classes of Named Entity classes e.g. Name of Person, Sport, River, Country, State, city, Organization etc. The unknown words in Named Entity Recognition can be handled using many ways.
  • 2. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013 88 In one of the approaches, unknown word can be allotted a specific tag e.g. unk tag in order to handle unknown words in NER [1][2]. Capitalization information cannot be used to identify the POS tag, since Capitalization can exists in unknown words also. Also, Capitalization information is only restricted to identify Named Entities in English [3][4]. 3. OUR APPROACH In our approach, we have considered ‘OTHER’ tag to be ‘Not a Named Entity’ tag. Figure 1 depicts our approach during training in NER. We process each token of training data and check its tag. If the token is not tagged with ‘OTHER’ tag, then we transliterate the token into different languages and save these transliterated Named Entities along with their tags in a file. If the token is attached with ‘OTHER’ tag, then we simply process the next token in a sentence and continue this procedure until all the training sentences are not processed. NO YES NO YES Figure 1 Steps taken while performing training in Named Entity Recognition If we have the sound mapping of one language into any other language, then we can easily perform the transliteration task. e. g. If ‘Ram/PER plays/OTHER cricket/OTHER’ is a training sentence, then the transliteration task will transliterate Ram into other languages, since Ram is a Named Entity that is attached to ‘PER’ tag and not ‘OTHER’ tag. ‘plays’ and ‘cricket’ are not the Named Entities, since these are tagged by ‘OTHER’ tag, so transliteration is not performed on it. So, our approach is a language independent based approach, that is capable of recognizing Named Entities from a text written in any of the Natural languages, provided that it has been trained on a TRAINING DATA (PROCESS EACH TOKEN) IF NOT OTHER TAG? TRANSLITERATE TOKEN AND ATTACH ITS TAG IF END OF SENTENCE REACHED? EXIT PROCESS NEXT TOKEN START
  • 3. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013 89 Named Entity Recognition based system in any of the languages and there lie some sound mappings so that transliterated Named Entities in other languages can be achieved. Figure 2 displays the steps taken while performing testing in a Named Entity Recognition based system. In this approach, all the tokens in a testing sentence are processed one at a time. If the current token is not an unknown entity, then a tag is assigned to it using Viterbi algorithm, if Hidden Markov Model approach is used to perform Named Entity Recognition. If current token is an unknown entity, then the Transliteration file which was generated during training process is searched. If the unknown entity is found in transliteration file, then the corresponding Named Entity tag is attached to it, else ‘OTHER’ tag is allotted to it. This procedure continues until all the training sentences are not processed. NO YES NO YES NO YES Figure 2 Steps taken while performing testing in Named Entity Recognition Consider a testing sentence: राम खेलता है If Training sentence is: Ram/PER plays/OTHER cricket/SPORT. Then, during training transliteration of Named Entities i.e. Ram and cricket takes place and the transliterated Named TESTING SENTENCE (PROCESS EACH TOKEN) IF AN UNKNOWN ENTITY? ATTACH TAG TO TOKEN SEARCH TRANSLITERATION FILE IF ENTITY EXISTS? ATTACH NAMED ENTITY TAG IF END OF SENTENCE ? EXIT ATTACH ‘OTHER’ TAG TO TOKEN
  • 4. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013 90 Entities are stored in a transliterated file along with their tags. So, during testing राम and are given PER and SPORT tags respectively. खेलता and है tokens are unknown entities, since these tokens are neither trained during training process nor they exist in transliteration file. 3. RESULT In Named Entity Recognition, handling of unknown words is a very crucial issue. We have developed a code that generates a list of Named Entities used in the training file. These Named Entities are then transliterated into different languages and stored in the respective transliterated files of different languages. TABLE 1 displays the transliteration of Named Entities in different languages obtained using our Transliteration code. TABLE 1 Transliteration of Named Entities in different languages ENGLISH HINDI FRENCH. URDU TELUGU BENGALI Ram राम Bélier ‫رام‬ গল Cricket Cricket ‫ﮐﺮﮐﭧ‬ িে কট খলা India भारत Inde ‫ﺑﮭﺎرت‬ ভারত Ganga गंगा Ganga ‫ﮔﻨﮕﺎ‬ গা For a given testing sentence, we can initially check if all the tokens exist in already created words list. If for all the words, training has been performed, then by using Viterbi Algorithm, we can generate optimal state sequence. If one of the word is unknown that does not exists in training file, then we can check the transliterated file, if the word exists then the correct tag can be allotted to it and hence the problem of unknown word is resolved. If initially we have performed training in NER in English and during testing if we give a word in Hindi, then this would be treated as an unknown word for which we have not performed training yet. So, we can generate the transliterated list of Named Entities from English to other languages and handle the problem of unknown words in NER. TABLE 2 displays our results obtained while performing NER on multilingual corpus. The Training and Testing file include sentences from the following languages: English, Hindi, Marathi, Punjabi and Urdu. We have performed training on 142 tokens and testing on 212 tokens. Recall, Precision and F-Measure obtained are: 95.80%, 96.3% & 96.04%. TABLE: 2 Unknown words handling in NER having 28 sentences in Training file Tag set {Person, City, Sport} Number of Training sentences 28 (142 tokens) Number of Testing Sentences 41(212tokens) Number of sentence 0
  • 5. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013 91 tagged wrongly According to human frequency of tags (Correct Answe r) 42 6 1 163 Observed (System provided) frequency of tags (Experimental Results) 41 3 1 167 TAG SET PER City Sport Not-a- name Recall(R) R=Correct/(Correct+Incorrect+Spurious)=209/(209+8+1)=95.80% Precision(P) P=Correct/(Correct+Incorrect+Missing)=209/(209+8+0)=96.3% F-measure F-Measure = (2*P*R)/(P+R)=(2*96.3*95.8)/(96.3+95.8)=96.04% TABLE 3 displays the list of Named Entities on which transliteration is performed in different languages and then these transliterated Named Entities are stored in a file along with their tags that can be used further to resolve the problem of unknown entities in a Named Entity Recognition. TABLE 3 Named Entities transliterated into different languages SNO TAGS 1 PER (Name of Person) 2 OZ (Name of Organization) 3 CN (Name of Country) 4 MAG (Name of Magazine) 5 WEEK (Name of Week) 6 LOC (Name of Location) 7 PC (Name of Personal Computer) 8 MONTH (Name of Month) 9 CITY (Name of City) 10 ST (Name of State) 11 SPORT (Name of Sport)
  • 6. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013 92 4. PERFORMANCE METRICS Performance Metrics is measure to estimate the performance of a NER based system. Performance Metrics can be calculated in terms of 3 parameters: Precision, Accuracy and F- Measure. The output of a NER based system is termed as “response” and the interpretation of human as the “answer key” [5]. Consider the following terms: 1. Correct-If the response is same as the answer key. 2. Incorrect-If the response is not same as the answer key. 3. Missing-If answer key is found to be tagged but response is not tagged. 4. Spurious-If response is found to be tagged but answer key is not tagged. [6] Hence, we define Precision, Recall and F-Measure as follows: [7][8][9] Precision (P): Correct / (Correct + Incorrect + Missing) Recall (R): Correct / (Correct + Incorrect + Spurious) F-Measure: (2 * P * R) / (P + R) 5. CONCLUSION In this paper, we have discussed about Transliteration. We have given our results obtained while performing Named Entity Recognition on multilingual corpus and handling unknown words using transliteration approach. We have also discussed about Performance Metrics which is a very important measure to judge the performance of a Named Entity Recognition based system. ACKNOWLEDGEMENT I would like to thank all those who helped me in accomplishing this task. REFERENCES [1] http://www.cse.iitb.ac.in/~cs626-460-2012/seminar_ppts/ner.pdf [2] http://staff.um.edu.mt/cabe2/publications/nfHmms.pdf [3] http://acl.ldc.upenn.edu/acl2002/MAIN/pdfs/Main036.pdf [4] http://www.sigkdd.org/explorations/issues/7-1-2005-06/4-Fu.pdf [5] Shilpi Srivastava, Mukund Sanglikar & D.C Kothari. ”Named Entity Recognition System for Hindi Language: A Hybrid Approach” International Journal of Computational Linguistics (IJCL), Volume (2) : Issue (1) : 2011.Available at http://cscjournals.org/csc/manuscript/Journals/IJCL/volume2/Issue1/IJCL-19.pdf [6] B. Sasidhar, P. M. Yohan, Dr. A. Vinaya Babu3, Dr. A. Govardhan,.“A Survey on Named Entity Recognition in Indian Languages with particular reference to Telugu” IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 2, March 2011. [7] Asif Ekbal, Rejwanul Haque, Amitava Das, Venkateswarlu Poka and Sivaji Bandyopadhyay “Language Independent Named Entity Recognition in Indian Languages” .In Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian Languages, pages 33–40,Hyderabad, India, January 2008.Available at: http://www.mt-archive.info/IJCNLP-2008-Ekbal.pdf [8] Darvinder kaur, Vishal Gupta. “A survey of Named Entity Recognition in English and other Indian Languages” .IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 6, November 2010. [9] G.V.S.RAJU, B.SRINIVASU, Dr.S.VISWANADHA RAJU, 4K.S.M.V.KUMAR “Named Entity Recognition for Telugu Using Maximum Entropy Model”
  • 7. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013 93 About Authors Deepti Chopra is working as Assistant Professor in the Department of Computer Science at Banasthali University (Rajasthan), India. She has received B.Tech degree in Computer Science and Engineering from Rajasthan College of Engineering for Women, Jaipur, Rajasthan in 2011.She has done M.Tech in Computer Science and Engineering from Banasthali University, Rajasthan in 2013. Her research interests include Artificial Intelligence, Natural Language Processing, and Information Retrieval. She has published many papers in International journals and conferences. Sudha Morwal is an active researcher in the field of Natural Language Processing. Currently working as Associate Professor in the Department of Computer Science at Banasthali University (Rajast han), India. She has done M.Tech (Computer Science) , NET, M.Sc (Computer Science) and her PhD is in progress from Banasthali University (Rajasthan), India. She has published many papers in International Conferences and Journals. Dr. G. N. Purohit is a Professor in Department of Mathematics & Statistics at Banasthali University (Rajasthan). Before joining Banasthali University, he was Professor and Head of the Department of Mathematics, University of Rajasthan, Jaipur. He had been Chief-editor of a research journal and regular reviewer of many journals. His present interest is in O.R., Discrete Mathematics and Communication networks. He has published around 40 research papers in various journals