SlideShare una empresa de Scribd logo
1 de 11
Descargar para leer sin conexión
International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 5, September 2013
DOI : 10.5121/ijaia.2013.4506 77
AN UNSUPERVISED APPROACH TO
DEVELOP IR SYSTEM: THE CASE OF URDU
Mohd. Shahid Husain
Integral University, Lucknow
ABSTRACT
Web Search Engines are best gifts to the mankind by Information and Communication Technologies.
Without the search engines it would have been almost impossible to make the efficient access of the
information available on the web today. They play a very vital role in the accessibility and usability of the
internet based information systems. As the internet users are increasing day by day so is the amount of
information being available on web increasing. But the access of information is not uniform across all the
language communities. Besides English and European languages that constitutes to the 60% of the
information available on the web, there is still a wide range of the information available on the internet in
different languages too. In the past few years the amount of information available in Indian Languages
has also increased. Besides English and few European Languages, there are no tools and techniques
available for the efficient retrieval of this information available on the internet. Especially in the case of
the Indian Languages the research is still in the preliminary steps. There are no sufficient amount of tools
and techniques available for the efficient retrieval of the information for Indian Languages.
As we know that Indian Languages are very resource poor languages in terms of IR test data collection.
So my main focus was mainly on developing the data set for URDU IR, training and testing data for
Stemmer.
We have developed a language independent system to facilitate efficient retrieval of information available
in Urdu language which can be used for other languages as well. The system gives precision of 0.63 and
the recall of the system is 0.8. For this Firstly I have developed an Unsupervised Stemmer for URDU
Language [1] as it is very important in the Information Retrieval.
Keywords: Information Retrieval, Vector Space Model, Stemming, Urdu IR
1. INTRODUCTION
The rapid growth of electronic data has attracted the attention in the research and industry
communities for efficient methods for indexing, analysis and retrieval of information from these
large number of data repositories having wide range of data for a vast domain of applications.
In this era of Information technology, more and more data is now being made available on
online data repositories. Almost every information one need is now available on internet.
English and European Languages basically dominated the web since its inception. However,
now the web is getting multi-lingual. Especially, during last few years, a wide range of
information in Indian regional languages like Hindi, Urdu, Bengali, Oriya, Tamil and Telugu
has been made available on web in the form of e-data. But the access to these data repositories
is very low because the efficient search engines/retrieval systems supporting these languages are
very limited. Hence automatic information processing and retrieval is become an urgent
requirement. Moreover, since India is a country having a wide range of regional languages, in
the Indian context, the IR approach should be such that it can handle multilingual document
collections.
International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 5, September 2013
78
A number of information retrieval systems are available to support English and some other
European languages. Work involving development of IR systems for Indian languages is only of
recent interests. Development of such systems is constraint by the lack of the availability of
linguistic resources and tools in these languages. The reported works in this direction for Indian
languages were focused on Hindi, Tamil, Bengali, Marathi and Oriya. But there is no reported
work is done for Urdu language.
There is no sufficient amount of resources available to retrieve information effectively available
on internet in Indian Languages. So there is a need of some efficient tools and techniques to
represent, express, store and retrieve the information available in different languages.
The present work focuses on development of an efficient Information Retrieval system for Urdu
Language.
2.INFORMATION RETRIEVAL
Information Retrieval is the sub domain of text mining and natural language processing. This is
the science in which the software system retrieves the relevant documents or the information in
response to the user query need. The Information Retrieval system match the given user quires
with the data corpus available and rank the documents on the basis of the relevance with the
user need. Then the IR system returns the top ranked documents containing relevant information
to the user query.
IR systems may be monolingual, bi-lingual or multilingual. The main objective of this thesis
work is the development of the mono-lingual information retrieval system for Urdu language.
To retrieve the relevant information on the basis of user query
Fig. 1: typical IR system
• The IR system breaks the query statement and the data corpus in a standard format.
• The query is then matched with the documents presented in the corpus and ranked on
the basis of the relevance with the query.
• Top ranked documents are then retrieved.
International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 5, September 2013
79
There are various approaches for converting the query statement and the corpus data in a
standard format like stemming, morphological analysis, Stop word removal, indexing etc.
Similarly there are various techniques or methods for query matching like cosine similarity,
Euclidean distance etc.
The efficiency of any Information Retrieval system depends on the term weighting schemes,
strategies used for indexing the documents and the retrieval model used to develop the IR
system.
2.1 Stemmer
Stemming is the backbone process of any IR system. Stemmers are used for getting base or root
form (i.e. stems) from inflected (or sometimes derived) words. Unlike morphological analyzer,
where the root words have some lexical meaning, it’s not necessary with the case of a stemmer.
A stemmer is used to remove the inflected part of the words to get their root form. Stemming is
used to reduce the overhead of indexing and to improve the performance of an IR system. More
specifically, stemmer increases recall of the search engine, whereas Precision decreases.
However sometimes precision may increases depending upon the information need of the users.
Stemming is the basic process of any query system, because a user who needs some information
on plays may also be interested in documents that contain the word play (without the s).
2.2 Term Frequency
This is a local parameter which indicates the frequency or the count of a term within a
document. This parameter gives the relevance of a document with a user query term on the basis
of how many times that term occurs in that particular document.
Mathematically it can be given as: tfij=nij
Where nij is the frequency or the number of occurrence of term ti in the document dj.
2.3 Document Frequency
This is a global parameter and attempts to include distribution of term across the documents.
This parameter gives the importance of the term across the document corpus. The number of the
documents in the corpus containing the considered term t is called the document frequency. To
normalize, it is divided by the total number of the documents in the corpus.
Mathematically it can be given as: dfi=ni/n
Where ni is the number of documents that contains term ti and the total number of the
documents in the corpus is n.
idf is the inverse of this document frequency.
2.4 The third factor which may affect the weighting function is the length of the
document.
Hence the term weighting function can be represented by a triplet ABC
here A- tf component
B- idf component
C- Length normalizing component
The factor Term frequency within a document i.e. A may have following options:
International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 5, September 2013
80
Table 1: different options for considering term frequency
n tf = tfij (Raw term frequency)
b tf = 0 or 1 (binary weight)
a






+=
j
ij
Dinmax tf
tf
0.50.5tf
(Augmented term frequency)
l tf = ln(tfij) + 1.0 Logarithmic term frequency
The options for the factor inverse document frequency i.e. B is:
Table 2: different options for considering inverse document frequency
N Wt=tf No conversion i.e. idf is not taken
T Wt=tf*idf Idf is taken into account
The options for the factor document length i.e. C is:
Table 3: different options for considering document length
N Wij=wt No conversion
C Wij=wt/ sqrt(sum of (wts squared)) Normalized weight
2.5 Indexing
To represent the documents in the corpus and the user query statement indexing is done. That is
the process of transforming document text and given query statement to some representation of
it is known as indexing.
There are different index structures which can be used for indexing. The most commonly used
data structure by IR system is inverted index.
Indexing techniques concerned with the selection of good document descriptors, such as
keywords or terms, to describe information content of the documents.
A good descriptor is one that helps in describing the content of the document and in
discriminating the document from other documents in the collection.
The most widely used method is to represent the query and the document as a set of tokens i.e.
index terms or keywords.
For indexing a document, there are different indexing strategies as given below :
International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 5, September 2013
81
2.5.1Character Indexing:
In this scheme the tokens used for representing the documents are the characters present in the
document.
2.5.2Word Indexing:
This approach uses words in the document to represent it.
2.5.3N-gram indexing:
This method breaks the words into n-grams, these n-grams are used to index the documents.
2.5.4Compound Word Indexing:
In this method bi-words or tri-words are used for indexing.
2.6 Information retrieval models
An IR model defines the following aspects of retrieval procedure of a search engine:
a. How the documents and user’s queries are represented
b. How system retrieves relevant documents according to users’ queries &
c. How retrieved documents are ranked.
Any typical IR model comprises of the following:
a. A model for documents
b. A model for queries and
c. Matching function which compares queries to documents.
The IR models can be categorized as:
2.6.1 Classical models of IR:
This is the simplest IR model. It is based on the well recognized and easy to understood
knowledge of mathematics.
Classical models are easy to implement and are very efficient.
The three classical information retrieval models are:
-Boolean
-Vector and
-Probabilistic models
2.6.2 Non-Classical models of IR:
Non-classical information retrieval models are based on principles like information logic model,
situation theory model and interaction model. They are not based on concepts like similarity,
probability, Boolean operations etc. on which classical retrieval models are based on.
2.6.3 Alternative models of IR:
Alternative models are advanced classical IR models. These models make use of specific
techniques from other fields like Cluster model, fuzzy model and latent semantic indexing (LSI)
models.
2.6.4Boolean Retrieval model:
This is the simplest retrieval model which retrieves the information on the basis of the query
given in Boolean expression. Boolean queries are queries that uses And, OR and Not Boolean
operations to join the query terms.
International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 5, September 2013
82
The one drawback of Boolean information retrieval model is that it requires Boolean query
instead of free text. The second drawback is that this model cannot rank the documents on the
basis of relevance with the user query. It just gives the document if it contains the query word,
regardless the term count in the document or the actual importance of that query word in the
document.
2.6.5 Vector Space model:
This model represents documents and queries as vectors of features representing terms. Features
are assigned some numerical value that is usually some function of frequency of terms.
In this model, each document d is viewed as a vector of tf×idf values, one component for each
term
So we have a vector space where
a. Terms are axes
b. documents live in this space
Ranking algorithm compute similarity between document and query vectors to yield a retrieval
score to each document. The Postulate is: Documents related to the same information are close
together in the vector space.
2.6.6 Probabilistic retrieval model:
In this model, initialy some set of documents is retrieved by using vectorial model or boolean
model. The user inspects these documents looking for the relevant ones and gives his feed back.
IR system uses this feedback information to refine the search criteria.
This process is repeated, untill user gets the desired information in response to his needs.
2.7.Similarity Measures
To retrieve the most relevant documents with the user information need, the IR system matches
the documents available in the corpus with the given user query. To perform this process
different similarity measures are used. For example Euclidean distance, cosine similarity.
2.7.1 Cosine Similarity
We regard the query as short document. The documents present in the corpus and the query are
represented by the vectors in the vector space with features as axes.
The IR system rank the documents by the closeness of document vectors to the query vectors.
IR system then retrieve the top ranked documents to the user.
Fig. 2: A VSM model representing 3 documents and a query
International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 5, September 2013
83
The above diagram shows a vector space model where axes ti and tj are the terms used for
indexing.
The cosine similarity between the document dj and the query vector qk is given as:
),(
),(
1
2
1
2
1
∑∑
∑
==
=
×
×
==
m
i
ij
m
i
ik
ik
m
i
ij
kj
kj
kj
ww
ww
qd
qd
qdsim
2.8.Metrics for IR Evaluation
The aim of any Information Retrieval system is to search document in responce to a user query
relavant to his information need. The performance of IR systems is evaluated on the basis of
how relavent documents it retrieve.
Relevance depends upon a specific user’s judgment. It is subjective in nature. The true
relevance of the retrieved document can be judged by the user only, on the basis of his
information need. For same query statement, the desired information need may differ from user
to user.
Traditionally the evaluation of IR systems has been done on a set of queries and test document
collections. For each test query a set of ranked relavant documents is created manually then the
system result is cross checked by it.
There are many retrieval models/ algorithms/ systems. Different performance metrics are used
to assess how effeciently an IR system retrieve the documents in responce to a users information
need.
Different Criteria's for evaluation of an IR system are:
a. Coverage of the collection
b. Time lag
c. Presentation format
d. User effort
e. Precision
f. Recall
Effectiveness is the performance measure of any IR system which describes, how much the IR
system satisfy a user’s information need by retrieving relevant documents.
Aspects of effectiveness include:
a. Whether the retrieved documents are pertinent to the information need of the user.
b. Whether the retrieved documents are ranked according to the relevance with the user query.
c. Whether the IR system returns a reasonable number of relevant documents present in the
corpus to the user etc.
Fig. 3: trade-off between precision and recall
International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 5, September 2013
84
3 OUR APPROACH
In this work, to develop an Information Retrieval system for Urdu language, the following
methods and evaluation parameters are used.
Fig. 4: Architecture of monolingual IR system
3.1 Stemmer:
For Developing Stemmer we have used an unsupervised approach [1] which gives accuracy of
84.2.
3.2 Term Weighting scheme:
For term weighting we have used the tf*idf
3.3 Indexing Scheme:
In this work the query statement and the documents are represented using the word indexing
strategy.
3.4 Retrieval Model:
To implement our IR system we have used the vector space model.
3.5 Encoding Scheme:
As the system focuses on Urdu language, to access the data UTF8 character encoding is used.
3.6 Similarity Measure:
For getting documents which are more closely related to the query i.e. to measure the similarity
between different documents in the corpus and the query statement, the cosine similarity
measure is used.
3.7 Ranking of the document:
For ranking of the retrieved documents in order to their relevance with the query, cosine
similarity values are used. The document having higher cosine value (min angular distance)
with the query will be more similar i.e. contains the query terms more frequently and hence
these documents will be considered more relevant to the user query.
International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 5, September 2013
85
4 EXPERIMENT
The data set used in this thesis for the training and testing of the developed Urdu IR system is
taken from Emilie corpus. In this corpus documents are in xml format.
The data set taken from EMILLE corpus a tagged data set consist of documents having
information related to health issues, road safety issues, education issues, legal social issues,
social issues, housing issues etc.
The testing data set consist of documents from various domains such as:
Table 4: dataset specification used for Urdu IR
Domain Number of Documents Number of words
Health 33 223412
Education 8 115264
Housing 8 120327
Legal 8 108055
Social issues 12 146083
Homeopathy 32 527360
Drama 13 135680
Myths 10 202880
Story and Novel 21 300160
Media 15 224000
Science 47 704000
History 33 502400
Politics 21 728320
Psychology 27 555520
Religion 34 556800
Sociology 21 398080
Miscellaneous 48 985374
A Query set consist of 200 queries is prepared manually for training and testing of the
IR system.
Table 5: Sample of Query set
International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 5, September 2013
86
5 RESULTS AND DISCUSSIONS
For testing purpose of the developed Information Retrieval system, a test collection of 350
documents have been used. A set of 200 queries was constructed on these 350 documents. This
query set is used to evaluate the developed Urdu IR.
Table 5: results of the developed Urdu IR system testing
As shown in the above table, the system has value of 0.13 as the minimum average precision
and maximum average precision value of the system is 0.63.
Similarly the minimum average recall value for the system is 0.5 and maximum average recall
value was found out to be 0.8.
6 CONCLUSION AND FUTURE WORK
In this paper we have discussed various indexing schemes and IR models. We have used tf*idf
scheme for indexing and to implement the IR system VSM (Vector Space Model) is used.
The experimental result shows that the average recall of the developed IR system is 0.8 with 0.3
precision.
IR is one of the hottest research fields. One can do a lot new research to provide efficient IR
system which can satisfy the user’s information needs. A lot of research is needed to develop
language independent approaches to support IR systems for multilingual data collections.
REFERENCES
[1] Mohd Shahid Husain et. al. “A language Independent Approach to develop Urdu stemmer”.
Proceedings of the second International Conference on Advances in Computing and Information
Technology. 2012.
[2] Rizvi, J et. al. “Modeling case marking system of Urdu-Hindi languages by using semantic
information”. Proceedings of the IEEE International Conference on Natural Language Processing
and Knowledge Engineering (IEEE NLP-KE '05). 2005.
[3] Butt, M. King, T. “Non-Nominative Subjects in Urdu: A Computational Analysis”. Proceedings of
the International Symposium on Non-nominative Subjects, Tokyo, December, pp. 525-548, 2001.
[4] Chen, A. Gey, F. “Building and Arabic Stemmer for Information Retrieval”. Proceedings of the
Text Retrieval Conference, 47, 2002.
[5] R. Wicentowski. "Multilingual Noise-Robust Supervised Morphological Analysis using the Word
Frame Model." In Proceedings of Seventh Meeting of the ACL Special Interest Group on
Computational Phonology (SIGPHON), pp. 70-77, 2004.
[6] Rizvi, Hussain M. “Analysis, Design and Implementation of Urdu Morphological Analyzer”.
SCONEST, 1-7, 2005.
[7] Krovetz, R. “View Morphology as an Inference Process”. In the Proceedings of 5th International
Conference on Research and Development in Information Retrieval, 1993.
[8] Thabet, N. “Stemming the Qur’an”. In the Proceedings of the Workshop on Computational
Approaches to Arabic Script-based Languages, 2004.
[9] Paik, Pauri. “A Simple Stemmer for Inflectional Languages”. FIRE 2008.
[10] Sharifloo, A.A., Shamsfard M. “A Bottom up Approach to Persian Stemming”. IJCNLP, 2008
[11] Kumar, A. and Siddiqui, T. “An Unsupervised Hindi Stemmer with Heuristics Improvements”. In
Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data, 2008.
International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 5, September 2013
87
[12] Kumar, M. S. and Murthy, K. N. “Corpus Based Statistical Approach for Stemming Telugu”.
Creation of Lexical Resources for Indian Language Computing and Processing (LRIL), C-DAC,
Mumbai, India, 2007.
[13] Qurat-ul-Ain Akram, Asma Naseer, Sarmad Hussain. “Assas-Band, an Affix-Exception-List Based
Urdu Stemmer”. Proceedings of ACL-IJCNLP 2009.
[14] http://en.wikipedia.org/wiki/Urdu
[15] http://www.bbc.co.uk/languages/other/guide/urdu/steps.shtml
[16] .http://www.andaman.org/BOOK/reprints/weber/rep-weber.htm
[17] Natural Language processing and Information Retrieval by Tanveer Siddiqui, U S Tiwary.
[18] Information retrieval: data structure and algorithms by William B. Frakes, Ricardo Baeza-Yates.
[19] http://www.crulp.org/software/ling_resources.htm
AUTHOR
Mohd. Shahid Husain
M.Tech. from Indian Institute of Information Technology (IIIT-A), Allahabad
with Intelligent System as specialization. Currently pursuing Ph.D. and working
as assistant professor in the department of Information Technology, Integral
University, Lucknow.

Más contenido relacionado

La actualidad más candente

NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATANAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATAijnlc
 
A Review on the Cross and Multilingual Information Retrieval
A Review on the Cross and Multilingual Information RetrievalA Review on the Cross and Multilingual Information Retrieval
A Review on the Cross and Multilingual Information Retrievaldannyijwest
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibEl Habib NFAOUI
 
IRJET - Sign Language Converter
IRJET -  	  Sign Language ConverterIRJET -  	  Sign Language Converter
IRJET - Sign Language ConverterIRJET Journal
 
Mining Opinion Features in Customer Reviews
Mining Opinion Features in Customer ReviewsMining Opinion Features in Customer Reviews
Mining Opinion Features in Customer ReviewsIJCERT JOURNAL
 
Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...
Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...
Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...DR.P.S.JAGADEESH KUMAR
 
Top 10 cited articles in nlp
Top 10 cited articles in nlpTop 10 cited articles in nlp
Top 10 cited articles in nlpkevig
 
Accessing database using nlp
Accessing database using nlpAccessing database using nlp
Accessing database using nlpeSAT Journals
 
The Process of Information extraction through Natural Language Processing
The Process of Information extraction through Natural Language ProcessingThe Process of Information extraction through Natural Language Processing
The Process of Information extraction through Natural Language ProcessingWaqas Tariq
 
Ijarcet vol-3-issue-1-9-11
Ijarcet vol-3-issue-1-9-11Ijarcet vol-3-issue-1-9-11
Ijarcet vol-3-issue-1-9-11Dhabal Sethi
 
Modern association rule mining methods
Modern association rule mining methodsModern association rule mining methods
Modern association rule mining methodsijcsity
 
Text mining open source tokenization
Text mining open source tokenizationText mining open source tokenization
Text mining open source tokenizationaciijournal
 
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 An Investigation of Keywords Extraction from Textual Documents using Word2Ve... An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...IJCSIS Research Publications
 
An analysis on Filter for Spam Mail
An analysis on Filter for Spam MailAn analysis on Filter for Spam Mail
An analysis on Filter for Spam MailAM Publications
 
IRJET - BOT Virtual Guide
IRJET -  	  BOT Virtual GuideIRJET -  	  BOT Virtual Guide
IRJET - BOT Virtual GuideIRJET Journal
 
Dictionary based concept mining an application for turkish
Dictionary based concept mining  an application for turkishDictionary based concept mining  an application for turkish
Dictionary based concept mining an application for turkishcsandit
 
Summer Research Project (Anusaaraka) Report
Summer Research Project (Anusaaraka) ReportSummer Research Project (Anusaaraka) Report
Summer Research Project (Anusaaraka) ReportAnwar Jameel
 

La actualidad más candente (20)

Ijet journal
Ijet journalIjet journal
Ijet journal
 
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATANAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
 
A Review on the Cross and Multilingual Information Retrieval
A Review on the Cross and Multilingual Information RetrievalA Review on the Cross and Multilingual Information Retrieval
A Review on the Cross and Multilingual Information Retrieval
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
 
IRJET - Sign Language Converter
IRJET -  	  Sign Language ConverterIRJET -  	  Sign Language Converter
IRJET - Sign Language Converter
 
Mining Opinion Features in Customer Reviews
Mining Opinion Features in Customer ReviewsMining Opinion Features in Customer Reviews
Mining Opinion Features in Customer Reviews
 
Accessing database using nlp
Accessing database using nlpAccessing database using nlp
Accessing database using nlp
 
Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...
Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...
Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...
 
Top 10 cited articles in nlp
Top 10 cited articles in nlpTop 10 cited articles in nlp
Top 10 cited articles in nlp
 
Accessing database using nlp
Accessing database using nlpAccessing database using nlp
Accessing database using nlp
 
The Process of Information extraction through Natural Language Processing
The Process of Information extraction through Natural Language ProcessingThe Process of Information extraction through Natural Language Processing
The Process of Information extraction through Natural Language Processing
 
Ijarcet vol-3-issue-1-9-11
Ijarcet vol-3-issue-1-9-11Ijarcet vol-3-issue-1-9-11
Ijarcet vol-3-issue-1-9-11
 
Modern association rule mining methods
Modern association rule mining methodsModern association rule mining methods
Modern association rule mining methods
 
Text mining open source tokenization
Text mining open source tokenizationText mining open source tokenization
Text mining open source tokenization
 
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 An Investigation of Keywords Extraction from Textual Documents using Word2Ve... An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 
An analysis on Filter for Spam Mail
An analysis on Filter for Spam MailAn analysis on Filter for Spam Mail
An analysis on Filter for Spam Mail
 
Fq2510361043
Fq2510361043Fq2510361043
Fq2510361043
 
IRJET - BOT Virtual Guide
IRJET -  	  BOT Virtual GuideIRJET -  	  BOT Virtual Guide
IRJET - BOT Virtual Guide
 
Dictionary based concept mining an application for turkish
Dictionary based concept mining  an application for turkishDictionary based concept mining  an application for turkish
Dictionary based concept mining an application for turkish
 
Summer Research Project (Anusaaraka) Report
Summer Research Project (Anusaaraka) ReportSummer Research Project (Anusaaraka) Report
Summer Research Project (Anusaaraka) Report
 

Destacado

Selling a disney ticket with iTicket POS
Selling a disney ticket with iTicket POSSelling a disney ticket with iTicket POS
Selling a disney ticket with iTicket POSiTicket POS
 
8 в на выставке абстрактных художников
8 в на выставке абстрактных художников8 в на выставке абстрактных художников
8 в на выставке абстрактных художниковp0sechka
 
Detection of slang words in e data using semi supervised learning
Detection of slang words in e data using semi supervised learningDetection of slang words in e data using semi supervised learning
Detection of slang words in e data using semi supervised learningijaia
 
Los presidentes constitucionales de 1917 hasta la fecha de hoy
Los presidentes constitucionales de 1917 hasta la fecha de hoyLos presidentes constitucionales de 1917 hasta la fecha de hoy
Los presidentes constitucionales de 1917 hasta la fecha de hoyRaulCZ1
 
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Attribute based access to scalable me...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Attribute based access to scalable me...DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Attribute based access to scalable me...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Attribute based access to scalable me...IEEEGLOBALSOFTTECHNOLOGIES
 
Cross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristicsCross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristicsijaia
 
The newest stats on State of the Social media 2013. [buzzshaper]
The newest stats on State of the Social media 2013.  [buzzshaper]The newest stats on State of the Social media 2013.  [buzzshaper]
The newest stats on State of the Social media 2013. [buzzshaper]BuzzShaper
 
Capture - Day 1 - 09:00 - "Drawing a Line Under the Measurement of Video Adve...
Capture - Day 1 - 09:00 - "Drawing a Line Under the Measurement of Video Adve...Capture - Day 1 - 09:00 - "Drawing a Line Under the Measurement of Video Adve...
Capture - Day 1 - 09:00 - "Drawing a Line Under the Measurement of Video Adve...PerformanceIN
 
Interactieve workshop - maak veiligheid bespreekbaar in jouw organisatie
Interactieve workshop - maak veiligheid bespreekbaar in jouw organisatieInteractieve workshop - maak veiligheid bespreekbaar in jouw organisatie
Interactieve workshop - maak veiligheid bespreekbaar in jouw organisatiet-groep NV
 
Poste Ingenieur Support Client ip-label
Poste Ingenieur Support Client ip-labelPoste Ingenieur Support Client ip-label
Poste Ingenieur Support Client ip-labelJessica Makaké
 
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Attribute based access to scalable me...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Attribute based access to scalable me...DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Attribute based access to scalable me...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Attribute based access to scalable me...IEEEGLOBALSOFTTECHNOLOGIES
 
Doha project
Doha projectDoha project
Doha projectcescyzl
 
Ch 1- Water and Soil, Stone and Metal
Ch 1- Water and Soil, Stone and MetalCh 1- Water and Soil, Stone and Metal
Ch 1- Water and Soil, Stone and MetalCraig Campbell
 
25 Resources for Classicists 2013
25 Resources for Classicists 201325 Resources for Classicists 2013
25 Resources for Classicists 2013lettylib
 
WP-CLI: WordCamp NYC 2015
WP-CLI: WordCamp NYC 2015WP-CLI: WordCamp NYC 2015
WP-CLI: WordCamp NYC 2015Terell Moore
 
Catálogo BMW X1 2016
Catálogo BMW X1 2016Catálogo BMW X1 2016
Catálogo BMW X1 2016rfarias_10
 
Family medicine exam_manual_2009
Family medicine exam_manual_2009Family medicine exam_manual_2009
Family medicine exam_manual_2009Aziz Sabrina
 

Destacado (20)

Selling a disney ticket with iTicket POS
Selling a disney ticket with iTicket POSSelling a disney ticket with iTicket POS
Selling a disney ticket with iTicket POS
 
8 в на выставке абстрактных художников
8 в на выставке абстрактных художников8 в на выставке абстрактных художников
8 в на выставке абстрактных художников
 
Detection of slang words in e data using semi supervised learning
Detection of slang words in e data using semi supervised learningDetection of slang words in e data using semi supervised learning
Detection of slang words in e data using semi supervised learning
 
Avant octubre 2013 PSPV Xirivella
Avant octubre 2013 PSPV XirivellaAvant octubre 2013 PSPV Xirivella
Avant octubre 2013 PSPV Xirivella
 
Los presidentes constitucionales de 1917 hasta la fecha de hoy
Los presidentes constitucionales de 1917 hasta la fecha de hoyLos presidentes constitucionales de 1917 hasta la fecha de hoy
Los presidentes constitucionales de 1917 hasta la fecha de hoy
 
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Attribute based access to scalable me...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Attribute based access to scalable me...DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Attribute based access to scalable me...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Attribute based access to scalable me...
 
Cross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristicsCross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristics
 
The newest stats on State of the Social media 2013. [buzzshaper]
The newest stats on State of the Social media 2013.  [buzzshaper]The newest stats on State of the Social media 2013.  [buzzshaper]
The newest stats on State of the Social media 2013. [buzzshaper]
 
Capture - Day 1 - 09:00 - "Drawing a Line Under the Measurement of Video Adve...
Capture - Day 1 - 09:00 - "Drawing a Line Under the Measurement of Video Adve...Capture - Day 1 - 09:00 - "Drawing a Line Under the Measurement of Video Adve...
Capture - Day 1 - 09:00 - "Drawing a Line Under the Measurement of Video Adve...
 
Interactieve workshop - maak veiligheid bespreekbaar in jouw organisatie
Interactieve workshop - maak veiligheid bespreekbaar in jouw organisatieInteractieve workshop - maak veiligheid bespreekbaar in jouw organisatie
Interactieve workshop - maak veiligheid bespreekbaar in jouw organisatie
 
Sala
SalaSala
Sala
 
Poste Ingenieur Support Client ip-label
Poste Ingenieur Support Client ip-labelPoste Ingenieur Support Client ip-label
Poste Ingenieur Support Client ip-label
 
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Attribute based access to scalable me...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Attribute based access to scalable me...DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Attribute based access to scalable me...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Attribute based access to scalable me...
 
Young2
Young2Young2
Young2
 
Doha project
Doha projectDoha project
Doha project
 
Ch 1- Water and Soil, Stone and Metal
Ch 1- Water and Soil, Stone and MetalCh 1- Water and Soil, Stone and Metal
Ch 1- Water and Soil, Stone and Metal
 
25 Resources for Classicists 2013
25 Resources for Classicists 201325 Resources for Classicists 2013
25 Resources for Classicists 2013
 
WP-CLI: WordCamp NYC 2015
WP-CLI: WordCamp NYC 2015WP-CLI: WordCamp NYC 2015
WP-CLI: WordCamp NYC 2015
 
Catálogo BMW X1 2016
Catálogo BMW X1 2016Catálogo BMW X1 2016
Catálogo BMW X1 2016
 
Family medicine exam_manual_2009
Family medicine exam_manual_2009Family medicine exam_manual_2009
Family medicine exam_manual_2009
 

Similar a An unsupervised approach to develop ir system the case of urdu

CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEMCANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEMIRJET Journal
 
Indian Language Text Representation and Categorization Using Supervised Learn...
Indian Language Text Representation and Categorization Using Supervised Learn...Indian Language Text Representation and Categorization Using Supervised Learn...
Indian Language Text Representation and Categorization Using Supervised Learn...ijbuiiir1
 
Survey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageSurvey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
 
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSD
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSDMarathi-English CLIR using detailed user query and unsupervised corpus-based WSD
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSDIJERA Editor
 
QUrdPro: Query processing system for Urdu Language
QUrdPro: Query processing system for Urdu LanguageQUrdPro: Query processing system for Urdu Language
QUrdPro: Query processing system for Urdu LanguageIJERA Editor
 
A novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching techniqueA novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching techniqueeSAT Journals
 
A Dialogue System for Telugu, a Resource-Poor Language
A Dialogue System for Telugu, a Resource-Poor LanguageA Dialogue System for Telugu, a Resource-Poor Language
A Dialogue System for Telugu, a Resource-Poor LanguageSravanthi Mullapudi
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibEl Habib NFAOUI
 
BENGALI INFORMATION RETRIEVAL SYSTEM (BIRS)
BENGALI INFORMATION RETRIEVAL SYSTEM (BIRS)BENGALI INFORMATION RETRIEVAL SYSTEM (BIRS)
BENGALI INFORMATION RETRIEVAL SYSTEM (BIRS)kevig
 
INTELLIGENT QUERY PROCESSING IN MALAYALAM
INTELLIGENT QUERY PROCESSING IN MALAYALAMINTELLIGENT QUERY PROCESSING IN MALAYALAM
INTELLIGENT QUERY PROCESSING IN MALAYALAMijcsa
 
An Unsupervised Approach to Develop Stemmer
An Unsupervised Approach to Develop StemmerAn Unsupervised Approach to Develop Stemmer
An Unsupervised Approach to Develop Stemmerkevig
 
Machine translation for Indian Languages
Machine translation for Indian LanguagesMachine translation for Indian Languages
Machine translation for Indian LanguagesNiharika Arora
 
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHDICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHcscpconf
 
Car-Following Parameters by Means of Cellular Automata in the Case of Evacuation
Car-Following Parameters by Means of Cellular Automata in the Case of EvacuationCar-Following Parameters by Means of Cellular Automata in the Case of Evacuation
Car-Following Parameters by Means of Cellular Automata in the Case of EvacuationCSCJournals
 
Diacritic Oriented Arabic Information Retrieval System
Diacritic Oriented Arabic Information Retrieval SystemDiacritic Oriented Arabic Information Retrieval System
Diacritic Oriented Arabic Information Retrieval SystemCSCJournals
 
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...IRJET Journal
 
Cross language information retrieval in indian
Cross language information retrieval in indianCross language information retrieval in indian
Cross language information retrieval in indianeSAT Publishing House
 
IRJET - A Robust Sign Language and Hand Gesture Recognition System using Conv...
IRJET - A Robust Sign Language and Hand Gesture Recognition System using Conv...IRJET - A Robust Sign Language and Hand Gesture Recognition System using Conv...
IRJET - A Robust Sign Language and Hand Gesture Recognition System using Conv...IRJET Journal
 

Similar a An unsupervised approach to develop ir system the case of urdu (20)

CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEMCANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM
 
A SURVEY ON VARIOUS CLIR TECHNIQUES
A SURVEY ON VARIOUS CLIR TECHNIQUESA SURVEY ON VARIOUS CLIR TECHNIQUES
A SURVEY ON VARIOUS CLIR TECHNIQUES
 
Indian Language Text Representation and Categorization Using Supervised Learn...
Indian Language Text Representation and Categorization Using Supervised Learn...Indian Language Text Representation and Categorization Using Supervised Learn...
Indian Language Text Representation and Categorization Using Supervised Learn...
 
Survey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageSurvey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi Language
 
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSD
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSDMarathi-English CLIR using detailed user query and unsupervised corpus-based WSD
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSD
 
QUrdPro: Query processing system for Urdu Language
QUrdPro: Query processing system for Urdu LanguageQUrdPro: Query processing system for Urdu Language
QUrdPro: Query processing system for Urdu Language
 
A novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching techniqueA novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching technique
 
A Dialogue System for Telugu, a Resource-Poor Language
A Dialogue System for Telugu, a Resource-Poor LanguageA Dialogue System for Telugu, a Resource-Poor Language
A Dialogue System for Telugu, a Resource-Poor Language
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
 
BENGALI INFORMATION RETRIEVAL SYSTEM (BIRS)
BENGALI INFORMATION RETRIEVAL SYSTEM (BIRS)BENGALI INFORMATION RETRIEVAL SYSTEM (BIRS)
BENGALI INFORMATION RETRIEVAL SYSTEM (BIRS)
 
INTELLIGENT QUERY PROCESSING IN MALAYALAM
INTELLIGENT QUERY PROCESSING IN MALAYALAMINTELLIGENT QUERY PROCESSING IN MALAYALAM
INTELLIGENT QUERY PROCESSING IN MALAYALAM
 
[IJET-V2I3P19] Authors: Priyanka Sharma
[IJET-V2I3P19] Authors: Priyanka Sharma[IJET-V2I3P19] Authors: Priyanka Sharma
[IJET-V2I3P19] Authors: Priyanka Sharma
 
An Unsupervised Approach to Develop Stemmer
An Unsupervised Approach to Develop StemmerAn Unsupervised Approach to Develop Stemmer
An Unsupervised Approach to Develop Stemmer
 
Machine translation for Indian Languages
Machine translation for Indian LanguagesMachine translation for Indian Languages
Machine translation for Indian Languages
 
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHDICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
 
Car-Following Parameters by Means of Cellular Automata in the Case of Evacuation
Car-Following Parameters by Means of Cellular Automata in the Case of EvacuationCar-Following Parameters by Means of Cellular Automata in the Case of Evacuation
Car-Following Parameters by Means of Cellular Automata in the Case of Evacuation
 
Diacritic Oriented Arabic Information Retrieval System
Diacritic Oriented Arabic Information Retrieval SystemDiacritic Oriented Arabic Information Retrieval System
Diacritic Oriented Arabic Information Retrieval System
 
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...
 
Cross language information retrieval in indian
Cross language information retrieval in indianCross language information retrieval in indian
Cross language information retrieval in indian
 
IRJET - A Robust Sign Language and Hand Gesture Recognition System using Conv...
IRJET - A Robust Sign Language and Hand Gesture Recognition System using Conv...IRJET - A Robust Sign Language and Hand Gesture Recognition System using Conv...
IRJET - A Robust Sign Language and Hand Gesture Recognition System using Conv...
 

Último

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Último (20)

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

An unsupervised approach to develop ir system the case of urdu

  • 1. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 5, September 2013 DOI : 10.5121/ijaia.2013.4506 77 AN UNSUPERVISED APPROACH TO DEVELOP IR SYSTEM: THE CASE OF URDU Mohd. Shahid Husain Integral University, Lucknow ABSTRACT Web Search Engines are best gifts to the mankind by Information and Communication Technologies. Without the search engines it would have been almost impossible to make the efficient access of the information available on the web today. They play a very vital role in the accessibility and usability of the internet based information systems. As the internet users are increasing day by day so is the amount of information being available on web increasing. But the access of information is not uniform across all the language communities. Besides English and European languages that constitutes to the 60% of the information available on the web, there is still a wide range of the information available on the internet in different languages too. In the past few years the amount of information available in Indian Languages has also increased. Besides English and few European Languages, there are no tools and techniques available for the efficient retrieval of this information available on the internet. Especially in the case of the Indian Languages the research is still in the preliminary steps. There are no sufficient amount of tools and techniques available for the efficient retrieval of the information for Indian Languages. As we know that Indian Languages are very resource poor languages in terms of IR test data collection. So my main focus was mainly on developing the data set for URDU IR, training and testing data for Stemmer. We have developed a language independent system to facilitate efficient retrieval of information available in Urdu language which can be used for other languages as well. The system gives precision of 0.63 and the recall of the system is 0.8. For this Firstly I have developed an Unsupervised Stemmer for URDU Language [1] as it is very important in the Information Retrieval. Keywords: Information Retrieval, Vector Space Model, Stemming, Urdu IR 1. INTRODUCTION The rapid growth of electronic data has attracted the attention in the research and industry communities for efficient methods for indexing, analysis and retrieval of information from these large number of data repositories having wide range of data for a vast domain of applications. In this era of Information technology, more and more data is now being made available on online data repositories. Almost every information one need is now available on internet. English and European Languages basically dominated the web since its inception. However, now the web is getting multi-lingual. Especially, during last few years, a wide range of information in Indian regional languages like Hindi, Urdu, Bengali, Oriya, Tamil and Telugu has been made available on web in the form of e-data. But the access to these data repositories is very low because the efficient search engines/retrieval systems supporting these languages are very limited. Hence automatic information processing and retrieval is become an urgent requirement. Moreover, since India is a country having a wide range of regional languages, in the Indian context, the IR approach should be such that it can handle multilingual document collections.
  • 2. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 5, September 2013 78 A number of information retrieval systems are available to support English and some other European languages. Work involving development of IR systems for Indian languages is only of recent interests. Development of such systems is constraint by the lack of the availability of linguistic resources and tools in these languages. The reported works in this direction for Indian languages were focused on Hindi, Tamil, Bengali, Marathi and Oriya. But there is no reported work is done for Urdu language. There is no sufficient amount of resources available to retrieve information effectively available on internet in Indian Languages. So there is a need of some efficient tools and techniques to represent, express, store and retrieve the information available in different languages. The present work focuses on development of an efficient Information Retrieval system for Urdu Language. 2.INFORMATION RETRIEVAL Information Retrieval is the sub domain of text mining and natural language processing. This is the science in which the software system retrieves the relevant documents or the information in response to the user query need. The Information Retrieval system match the given user quires with the data corpus available and rank the documents on the basis of the relevance with the user need. Then the IR system returns the top ranked documents containing relevant information to the user query. IR systems may be monolingual, bi-lingual or multilingual. The main objective of this thesis work is the development of the mono-lingual information retrieval system for Urdu language. To retrieve the relevant information on the basis of user query Fig. 1: typical IR system • The IR system breaks the query statement and the data corpus in a standard format. • The query is then matched with the documents presented in the corpus and ranked on the basis of the relevance with the query. • Top ranked documents are then retrieved.
  • 3. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 5, September 2013 79 There are various approaches for converting the query statement and the corpus data in a standard format like stemming, morphological analysis, Stop word removal, indexing etc. Similarly there are various techniques or methods for query matching like cosine similarity, Euclidean distance etc. The efficiency of any Information Retrieval system depends on the term weighting schemes, strategies used for indexing the documents and the retrieval model used to develop the IR system. 2.1 Stemmer Stemming is the backbone process of any IR system. Stemmers are used for getting base or root form (i.e. stems) from inflected (or sometimes derived) words. Unlike morphological analyzer, where the root words have some lexical meaning, it’s not necessary with the case of a stemmer. A stemmer is used to remove the inflected part of the words to get their root form. Stemming is used to reduce the overhead of indexing and to improve the performance of an IR system. More specifically, stemmer increases recall of the search engine, whereas Precision decreases. However sometimes precision may increases depending upon the information need of the users. Stemming is the basic process of any query system, because a user who needs some information on plays may also be interested in documents that contain the word play (without the s). 2.2 Term Frequency This is a local parameter which indicates the frequency or the count of a term within a document. This parameter gives the relevance of a document with a user query term on the basis of how many times that term occurs in that particular document. Mathematically it can be given as: tfij=nij Where nij is the frequency or the number of occurrence of term ti in the document dj. 2.3 Document Frequency This is a global parameter and attempts to include distribution of term across the documents. This parameter gives the importance of the term across the document corpus. The number of the documents in the corpus containing the considered term t is called the document frequency. To normalize, it is divided by the total number of the documents in the corpus. Mathematically it can be given as: dfi=ni/n Where ni is the number of documents that contains term ti and the total number of the documents in the corpus is n. idf is the inverse of this document frequency. 2.4 The third factor which may affect the weighting function is the length of the document. Hence the term weighting function can be represented by a triplet ABC here A- tf component B- idf component C- Length normalizing component The factor Term frequency within a document i.e. A may have following options:
  • 4. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 5, September 2013 80 Table 1: different options for considering term frequency n tf = tfij (Raw term frequency) b tf = 0 or 1 (binary weight) a       += j ij Dinmax tf tf 0.50.5tf (Augmented term frequency) l tf = ln(tfij) + 1.0 Logarithmic term frequency The options for the factor inverse document frequency i.e. B is: Table 2: different options for considering inverse document frequency N Wt=tf No conversion i.e. idf is not taken T Wt=tf*idf Idf is taken into account The options for the factor document length i.e. C is: Table 3: different options for considering document length N Wij=wt No conversion C Wij=wt/ sqrt(sum of (wts squared)) Normalized weight 2.5 Indexing To represent the documents in the corpus and the user query statement indexing is done. That is the process of transforming document text and given query statement to some representation of it is known as indexing. There are different index structures which can be used for indexing. The most commonly used data structure by IR system is inverted index. Indexing techniques concerned with the selection of good document descriptors, such as keywords or terms, to describe information content of the documents. A good descriptor is one that helps in describing the content of the document and in discriminating the document from other documents in the collection. The most widely used method is to represent the query and the document as a set of tokens i.e. index terms or keywords. For indexing a document, there are different indexing strategies as given below :
  • 5. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 5, September 2013 81 2.5.1Character Indexing: In this scheme the tokens used for representing the documents are the characters present in the document. 2.5.2Word Indexing: This approach uses words in the document to represent it. 2.5.3N-gram indexing: This method breaks the words into n-grams, these n-grams are used to index the documents. 2.5.4Compound Word Indexing: In this method bi-words or tri-words are used for indexing. 2.6 Information retrieval models An IR model defines the following aspects of retrieval procedure of a search engine: a. How the documents and user’s queries are represented b. How system retrieves relevant documents according to users’ queries & c. How retrieved documents are ranked. Any typical IR model comprises of the following: a. A model for documents b. A model for queries and c. Matching function which compares queries to documents. The IR models can be categorized as: 2.6.1 Classical models of IR: This is the simplest IR model. It is based on the well recognized and easy to understood knowledge of mathematics. Classical models are easy to implement and are very efficient. The three classical information retrieval models are: -Boolean -Vector and -Probabilistic models 2.6.2 Non-Classical models of IR: Non-classical information retrieval models are based on principles like information logic model, situation theory model and interaction model. They are not based on concepts like similarity, probability, Boolean operations etc. on which classical retrieval models are based on. 2.6.3 Alternative models of IR: Alternative models are advanced classical IR models. These models make use of specific techniques from other fields like Cluster model, fuzzy model and latent semantic indexing (LSI) models. 2.6.4Boolean Retrieval model: This is the simplest retrieval model which retrieves the information on the basis of the query given in Boolean expression. Boolean queries are queries that uses And, OR and Not Boolean operations to join the query terms.
  • 6. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 5, September 2013 82 The one drawback of Boolean information retrieval model is that it requires Boolean query instead of free text. The second drawback is that this model cannot rank the documents on the basis of relevance with the user query. It just gives the document if it contains the query word, regardless the term count in the document or the actual importance of that query word in the document. 2.6.5 Vector Space model: This model represents documents and queries as vectors of features representing terms. Features are assigned some numerical value that is usually some function of frequency of terms. In this model, each document d is viewed as a vector of tf×idf values, one component for each term So we have a vector space where a. Terms are axes b. documents live in this space Ranking algorithm compute similarity between document and query vectors to yield a retrieval score to each document. The Postulate is: Documents related to the same information are close together in the vector space. 2.6.6 Probabilistic retrieval model: In this model, initialy some set of documents is retrieved by using vectorial model or boolean model. The user inspects these documents looking for the relevant ones and gives his feed back. IR system uses this feedback information to refine the search criteria. This process is repeated, untill user gets the desired information in response to his needs. 2.7.Similarity Measures To retrieve the most relevant documents with the user information need, the IR system matches the documents available in the corpus with the given user query. To perform this process different similarity measures are used. For example Euclidean distance, cosine similarity. 2.7.1 Cosine Similarity We regard the query as short document. The documents present in the corpus and the query are represented by the vectors in the vector space with features as axes. The IR system rank the documents by the closeness of document vectors to the query vectors. IR system then retrieve the top ranked documents to the user. Fig. 2: A VSM model representing 3 documents and a query
  • 7. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 5, September 2013 83 The above diagram shows a vector space model where axes ti and tj are the terms used for indexing. The cosine similarity between the document dj and the query vector qk is given as: ),( ),( 1 2 1 2 1 ∑∑ ∑ == = × × == m i ij m i ik ik m i ij kj kj kj ww ww qd qd qdsim 2.8.Metrics for IR Evaluation The aim of any Information Retrieval system is to search document in responce to a user query relavant to his information need. The performance of IR systems is evaluated on the basis of how relavent documents it retrieve. Relevance depends upon a specific user’s judgment. It is subjective in nature. The true relevance of the retrieved document can be judged by the user only, on the basis of his information need. For same query statement, the desired information need may differ from user to user. Traditionally the evaluation of IR systems has been done on a set of queries and test document collections. For each test query a set of ranked relavant documents is created manually then the system result is cross checked by it. There are many retrieval models/ algorithms/ systems. Different performance metrics are used to assess how effeciently an IR system retrieve the documents in responce to a users information need. Different Criteria's for evaluation of an IR system are: a. Coverage of the collection b. Time lag c. Presentation format d. User effort e. Precision f. Recall Effectiveness is the performance measure of any IR system which describes, how much the IR system satisfy a user’s information need by retrieving relevant documents. Aspects of effectiveness include: a. Whether the retrieved documents are pertinent to the information need of the user. b. Whether the retrieved documents are ranked according to the relevance with the user query. c. Whether the IR system returns a reasonable number of relevant documents present in the corpus to the user etc. Fig. 3: trade-off between precision and recall
  • 8. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 5, September 2013 84 3 OUR APPROACH In this work, to develop an Information Retrieval system for Urdu language, the following methods and evaluation parameters are used. Fig. 4: Architecture of monolingual IR system 3.1 Stemmer: For Developing Stemmer we have used an unsupervised approach [1] which gives accuracy of 84.2. 3.2 Term Weighting scheme: For term weighting we have used the tf*idf 3.3 Indexing Scheme: In this work the query statement and the documents are represented using the word indexing strategy. 3.4 Retrieval Model: To implement our IR system we have used the vector space model. 3.5 Encoding Scheme: As the system focuses on Urdu language, to access the data UTF8 character encoding is used. 3.6 Similarity Measure: For getting documents which are more closely related to the query i.e. to measure the similarity between different documents in the corpus and the query statement, the cosine similarity measure is used. 3.7 Ranking of the document: For ranking of the retrieved documents in order to their relevance with the query, cosine similarity values are used. The document having higher cosine value (min angular distance) with the query will be more similar i.e. contains the query terms more frequently and hence these documents will be considered more relevant to the user query.
  • 9. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 5, September 2013 85 4 EXPERIMENT The data set used in this thesis for the training and testing of the developed Urdu IR system is taken from Emilie corpus. In this corpus documents are in xml format. The data set taken from EMILLE corpus a tagged data set consist of documents having information related to health issues, road safety issues, education issues, legal social issues, social issues, housing issues etc. The testing data set consist of documents from various domains such as: Table 4: dataset specification used for Urdu IR Domain Number of Documents Number of words Health 33 223412 Education 8 115264 Housing 8 120327 Legal 8 108055 Social issues 12 146083 Homeopathy 32 527360 Drama 13 135680 Myths 10 202880 Story and Novel 21 300160 Media 15 224000 Science 47 704000 History 33 502400 Politics 21 728320 Psychology 27 555520 Religion 34 556800 Sociology 21 398080 Miscellaneous 48 985374 A Query set consist of 200 queries is prepared manually for training and testing of the IR system. Table 5: Sample of Query set
  • 10. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 5, September 2013 86 5 RESULTS AND DISCUSSIONS For testing purpose of the developed Information Retrieval system, a test collection of 350 documents have been used. A set of 200 queries was constructed on these 350 documents. This query set is used to evaluate the developed Urdu IR. Table 5: results of the developed Urdu IR system testing As shown in the above table, the system has value of 0.13 as the minimum average precision and maximum average precision value of the system is 0.63. Similarly the minimum average recall value for the system is 0.5 and maximum average recall value was found out to be 0.8. 6 CONCLUSION AND FUTURE WORK In this paper we have discussed various indexing schemes and IR models. We have used tf*idf scheme for indexing and to implement the IR system VSM (Vector Space Model) is used. The experimental result shows that the average recall of the developed IR system is 0.8 with 0.3 precision. IR is one of the hottest research fields. One can do a lot new research to provide efficient IR system which can satisfy the user’s information needs. A lot of research is needed to develop language independent approaches to support IR systems for multilingual data collections. REFERENCES [1] Mohd Shahid Husain et. al. “A language Independent Approach to develop Urdu stemmer”. Proceedings of the second International Conference on Advances in Computing and Information Technology. 2012. [2] Rizvi, J et. al. “Modeling case marking system of Urdu-Hindi languages by using semantic information”. Proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE '05). 2005. [3] Butt, M. King, T. “Non-Nominative Subjects in Urdu: A Computational Analysis”. Proceedings of the International Symposium on Non-nominative Subjects, Tokyo, December, pp. 525-548, 2001. [4] Chen, A. Gey, F. “Building and Arabic Stemmer for Information Retrieval”. Proceedings of the Text Retrieval Conference, 47, 2002. [5] R. Wicentowski. "Multilingual Noise-Robust Supervised Morphological Analysis using the Word Frame Model." In Proceedings of Seventh Meeting of the ACL Special Interest Group on Computational Phonology (SIGPHON), pp. 70-77, 2004. [6] Rizvi, Hussain M. “Analysis, Design and Implementation of Urdu Morphological Analyzer”. SCONEST, 1-7, 2005. [7] Krovetz, R. “View Morphology as an Inference Process”. In the Proceedings of 5th International Conference on Research and Development in Information Retrieval, 1993. [8] Thabet, N. “Stemming the Qur’an”. In the Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, 2004. [9] Paik, Pauri. “A Simple Stemmer for Inflectional Languages”. FIRE 2008. [10] Sharifloo, A.A., Shamsfard M. “A Bottom up Approach to Persian Stemming”. IJCNLP, 2008 [11] Kumar, A. and Siddiqui, T. “An Unsupervised Hindi Stemmer with Heuristics Improvements”. In Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data, 2008.
  • 11. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 5, September 2013 87 [12] Kumar, M. S. and Murthy, K. N. “Corpus Based Statistical Approach for Stemming Telugu”. Creation of Lexical Resources for Indian Language Computing and Processing (LRIL), C-DAC, Mumbai, India, 2007. [13] Qurat-ul-Ain Akram, Asma Naseer, Sarmad Hussain. “Assas-Band, an Affix-Exception-List Based Urdu Stemmer”. Proceedings of ACL-IJCNLP 2009. [14] http://en.wikipedia.org/wiki/Urdu [15] http://www.bbc.co.uk/languages/other/guide/urdu/steps.shtml [16] .http://www.andaman.org/BOOK/reprints/weber/rep-weber.htm [17] Natural Language processing and Information Retrieval by Tanveer Siddiqui, U S Tiwary. [18] Information retrieval: data structure and algorithms by William B. Frakes, Ricardo Baeza-Yates. [19] http://www.crulp.org/software/ling_resources.htm AUTHOR Mohd. Shahid Husain M.Tech. from Indian Institute of Information Technology (IIIT-A), Allahabad with Intelligent System as specialization. Currently pursuing Ph.D. and working as assistant professor in the department of Information Technology, Integral University, Lucknow.