SlideShare una empresa de Scribd logo
1 de 10
Descargar para leer sin conexión
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013
DOI : 10.5121/ijcsea.2013.3201 1
SIMILAR THESAURUS BASED ON ARABIC
DOCUMENT: AN OVERVIEW AND
COMPARISON
Essam S. Hanandeh,
Department of Computer Information System, Zarqa University, Zarqa, Jordan
Hanandeh@zu.edu.jo
ABSTRACT
The massive grow of the modern information retrieval system (IRS), especially in natural languages
becomes more difficult. The search in Arabic languages, as natural language, is not good enough yet. This
paper will try to build similar thesaurus based on Arabic language in two mechanisms, the first one is full
word mechanisms and the other is stemmed mechanisms, and then to compare between them.
The comparison made by this study proves that the similar thesaurus using stemmed mechanisms get more
better results than using traditional in the same mechanisms and similar thesaurus improved more the
recall and precision than traditional information retrieval system at recall and precision levels.
KEYWORDS
Similarity thesaurus, Recall, Precision, information retrieval, Traditional
1. INTRODUCTION
A thesaurus (plural: thesauri) is a valuable tool in IR, in both indexing and in searching processes.
It is used as a controlled vocabulary and as a means for expanding or altering queries (query
expansion)[10]. Most thesauri that users encounter are manually constructed by domain experts
and/or experts at document description. Manual thesaurus construction is a time-consuming and
quite expensive process, and the results are bound to be more or less subjective since the person
creating the thesaurus make choices that affect the structure of the thesaurus. There is a need for
methods of automatically construct thesauri, which in addition to the improvements in time and
cost aspects can result in more objective thesauri that are easier to update. These developed
thesauri have been designed to comply with international and Arabic standards, and is capable of
performing all tasks & duties needed in thesauri building. It automates most tasks in the building
and maintenance of thesauri and insures integrity of its structures and its relations with database
materials.
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013
2
2. INFORMATION SYSTEM
Information retrieval exhibits similarity to many other areas of information processing. The most
important computer-based information systems today are the management information systems
(MIS), data base management systems (DBMS), decision support systems (DSS), question-
answering systems (QA), as well as information retrieval system (IRS) [8].
Information Retrieval (IR) is best understood if one remembers that the information consists of
documents. In that context, information retrieval deals with the representation, storage, and access
to documents or representatives of documents [10]. Information Retrieval systems have an
important role in the studies of libraries and information, and they are really considered the top of
this field. They get use of facts and ideas achieved in all studies, theoretical or practical, such as
indexing, classifying, subjective analysis, concluding, computers, bibliography and others.
As we know that half of the science is in its organization, and the main objective of this
specialization is to supply and afford the suitable and sufficient information in the suitable time to
the suitable person or researcher.
3. REPRESENTATION OF DOCUMENTS AND QUERIES:
The following section describes the steps of representing the documents and queries
automatically:
3.1 Filtering:
In filtering the document collection, these filtering processes consist mainly of:
3.1.1 Eliminating Stop words:
Stop word is a word that occurs so frequently in documents in the collection that it is useless for
purposes of retrieval [3], Elimination of stop words reduces the size of the indexing structure and
thus increases the performance of the system and enables it to retrieve more relevant documents.
Stop words in Arabic include some of grammatical links such as the definite article (‫,)اﻟـ‬ attached
and separated prepositions, conjunctions, interrogative words, negative words, exclamations ,
calling letters, adverbs of time and place They also include all the pronouns, demonstratives,
subject and object pronouns, the Five Distinctive Nouns, some numbers, additions and verbs.
Stop words may be separated or attached ones in a form of prefixes or suffixes [3].
3.1.2 Stemming:
Stemming is the deletion of prefixes and suffixes (getting the root of the word). The root means:
the part of the word that is left after deletion of prefixes, suffixes and Infixes [4].
Stems are thought to be useful for improving retrieval performance because they reduce variants
of the same root word to a common concept. Furthermore, stemming has the secondary effect of
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013
3
reducing the size of the indexing structure because the number of distinct index terms is reduced
[3].
3.1.3 Indexing:
Selection of index terms from the collection of filtered terms as seen in the construction
processes.
Indexing is defined as the process of choosing a term or a number of terms that can represent what
the document contains. These terms have been called (Index terms) [8]. Indexing can be
performed either manually (Manual Indexing) or through using computers software and programs
(automatic Indexing) [6]. To decide the location of accuracy of the keywords in a certain text, the
researcher will generate the inverted files. This file will contain all the keywords that the text
contains, accompanied by the number of each paragraph containing such words. The paragraphs
may be a sentence, a paragraph, a whole page of a document or the complete document [3].
4. ALGORITHM
Phase 1: preparing documents:
1 Use vector space model to put text of documents and query in vectors.
2 Normalization.
 Removing stop words those were collected by Al-Shalabi, et al [2], and they gained
98% success in distinguishing in addition to deleting some signs appeared. (stop
words are the words that occur so frequently in documents in the collection that it is
useless for purposes of retrieval [9]), Elimination of stop words reduces the size of
the indexing structure and thus increases the performance of the system and enables it
to retrieve more relevant documents.
 Deleting punctuation marks, commas, follow stops, especial signs, numbers (contents
that has no meanings).
3 Stemming : the following stemming algorithm as in [11] with a little bet modification :
Let T denote the set of characters of the Arabic surface Full word
Let Li denote the position of letter i in term T
Let Stem denote the term after stemming in each step
Let D denote the set of definite articles ( ‫)ال‬
Let S denote the set of suffixes
S ={ ،‫ه،ة‬ ،‫ك‬ ،‫و‬ ،‫ي‬ ،‫ن‬ ،‫ا‬ ‫ت‬
،‫ان‬ ،‫ﯾﻦ‬ ،‫ون‬ ،‫ات‬ ،‫ھﻢ‬ ،‫ھﻦ‬ ،‫ھﺎ‬ ،‫ﻛﻢ‬ ،‫ﻛﻦ‬ ،‫ﻧﺎ‬ ،‫وا‬ ،‫ﺗﻢ‬ ،‫ﻧﻲ‬ ،‫ﺗﻦ‬ ،‫ﺗﮫ‬ ،‫ﯾﮫ‬ ،‫ﻣﺎ‬ ،‫ﯾﺎ‬ ،‫ﺗﺎ‬ ،‫ﺗﻚ‬ ،‫ري‬ ،‫ﻟﻲ‬ ،‫ار‬
،‫ﯾﺮ‬ ،‫ﺗﻲ‬ ،‫ﯾﻞ‬ ‫ﯾﺔ‬
،‫ﻟﯿﻦ‬ ،‫ﯾﺎن‬ ،‫ﯾﺘﺶ‬ ،‫ﯾﻮن‬ ،‫رات‬ ،‫ﻣﺎن‬ ،‫رﯾﻦ‬ ‫رھﺎ‬، ،‫ﯾﻨﺎ‬ ‫}ﻟﮭﺎ‬
Let P denote the set of prefixes
P={ ،‫ي‬ ‫ت‬ ، ،‫ن‬ ،‫ب‬ ‫ل‬
،‫ال‬ ،‫ﻟﻞ‬ ،‫ﺳﻲ‬ ،‫ﺳﺎ‬ ،‫ﺳﺖ‬ ،‫ﺳﻦ‬ ‫ﻛﺎ‬ ، ‫ﻓﺎ‬ ، ‫ﺑﺎ‬ ، ،‫ﻟﻲ‬ ‫ﻟﺖ‬ ، ،‫ﻟﻦ‬ ،‫ﻓﺖ‬ ‫ﻓﻲ‬ ، ،‫ﻓﻦ‬ ‫اس‬
،‫اﻟﺢ‬ ،‫ﻣﺎل‬ ،‫ﻻل‬ ،‫اﻟﻢ‬ ،‫اﻻ‬ ،‫اﻟﺲ‬ ،‫اﻟﻊ‬ ،‫ﻟﻠﻢ‬ ،‫اﻟﻚ‬ ‫اﻟﻒ‬ }
Let n is the total number of characters in the Arabic word
Step 1: Remove any diacritic in T
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013
4
Step2: If the length of T is > 3 characters then,
Remove the prefix Waw “ ‫”و‬ in position L1
Step 3: Normalize ‫آ‬ ,‫إ‬ ,‫أ‬ of T to ‫ا‬ (plain alif)
Step 4: Normalize ‫ى‬ in Ln of T to ‫ي‬
Replace the sequence of ‫ى‬ in Ln-1 and ‫ء‬ in Ln to ‫ئ‬
Replace the sequence of ‫ي‬ in Ln-1 and ‫ء‬ in Ln to ‫ئ‬
Normalize ‫ه‬ in Ln of T to ‫ة‬
Step 5: For all variations of D (‫)ال‬ do,
Locate the definite article Di in T
If Di in T matches Di = Di + Characters in T ahead of Di
Stem = T – Di
Step 6: If the length of Stem is > 3 characters then,
For all variations of S, obtain the most frequent suffix,
Match the region of Si to longest suffix in Stem
If length of (Stem -Si ) >= to 3 char then,
Stem = Stem – Si
Step 7: If the length of Stem is > 3 characters then,
For all variations of P do
Match the region of Pi in Stem
If the length of (Stem -Pi ) > 3 characters then,
Stem = Stem – Pi
Step 8: Return the Stem
Phase 2: building a traditional IRS.
4 Selection of index terms from the collection of filtered terms. Ricardo Baeza-Yates and
Berthier Ribeiro-Neto in [9], show that the inverted file (or inverted index) is a word
oriented mechanism for indexing a text collection in order to speed up the searching task,
Index terms can be Individual words, group of words, or phrases, but most of them are
single words [9] for this reason researcher choose a single words (i.e., single term) as
index terms in this work.
This phase includes an affecting decision, to repeat the required word within the document to be
an "Index Term". If it only appears once, can a word be used as an Index Term? Or researcher
must use words that repeat many times in the same document?
In this thesis the index terms should be the words that repeated from (2-7) times in the text. After
studying some files and taking the average number of occurrence of some words, they found in
[7] that the best index terms to be used are those repeated within the average. (not too little, nor
too much).
It is expected that the use of a controlled vocabulary leads to an improvement in retrieval
performance. So we ignore the terms that appear in most documents in the collection (i.e., has
high frequencies), and the terms that appear only once in a document (i.e., terms that low
frequencies).
1 Creating the inverted file based on the stemmed words of each documents. (The stemmed
words technique used is suffix prefix removal).
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013
5
2 Compute the frequencies of each index term in each document (tf).
3 Compute the normalized frequencies of terms in each document by using the
following formula
jll
ji
ji
freq
freq
f
,
,
,
max
=
(1)
1 Compute the inverse document frequency, for each index term Ki in a document dj, as
follows
i
i
n
N
idf log=
(2)
Where N is the total number of documents in the collections, and ni is the number
of documents in which index term ki appears.
2 Calculate the weight of each term in a document by multiplying the normalized
frequencies with inverse document frequencies as follows
ijiji idffw ∗= ,,
(3)
After these steps, we have an inverted file that contains index terms (i.e., words) and terms
frequency and the weight of each term in a document.
Phase 3: Building Similar Thesaurus:
In this paper, the researcher uses Cosine equation, as it is the most common equation in building
the similarity thesaurus, and the threshold similarity was a variable to be entered while the system
working.
∑∑
∑
==
=
=
n
i
ki
n
i
ji
ki
n
i
ji
ww
ww
1
2
,
1
2
,
,
1
,
kj,
*
)*(
SsimilarityCosine
(4)
All the results were between 0 and 1 as (0<=Wi,k <=1) & (0<=Wi,j <=1)
5. EXPERIMENTS AND RESULTS
This study aims to reinforcing IRS depending on 242 Arabic abstract documents that used by
(Hmeidi & Kanaan, 1997) in [5], also to realize the importance of using stemmed words in these
systems instead of full words. All these abstracts involve computer science and information
system.
To achieve this aim, the researcher designed and built an automatic information retrieval system
from scratch to handle Arabic text. Work on these results that we got after applying 59 queries
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013
6
from the Relevance Judgments documents began and results were analyzed using the Recall and
Precision criteria. After that, Average of Recall and Precision were calculated.
Researcher has constructed an automatic stemmed words and full word index using inverted file
technique. Depending on these indexing words researcher has built three information retrieval
systems; in the first system, the researcher has used a Traditional Information Retrieval system
using a term frequency-inverse document frequency (tf-idf) for index term weights. In the second
one, the researcher used Similar Thesaurus by using Vector Space Model with Cosine formula
using a term frequency-inverse document frequency (tf-idf) for index term weights, and compare
between the similar measurements to find out the best that will be used in building the Similar
thesaurus.
3.1 Results
Table 1 shows the effect of using Traditional (full words, stemmed) and using Similarity thesaurus(full
words, stemmed).
Table 1 :Number of Retrieved, Relevant, Irrelevant documents using Traditional and when
using similarity thesaurus
Retrieved Relevant Irrelevant
Traditional-Full
Words
1706 763 943
Traditional -Stemmed
words
2399 1022 1377
Thesaurus -Full
Words
1704 771 933
Thesaurus -Stemmed
words
2029 991 1038
Figure 1 & Figure 2: Comparison values of the Average Recall Precision when using Traditional and when
using similarity thesaurus
Traditional (Recall Docs)
0
500
1000
1500
2000
2500
3000
Retrieved Relevant Irrelevant
No.ofDocs
Full Words Stemmed words
Figure 1: relevant and irrelevant shape from retrieved document in Tradition system
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013
7
Similarity Thesaurus (Recall Docs)
0
500
1000
1500
2000
2500
3000
Retrieved Relevant Irrelevant
No.ofDocs
Full Words Stemmed words
Figure 2: relevant and irrelevant shape from retrieved document in the case of similarity thesaurus
retrieving system
Table 2, Figure 3, and Figure 4 shows the percentage of the relevant retrieved documents from
all the relevant documents in the collection, when using Traditional and when using similarity
thesaurus
.
Table 2: shows the percentage of the relevant retrieved documents
% of Relevant Docs that Retrieved
Traditional-Full
Words
46.07487923
Traditional -
Stemmed words
61.71497585
Thesaurus -Full
Words
46.557971
Thesaurus -Stemmed
words
62.681159
0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
% R e l e v a n t D o c s t h a t R e
F u l l w o r d s S t e m m e d w o r d s
Figure 3: % of retrieved document in Traditional system
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013
8
S i m i l a r i t y
0
2 0
4 0
6 0
8 0
% R e l e v a n t D o c s t h a t R e t r i e v e d
F u l l w o r d s S t e m m e d w o r d s
Figure 4: % of retrieved document in thesaurus system
Table 3 shows percentage when using Traditional and when using similarity thesaurus
Table 3: Percentage of using Traditional & similarity thesaurus
Traditional Similarity
Full words 46.07487923 46.55797101
Stemmed words 61.71497585 62.68115942
Table 4: shows how much better were the results of using Traditional and when using similarity
thesaurus
Average Recall Precision
Recall Roots with using
Similarity
Thesaurus
Roots with using
Traditional retrieving
% of Improvement for
using Association
Thesaurus over Traditional
retrieving
0 0.908 0.917966102 -1.00%
0.1 0.87 0.875762712 -0.58%
0.2 0.810178571 0.785762712 2.44%
0.3 0.709464286 0.695254237 1.42%
0.4 0.664821429 0.626237288 3.86%
0.5 0.541428571 0.523389831 1.80%
0.6 0.438571429 0.442542373 -0.40%
0.7 0.325357143 0.290847458 3.45%
0.8 0.251428571 0.198305085 5.31%
0.9 0.13875 0.084745763 5.40%
q1 `2q2 0.047288136 0.91%
Table 4: Effect of using Traditional and when using similarity thesaurus
Table 5 shows the effect of using the stemmed words for information retrieving were always
better than using Full words, and ensure that using thesauri is much better than using Traditional
information retrieval.
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013
9
Table 5: Average of all the Relative work
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Traditional-
Full Words
0.91 0.87 0.81 0.71 0.66 0.54 0.44 0.33 0.25 0.14
Traditional -
Stemmed
words
0.92 0.88 0.79 0.7 0.63 0.52 0.44 0.29 0.2 0.08
Thesaurus -
Full Words
0.92 0.8 0.68 0.54 0.4 0.33 0.21 0.13 0.11 0.07
Thesaurus -
Stemmed
words
0.9 0.82 0.65 0.49 0.39 0.26 0.21 0.12 0.08 0.04
Averages
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Precision
Simi with Stemming Trad with Stemming
Simi with fullword Trad with fullword
Figure 5: A comparison between the values of average Recall Precision with all cases
6. CONCLUSION:-
In this paper, researcher built similar thesaurus in tow mechanisms (full word, and stemmed) ,
researcher found out that the results for retrieval information in used stemmed word get better
result than using full word in case of used traditional search, but when researcher used the similar
thesaurus in both mechanisms, got better result of stemmed than full word.. Finally, the results
when using stemmed in similar thesaurus got better result over than using stemmed in traditional
search.
International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013
10
7. FUTURE WORK:-
In this study i built the similar thesaurus in both mechanisms full word and stemming word. I
hope in the future to build an associative thesaurus and compare between them to know what is
better for retrieval information.
REFERENCES
[1] Adriani, M. and Croft, W. “Retrieval Effectiveness of Various Indexing Techniques on Indonesian
News Articles”, 1997.
[2] Al-Shalabi, R. Kannan, G., Al-Jaam, J., Hasnah A., and Helat, E., “Stop-word Removal Algorithm for
Arabic Language”, processing of the 1st International Conference on Information & Communication
Technologies: from theory to Applications-ICTTA, Damascus, 2004.
[3] baeza-yates R.,and Rierio-neto B., “Modern Information Retrieval” , Addison-Wesley,New-
York,1999.
[4] Darwish, K., “Building a Shallow Arabic Morphological Analyzer in one Day”, Acl Workshop on
Computational Approaches to Semitic Language, PP 47-57, 2002.
[5] Kanaan, G. “Comparing Automatic Statistical and Syntactic Phrase Indexing for Arabic Information
Retrieval”, Ph.D.Thesis, University of Illinois, Chicago, USA, 1997.
[6] Monica Lassi ”Automatic Thesaurus Construction”, A paper written within the GSLT course
Linguistic Resource, autumn 2002.
[7]Al-Shalabi, R. Kannan, G., Al-Jaam, J., Hasnah A., and Helat, E., “Stop-word Removal Algorithm for
Arabic Language”, processing of the 1st International Conference on Information & Communication
Technologies: from theory to Applications-ICTTA, Damascus, 2004.
[8]Kanaan, G. Ghassan and Wedyan, M. (2006). Constructing an Automatic Thesaurus to Enhance Arabic
Information Retrieval System. The 2nd Jordanian International Conference on Computer Science and
Engineering, JICCSE 2006, Salt, Jordan. 89-97.
[9]Smeaton, A.F., Van Rijsbergen, C.J., The Retrieval Effects of Query Expansion on Feedback Document
Retrieval System, The Computer Journal, 26(3), p239-46, 1983.
[10]T. R. Addis, Machine Understanding of Natural Language, International Journal of Man-Machine
Studies, Vol. 9, No. 2, March 1977, pp. 207-222.
[11]Aljlayl, M, and Frieder, O, "on Arabic Search: Improving the Retrieval Effectiveness via a Light
Stemming Approach" ACM Conference on Information and Knowledge Management, Mcelean, VA ,
November, 2002
AUTHORS
Assistant Prof. in Zarqa University , jordan

Más contenido relacionado

La actualidad más candente

Chinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLPChinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLPAndi Wu
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrievalmghgk
 
A survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languagesA survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languagesijnlc
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewINFOGAIN PUBLICATION
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLPRupak Roy
 
Information extraction using discourse
Information extraction using discourseInformation extraction using discourse
Information extraction using discourseijitcs
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining Rupak Roy
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
A domain specific automatic text summarization using fuzzy logic
A domain specific automatic text summarization using fuzzy logicA domain specific automatic text summarization using fuzzy logic
A domain specific automatic text summarization using fuzzy logicIAEME Publication
 
Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...butest
 
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...cseij
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
Lecture Notes in Computer Science:
Lecture Notes in Computer Science:Lecture Notes in Computer Science:
Lecture Notes in Computer Science:butest
 

La actualidad más candente (20)

Chinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLPChinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLP
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
 
A survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languagesA survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languages
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
 
Information extraction using discourse
Information extraction using discourseInformation extraction using discourse
Information extraction using discourse
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
 
Does sizematter
Does sizematterDoes sizematter
Does sizematter
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining
 
Ir 02
Ir   02Ir   02
Ir 02
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
P33077080
P33077080P33077080
P33077080
 
Ir 03
Ir   03Ir   03
Ir 03
 
A domain specific automatic text summarization using fuzzy logic
A domain specific automatic text summarization using fuzzy logicA domain specific automatic text summarization using fuzzy logic
A domain specific automatic text summarization using fuzzy logic
 
Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...
 
Term weighting
Term weightingTerm weighting
Term weighting
 
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
G04124041046
G04124041046G04124041046
G04124041046
 
Lecture Notes in Computer Science:
Lecture Notes in Computer Science:Lecture Notes in Computer Science:
Lecture Notes in Computer Science:
 

Destacado

Universidad complejidad calidadacademica_pertinencia_social
Universidad complejidad calidadacademica_pertinencia_socialUniversidad complejidad calidadacademica_pertinencia_social
Universidad complejidad calidadacademica_pertinencia_socialunprg
 
Standards Educacion
Standards EducacionStandards Educacion
Standards EducacionMPPE
 
Daily equity-report by epicreserach 14 may 2013
Daily equity-report by epicreserach 14 may 2013Daily equity-report by epicreserach 14 may 2013
Daily equity-report by epicreserach 14 may 2013Epic Daily Report
 
Strategic management chapter 5 and 6 note for bba vii
Strategic management chapter 5 and 6 note for bba viiStrategic management chapter 5 and 6 note for bba vii
Strategic management chapter 5 and 6 note for bba viiSanjeev Bhandari
 
Q3 presentation final
Q3 presentation finalQ3 presentation final
Q3 presentation finalTele2
 
Capbussem en els rius de més enllà
Capbussem en els rius de més enllàCapbussem en els rius de més enllà
Capbussem en els rius de més enllàml2063
 
Fenilcetonuria genética
Fenilcetonuria genética Fenilcetonuria genética
Fenilcetonuria genética Miguel PT
 
Project Management internal Sharing
Project Management internal SharingProject Management internal Sharing
Project Management internal SharingAldo Santoso
 

Destacado (12)

Free visa for the vietnamese
Free visa for the vietnameseFree visa for the vietnamese
Free visa for the vietnamese
 
Universidad complejidad calidadacademica_pertinencia_social
Universidad complejidad calidadacademica_pertinencia_socialUniversidad complejidad calidadacademica_pertinencia_social
Universidad complejidad calidadacademica_pertinencia_social
 
Trendy Technology and Social Media for EGAT Executive
Trendy Technology and Social Media for EGAT ExecutiveTrendy Technology and Social Media for EGAT Executive
Trendy Technology and Social Media for EGAT Executive
 
SéSé
 
Standards Educacion
Standards EducacionStandards Educacion
Standards Educacion
 
Daily equity-report by epicreserach 14 may 2013
Daily equity-report by epicreserach 14 may 2013Daily equity-report by epicreserach 14 may 2013
Daily equity-report by epicreserach 14 may 2013
 
Strategic management chapter 5 and 6 note for bba vii
Strategic management chapter 5 and 6 note for bba viiStrategic management chapter 5 and 6 note for bba vii
Strategic management chapter 5 and 6 note for bba vii
 
Q3 presentation final
Q3 presentation finalQ3 presentation final
Q3 presentation final
 
Capbussem en els rius de més enllà
Capbussem en els rius de més enllàCapbussem en els rius de més enllà
Capbussem en els rius de més enllà
 
Fenilcetonuria genética
Fenilcetonuria genética Fenilcetonuria genética
Fenilcetonuria genética
 
Project Management internal Sharing
Project Management internal SharingProject Management internal Sharing
Project Management internal Sharing
 
Indicadores itzel
Indicadores itzelIndicadores itzel
Indicadores itzel
 

Similar a SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON

AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...cscpconf
 
Ijarcet vol-3-issue-1-9-11
Ijarcet vol-3-issue-1-9-11Ijarcet vol-3-issue-1-9-11
Ijarcet vol-3-issue-1-9-11Dhabal Sethi
 
Answer extraction and passage retrieval for
Answer extraction and passage retrieval forAnswer extraction and passage retrieval for
Answer extraction and passage retrieval forWaheeb Ahmed
 
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEMCANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEMIRJET Journal
 
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)ijnlc
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionaryEditor IJMTER
 
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVALONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVALijaia
 
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATION
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATIONEFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATION
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATIONIJDKP
 
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningA Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
 
Automatic document clustering
Automatic document clusteringAutomatic document clustering
Automatic document clusteringIAEME Publication
 
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
 
Car-Following Parameters by Means of Cellular Automata in the Case of Evacuation
Car-Following Parameters by Means of Cellular Automata in the Case of EvacuationCar-Following Parameters by Means of Cellular Automata in the Case of Evacuation
Car-Following Parameters by Means of Cellular Automata in the Case of EvacuationCSCJournals
 
Diacritic Oriented Arabic Information Retrieval System
Diacritic Oriented Arabic Information Retrieval SystemDiacritic Oriented Arabic Information Retrieval System
Diacritic Oriented Arabic Information Retrieval SystemCSCJournals
 
An Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic DocumentsAn Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic DocumentsIJCSEA Journal
 
An Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic DocumentsAn Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic DocumentsIJCSEA Journal
 
An evaluation and overview of indices
An evaluation and overview of indicesAn evaluation and overview of indices
An evaluation and overview of indicesIJCSEA Journal
 
Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...NALESVPMEngg
 
Survey on Text Classification
Survey on Text ClassificationSurvey on Text Classification
Survey on Text ClassificationAM Publications
 
Enhancing the labelling technique of
Enhancing the labelling technique ofEnhancing the labelling technique of
Enhancing the labelling technique ofIJDKP
 

Similar a SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON (20)

AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
 
Ijarcet vol-3-issue-1-9-11
Ijarcet vol-3-issue-1-9-11Ijarcet vol-3-issue-1-9-11
Ijarcet vol-3-issue-1-9-11
 
Answer extraction and passage retrieval for
Answer extraction and passage retrieval forAnswer extraction and passage retrieval for
Answer extraction and passage retrieval for
 
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEMCANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM
 
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse Dictionary
 
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVALONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
 
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATION
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATIONEFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATION
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATION
 
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningA Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text mining
 
Automatic document clustering
Automatic document clusteringAutomatic document clustering
Automatic document clustering
 
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
 
Car-Following Parameters by Means of Cellular Automata in the Case of Evacuation
Car-Following Parameters by Means of Cellular Automata in the Case of EvacuationCar-Following Parameters by Means of Cellular Automata in the Case of Evacuation
Car-Following Parameters by Means of Cellular Automata in the Case of Evacuation
 
Diacritic Oriented Arabic Information Retrieval System
Diacritic Oriented Arabic Information Retrieval SystemDiacritic Oriented Arabic Information Retrieval System
Diacritic Oriented Arabic Information Retrieval System
 
An Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic DocumentsAn Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic Documents
 
An Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic DocumentsAn Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic Documents
 
An evaluation and overview of indices
An evaluation and overview of indicesAn evaluation and overview of indices
An evaluation and overview of indices
 
Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...
 
Survey on Text Classification
Survey on Text ClassificationSurvey on Text Classification
Survey on Text Classification
 
Enhancing the labelling technique of
Enhancing the labelling technique ofEnhancing the labelling technique of
Enhancing the labelling technique of
 
Cc35451454
Cc35451454Cc35451454
Cc35451454
 

Último

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Último (20)

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON

  • 1. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013 DOI : 10.5121/ijcsea.2013.3201 1 SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON Essam S. Hanandeh, Department of Computer Information System, Zarqa University, Zarqa, Jordan Hanandeh@zu.edu.jo ABSTRACT The massive grow of the modern information retrieval system (IRS), especially in natural languages becomes more difficult. The search in Arabic languages, as natural language, is not good enough yet. This paper will try to build similar thesaurus based on Arabic language in two mechanisms, the first one is full word mechanisms and the other is stemmed mechanisms, and then to compare between them. The comparison made by this study proves that the similar thesaurus using stemmed mechanisms get more better results than using traditional in the same mechanisms and similar thesaurus improved more the recall and precision than traditional information retrieval system at recall and precision levels. KEYWORDS Similarity thesaurus, Recall, Precision, information retrieval, Traditional 1. INTRODUCTION A thesaurus (plural: thesauri) is a valuable tool in IR, in both indexing and in searching processes. It is used as a controlled vocabulary and as a means for expanding or altering queries (query expansion)[10]. Most thesauri that users encounter are manually constructed by domain experts and/or experts at document description. Manual thesaurus construction is a time-consuming and quite expensive process, and the results are bound to be more or less subjective since the person creating the thesaurus make choices that affect the structure of the thesaurus. There is a need for methods of automatically construct thesauri, which in addition to the improvements in time and cost aspects can result in more objective thesauri that are easier to update. These developed thesauri have been designed to comply with international and Arabic standards, and is capable of performing all tasks & duties needed in thesauri building. It automates most tasks in the building and maintenance of thesauri and insures integrity of its structures and its relations with database materials.
  • 2. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013 2 2. INFORMATION SYSTEM Information retrieval exhibits similarity to many other areas of information processing. The most important computer-based information systems today are the management information systems (MIS), data base management systems (DBMS), decision support systems (DSS), question- answering systems (QA), as well as information retrieval system (IRS) [8]. Information Retrieval (IR) is best understood if one remembers that the information consists of documents. In that context, information retrieval deals with the representation, storage, and access to documents or representatives of documents [10]. Information Retrieval systems have an important role in the studies of libraries and information, and they are really considered the top of this field. They get use of facts and ideas achieved in all studies, theoretical or practical, such as indexing, classifying, subjective analysis, concluding, computers, bibliography and others. As we know that half of the science is in its organization, and the main objective of this specialization is to supply and afford the suitable and sufficient information in the suitable time to the suitable person or researcher. 3. REPRESENTATION OF DOCUMENTS AND QUERIES: The following section describes the steps of representing the documents and queries automatically: 3.1 Filtering: In filtering the document collection, these filtering processes consist mainly of: 3.1.1 Eliminating Stop words: Stop word is a word that occurs so frequently in documents in the collection that it is useless for purposes of retrieval [3], Elimination of stop words reduces the size of the indexing structure and thus increases the performance of the system and enables it to retrieve more relevant documents. Stop words in Arabic include some of grammatical links such as the definite article (‫,)اﻟـ‬ attached and separated prepositions, conjunctions, interrogative words, negative words, exclamations , calling letters, adverbs of time and place They also include all the pronouns, demonstratives, subject and object pronouns, the Five Distinctive Nouns, some numbers, additions and verbs. Stop words may be separated or attached ones in a form of prefixes or suffixes [3]. 3.1.2 Stemming: Stemming is the deletion of prefixes and suffixes (getting the root of the word). The root means: the part of the word that is left after deletion of prefixes, suffixes and Infixes [4]. Stems are thought to be useful for improving retrieval performance because they reduce variants of the same root word to a common concept. Furthermore, stemming has the secondary effect of
  • 3. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013 3 reducing the size of the indexing structure because the number of distinct index terms is reduced [3]. 3.1.3 Indexing: Selection of index terms from the collection of filtered terms as seen in the construction processes. Indexing is defined as the process of choosing a term or a number of terms that can represent what the document contains. These terms have been called (Index terms) [8]. Indexing can be performed either manually (Manual Indexing) or through using computers software and programs (automatic Indexing) [6]. To decide the location of accuracy of the keywords in a certain text, the researcher will generate the inverted files. This file will contain all the keywords that the text contains, accompanied by the number of each paragraph containing such words. The paragraphs may be a sentence, a paragraph, a whole page of a document or the complete document [3]. 4. ALGORITHM Phase 1: preparing documents: 1 Use vector space model to put text of documents and query in vectors. 2 Normalization.  Removing stop words those were collected by Al-Shalabi, et al [2], and they gained 98% success in distinguishing in addition to deleting some signs appeared. (stop words are the words that occur so frequently in documents in the collection that it is useless for purposes of retrieval [9]), Elimination of stop words reduces the size of the indexing structure and thus increases the performance of the system and enables it to retrieve more relevant documents.  Deleting punctuation marks, commas, follow stops, especial signs, numbers (contents that has no meanings). 3 Stemming : the following stemming algorithm as in [11] with a little bet modification : Let T denote the set of characters of the Arabic surface Full word Let Li denote the position of letter i in term T Let Stem denote the term after stemming in each step Let D denote the set of definite articles ( ‫)ال‬ Let S denote the set of suffixes S ={ ،‫ه،ة‬ ،‫ك‬ ،‫و‬ ،‫ي‬ ،‫ن‬ ،‫ا‬ ‫ت‬ ،‫ان‬ ،‫ﯾﻦ‬ ،‫ون‬ ،‫ات‬ ،‫ھﻢ‬ ،‫ھﻦ‬ ،‫ھﺎ‬ ،‫ﻛﻢ‬ ،‫ﻛﻦ‬ ،‫ﻧﺎ‬ ،‫وا‬ ،‫ﺗﻢ‬ ،‫ﻧﻲ‬ ،‫ﺗﻦ‬ ،‫ﺗﮫ‬ ،‫ﯾﮫ‬ ،‫ﻣﺎ‬ ،‫ﯾﺎ‬ ،‫ﺗﺎ‬ ،‫ﺗﻚ‬ ،‫ري‬ ،‫ﻟﻲ‬ ،‫ار‬ ،‫ﯾﺮ‬ ،‫ﺗﻲ‬ ،‫ﯾﻞ‬ ‫ﯾﺔ‬ ،‫ﻟﯿﻦ‬ ،‫ﯾﺎن‬ ،‫ﯾﺘﺶ‬ ،‫ﯾﻮن‬ ،‫رات‬ ،‫ﻣﺎن‬ ،‫رﯾﻦ‬ ‫رھﺎ‬، ،‫ﯾﻨﺎ‬ ‫}ﻟﮭﺎ‬ Let P denote the set of prefixes P={ ،‫ي‬ ‫ت‬ ، ،‫ن‬ ،‫ب‬ ‫ل‬ ،‫ال‬ ،‫ﻟﻞ‬ ،‫ﺳﻲ‬ ،‫ﺳﺎ‬ ،‫ﺳﺖ‬ ،‫ﺳﻦ‬ ‫ﻛﺎ‬ ، ‫ﻓﺎ‬ ، ‫ﺑﺎ‬ ، ،‫ﻟﻲ‬ ‫ﻟﺖ‬ ، ،‫ﻟﻦ‬ ،‫ﻓﺖ‬ ‫ﻓﻲ‬ ، ،‫ﻓﻦ‬ ‫اس‬ ،‫اﻟﺢ‬ ،‫ﻣﺎل‬ ،‫ﻻل‬ ،‫اﻟﻢ‬ ،‫اﻻ‬ ،‫اﻟﺲ‬ ،‫اﻟﻊ‬ ،‫ﻟﻠﻢ‬ ،‫اﻟﻚ‬ ‫اﻟﻒ‬ } Let n is the total number of characters in the Arabic word Step 1: Remove any diacritic in T
  • 4. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013 4 Step2: If the length of T is > 3 characters then, Remove the prefix Waw “ ‫”و‬ in position L1 Step 3: Normalize ‫آ‬ ,‫إ‬ ,‫أ‬ of T to ‫ا‬ (plain alif) Step 4: Normalize ‫ى‬ in Ln of T to ‫ي‬ Replace the sequence of ‫ى‬ in Ln-1 and ‫ء‬ in Ln to ‫ئ‬ Replace the sequence of ‫ي‬ in Ln-1 and ‫ء‬ in Ln to ‫ئ‬ Normalize ‫ه‬ in Ln of T to ‫ة‬ Step 5: For all variations of D (‫)ال‬ do, Locate the definite article Di in T If Di in T matches Di = Di + Characters in T ahead of Di Stem = T – Di Step 6: If the length of Stem is > 3 characters then, For all variations of S, obtain the most frequent suffix, Match the region of Si to longest suffix in Stem If length of (Stem -Si ) >= to 3 char then, Stem = Stem – Si Step 7: If the length of Stem is > 3 characters then, For all variations of P do Match the region of Pi in Stem If the length of (Stem -Pi ) > 3 characters then, Stem = Stem – Pi Step 8: Return the Stem Phase 2: building a traditional IRS. 4 Selection of index terms from the collection of filtered terms. Ricardo Baeza-Yates and Berthier Ribeiro-Neto in [9], show that the inverted file (or inverted index) is a word oriented mechanism for indexing a text collection in order to speed up the searching task, Index terms can be Individual words, group of words, or phrases, but most of them are single words [9] for this reason researcher choose a single words (i.e., single term) as index terms in this work. This phase includes an affecting decision, to repeat the required word within the document to be an "Index Term". If it only appears once, can a word be used as an Index Term? Or researcher must use words that repeat many times in the same document? In this thesis the index terms should be the words that repeated from (2-7) times in the text. After studying some files and taking the average number of occurrence of some words, they found in [7] that the best index terms to be used are those repeated within the average. (not too little, nor too much). It is expected that the use of a controlled vocabulary leads to an improvement in retrieval performance. So we ignore the terms that appear in most documents in the collection (i.e., has high frequencies), and the terms that appear only once in a document (i.e., terms that low frequencies). 1 Creating the inverted file based on the stemmed words of each documents. (The stemmed words technique used is suffix prefix removal).
  • 5. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013 5 2 Compute the frequencies of each index term in each document (tf). 3 Compute the normalized frequencies of terms in each document by using the following formula jll ji ji freq freq f , , , max = (1) 1 Compute the inverse document frequency, for each index term Ki in a document dj, as follows i i n N idf log= (2) Where N is the total number of documents in the collections, and ni is the number of documents in which index term ki appears. 2 Calculate the weight of each term in a document by multiplying the normalized frequencies with inverse document frequencies as follows ijiji idffw ∗= ,, (3) After these steps, we have an inverted file that contains index terms (i.e., words) and terms frequency and the weight of each term in a document. Phase 3: Building Similar Thesaurus: In this paper, the researcher uses Cosine equation, as it is the most common equation in building the similarity thesaurus, and the threshold similarity was a variable to be entered while the system working. ∑∑ ∑ == = = n i ki n i ji ki n i ji ww ww 1 2 , 1 2 , , 1 , kj, * )*( SsimilarityCosine (4) All the results were between 0 and 1 as (0<=Wi,k <=1) & (0<=Wi,j <=1) 5. EXPERIMENTS AND RESULTS This study aims to reinforcing IRS depending on 242 Arabic abstract documents that used by (Hmeidi & Kanaan, 1997) in [5], also to realize the importance of using stemmed words in these systems instead of full words. All these abstracts involve computer science and information system. To achieve this aim, the researcher designed and built an automatic information retrieval system from scratch to handle Arabic text. Work on these results that we got after applying 59 queries
  • 6. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013 6 from the Relevance Judgments documents began and results were analyzed using the Recall and Precision criteria. After that, Average of Recall and Precision were calculated. Researcher has constructed an automatic stemmed words and full word index using inverted file technique. Depending on these indexing words researcher has built three information retrieval systems; in the first system, the researcher has used a Traditional Information Retrieval system using a term frequency-inverse document frequency (tf-idf) for index term weights. In the second one, the researcher used Similar Thesaurus by using Vector Space Model with Cosine formula using a term frequency-inverse document frequency (tf-idf) for index term weights, and compare between the similar measurements to find out the best that will be used in building the Similar thesaurus. 3.1 Results Table 1 shows the effect of using Traditional (full words, stemmed) and using Similarity thesaurus(full words, stemmed). Table 1 :Number of Retrieved, Relevant, Irrelevant documents using Traditional and when using similarity thesaurus Retrieved Relevant Irrelevant Traditional-Full Words 1706 763 943 Traditional -Stemmed words 2399 1022 1377 Thesaurus -Full Words 1704 771 933 Thesaurus -Stemmed words 2029 991 1038 Figure 1 & Figure 2: Comparison values of the Average Recall Precision when using Traditional and when using similarity thesaurus Traditional (Recall Docs) 0 500 1000 1500 2000 2500 3000 Retrieved Relevant Irrelevant No.ofDocs Full Words Stemmed words Figure 1: relevant and irrelevant shape from retrieved document in Tradition system
  • 7. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013 7 Similarity Thesaurus (Recall Docs) 0 500 1000 1500 2000 2500 3000 Retrieved Relevant Irrelevant No.ofDocs Full Words Stemmed words Figure 2: relevant and irrelevant shape from retrieved document in the case of similarity thesaurus retrieving system Table 2, Figure 3, and Figure 4 shows the percentage of the relevant retrieved documents from all the relevant documents in the collection, when using Traditional and when using similarity thesaurus . Table 2: shows the percentage of the relevant retrieved documents % of Relevant Docs that Retrieved Traditional-Full Words 46.07487923 Traditional - Stemmed words 61.71497585 Thesaurus -Full Words 46.557971 Thesaurus -Stemmed words 62.681159 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 % R e l e v a n t D o c s t h a t R e F u l l w o r d s S t e m m e d w o r d s Figure 3: % of retrieved document in Traditional system
  • 8. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013 8 S i m i l a r i t y 0 2 0 4 0 6 0 8 0 % R e l e v a n t D o c s t h a t R e t r i e v e d F u l l w o r d s S t e m m e d w o r d s Figure 4: % of retrieved document in thesaurus system Table 3 shows percentage when using Traditional and when using similarity thesaurus Table 3: Percentage of using Traditional & similarity thesaurus Traditional Similarity Full words 46.07487923 46.55797101 Stemmed words 61.71497585 62.68115942 Table 4: shows how much better were the results of using Traditional and when using similarity thesaurus Average Recall Precision Recall Roots with using Similarity Thesaurus Roots with using Traditional retrieving % of Improvement for using Association Thesaurus over Traditional retrieving 0 0.908 0.917966102 -1.00% 0.1 0.87 0.875762712 -0.58% 0.2 0.810178571 0.785762712 2.44% 0.3 0.709464286 0.695254237 1.42% 0.4 0.664821429 0.626237288 3.86% 0.5 0.541428571 0.523389831 1.80% 0.6 0.438571429 0.442542373 -0.40% 0.7 0.325357143 0.290847458 3.45% 0.8 0.251428571 0.198305085 5.31% 0.9 0.13875 0.084745763 5.40% q1 `2q2 0.047288136 0.91% Table 4: Effect of using Traditional and when using similarity thesaurus Table 5 shows the effect of using the stemmed words for information retrieving were always better than using Full words, and ensure that using thesauri is much better than using Traditional information retrieval.
  • 9. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013 9 Table 5: Average of all the Relative work 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Traditional- Full Words 0.91 0.87 0.81 0.71 0.66 0.54 0.44 0.33 0.25 0.14 Traditional - Stemmed words 0.92 0.88 0.79 0.7 0.63 0.52 0.44 0.29 0.2 0.08 Thesaurus - Full Words 0.92 0.8 0.68 0.54 0.4 0.33 0.21 0.13 0.11 0.07 Thesaurus - Stemmed words 0.9 0.82 0.65 0.49 0.39 0.26 0.21 0.12 0.08 0.04 Averages 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision Simi with Stemming Trad with Stemming Simi with fullword Trad with fullword Figure 5: A comparison between the values of average Recall Precision with all cases 6. CONCLUSION:- In this paper, researcher built similar thesaurus in tow mechanisms (full word, and stemmed) , researcher found out that the results for retrieval information in used stemmed word get better result than using full word in case of used traditional search, but when researcher used the similar thesaurus in both mechanisms, got better result of stemmed than full word.. Finally, the results when using stemmed in similar thesaurus got better result over than using stemmed in traditional search.
  • 10. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.3, No.2, April 2013 10 7. FUTURE WORK:- In this study i built the similar thesaurus in both mechanisms full word and stemming word. I hope in the future to build an associative thesaurus and compare between them to know what is better for retrieval information. REFERENCES [1] Adriani, M. and Croft, W. “Retrieval Effectiveness of Various Indexing Techniques on Indonesian News Articles”, 1997. [2] Al-Shalabi, R. Kannan, G., Al-Jaam, J., Hasnah A., and Helat, E., “Stop-word Removal Algorithm for Arabic Language”, processing of the 1st International Conference on Information & Communication Technologies: from theory to Applications-ICTTA, Damascus, 2004. [3] baeza-yates R.,and Rierio-neto B., “Modern Information Retrieval” , Addison-Wesley,New- York,1999. [4] Darwish, K., “Building a Shallow Arabic Morphological Analyzer in one Day”, Acl Workshop on Computational Approaches to Semitic Language, PP 47-57, 2002. [5] Kanaan, G. “Comparing Automatic Statistical and Syntactic Phrase Indexing for Arabic Information Retrieval”, Ph.D.Thesis, University of Illinois, Chicago, USA, 1997. [6] Monica Lassi ”Automatic Thesaurus Construction”, A paper written within the GSLT course Linguistic Resource, autumn 2002. [7]Al-Shalabi, R. Kannan, G., Al-Jaam, J., Hasnah A., and Helat, E., “Stop-word Removal Algorithm for Arabic Language”, processing of the 1st International Conference on Information & Communication Technologies: from theory to Applications-ICTTA, Damascus, 2004. [8]Kanaan, G. Ghassan and Wedyan, M. (2006). Constructing an Automatic Thesaurus to Enhance Arabic Information Retrieval System. The 2nd Jordanian International Conference on Computer Science and Engineering, JICCSE 2006, Salt, Jordan. 89-97. [9]Smeaton, A.F., Van Rijsbergen, C.J., The Retrieval Effects of Query Expansion on Feedback Document Retrieval System, The Computer Journal, 26(3), p239-46, 1983. [10]T. R. Addis, Machine Understanding of Natural Language, International Journal of Man-Machine Studies, Vol. 9, No. 2, March 1977, pp. 207-222. [11]Aljlayl, M, and Frieder, O, "on Arabic Search: Improving the Retrieval Effectiveness via a Light Stemming Approach" ACM Conference on Information and Knowledge Management, Mcelean, VA , November, 2002 AUTHORS Assistant Prof. in Zarqa University , jordan