IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
International Journal of Engineering Research and Development
Mp2420852090
1. Safa’a I. Hajeer / International Journal of Engineering Research and Applications
(IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue4, July-August 2012, pp.2085-2090
Vector Space Model: Comparison Between Euclidean Distance &
Cosine Measure On Arabic Documents
Safa’a I. Hajeer
Department of Computer Information Systems
Abstract
The purpose of this research is to give an So the information retrieval systems issue is
idea about Euclidean distance and cosine measure predicting which documents are relevant and which
based on Arabic documents collection, and gives are not. Such a decision is usually dependent on a
the comparison points between those measures. ranking algorithm which attempts to establish a
The most common points to compare are simple order of the document retrieved. Documents
the system performance with these two measures appearing at the top of this ordering are considered to
by give the attention on time, space and be more likely to be relevant.
recall/precision evaluation measures. The time A ranking algorithm operates according to
measure represents the time needed to retrieve the basic premises (evidences, principle, ideas,
relevant documents to specific query. On the other foundations, grounds) regarding the nation of
hand, space represents the capacity of memory document relevance. Distinct sets of premises yield
needed for reach the results. Then at last find the distinct information retrieval models. One of the IR
recall/precision the saw the effectiveness of the models is vector space model.
system. The Vector Space Model (VSM) is a
A collection of 242 Arabic Abstracts from popular to Information retrieval system
the proceeding of the Saudi Arabian National implementation which it based on the idea of
Computer Conferences, and tested the system represented documents by vectors (arrays of
with a lot of sample queries to see that its work go numbers) in a high-dimensionality vector space. To
correctly. Note: the index terms used in full words look deeply in the vector space model read the next
after removing the stop words, to reach better section.
results by the exact matches.
The results have proved that the Vector Space Model
Euclidean distance had exact accuracy in The vector model defines a vector that
compared the cosine measure that theoretically represents each document, and a vector that
exact, but suffers from rounding errors. However, represents the query [9]. There is one component in
the comparison of time complexity of the two each vector for every distinct term that occurs in the
measures we found them the same define as O(N). document collection. Once the vectors are
Note the time for calculate the cosine measure constructed, the distance between the vectors, or the
equation need more time and space than the size of the angle between the vectors, is used to
Euclidean distance although the complexity time compute a similarity coefficient [2].
is the same for both. Consider a document collection with only
two distinct terms, a and B. All vectors contain only
Keywords: IR, Vector space model, ranking two components. The first component represents
algorithm, Euclidean distance, Cosine measure. occurrences of a, and the second represents
occurrences of B. The simplest means of constructing
Introduction a vector is to place a one in the corresponding vector
Information Retrieval (IR) is a field devoted component if the term appears, and a zero, if the term
primarily to efficient, automated indexing and does not appear. Consider a document D1, that
retrieval of documents. There are a variety of contains two occurrences of term a and zero
sophisticated techniques for quickly searching occurrences of term b. The vector, represents this
documents with little or no human intervention. [6] document using a binary representation. This binary
Traditional information retrieval systems usually representation can be used to produce a similarity
adopt index terms to index and retrieve documents. coefficient, but it does not take into account the
An index term is a keyword (or group of related frequency of a term within a document. By extending
words) which has some meaning of its own (i.e. the representation to include a count of the number of
which usually has the semantics of the noun). [1] occurrences of the terms in each component, these
That’s mean in a simple way an words which appears frequencies can be considered [2].
in the text of a document in a collection and its Early work in the field used manually assigned
fundamental for information retrieval task to help the weights. Similarity coefficients that employed
users to find the information which they need in an automatically assigned weights were compared to
easy way. manually assigned weights. Repeatedly, it was shown
2085 | P a g e
2. Safa’a I. Hajeer / International Journal of Engineering Research and Applications
(IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue4, July-August 2012, pp.2085-2090
that automatically assigned weights would perform at 5- find the Cosine measure
least as well as manually assigned weights [9]. The Cosine Measure formula is used to calculate the
cosine measure between the query and the
Euclidean Distance document.
Around 300 BC, the Greek mathematician C.M(D,Q)= ∑(tkqk) / [ sqrt( ∑(tk)2 ) * sqrt(
Euclid laid down the rules of what has now come to ∑(qk)2 ) ]
be called "Euclidean geometry", which is the study of
the relationships between angles and distances in 6- The results are then ranked. For the Euclidean
space. [4] Distance (E.D) ranking is done from lowest
Euclidean distance is a measure to find the distance to highest distance (i.e. lowest E.D are
distance between the query and the document. placed first).
The reader can see the formula of Euclidean distance For the Cosine Measure (C.M) ranking is done from
later on (in the section of The Algorithm). highest value to the lowest value. (i.e. highest C.M
are placed first).
Cosine Measure The reason for this: If the angle between the vectors
Cosine measure: one of the similarity is small they are said to be near each other and a
measures of VSM which represents the documents is small angel means a high cosine value (for Ex: cos 0 o
closest to the query (user request). = 1)
Thus a cosine value of zero meant that the 7- At last check the effectiveness of the
query and document vector were orthogonal to each system by measuring the Accuracy (recall/precision).
other and meant that there was no match or the term
simply did not exist in the document being Measuring Accuracy
considered. Accuracy of an Information Retrieval
The reader can see the formula of Cosine Measure System is commonly measured using the metrics of
later on (in the section of the Algorithm). precision and recall. These are defined in Equation
one below. Precision is the measure of how much
The Algorithm: junk (non-relevant) documents get returned for each
1- The frequency of each term is found and a relevant one. Recall is the measure of how many of
preliminary document vector is formed. the relevant documents were found no matter what
2- The document vectors are normalized using the else was found. These measures assume a prior
below formula: knowledge of which documents are relevant to each
co-ordinate of the vector is = ( term 1 query. [2]
frequency ) / sqrt [ ( term 1 freq )2 + ( term 2 freq
)2 +.....+ ( term n freq )2 ] Precision= number of relevant document retrieved/
3- Similarly all the other co-ordinates of the Total # of retrieved document
document vector and query (Note: The query is
also treaded as a document) are calculated. Note Recall=number of relevant document retrieved/ Total
that the co-ordinate= weight of the terms. # of relevant document
Equation1: Precision and recall
4- Find the Euclidean distance Definition
The Euclidean distance formula is used then to
calculate the distance between the query and the Results
document. We can see the results that reached from the
E.D (D, Q) = sqrt [∑ ( tk - qk ) 2] comparison:
Measure Name Description Data Structure Accuracy(Theoreti Theoretical
cal) time
complexity
Cosine Measure A simple brute force algorithm that Linked list of Theoretically O(N)
calculates the Cosine measure between the all document exact, but suffers
query profile and all profiles in the system, profiles from rounding
and then returns a list of the best matching errors, because of
ones documents. For the cosine measure to excessive
work all profiles must be unit length, and normalizing steps
therefore they must be normalized first.
This can be done as a one time
preprocessing step so it does not impose too
severe an overhead. Unfortunately reduction
does not preserve unit length and therefore
2086 | P a g e
3. Safa’a I. Hajeer / International Journal of Engineering Research and Applications
(IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue4, July-August 2012, pp.2085-2090
every time a profile is reduced, it needs to
be normalized again. These repeated
normalization operations cause
computational costs and reduce the
accuracy of this method.
Euclidean distance A very simple brute force algorithm that Linked list of Exact O(N)
calculates the Euclidean distance from the all document
query profile to all profiles in the system, profiles
and then returns a list of the best matching
ones.
To Focus at Accuracy in this information retrieval
system, see the graph:
Size (KB)
Precision
Cosine
Measure
Euclidean
distance Number of documents
Figure 2: space requirement
Recall Conclusion
Through the theoretical analysis and
Figure 1: Recallprecision using 50 Query as a experimental results: for the normalized data,
sample Euclidean distance and cosine measure becomes even
more similar.
To know more about the space which is used The cosine measure suffers from rounding
by the system during the testing at the documents - error because of excessive normalizing steps. Note
see below graph the repeated of normalization operations cause
computational costs, so the looking for a normalized
data without any need to do normalization operation
on them to reach the best results without any
rounding error.
2087 | P a g e
4. Safa’a I. Hajeer / International Journal of Engineering Research and Applications
(IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue4, July-August 2012, pp.2085-2090
In future work, the plan to extend this project to Q18:ًٌاٌزواء اال
analyze other distance measures, such as Manhattan Q19:ًاٌؼاٌُ اٌؼشت
distance, Chebishov distance, power distance …etc. Q20:ٌُاٌمشاْ اٌىش
Q21:اٌىٍّاخ اٌؼشتٍح
References: Q22:اٌٍغاخ اٌطثٍؼٍح
[1] Baeza-Yates Ricardo, Ribeiro-Neto Berthier Q23:اٌٍغح اٌؼشتٍح
(1999), Modern Information Retrieval, New Q24:اٌّذسسح االٌىرشٍٚٔح
York, USA. Q25: ٌاٌٍّّىح اٌؼشتٍح اٌسؼٛد
ح
[2] Chowdhury Abdur, McCabe M.Catherine Q26:اٌّٛاسد اٌثششٌح
(1993), Improving Information Retrieval Q27:ًإٌص اٌؼشت
Systems using Part of Speech Tagging.
[3] Deitel (2003), Java how to program, USA, Q28:امن المعلومات
fifth edition.
Q29:انظمة الحاسبات االلية
[4] Euclidean Space – Wikipedia (2012), the
free encyclopedia, Q30:برامج الحاسب االلي
http://en.wikipedia.org/wiki/Euclidean_spac
Q31:برمجة الحاسبات االلية
e, available on: July 2012.
[5] G. Salton, M. J. McGill (1983), introduction Q32:بنوك المعلومات
to Modern Information Retrieval, McGraw-
Q33:تدريس مواد الحاسب
Hill Inc.
[6] Information Retrieval & computational Q34:تركيب الجملة العربية
Geometry (2012),
www.ddj.com/dept/architect/184405928, Q35:تطبيقات الكمبيوتر
available on: July 2012. Q36:تعريب البرامج
[7] Information retrieval – Wikipedia (2012),
the free encyclopedia, Q37:تعريب الحاسبات االلية
http://en.wikipedia.org/wiki/Information_ret Q38:تعريب الحاسوب
rieval, available on: Jan. 2012.
[8] Qian Gang, Sural Shamik, Gu Yuelong, Q39:تعليم الكمبيوتر
Pramanik Sakti (2004), Similarity between Q40:تقنية المعلومات
Euclidean and cosine angle distance for
nearest neighbor queries, to appear in Q41:تمييز االشكال بواسطة الحاسب االلي
Proceedings of the 19th Annual ACM Q42:جامعة الملك سعود
Symposium on. Applied Computing (SAC
2004), Nicosia, Cyprus. Q43:جامعة الملك عبد العزيز
[9] Salto, G., Wong, A., and C.S. Yang (1975), Q44:جمعية الحاسبات السعودية
A Vector Space Model for Automatic
Indexing, Communication of ACM, New Q45:شبكات الحاسب االلي
York, USA, Vol. 18 Issue 11, p.p. 613- 620. Q46:شبكة اتصاالت الحاسبات
Appendix A Q47:شبكة االتصاالت
The following are the queries used to test the system: Q48:علوم الحاسب و المعلومات
Q1:ًٌاسرخذاَ اٌحاسة اال Q49:قواعد البيانات
Q2:اسرشظاع اٌّؼٍِٛاخ Q50:قواعد المعلومات
Q3:االداسج ٚ اٌرخطٍظ
Q4:ٍٍُاٌرذسٌة ٚ اٌرؼ
Q5:اٌرشٍِض ٚ اٌرشفٍش
Q6:اٌرؼٍٍُ تّساػذج اٌحاسة
Q7:اٌرؼٍٍُ تٛاسطح اٌحاسة
Q8:ًٌاٌحاسة اال
Q9:اٌحاسثاخ اٌصغٍشج
Q10: اٌحاسثاخ اٌّرٕاٍ٘ح اٌصغ
ش
Q11:ٍٍُاٌحاسٛب ٚ اٌرؼ
Q12:اٌحط ٚ اٌؼّشج
Q13:ًاٌحشف اٌؼشت
Q14:اٌخطح اٌٛطٍٕح ٌٍّؼٍِٛاخ
Q15:ًاٌخٍٍط اٌؼشت
Q16:اٌذٚائش اٌّرىاٍِح
Q17:ًاٌزواء االصطٕاػ
2088 | P a g e