Designing IA for AI - Information Architecture Conference 2024
Vsm lsi
1. Vector Space Model & Lantent Semantic
Indexing
Ryan Reck
November 18, 2008
2. 1 Introduction
2 Vector Space Model
3 Lantent Semantic Indexing
4 Applications of VSM & LSI
5 Comparison: VSM vs. LSI
6 Conclusion
7 References
3. Introduction
What are VSM & LSI?
VSM & LSI are techniques from information retrievel for managing
documents based on their content.
4. Vector Space Model
Models documents as a vector in a multi-dimensional space.
Similar documents are closer together, angle between vectors
can be interpretted as similarity of two documents.
Queries are translated into the vector space, and the nearest
documents (point in space, or vector angle) are the desired
documents.
Originated from the SMART Information Retrieval project at
Cornell University. First published paper in 1975 [2].
6. Vector Space Model
Calcuating Term Weights
VSM introduced the Term Frequency - Inverse Document
Frequency method of calculating term weights.
TF-IDF gives greater weight to less common terms, and less
weight to common ones, since rare terms will better
distinguish documents than common terms.
|D|
Wf ,d = tft · log ( |t∈d|
7. Lantent Semantic Indexing
Built off of Vector Space Model.
Extracts concepts from the term-document matrix.
Combines corelated dimensions into a single aggrgate
dimension.
This allows the documents to be indexed by concept instead
of simple terms.
8. Lantent Semantic Indexing
Example
Good Example
{computer , laptop} − > {1.2 ∗ computer + 0.9 ∗ laptop}
Realistic Example
{computer , elevator } − > {1.2 ∗ computer + 0.9 ∗ elevator }
9. Applications of VSM & LSI
VSM, or variations of it, are almost universal.
Search Engines
Apache Lucene
10. Comparison: VSM vs. LSI
Advantages of LSI
Handles synonymy and polysemy directly
Can match documents using differing vocabularies.
Can even match across different languages, after some
translated documents have been handled[1].
Advantages of VSM
Much simpler, but still performs well
Handles new documents more easily, LSI’s dimension
reduction can cause problems with this.
11. Conslusion
VSM and LSI are both good ways to index and compare
documents. VSM is pretty basic but still gets the job done. LSI
provides a more complex system, but it can do a very good job,
even under extreme circumstances, like multi-language datasets.
12. Refeences
Dumais, S. T., Letsche, T. A., Littman, M. L., and
Landauer, T. K.
Automatic cross-language retrieval using latent semantic
indexing.
In AAAI Symposium on CrossLanguage Text and Speech
Retrieval. American Association for Artificial Intelligence,
March 1997. (March 1997).
Salton, G., Wong, A., and Yang, C. S.
A vector space model for automatic indexing.
Commun. ACM 18, 11 (1975), 613–620.
Latent semantic indexing, 2008.
http://en.wikipedia.com/wiki/Latent semantic indexing.
Vector space model, 2008.
http://en.wikipedia.com/wiki/Vector space model.