SlideShare a Scribd company logo
1 of 12
Download to read offline
Vector Space Model & Lantent Semantic
              Indexing

              Ryan Reck


           November 18, 2008
1 Introduction

2 Vector Space Model

3 Lantent Semantic Indexing

4 Applications of VSM & LSI

5 Comparison: VSM vs. LSI

6 Conclusion

7 References
Introduction
What are VSM & LSI?




    VSM & LSI are techniques from information retrievel for managing
    documents based on their content.
Vector Space Model




      Models documents as a vector in a multi-dimensional space.
      Similar documents are closer together, angle between vectors
      can be interpretted as similarity of two documents.
      Queries are translated into the vector space, and the nearest
      documents (point in space, or vector angle) are the desired
      documents.
      Originated from the SMART Information Retrieval project at
      Cornell University. First published paper in 1975 [2].
Vector Space Model
Example




          doc1 =< tf1 , tf2 , tf3 , . . . , tfn >
          doc2 =< tf1 , tf2 , tf3 , . . . , tfn >
          sim(doc1 , doc2 ) = cos(θ) = v0 · v1
Vector Space Model
Calcuating Term Weights




         VSM introduced the Term Frequency - Inverse Document
         Frequency method of calculating term weights.
         TF-IDF gives greater weight to less common terms, and less
         weight to common ones, since rare terms will better
         distinguish documents than common terms.
                              |D|
         Wf ,d = tft · log ( |t∈d|
Lantent Semantic Indexing




      Built off of Vector Space Model.
      Extracts concepts from the term-document matrix.
          Combines corelated dimensions into a single aggrgate
          dimension.
      This allows the documents to be indexed by concept instead
      of simple terms.
Lantent Semantic Indexing
Example




    Good Example
    {computer , laptop} − >      {1.2 ∗ computer + 0.9 ∗ laptop}

    Realistic Example
    {computer , elevator } − >    {1.2 ∗ computer + 0.9 ∗ elevator }
Applications of VSM & LSI




      VSM, or variations of it, are almost universal.
      Search Engines
          Apache Lucene
Comparison: VSM vs. LSI


  Advantages of LSI

      Handles synonymy and polysemy directly
      Can match documents using differing vocabularies.
      Can even match across different languages, after some
      translated documents have been handled[1].


  Advantages of VSM

      Much simpler, but still performs well
      Handles new documents more easily, LSI’s dimension
      reduction can cause problems with this.
Conslusion




   VSM and LSI are both good ways to index and compare
   documents. VSM is pretty basic but still gets the job done. LSI
   provides a more complex system, but it can do a very good job,
   even under extreme circumstances, like multi-language datasets.
Refeences

      Dumais, S. T., Letsche, T. A., Littman, M. L., and
      Landauer, T. K.
      Automatic cross-language retrieval using latent semantic
      indexing.
      In AAAI Symposium on CrossLanguage Text and Speech
      Retrieval. American Association for Artificial Intelligence,
      March 1997. (March 1997).
      Salton, G., Wong, A., and Yang, C. S.
      A vector space model for automatic indexing.
      Commun. ACM 18, 11 (1975), 613–620.
      Latent semantic indexing, 2008.
      http://en.wikipedia.com/wiki/Latent semantic indexing.
      Vector space model, 2008.
      http://en.wikipedia.com/wiki/Vector space model.

More Related Content

What's hot

The science behind predictive analytics a text mining perspective
The science behind predictive analytics  a text mining perspectiveThe science behind predictive analytics  a text mining perspective
The science behind predictive analytics a text mining perspective
ankurpandeyinfo
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
NYC Predictive Analytics
 

What's hot (19)

The science behind predictive analytics a text mining perspective
The science behind predictive analytics  a text mining perspectiveThe science behind predictive analytics  a text mining perspective
The science behind predictive analytics a text mining perspective
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...
 
4 Cliques Clusters
4 Cliques Clusters4 Cliques Clusters
4 Cliques Clusters
 
5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological data5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological data
 
Clustering ppt
Clustering pptClustering ppt
Clustering ppt
 
P229 godfrey
P229 godfreyP229 godfrey
P229 godfrey
 
Advanced topics in artificial neural networks
Advanced topics in artificial neural networksAdvanced topics in artificial neural networks
Advanced topics in artificial neural networks
 
Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)
 
Papers We Love Kyiv, July 2018: A Conflict-Free Replicated JSON Datatype
Papers We Love Kyiv, July 2018: A Conflict-Free Replicated JSON DatatypePapers We Love Kyiv, July 2018: A Conflict-Free Replicated JSON Datatype
Papers We Love Kyiv, July 2018: A Conflict-Free Replicated JSON Datatype
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed Coordination
 
Svv
SvvSvv
Svv
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into Texts
 
15 82-87
15 82-8715 82-87
15 82-87
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
 
xSDN - An Expressive Simulator for Dynamic Network Flows
xSDN - An Expressive Simulator for Dynamic Network FlowsxSDN - An Expressive Simulator for Dynamic Network Flows
xSDN - An Expressive Simulator for Dynamic Network Flows
 
Strings in c langauge
Strings in c langaugeStrings in c langauge
Strings in c langauge
 

Viewers also liked

ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERING
George Simov
 

Viewers also liked (7)

Topic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and ApplicationsTopic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and Applications
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERING
 
Topic extraction using machine learning
Topic extraction using machine learningTopic extraction using machine learning
Topic extraction using machine learning
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
 

Similar to Vsm lsi

Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...
csandit
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
fridolin.wild
 

Similar to Vsm lsi (20)

Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...
 
TEXT CLUSTERING.doc
TEXT CLUSTERING.docTEXT CLUSTERING.doc
TEXT CLUSTERING.doc
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Extractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachExtractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised Approach
 
L0261075078
L0261075078L0261075078
L0261075078
 
L0261075078
L0261075078L0261075078
L0261075078
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Sequence learning and modern RNNs
Sequence learning and modern RNNsSequence learning and modern RNNs
Sequence learning and modern RNNs
 
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
 
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
 
Texts Classification with the usage of Neural Network based on the Word2vec’s...
Texts Classification with the usage of Neural Network based on the Word2vec’s...Texts Classification with the usage of Neural Network based on the Word2vec’s...
Texts Classification with the usage of Neural Network based on the Word2vec’s...
 
Mapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationMapping Subsets of Scholarly Information
Mapping Subsets of Scholarly Information
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSCONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
 
Barzilay & Lapata 2008 presentation
Barzilay & Lapata 2008 presentationBarzilay & Lapata 2008 presentation
Barzilay & Lapata 2008 presentation
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Vsm lsi

  • 1. Vector Space Model & Lantent Semantic Indexing Ryan Reck November 18, 2008
  • 2. 1 Introduction 2 Vector Space Model 3 Lantent Semantic Indexing 4 Applications of VSM & LSI 5 Comparison: VSM vs. LSI 6 Conclusion 7 References
  • 3. Introduction What are VSM & LSI? VSM & LSI are techniques from information retrievel for managing documents based on their content.
  • 4. Vector Space Model Models documents as a vector in a multi-dimensional space. Similar documents are closer together, angle between vectors can be interpretted as similarity of two documents. Queries are translated into the vector space, and the nearest documents (point in space, or vector angle) are the desired documents. Originated from the SMART Information Retrieval project at Cornell University. First published paper in 1975 [2].
  • 5. Vector Space Model Example doc1 =< tf1 , tf2 , tf3 , . . . , tfn > doc2 =< tf1 , tf2 , tf3 , . . . , tfn > sim(doc1 , doc2 ) = cos(θ) = v0 · v1
  • 6. Vector Space Model Calcuating Term Weights VSM introduced the Term Frequency - Inverse Document Frequency method of calculating term weights. TF-IDF gives greater weight to less common terms, and less weight to common ones, since rare terms will better distinguish documents than common terms. |D| Wf ,d = tft · log ( |t∈d|
  • 7. Lantent Semantic Indexing Built off of Vector Space Model. Extracts concepts from the term-document matrix. Combines corelated dimensions into a single aggrgate dimension. This allows the documents to be indexed by concept instead of simple terms.
  • 8. Lantent Semantic Indexing Example Good Example {computer , laptop} − > {1.2 ∗ computer + 0.9 ∗ laptop} Realistic Example {computer , elevator } − > {1.2 ∗ computer + 0.9 ∗ elevator }
  • 9. Applications of VSM & LSI VSM, or variations of it, are almost universal. Search Engines Apache Lucene
  • 10. Comparison: VSM vs. LSI Advantages of LSI Handles synonymy and polysemy directly Can match documents using differing vocabularies. Can even match across different languages, after some translated documents have been handled[1]. Advantages of VSM Much simpler, but still performs well Handles new documents more easily, LSI’s dimension reduction can cause problems with this.
  • 11. Conslusion VSM and LSI are both good ways to index and compare documents. VSM is pretty basic but still gets the job done. LSI provides a more complex system, but it can do a very good job, even under extreme circumstances, like multi-language datasets.
  • 12. Refeences Dumais, S. T., Letsche, T. A., Littman, M. L., and Landauer, T. K. Automatic cross-language retrieval using latent semantic indexing. In AAAI Symposium on CrossLanguage Text and Speech Retrieval. American Association for Artificial Intelligence, March 1997. (March 1997). Salton, G., Wong, A., and Yang, C. S. A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613–620. Latent semantic indexing, 2008. http://en.wikipedia.com/wiki/Latent semantic indexing. Vector space model, 2008. http://en.wikipedia.com/wiki/Vector space model.