SlideShare a Scribd company logo
1 of 26
Machine Learning & Support Vector Machines
                 Lecture 9
             Sean A. Golliher
Let a, b be two events.

  p(a | b)p(b) = p(a Ç b) = p(b | a)p(a)
             p(b | a)p(a)
  p(a | b) =
                p(b)
  p(a | b)p(b) = p(b | a)p(a)
Let D be a document in the collection.
Let R represent relevance of a document w.r.t. given (fixed)
query and let NR represent non-relevance.

Need to find p(R|D) - probability that a retrieved document D
is relevant.
            p(D | R)p(R)
 p(R | D) =
               p(D)            p(R),p(NR) - prior probability
             p(xD | NR)p(NR) of retrieving a (non) relevant
 p(NR | D) =
                   p(xD)        document
P(D|R), p(D|NR) - probability that if a relevant (non-relev
document is retrieved, it is D.
   Suppose we have a vector representing the presence and
    absence of terms (1,0,0,1,1). Terms 1, 4, & 5 are present.
   What is the probability of this document occurring in the
    relevant set?
   pi is the probability that the term i occurs in a relevant
    set. (1- pi ) would be the probability a term would not be
    included the relevant set.
   This gives us: p1 x (1-p2) x (1-p3) x p4 x p5
   Popular and effective ranking algorithm
    based on binary independence model
     adds document and query term weights



     k1, k2 and K are parameters whose values are set
        empirically
                                   dl is doc length
     Typical TREC value for k1 is 1.2, k2 varies from 0
      to 1000, b = 0.75
   Query with two terms, “president lincoln”, (qf = 1).
    Frequency of term i in the query
   No relevance information (r and R are zero)
   N = 500,000 documents
   “president” occurs in 40,000 documents (n1 = 40, 000)
   “lincoln” occurs in 300 documents (n2 = 300)
   “president” occurs 15 times in doc (f1 = 15)
   “lincoln” occurs 25 times (f2 = 25)
   document length is 90% of the average length (dl/avdl
    = .9)
   k1 = 1.2, b = 0.75, and k2 = 100
   K = 1.2 · (0.25 + 0.75 · 0.9) = 1.11
   Unigram language model (simplest form)
     probability distribution over the words in a
      language
     generation of text consists of pulling words out of
      a “bucket” according to the probability distribution
      and replacing them
   N-gram language model
     some applications use bigram and trigram
      language models where probabilities depend on
      previous words
     Based on previous n-1 words
   A topic in a document or query can be
    represented as a language model
     i.e., words that tend to occur often when
     discussing a topic will have high probabilities in
     the corresponding language model
   Rank documents by the probability that the query
    could be generated by the document language
    model (i.e. same topic) P(Q|D)
   Assuming uniform, unigram model
   Obvious estimate for unigram probabilities is


     fqi, D is number of times word occurs in document.
      D is number of words in document
     If query words are missing from document, score
      will be zero
     Missing 1 out of 4 query words same as missing 3
      out of 4. Not good for long queries!
   Document texts are a sample from the
    language model
     Missing words should not have zero probability of
     occurring (calculating probability query could be
     generated from document)
   Smoothing is a technique for estimating
    probabilities for missing (or unseen) words
     lower (or discount) the probability estimates for
      words that are seen in the document text
     assign that “left-over” probability to the estimates
      for the words that are not seen in the text
   Informational
     Finding information about some topic which may be on one or
       more web pages
     Topical search
   Navigational
     finding a particular web page that the user has either seen before
       or is assumed to exist
   Transactional
     finding a site where a task such as shopping or downloading
       music can be performed

    Broder (2002) http://www.sigir.org/forum/F2002/broder.pdf
 For effective navigational and transactional
  search, need to combine features that reflect
  user relevance
 Commercial web search engines combine
  evidence from hundreds of features to
  generate a ranking score for a web page
     page content, page metadata, anchor text, links
      (e.g., PageRank), and user behavior (click logs)
     page metadata – e.g., “age”, how often it is
      updated, the URL of the page, the domain name
      of its site, and the amount of text content
   SEO: understanding the relative importance
    of features used in search and how they can
    be optimized to obtain better search rankings
    for a web page
     e.g., improve the text used in the title tag, improve
      the text in heading tags, make sure that the
      domain name and URL contain important
      keywords, and try to improve the anchor text and
      link structure
     Some of these techniques are regarded as not
      appropriate by search engine companies
   Toolkit, written in Java, for experimenting with text.

   http://www.galagosearch.org/quick-start.html
   Considerable interaction between these
    fields
     Arthur Samuel: 1959 – Checkers game. World’s
     first self-learning program. IBM701.
   Web query logs have generated new wave of
    research
     e.g., “Learning to Rank”
   Supervised Learning
     Regression analysis
   Classification Problems
     Support Vector Machines (SVM)
   Unsupervised Learning
     http://www.youtube.com/watch?v=GWWIn29ZV4Q
 Reinforcement Learning
 Learning Theory
     How much training data do we need?
     How accurately can we predict an event to 99%
     accuracy?
 Papers: Boser et al,. 1992
 Standard SVM [Cortes and Vapnik, 1995]
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)

More Related Content

What's hot

RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rYanchang Zhao
 
Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Bhaskar Mitra
 
OUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text ClassificationOUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text ClassificationFlorian Leitner
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document RankingBhaskar Mitra
 
Text Mining Using R
Text Mining Using RText Mining Using R
Text Mining Using RKnoldus Inc.
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in RAshraf Uddin
 
Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Roman Stanchak
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analyticsFarheen Nilofer
 
Applications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and ClassificationApplications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and Classificationshakimov
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
Vectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchVectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchBhaskar Mitra
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrievalssbd6985
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)fridolin.wild
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrievalmghgk
 

What's hot (20)

RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
OUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text ClassificationOUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text Classification
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
 
Topic modelling
Topic modellingTopic modelling
Topic modelling
 
Text Mining Using R
Text Mining Using RText Mining Using R
Text Mining Using R
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analytics
 
Text Mining with R
Text Mining with RText Mining with R
Text Mining with R
 
Ir 03
Ir   03Ir   03
Ir 03
 
Applications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and ClassificationApplications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and Classification
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Vectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchVectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for Search
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
 

Viewers also liked

Knowledge extraction from support vector machines
Knowledge extraction from support vector machinesKnowledge extraction from support vector machines
Knowledge extraction from support vector machinesEyad Alshami
 
Cost savings from auto-scaling of network resources using machine learning
Cost savings from auto-scaling of network resources using machine learningCost savings from auto-scaling of network resources using machine learning
Cost savings from auto-scaling of network resources using machine learningSabidur Rahman
 
Applications of Machine Learning to Location-based Social Networks
Applications of Machine Learning to Location-based Social NetworksApplications of Machine Learning to Location-based Social Networks
Applications of Machine Learning to Location-based Social NetworksJoan Capdevila Pujol
 
IoT Mobility Forensics
IoT Mobility ForensicsIoT Mobility Forensics
IoT Mobility ForensicsSabidur Rahman
 
Network_Intrusion_Detection_System_Team1
Network_Intrusion_Detection_System_Team1Network_Intrusion_Detection_System_Team1
Network_Intrusion_Detection_System_Team1Saksham Agrawal
 
Airline passenger profiling based on fuzzy deep machine learning
Airline passenger profiling based on fuzzy deep machine learningAirline passenger profiling based on fuzzy deep machine learning
Airline passenger profiling based on fuzzy deep machine learningAyman Qaddumi
 
Computer security using machine learning
Computer security using machine learningComputer security using machine learning
Computer security using machine learningSandeep Sabnani
 
Online Machine Learning: introduction and examples
Online Machine Learning:  introduction and examplesOnline Machine Learning:  introduction and examples
Online Machine Learning: introduction and examplesFelipe
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
BSidesLV 2013 - Using Machine Learning to Support Information Security
BSidesLV 2013 - Using Machine Learning to Support Information SecurityBSidesLV 2013 - Using Machine Learning to Support Information Security
BSidesLV 2013 - Using Machine Learning to Support Information SecurityAlex Pinto
 
Machine learning support vector machines
Machine learning   support vector machinesMachine learning   support vector machines
Machine learning support vector machinesSjoerd Maessen
 
Distributed Online Machine Learning Framework for Big Data
Distributed Online Machine Learning Framework for Big DataDistributed Online Machine Learning Framework for Big Data
Distributed Online Machine Learning Framework for Big DataJubatusOfficial
 
Online algorithms in Machine Learning
Online algorithms in Machine LearningOnline algorithms in Machine Learning
Online algorithms in Machine LearningAmrinder Arora
 
Computer security - A machine learning approach
Computer security - A machine learning approachComputer security - A machine learning approach
Computer security - A machine learning approachSandeep Sabnani
 
Application of machine learning in industrial applications
Application of machine learning in industrial applicationsApplication of machine learning in industrial applications
Application of machine learning in industrial applicationsAnish Das
 
A review of machine learning based anomaly detection
A review of machine learning based anomaly detectionA review of machine learning based anomaly detection
A review of machine learning based anomaly detectionMohamed Elfadly
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesadil raja
 

Viewers also liked (20)

Knowledge extraction from support vector machines
Knowledge extraction from support vector machinesKnowledge extraction from support vector machines
Knowledge extraction from support vector machines
 
Cost savings from auto-scaling of network resources using machine learning
Cost savings from auto-scaling of network resources using machine learningCost savings from auto-scaling of network resources using machine learning
Cost savings from auto-scaling of network resources using machine learning
 
Applications of Machine Learning to Location-based Social Networks
Applications of Machine Learning to Location-based Social NetworksApplications of Machine Learning to Location-based Social Networks
Applications of Machine Learning to Location-based Social Networks
 
IoT Mobility Forensics
IoT Mobility ForensicsIoT Mobility Forensics
IoT Mobility Forensics
 
Network_Intrusion_Detection_System_Team1
Network_Intrusion_Detection_System_Team1Network_Intrusion_Detection_System_Team1
Network_Intrusion_Detection_System_Team1
 
Airline passenger profiling based on fuzzy deep machine learning
Airline passenger profiling based on fuzzy deep machine learningAirline passenger profiling based on fuzzy deep machine learning
Airline passenger profiling based on fuzzy deep machine learning
 
Machine Learning for dummies
Machine Learning for dummiesMachine Learning for dummies
Machine Learning for dummies
 
Computer security using machine learning
Computer security using machine learningComputer security using machine learning
Computer security using machine learning
 
Online Machine Learning: introduction and examples
Online Machine Learning:  introduction and examplesOnline Machine Learning:  introduction and examples
Online Machine Learning: introduction and examples
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
BSidesLV 2013 - Using Machine Learning to Support Information Security
BSidesLV 2013 - Using Machine Learning to Support Information SecurityBSidesLV 2013 - Using Machine Learning to Support Information Security
BSidesLV 2013 - Using Machine Learning to Support Information Security
 
Machine learning support vector machines
Machine learning   support vector machinesMachine learning   support vector machines
Machine learning support vector machines
 
Distributed Online Machine Learning Framework for Big Data
Distributed Online Machine Learning Framework for Big DataDistributed Online Machine Learning Framework for Big Data
Distributed Online Machine Learning Framework for Big Data
 
Online algorithms in Machine Learning
Online algorithms in Machine LearningOnline algorithms in Machine Learning
Online algorithms in Machine Learning
 
A use case of online machine learning using Jubatus
A use case of online machine learning using JubatusA use case of online machine learning using Jubatus
A use case of online machine learning using Jubatus
 
Computer security - A machine learning approach
Computer security - A machine learning approachComputer security - A machine learning approach
Computer security - A machine learning approach
 
Application of machine learning in industrial applications
Application of machine learning in industrial applicationsApplication of machine learning in industrial applications
Application of machine learning in industrial applications
 
A review of machine learning based anomaly detection
A review of machine learning based anomaly detectionA review of machine learning based anomaly detection
A review of machine learning based anomaly detection
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 

Similar to Lecture 9 - Machine Learning and Support Vector Machines (SVM)

Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrievalKU Leuven
 
Search Engines
Search EnginesSearch Engines
Search Enginesbutest
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
Evaluation Initiatives for Entity-oriented Search
Evaluation Initiatives for Entity-oriented SearchEvaluation Initiatives for Entity-oriented Search
Evaluation Initiatives for Entity-oriented Searchkrisztianbalog
 
Slides
SlidesSlides
Slidesbutest
 
Entity Retrieval (WWW 2013 tutorial)
Entity Retrieval (WWW 2013 tutorial)Entity Retrieval (WWW 2013 tutorial)
Entity Retrieval (WWW 2013 tutorial)krisztianbalog
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
 
Using topic modelling frameworks for NLP and semantic search
Using topic modelling frameworks for NLP and semantic searchUsing topic modelling frameworks for NLP and semantic search
Using topic modelling frameworks for NLP and semantic searchDawn Anderson MSc DigM
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Lucidworks
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisJonathan Stray
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
DB-IR-ranking
DB-IR-rankingDB-IR-ranking
DB-IR-rankingFELIX75
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 

Similar to Lecture 9 - Machine Learning and Support Vector Machines (SVM) (20)

Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Evaluation Initiatives for Entity-oriented Search
Evaluation Initiatives for Entity-oriented SearchEvaluation Initiatives for Entity-oriented Search
Evaluation Initiatives for Entity-oriented Search
 
Slides
SlidesSlides
Slides
 
Entity Retrieval (WWW 2013 tutorial)
Entity Retrieval (WWW 2013 tutorial)Entity Retrieval (WWW 2013 tutorial)
Entity Retrieval (WWW 2013 tutorial)
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
 
Ir
IrIr
Ir
 
Ir
IrIr
Ir
 
Using topic modelling frameworks for NLP and semantic search
Using topic modelling frameworks for NLP and semantic searchUsing topic modelling frameworks for NLP and semantic search
Using topic modelling frameworks for NLP and semantic search
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
DB-IR-ranking
DB-IR-rankingDB-IR-ranking
DB-IR-ranking
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
DB and IR Integration
DB and IR IntegrationDB and IR Integration
DB and IR Integration
 

More from Sean Golliher

Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)Sean Golliher
 
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:Sean Golliher
 
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...Sean Golliher
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingSean Golliher
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - IndexingSean Golliher
 
PageRank and The Google Matrix
PageRank and The Google MatrixPageRank and The Google Matrix
PageRank and The Google MatrixSean Golliher
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerSean Golliher
 

More from Sean Golliher (8)

Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)
 
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
 
Goprez sg
Goprez  sgGoprez  sg
Goprez sg
 
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
 
PageRank and The Google Matrix
PageRank and The Google MatrixPageRank and The Google Matrix
PageRank and The Google Matrix
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
 

Recently uploaded

What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Recently uploaded (20)

What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

Lecture 9 - Machine Learning and Support Vector Machines (SVM)

  • 1. Machine Learning & Support Vector Machines Lecture 9 Sean A. Golliher
  • 2. Let a, b be two events. p(a | b)p(b) = p(a Ç b) = p(b | a)p(a) p(b | a)p(a) p(a | b) = p(b) p(a | b)p(b) = p(b | a)p(a)
  • 3. Let D be a document in the collection. Let R represent relevance of a document w.r.t. given (fixed) query and let NR represent non-relevance. Need to find p(R|D) - probability that a retrieved document D is relevant. p(D | R)p(R) p(R | D) = p(D) p(R),p(NR) - prior probability p(xD | NR)p(NR) of retrieving a (non) relevant p(NR | D) = p(xD) document P(D|R), p(D|NR) - probability that if a relevant (non-relev document is retrieved, it is D.
  • 4. Suppose we have a vector representing the presence and absence of terms (1,0,0,1,1). Terms 1, 4, & 5 are present.  What is the probability of this document occurring in the relevant set?  pi is the probability that the term i occurs in a relevant set. (1- pi ) would be the probability a term would not be included the relevant set.  This gives us: p1 x (1-p2) x (1-p3) x p4 x p5
  • 5.
  • 6. Popular and effective ranking algorithm based on binary independence model  adds document and query term weights  k1, k2 and K are parameters whose values are set empirically  dl is doc length  Typical TREC value for k1 is 1.2, k2 varies from 0 to 1000, b = 0.75
  • 7. Query with two terms, “president lincoln”, (qf = 1). Frequency of term i in the query  No relevance information (r and R are zero)  N = 500,000 documents  “president” occurs in 40,000 documents (n1 = 40, 000)  “lincoln” occurs in 300 documents (n2 = 300)  “president” occurs 15 times in doc (f1 = 15)  “lincoln” occurs 25 times (f2 = 25)  document length is 90% of the average length (dl/avdl = .9)  k1 = 1.2, b = 0.75, and k2 = 100  K = 1.2 · (0.25 + 0.75 · 0.9) = 1.11
  • 8.
  • 9. Unigram language model (simplest form)  probability distribution over the words in a language  generation of text consists of pulling words out of a “bucket” according to the probability distribution and replacing them  N-gram language model  some applications use bigram and trigram language models where probabilities depend on previous words  Based on previous n-1 words
  • 10. A topic in a document or query can be represented as a language model  i.e., words that tend to occur often when discussing a topic will have high probabilities in the corresponding language model
  • 11. Rank documents by the probability that the query could be generated by the document language model (i.e. same topic) P(Q|D)  Assuming uniform, unigram model
  • 12. Obvious estimate for unigram probabilities is  fqi, D is number of times word occurs in document. D is number of words in document  If query words are missing from document, score will be zero  Missing 1 out of 4 query words same as missing 3 out of 4. Not good for long queries!
  • 13. Document texts are a sample from the language model  Missing words should not have zero probability of occurring (calculating probability query could be generated from document)  Smoothing is a technique for estimating probabilities for missing (or unseen) words  lower (or discount) the probability estimates for words that are seen in the document text  assign that “left-over” probability to the estimates for the words that are not seen in the text
  • 14. Informational  Finding information about some topic which may be on one or more web pages  Topical search  Navigational  finding a particular web page that the user has either seen before or is assumed to exist  Transactional  finding a site where a task such as shopping or downloading music can be performed Broder (2002) http://www.sigir.org/forum/F2002/broder.pdf
  • 15.  For effective navigational and transactional search, need to combine features that reflect user relevance  Commercial web search engines combine evidence from hundreds of features to generate a ranking score for a web page  page content, page metadata, anchor text, links (e.g., PageRank), and user behavior (click logs)  page metadata – e.g., “age”, how often it is updated, the URL of the page, the domain name of its site, and the amount of text content
  • 16. SEO: understanding the relative importance of features used in search and how they can be optimized to obtain better search rankings for a web page  e.g., improve the text used in the title tag, improve the text in heading tags, make sure that the domain name and URL contain important keywords, and try to improve the anchor text and link structure  Some of these techniques are regarded as not appropriate by search engine companies
  • 17. Toolkit, written in Java, for experimenting with text.  http://www.galagosearch.org/quick-start.html
  • 18.
  • 19. Considerable interaction between these fields  Arthur Samuel: 1959 – Checkers game. World’s first self-learning program. IBM701.  Web query logs have generated new wave of research  e.g., “Learning to Rank”
  • 20. Supervised Learning  Regression analysis  Classification Problems  Support Vector Machines (SVM)  Unsupervised Learning  http://www.youtube.com/watch?v=GWWIn29ZV4Q  Reinforcement Learning  Learning Theory  How much training data do we need?  How accurately can we predict an event to 99% accuracy?
  • 21.  Papers: Boser et al,. 1992  Standard SVM [Cortes and Vapnik, 1995]

Editor's Notes

  1. Di = 1 product over the terms that have value 1. Example in the index if the phrase appeared in the document it would have a one. Si = denominator P(D|NR).
  2. http://www.miislita.com/information-retrieval-tutorial/okapi-bm25-tutorial.pdf …Stands for Best Match. Developed in 1980s.K normalizes by document length. b regulates the impact of the length normalization. B = 0.75 was found to be effective.
  3. Summation over all terms in the query. Scoring a single document in the collection to see how it matches a query.
  4. Language models used in speech recognition, machine learning et.
  5. Di = 1 product over the terms that have value 1. Example in the index if the phrase appeared in the document it would have a one. Qi is query word and there are n words in the query
  6. For example… if we have a language model and we representing a document about computer computer games the document should have a non-zero probablity for the word RPG (role playing game) even if the word does not appear in the document. Question is how much weight do you give document if it has ALL words? Is it really MORE relevant because the word appeared in the documents.
  7. Taxonomy – Identifying and classifying things into groups or classes.
  8. Di = 1 product over the terms that have value 1. Example in the index if the phrase appeared in the document it would have a one. Qi is query word and there are n words in the query
  9. I this case we can use density and frequency…
  10. Trying to maximize the width of the tube. If it is on the right it is relevant if it is on the left it is not. Then we define a decision function. How do we find the optimium? If we use the dotted line as our model we just check if data is on right or left hand side. Find a seperating hyperplane. We are going to train this function until we get a good predictive model. Finding general hyperplan wT + b = 0. Once we find w and b we can make predictions. If we put in a sample xi it should be > 0 if wT + b > 0. Will comibing the 2 inequalities next.
  11. Distance between to parallel lines is given by.
  12. The subtraction of epsilon guarantees a seperation in the data. C is a term for training errors.