This document discusses machine learning and support vector machines. It provides examples of using probabilities to determine the likelihood of a document being relevant given certain terms. It also discusses language models and smoothing techniques used in document ranking. Finally, it briefly outlines different types of machine learning problems and algorithms like supervised learning, classification, and reinforcement learning.
2. Let a, b be two events.
p(a | b)p(b) = p(a Ç b) = p(b | a)p(a)
p(b | a)p(a)
p(a | b) =
p(b)
p(a | b)p(b) = p(b | a)p(a)
3. Let D be a document in the collection.
Let R represent relevance of a document w.r.t. given (fixed)
query and let NR represent non-relevance.
Need to find p(R|D) - probability that a retrieved document D
is relevant.
p(D | R)p(R)
p(R | D) =
p(D) p(R),p(NR) - prior probability
p(xD | NR)p(NR) of retrieving a (non) relevant
p(NR | D) =
p(xD) document
P(D|R), p(D|NR) - probability that if a relevant (non-relev
document is retrieved, it is D.
4. Suppose we have a vector representing the presence and
absence of terms (1,0,0,1,1). Terms 1, 4, & 5 are present.
What is the probability of this document occurring in the
relevant set?
pi is the probability that the term i occurs in a relevant
set. (1- pi ) would be the probability a term would not be
included the relevant set.
This gives us: p1 x (1-p2) x (1-p3) x p4 x p5
5.
6. Popular and effective ranking algorithm
based on binary independence model
adds document and query term weights
k1, k2 and K are parameters whose values are set
empirically
dl is doc length
Typical TREC value for k1 is 1.2, k2 varies from 0
to 1000, b = 0.75
7. Query with two terms, “president lincoln”, (qf = 1).
Frequency of term i in the query
No relevance information (r and R are zero)
N = 500,000 documents
“president” occurs in 40,000 documents (n1 = 40, 000)
“lincoln” occurs in 300 documents (n2 = 300)
“president” occurs 15 times in doc (f1 = 15)
“lincoln” occurs 25 times (f2 = 25)
document length is 90% of the average length (dl/avdl
= .9)
k1 = 1.2, b = 0.75, and k2 = 100
K = 1.2 · (0.25 + 0.75 · 0.9) = 1.11
8.
9. Unigram language model (simplest form)
probability distribution over the words in a
language
generation of text consists of pulling words out of
a “bucket” according to the probability distribution
and replacing them
N-gram language model
some applications use bigram and trigram
language models where probabilities depend on
previous words
Based on previous n-1 words
10. A topic in a document or query can be
represented as a language model
i.e., words that tend to occur often when
discussing a topic will have high probabilities in
the corresponding language model
11. Rank documents by the probability that the query
could be generated by the document language
model (i.e. same topic) P(Q|D)
Assuming uniform, unigram model
12. Obvious estimate for unigram probabilities is
fqi, D is number of times word occurs in document.
D is number of words in document
If query words are missing from document, score
will be zero
Missing 1 out of 4 query words same as missing 3
out of 4. Not good for long queries!
13. Document texts are a sample from the
language model
Missing words should not have zero probability of
occurring (calculating probability query could be
generated from document)
Smoothing is a technique for estimating
probabilities for missing (or unseen) words
lower (or discount) the probability estimates for
words that are seen in the document text
assign that “left-over” probability to the estimates
for the words that are not seen in the text
14. Informational
Finding information about some topic which may be on one or
more web pages
Topical search
Navigational
finding a particular web page that the user has either seen before
or is assumed to exist
Transactional
finding a site where a task such as shopping or downloading
music can be performed
Broder (2002) http://www.sigir.org/forum/F2002/broder.pdf
15. For effective navigational and transactional
search, need to combine features that reflect
user relevance
Commercial web search engines combine
evidence from hundreds of features to
generate a ranking score for a web page
page content, page metadata, anchor text, links
(e.g., PageRank), and user behavior (click logs)
page metadata – e.g., “age”, how often it is
updated, the URL of the page, the domain name
of its site, and the amount of text content
16. SEO: understanding the relative importance
of features used in search and how they can
be optimized to obtain better search rankings
for a web page
e.g., improve the text used in the title tag, improve
the text in heading tags, make sure that the
domain name and URL contain important
keywords, and try to improve the anchor text and
link structure
Some of these techniques are regarded as not
appropriate by search engine companies
17. Toolkit, written in Java, for experimenting with text.
http://www.galagosearch.org/quick-start.html
18.
19. Considerable interaction between these
fields
Arthur Samuel: 1959 – Checkers game. World’s
first self-learning program. IBM701.
Web query logs have generated new wave of
research
e.g., “Learning to Rank”
20. Supervised Learning
Regression analysis
Classification Problems
Support Vector Machines (SVM)
Unsupervised Learning
http://www.youtube.com/watch?v=GWWIn29ZV4Q
Reinforcement Learning
Learning Theory
How much training data do we need?
How accurately can we predict an event to 99%
accuracy?
21. Papers: Boser et al,. 1992
Standard SVM [Cortes and Vapnik, 1995]
Editor's Notes
Di = 1 product over the terms that have value 1. Example in the index if the phrase appeared in the document it would have a one. Si = denominator P(D|NR).
http://www.miislita.com/information-retrieval-tutorial/okapi-bm25-tutorial.pdf …Stands for Best Match. Developed in 1980s.K normalizes by document length. b regulates the impact of the length normalization. B = 0.75 was found to be effective.
Summation over all terms in the query. Scoring a single document in the collection to see how it matches a query.
Language models used in speech recognition, machine learning et.
Di = 1 product over the terms that have value 1. Example in the index if the phrase appeared in the document it would have a one. Qi is query word and there are n words in the query
For example… if we have a language model and we representing a document about computer computer games the document should have a non-zero probablity for the word RPG (role playing game) even if the word does not appear in the document. Question is how much weight do you give document if it has ALL words? Is it really MORE relevant because the word appeared in the documents.
Taxonomy – Identifying and classifying things into groups or classes.
Di = 1 product over the terms that have value 1. Example in the index if the phrase appeared in the document it would have a one. Qi is query word and there are n words in the query
I this case we can use density and frequency…
Trying to maximize the width of the tube. If it is on the right it is relevant if it is on the left it is not. Then we define a decision function. How do we find the optimium? If we use the dotted line as our model we just check if data is on right or left hand side. Find a seperating hyperplane. We are going to train this function until we get a good predictive model. Finding general hyperplan wT + b = 0. Once we find w and b we can make predictions. If we put in a sample xi it should be > 0 if wT + b > 0. Will comibing the 2 inequalities next.
Distance between to parallel lines is given by.
The subtraction of epsilon guarantees a seperation in the data. C is a term for training errors.