2. Words and Word Classes
Words are classified into categories called
part of speech (word classes or lexical
categories)
2
3. Part of Speech
NN noun student, chair, proof, mechanism
VB verb study, increase, produce
ADJ adj large, high, tall, few,
JJ adverb carefully slowly, uniformly
3
JJ adverb carefully slowly, uniformly
IN preposition in, on, to, of
PRP pronoun I, me, they
DET determiner the, a, an, this, those
open vs. closed word classes
4. Part of Speech tagging
process of assigning a part of speech like noun,
verb, pronoun, preposition, adverb, adjective,
etc. to each word in a sentence
4
POS tagger
Words
+
tag set
POS
tag
5. Speech/NN sounds/NNS were/VBD sampled/VBN
by/IN a/DT microphone/NN.
Another tagging possible for the sentence is:
5
Speech/NN sounds/VBZ were/VBD sampled/VBN
by/IN a/DT microphone/NN.
6. Part of speech tagging
methods
Rule-based (linguistic)
Stochastic (Data-driven) and
TBL (Transformation Based Learning)
6
TBL (Transformation Based Learning)
7. Rule-based (linguistic)
Steps:
1. Dictionary lookup potential tags
2. Hand-coded Rules
The show must go on.
7
The show must go on.
Step 1 NN, VB
Step 2 discard incorrect tag
Rule: IF preceding word is determiner THEN
eliminate VB tag.
8. Morphological information
IF word ends in –ing and preceding word is a verb
THEN label it a verb (VB).
8
Capitalization information
10. Stochastic Tagger
The standard stochastic tagger algorithm is the
Hidden Markov Model (HMM) tagger.
A Markov model applies the simplifying
assumption that the probability of a chain of
assumption that the probability of a chain of
symbols can be approximated in terms of its
parts or n-grams.
The simplest n-gram model is the unigram
model, which assigns the most likely tag (part-of-
speech) to each token.
10
11. The unigram model requires tagged data to
gather most likely statistics. The context used by
the unigram tagger is the text of the word itself.
For example, it will assign the tag JJ for each
occurrence of fast if fast is used as an adjective
more frequently than it is used as a noun, verb,
or adverb.
or adverb.
She had a fast
Muslim fast during Ramadan
Those who are injured need medical help fast.
We would expect more accurate predictions if we
took more context into account when making a
tagging decision.
11
12. A bi-gram tagger uses the current word and the
tag of the previous word in the tagging process.
As the tag sequence “DT NN” is more likely than
the tag sequence “DT JJ”, a bi-gram model will
assign a correct tag to the word fast in sentence
assign a correct tag to the word fast in sentence
(1).
Similarly, it is more likely that an adverb (rather
than a noun or an adjective) follows a verb.
Hence, in sentence (3), the tag assigned to fast
will be RB (Adverb)
12
13. N- gram Model
An n-gram model considers the current word
and the tag of the previous n-1 words in
assigning a tag to a word.
Fig. Context used by Tri-gram Model
13
14. HMM Tagger
Given a sequence of words (sentence), the
objective is to find the most probable tag sequence
for the sentence.
Let W be the sequence of words:
Let W be the sequence of words:
W = w1, w2, … , wn
The task is to find the tag sequence
T = t1, t2, … , tn
which maximizes P(T|W), i.e.,
T’ = argmaxT P(T|W)
14
15. Applying Bayes Rule, P(T/W) can be
estimated using the expression:
P(T|W) = P(W|T) * P(T) /P(W)
As the probability of the word sequence,
P(W), remains the same for each tag
sequence, we can drop it. The expression
for the most likely tag sequence becomes:
T’ = argmaxT P(W|T) * P(T)
15
16. Using the Markov assumption, the probability of
a tag sequence can be estimated as the product
of the probability of its constituent n-grams, i.e.,
P(T) = P(t1) * P(t2|t1) * P(t3|t1t2) … * P(tn|t1 …
P(T) = P(t1) * P(t2|t1) * P(t3|t1t2) … * P(tn|t1 …
tn-1)
P(W/T) is the probability of seeing a word
sequence, given a tag sequence
For ex, what is the probability of seeing ‘The egg
is rotten’ given ‘DT NNP VB JJ’.
16
17. We make following two assumptions :
1. The words are independent of each other, and
2. The probability of a word is dependent only on
2. The probability of a word is dependent only on
its tag.
Using these assumptions P(W/T) can be expr :
P(W/T) = P(w1/t1) * P(w2/t2) .... P(wi/ti) *
...P(Wn/tn)
i,.e.,
17
22. Brill Tagger: Initial state
Initial State:
most likely tag
Transformation:
22
Transformation:
The text is then passed through an ordered
list of transformations.
Each transformation is a pair of a a rewrite
rule and a contextual condition .
23. Learning Rules
Rules are learned in the following manner
1. each rule, i.e. each possible transformation, is
applied to each matching word-tag-pair.
2. the number of tagging errors is measured
against the correct sequences of the training
corpus ("Truth" ).
23
corpus ("Truth" ).
3. the transformation which yields the greatest
error reduction is chosen.
4. Learning stops when no rules / transformations
can be found that, if applied, reduces errors
beyond some given threshold.
24. • Set of possible ‘transforms’ is infinite, e.g.,
“transform NN to VB if the previous word
was MicrosoftWindoze & word braindead
occurs between 17 and 158 words before
24
occurs between 17 and 158 words before
that”
• To limit: start with small set of abstracted
transforms, or templates
27. Lexicalized transformations
Brill complements the rule schemes by so-called
lexicalized rules which refer to particular words in
the condition part of the transformation:
Change a to b if
27
Change a to b if
1. the preceding (following, current) word is C
2. the preceding (following, current) word is C and
the preceding (following) word is tagged d.
etc.
28. unknown words
In handling unknown words, a POS-tagger can
adopt the following strategies
assign all possible tags to the unknown word
assign the most probable tag to the unknown
word
28
same distribution as ‘Things seen once’
estimator of ‘things never seen’
use word features i.e. see how words are
spelled (prefixes, suffixes, word length,
capitalization) to guess a (set of) word
class(es). -- Most powerful
29. Most powerful unknown word
detectors
32 derivational endings ( -ion,etc.);
capitalization; hyphenation
More generally: should use morphological
29
More generally: should use morphological
analysis! (and some kind of machine learning
approach)