Adding morphological information to a connectionist Part-Of-Speech tagger

Adding morphological information to a connectionist
Part-Of-Speech tagger

F. Zamora-Martínez M.J. Castro-Bleda S. España-Boquera
S. Tortajada-Velert

Departamento de Sistemas Informáticos y Computación
Universidad Politécnica de Valencia, Spain

Escuela Superior de Enseñanzas Técnicas
Universidad CEU-Cadenal Herrera, Alfara del Patriarca, Valencia, Spain

10-12 November 2009, Sevilla

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 1 / 33

Index

1 POS tagging

2 Probalilistic tagging

3 Connectionist tagging

4 The Penn Treebank Corpus

5 The connectionist POS taggers

6 Conclusions


Index

1 POS tagging





6 Conclusions


What is Part-Of-Speech (POS) tagging?

T = {τ1 , τ2 , . . . , τk }: a set of POS tags
Ω = {ω1 , ω2 , . . . , ωm }: the vocabulary of the application

The goal of a Part-Of-Speech tagger is to associate each word in a text
with its correct lexical-syntactic category (represented by a tag).

Example
The grand jury commented on a number of other topics
DT JJ NN VBD IN DT NN IN JJ NNS


Ambiguity and applications

Words often have more than one POS tag: lower
Europe proposed lower rate increases . . . = JJR
To push the pound even lower . . . = RBR
. . . should be able to lower long-term . . . = VB

Ambiguity!!!

Applications: speech synthesis, speech recognition, information
retrieval, word-sense disambiguation, machine translation, ...


How hard is POS tagging? Measuring ambiguity

Peen Treebank (45-tag corpus)
Unambiguous (1 tag) 36,678 (84%)
Ambiguous (2-7 tags) 7,088 (16%)
Details: 2 tags 5,475
3 tags 1,313 (lower)
4 tags 250
5 tags 41
6 tags 7
7 tags 2 (bet, open)

A simple approach which assigns only the most common tag to each
word performs with 90% accuracy!


Unknown Words

How can one assign a tag to a given word if that word is unknown to
the tagger?

Unknown words are the hardest problem for POS tagging!


Index

1 POS tagging





6 Conclusions


Probabilistic model

We are given a sentence: what is the best sequence of tags which
corresponds to the sequence of words?

Probabilistic view: Consider all possible sequences of tags and out of
this universe of sequences, choose the tag sequence which is most
probable given the observation sequence of words.

ˆn = argmax P(t n |w n ) = argmax P(w n |t n )P(t n ).
t1 1 1 1 1 1
n
t1 n
t1


Probabilistic model: Simpliﬁcations

To simplify:
1 Words are independent of each other and a word’s identity only
depends on its tag → lexical probabilities:
n
n n
P(w1 |t1 ) ≈ P(wi |ti )
i=1

2 Another one establishes that the probability of one tag to appear
only depends on its predecessor tag (bigram, trigram, ...) →
contextual probabilities:
n
n
P(t1 ) ≈ P(ti |ti−1 ).
i=1


Probabilistic model: Limitations

With these assumptions, a typical probabilistic model is expressed as:
n
ˆn = argmax P(t n |w n ) ≈ argmax
t1 P(wi |ti )P(ti |ti−1 ),
1 1
n
t1 n
t1 i=1

where ˆ1 is the best estimation of POS tags for the given sentence
tn
n = w w . . . w and considering that P(t |t ) = 1.
w1 1 2 n 1 0

1 It does not model long-distance relationships.
2 The contextual information takes into account the context on the
left while the context on the right is not considered.
Both limitations can be overwhelmed using ANNs models.


Index

1 POS tagging





6 Conclusions


Basic connectionist model

Europe proposed lower rate increases
NNP VBD ????? NN NNS

MLPs as POS tags classiﬁers:
MLP Input:
lower — wi : the ambiguous input word, loc. cod. → projection layer
NNP , VBD, NN, NNS — ci : the tags of the words surrounding the
ambiguous word to be tagged (past and future context), loc. cod.
MLP Output:
the probability of each tag given the input:
Pr(JJR|input)=0.6, Pr(RBR|input)=0.2, Pr(VB|input)=0.1, . . .
Therefore, the network learnt the following mapping:
F (wi , ci , ti , Θ) = PrΘ (ti |wi , ci )


Morphological extended connectionist model

Europe proposed lower rate increases
NNP-Cap VBD-NCap ????? NN-NCap NNS-NCap
NCap, -er

MLPs as POS tags classiﬁers:
MLP Input:
lower — wi : the ambiguous input word, loc. cod. → projection layer
NCap, -er — mi : morph. info related to the amb. input word.
NNP-Cap., VBD-NCap, NN-NCap, NNS-NCap — ci : the tags of the
words surrounding the ambiguous word to be tagged (past and
future context) extended with morphological information, loc. cod.
MLP Output:
the probability of each tag given the input.
Therefore, the network learnt the following mapping:
F (wi , mi , ci , ti , Θ) = PrΘ (ti |wi , mi , ci ),


And what about Unknown Words?

When evaluating the model, there are words that have never been
seen during training; therefore, they do not belong neither to the
vocabulary of known ambiguous words nor to the vocabulary of known
non-ambiguous words → “Unknown words”: the hardest problem for
the network to tag correctly.

Proposed solution
A combination of two especialized models:
MLPKnow : the MLP specialized for known ambiguous words
MLPUnk : the MLP specialized in unknown words


MLPKnow for known ambiguous words

wi : known ambiguous
input word locally
codiﬁed at the input of
the projection layer
mi : morphological info
related to the input
ambiguous word
Context: two labels of
past context and one
label of future context,
extended with
morphological info.

FKnow (wi , mi , ci , ti , ΘK ) = PrΘK (ti |wi , mi , ci ).


MLPUnk for unknown words
mi : morphological info
unknown word (the
same that for MLPKnow
si : more speciﬁc
morphological info
unknown word
(different from
MLPKnow
Context: three labels
of past context and
one label of future
context, extended with
morphological info.
FUnk (mi , si , ci , ti , ΘU ) = PrΘU (ti |mi , si , ci ),
where si corresponds to additional morphological information related
to the unknown input i-th word.

Twi table with the POS tags

minutes NNS, NNPS
magniﬁcation NN
strikes NNS, VBZ
size VBP, NN
layoff NN
cohens NNPS
... ...

Tminutes = {NNS, NNPS} Known ambiguous word
Tmagniﬁcation = {NN} Known non-ambiguous word


Final connectionist model

For each posible known word (ambiguous and non-ambiguous) we
have a Twi table with the POS tags observed in training for word wi :


0
 if ti ∈ Twi ,

1 if Twi = {ti },

F (wi , mi , si , ci , ti , ΘK , ΘU ) =
FKnow (wi , mi , ci , ti , ΘK ) if wi ∈ Ω ∧ ti ∈ Twi ,


F (m , s , c , t , Θ )
Unk i i i i U in other case.

Where Ω is the ambiguous words vocabulary.

n
ˆn = argmax Pr (t n |w n ) ≈ argmax
t1 F (wi , mi , si , ci , ti , ΘK , ΘU )
1 1
n
t1 n
t1 i=1


Index

1 POS tagging





6 Conclusions


The Penn Treebank Corpus

This corpus consists of a set of English texts from the Wall Street
Journal distributed in 25 directories containing 100 ﬁles with
several sentences each one.
The total number of words is about one million, being 49 000
different.
The whole corpus was labeled with POS and synyactic tags.
The POS tag labeling consists of a set of 45 different categories.
Two more tag were added to take into account the beginning and
ending of a sentence, thus resulting in a total amount of 47
different POS tags.


The Penn Treebank Corpus: Partitions

Dataset Directory Num. of Num. of Vocabulary
sentences words size
Training 00-18 38 219 912 344 34 064
Tuning 19-21 5 527 131 768 12 389
Test 22-24 5 462 129 654 11 548
Total 00-24 49 208 1 173 766 38 452


The Penn Treebank Corpus: Preprocess

Huge corpus with a lot of words in ambiguous vocabulary. Preprocess
to reduce the vocabulary:
Ten random partitions from training set of equal size. Words that
appeared just in one partition were considered as unknown words.
POS tags appearing in a word less than 1% of its possible tags
were eliminated (tagging errors).


The Penn Treebank Corpus: Morph. information

Two morphological preprocessing filters:
Deleting the prefixes from the composed words (using a set of the
125 more common English prefixes). In this way, some unknown
words were converted to known words.
Example
pre-, electro-, tele-, . . .

All the cardinal and ordinal numbers (except “one” and “second”
that are polysemic) were replaced with the special token *CD*.
Example
twenty-years-old ⇒ *CD*-years-old
post-1987 ⇒ post-*CD*


The Penn Treebank Corpus: Morph. information

Morphological added to MLPs:
Three input units ⇒ input word has the first capital letter, all caps
or a subset. This is an important morphological characteristic and
it was also added to the POS tags of the context (both MLPs).
A unit indicating if the word has any dash “-” (both MLPs).
A unit indicating if the word has any point “.” (both MLPs).
Suffix analysis to deal with unknown words (only MLPUnk ):
Compute the probability distribution of tags for suffixes of length
less or equal to 10 ⇒ 709 suffixes found.
An agglomerative hierarchical clustering process was followed, and
a empirical set of clusters was chosen.
Finally, a set of the 21 more common grammatical suffixes were
added.
MLPUnk needs 209 units for take into account the presence of
suffixes in words.


The Penn Treebank Corpus: after preproces

Dataset Num. of words Unambiguous Ambiguous Unknown
Training 912 344 549 272 361 704 1 368
Tuning 131 768 77 347 51 292 3 129
Test 129 654 75 758 51 315 2 581
Total 1 173 766 702 377 464 311 7 078

Vocabulary in Training

6 239 ambiguous words.
25 798 unambiguous words were obtained.


Index

1 POS tagging





6 Conclusions


The connectionist POS taggers
Projection layer.
Error backpropagation algorithm for training.
The topology and parameters of multilayer perceptrons in the
trainings were selected in previous experimentation.
For the experiments we have used a toolkit for pattern recognition
tasks developed by our research group.
MLPKnow trained with ambiguous vocabulary words.
MLPUnk trained with words that appear less than 4 times.
Parameter MLPKnown MLPUnk
Input layer size |T + M |(p + f ) + 50 + |M| |T + M |(p + f ) + |M| + |S|
Output layer size |T | |T |
Projection layer size |Ω | → 50 –
Hidden layer(s) size 100-75 175-100
Hidden layer act. func. Hyperbolic Tangent
Output layer act. func. Softmax
Learning rate 0.005
Momentum 0.001
Weight decay 0.0000001


Performance on the tuning set

POS tagging error rate for the tuning set varying the context (p is the
past context, and f is the future context).

MLPUnk error
MLPKnown error
Future
Future
Past 1 2 3
Past 2 3 4 5
1 12.56 12.46 12.40
2 6.30 6.26 6.25 6.31
2 12.27 12.08 12.37
3 6.28 6.22 6.20 6.31
3 12.59 11.95 12.24
4 6.28 6.27 6.28 6.31
4 12.72 12.34 12.46


Test POS tagging performance

POS tagging error rate for the tuning and test sets for the global
system. Comparison of our connectionist system with morphological
information versus our previous system without morphological
information.

Partition With morp. info. Without morp. info.
Tuning 3.2% 4.2%
Test 3.3% 4.3%


Index

1 POS tagging





6 Conclusions


Conclusions: Comparison with other tagging systems
POS tagging error rate for the test set. Known refers to the
disambiguation error for known ambiguous words. Unk refers to the
POS tag error for unknown words. Total is the total POS tag error, with
ambiguous, non-ambiguous, and unknown words.

Model KnownAmb Unknown Total
SVMs 6.1 11.0 2.8
MT - 23.5 3.5
TnT 7.8 14.1 3.5
NetTagger - - 3.8
HMM Tagger - - 5.8
RANN - - 8.0
Our approach 6.7 10.3 3.3

Results comparable with state of the art systems.


Conclusions: Future works

Increase the amount of morphological information.
Test the models in a graph based approach.
Introduce a language model of POS tags to improve the results.


Thank you!


Adding morphological information to a connectionist Part-Of-Speech tagger

Recomendados

Recomendados

Más contenido relacionado

Similar a Adding morphological information to a connectionist Part-Of-Speech tagger

Similar a Adding morphological information to a connectionist Part-Of-Speech tagger (9)

Más de Francisco Zamora-Martinez

Más de Francisco Zamora-Martinez (10)

Último

Último (20)

Adding morphological information to a connectionist Part-Of-Speech tagger