In this paper, we describe our recent advances on a novel approach to Part-Of-Speech tagging based on neural networks. Multilayer perceptrons are used following corpus-based learning from contextual, lexical and morphological information. The Penn Treebank corpus has been used for the training and evaluation of the tagging system. The results show that the connectionist approach is feasible and comparable with other approaches.
Developer Data Modeling Mistakes: From Postgres to NoSQL
Adding morphological information to a connectionist Part-Of-Speech tagger
1. Adding morphological information to a connectionist
Part-Of-Speech tagger
F. Zamora-Martínez M.J. Castro-Bleda S. España-Boquera
S. Tortajada-Velert
Departamento de Sistemas Informáticos y Computación
Universidad Politécnica de Valencia, Spain
Escuela Superior de Enseñanzas Técnicas
Universidad CEU-Cadenal Herrera, Alfara del Patriarca, Valencia, Spain
10-12 November 2009, Sevilla
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 1 / 33
2. Index
1 POS tagging
2 Probalilistic tagging
3 Connectionist tagging
4 The Penn Treebank Corpus
5 The connectionist POS taggers
6 Conclusions
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 2 / 33
3. Index
1 POS tagging
2 Probalilistic tagging
3 Connectionist tagging
4 The Penn Treebank Corpus
5 The connectionist POS taggers
6 Conclusions
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 3 / 33
4. What is Part-Of-Speech (POS) tagging?
T = {τ1 , τ2 , . . . , τk }: a set of POS tags
Ω = {ω1 , ω2 , . . . , ωm }: the vocabulary of the application
The goal of a Part-Of-Speech tagger is to associate each word in a text
with its correct lexical-syntactic category (represented by a tag).
Example
The grand jury commented on a number of other topics
DT JJ NN VBD IN DT NN IN JJ NNS
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 4 / 33
5. Ambiguity and applications
Words often have more than one POS tag: lower
Europe proposed lower rate increases . . . = JJR
To push the pound even lower . . . = RBR
. . . should be able to lower long-term . . . = VB
Ambiguity!!!
Applications: speech synthesis, speech recognition, information
retrieval, word-sense disambiguation, machine translation, ...
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 5 / 33
6. How hard is POS tagging? Measuring ambiguity
Peen Treebank (45-tag corpus)
Unambiguous (1 tag) 36,678 (84%)
Ambiguous (2-7 tags) 7,088 (16%)
Details: 2 tags 5,475
3 tags 1,313 (lower)
4 tags 250
5 tags 41
6 tags 7
7 tags 2 (bet, open)
A simple approach which assigns only the most common tag to each
word performs with 90% accuracy!
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 6 / 33
7. Unknown Words
How can one assign a tag to a given word if that word is unknown to
the tagger?
Unknown words are the hardest problem for POS tagging!
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 7 / 33
8. Index
1 POS tagging
2 Probalilistic tagging
3 Connectionist tagging
4 The Penn Treebank Corpus
5 The connectionist POS taggers
6 Conclusions
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 8 / 33
9. Probabilistic model
We are given a sentence: what is the best sequence of tags which
corresponds to the sequence of words?
Probabilistic view: Consider all possible sequences of tags and out of
this universe of sequences, choose the tag sequence which is most
probable given the observation sequence of words.
ˆn = argmax P(t n |w n ) = argmax P(w n |t n )P(t n ).
t1 1 1 1 1 1
n
t1 n
t1
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 9 / 33
10. Probabilistic model: Simplifications
To simplify:
1 Words are independent of each other and a word’s identity only
depends on its tag → lexical probabilities:
n
n n
P(w1 |t1 ) ≈ P(wi |ti )
i=1
2 Another one establishes that the probability of one tag to appear
only depends on its predecessor tag (bigram, trigram, ...) →
contextual probabilities:
n
n
P(t1 ) ≈ P(ti |ti−1 ).
i=1
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 10 / 33
11. Probabilistic model: Limitations
With these assumptions, a typical probabilistic model is expressed as:
n
ˆn = argmax P(t n |w n ) ≈ argmax
t1 P(wi |ti )P(ti |ti−1 ),
1 1
n
t1 n
t1 i=1
where ˆ1 is the best estimation of POS tags for the given sentence
tn
n = w w . . . w and considering that P(t |t ) = 1.
w1 1 2 n 1 0
1 It does not model long-distance relationships.
2 The contextual information takes into account the context on the
left while the context on the right is not considered.
Both limitations can be overwhelmed using ANNs models.
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 11 / 33
12. Index
1 POS tagging
2 Probalilistic tagging
3 Connectionist tagging
4 The Penn Treebank Corpus
5 The connectionist POS taggers
6 Conclusions
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 12 / 33
13. Basic connectionist model
Europe proposed lower rate increases
NNP VBD ????? NN NNS
MLPs as POS tags classifiers:
MLP Input:
lower — wi : the ambiguous input word, loc. cod. → projection layer
NNP , VBD, NN, NNS — ci : the tags of the words surrounding the
ambiguous word to be tagged (past and future context), loc. cod.
MLP Output:
the probability of each tag given the input:
Pr(JJR|input)=0.6, Pr(RBR|input)=0.2, Pr(VB|input)=0.1, . . .
Therefore, the network learnt the following mapping:
F (wi , ci , ti , Θ) = PrΘ (ti |wi , ci )
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 13 / 33
14. Morphological extended connectionist model
Europe proposed lower rate increases
NNP-Cap VBD-NCap ????? NN-NCap NNS-NCap
NCap, -er
MLPs as POS tags classifiers:
MLP Input:
lower — wi : the ambiguous input word, loc. cod. → projection layer
NCap, -er — mi : morph. info related to the amb. input word.
NNP-Cap., VBD-NCap, NN-NCap, NNS-NCap — ci : the tags of the
words surrounding the ambiguous word to be tagged (past and
future context) extended with morphological information, loc. cod.
MLP Output:
the probability of each tag given the input.
Therefore, the network learnt the following mapping:
F (wi , mi , ci , ti , Θ) = PrΘ (ti |wi , mi , ci ),
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 14 / 33
15. And what about Unknown Words?
When evaluating the model, there are words that have never been
seen during training; therefore, they do not belong neither to the
vocabulary of known ambiguous words nor to the vocabulary of known
non-ambiguous words → “Unknown words”: the hardest problem for
the network to tag correctly.
Proposed solution
A combination of two especialized models:
MLPKnow : the MLP specialized for known ambiguous words
MLPUnk : the MLP specialized in unknown words
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 15 / 33
16. MLPKnow for known ambiguous words
wi : known ambiguous
input word locally
codified at the input of
the projection layer
mi : morphological info
related to the input
ambiguous word
Context: two labels of
past context and one
label of future context,
extended with
morphological info.
FKnow (wi , mi , ci , ti , ΘK ) = PrΘK (ti |wi , mi , ci ).
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 16 / 33
17. MLPUnk for unknown words
mi : morphological info
related to the input
unknown word (the
same that for MLPKnow
si : more specific
morphological info
related to the input
unknown word
(different from
MLPKnow
Context: three labels
of past context and
one label of future
context, extended with
morphological info.
FUnk (mi , si , ci , ti , ΘU ) = PrΘU (ti |mi , si , ci ),
where si corresponds to additional morphological information related
to the unknown input i-th word.
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 17 / 33
18. Twi table with the POS tags
minutes NNS, NNPS
magnification NN
strikes NNS, VBZ
size VBP, NN
layoff NN
cohens NNPS
... ...
Tminutes = {NNS, NNPS} Known ambiguous word
Tmagnification = {NN} Known non-ambiguous word
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 18 / 33
19. Final connectionist model
For each posible known word (ambiguous and non-ambiguous) we
have a Twi table with the POS tags observed in training for word wi :
0
if ti ∈ Twi ,
1 if Twi = {ti },
F (wi , mi , si , ci , ti , ΘK , ΘU ) =
FKnow (wi , mi , ci , ti , ΘK ) if wi ∈ Ω ∧ ti ∈ Twi ,
F (m , s , c , t , Θ )
Unk i i i i U in other case.
Where Ω is the ambiguous words vocabulary.
n
ˆn = argmax Pr (t n |w n ) ≈ argmax
t1 F (wi , mi , si , ci , ti , ΘK , ΘU )
1 1
n
t1 n
t1 i=1
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 19 / 33
20. Index
1 POS tagging
2 Probalilistic tagging
3 Connectionist tagging
4 The Penn Treebank Corpus
5 The connectionist POS taggers
6 Conclusions
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 20 / 33
21. The Penn Treebank Corpus
This corpus consists of a set of English texts from the Wall Street
Journal distributed in 25 directories containing 100 files with
several sentences each one.
The total number of words is about one million, being 49 000
different.
The whole corpus was labeled with POS and synyactic tags.
The POS tag labeling consists of a set of 45 different categories.
Two more tag were added to take into account the beginning and
ending of a sentence, thus resulting in a total amount of 47
different POS tags.
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 21 / 33
22. The Penn Treebank Corpus: Partitions
Dataset Directory Num. of Num. of Vocabulary
sentences words size
Training 00-18 38 219 912 344 34 064
Tuning 19-21 5 527 131 768 12 389
Test 22-24 5 462 129 654 11 548
Total 00-24 49 208 1 173 766 38 452
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 22 / 33
23. The Penn Treebank Corpus: Preprocess
Huge corpus with a lot of words in ambiguous vocabulary. Preprocess
to reduce the vocabulary:
Ten random partitions from training set of equal size. Words that
appeared just in one partition were considered as unknown words.
POS tags appearing in a word less than 1% of its possible tags
were eliminated (tagging errors).
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 23 / 33
24. The Penn Treebank Corpus: Morph. information
Two morphological preprocessing filters:
Deleting the prefixes from the composed words (using a set of the
125 more common English prefixes). In this way, some unknown
words were converted to known words.
Example
pre-, electro-, tele-, . . .
All the cardinal and ordinal numbers (except “one” and “second”
that are polysemic) were replaced with the special token *CD*.
Example
twenty-years-old ⇒ *CD*-years-old
post-1987 ⇒ post-*CD*
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 24 / 33
25. The Penn Treebank Corpus: Morph. information
Morphological added to MLPs:
Three input units ⇒ input word has the first capital letter, all caps
or a subset. This is an important morphological characteristic and
it was also added to the POS tags of the context (both MLPs).
A unit indicating if the word has any dash “-” (both MLPs).
A unit indicating if the word has any point “.” (both MLPs).
Suffix analysis to deal with unknown words (only MLPUnk ):
Compute the probability distribution of tags for suffixes of length
less or equal to 10 ⇒ 709 suffixes found.
An agglomerative hierarchical clustering process was followed, and
a empirical set of clusters was chosen.
Finally, a set of the 21 more common grammatical suffixes were
added.
MLPUnk needs 209 units for take into account the presence of
suffixes in words.
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 24 / 33
26. The Penn Treebank Corpus: after preproces
Dataset Num. of words Unambiguous Ambiguous Unknown
Training 912 344 549 272 361 704 1 368
Tuning 131 768 77 347 51 292 3 129
Test 129 654 75 758 51 315 2 581
Total 1 173 766 702 377 464 311 7 078
Vocabulary in Training
6 239 ambiguous words.
25 798 unambiguous words were obtained.
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 25 / 33
27. Index
1 POS tagging
2 Probalilistic tagging
3 Connectionist tagging
4 The Penn Treebank Corpus
5 The connectionist POS taggers
6 Conclusions
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 26 / 33
28. The connectionist POS taggers
Projection layer.
Error backpropagation algorithm for training.
The topology and parameters of multilayer perceptrons in the
trainings were selected in previous experimentation.
For the experiments we have used a toolkit for pattern recognition
tasks developed by our research group.
MLPKnow trained with ambiguous vocabulary words.
MLPUnk trained with words that appear less than 4 times.
Parameter MLPKnown MLPUnk
Input layer size |T + M |(p + f ) + 50 + |M| |T + M |(p + f ) + |M| + |S|
Output layer size |T | |T |
Projection layer size |Ω | → 50 –
Hidden layer(s) size 100-75 175-100
Hidden layer act. func. Hyperbolic Tangent
Output layer act. func. Softmax
Learning rate 0.005
Momentum 0.001
Weight decay 0.0000001
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 27 / 33
29. Performance on the tuning set
POS tagging error rate for the tuning set varying the context (p is the
past context, and f is the future context).
MLPUnk error
MLPKnown error
Future
Future
Past 1 2 3
Past 2 3 4 5
1 12.56 12.46 12.40
2 6.30 6.26 6.25 6.31
2 12.27 12.08 12.37
3 6.28 6.22 6.20 6.31
3 12.59 11.95 12.24
4 6.28 6.27 6.28 6.31
4 12.72 12.34 12.46
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 28 / 33
30. Test POS tagging performance
POS tagging error rate for the tuning and test sets for the global
system. Comparison of our connectionist system with morphological
information versus our previous system without morphological
information.
Partition With morp. info. Without morp. info.
Tuning 3.2% 4.2%
Test 3.3% 4.3%
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 29 / 33
31. Index
1 POS tagging
2 Probalilistic tagging
3 Connectionist tagging
4 The Penn Treebank Corpus
5 The connectionist POS taggers
6 Conclusions
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 30 / 33
32. Conclusions: Comparison with other tagging systems
POS tagging error rate for the test set. Known refers to the
disambiguation error for known ambiguous words. Unk refers to the
POS tag error for unknown words. Total is the total POS tag error, with
ambiguous, non-ambiguous, and unknown words.
Model KnownAmb Unknown Total
SVMs 6.1 11.0 2.8
MT - 23.5 3.5
TnT 7.8 14.1 3.5
NetTagger - - 3.8
HMM Tagger - - 5.8
RANN - - 8.0
Our approach 6.7 10.3 3.3
Results comparable with state of the art systems.
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 31 / 33
33. Conclusions: Future works
Increase the amount of morphological information.
Test the models in a graph based approach.
Introduce a language model of POS tags to improve the results.
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 32 / 33
34. Thank you!
F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 33 / 33