Jessica Hullman's Natural Language Processing Project Using Support Vector Machines
1. Jessica Hullman
Natural Language Processing, Fall 2006
Professor Rich Thomason
December 15, 2006.
Abstract
I modeled my project after the implementation of supervised word sense disambiguation
with Support Vector Machines by Lee, Ng, and Chia. The authors participated in the
word sense disambiguation competition Senseval-3 in the English lexical sample task,
using information of the Part of Speech (POS) of neighboring words, single words in
surrounding contexts, local collocations, and syntactic relations to implement the
machine learning technique of Support Vector Machines (SVM). This paper details the
first section of my project in which I modify the POS portion of their implementation
using the identically formatted Senseval-2 data. I scored my performance on the
accuracy of the sense assignments by the SVM and received a mean average accuracy of
87%, with a standard deviation of 15% and a median of 92%.
This paper has five parts (Introduction, Support Vector Machines, Method, Evaluation
section, and Possible Improvements section). This is followed by a bibliography and a
section called “Programs” which outlines how the experiment proceeded more
specifically. I would like to acknowledge the following individuals who helped me
(particularly in designing a couple of the more complicated programs): Robert Finn,
Joshua Gerrish, and Rich Thomason.
Introduction
Word sense disambiguation, an area of considerable research in Computational
Linguistics, refers to the problem of differentiating the various meanings of a word. A
word is described as polysemous if it has multiple meanings; for example, given a word
“bar”, and a set of word senses such as “a long piece of wood, metal etc. used as a
support”, “a barrier of any kind”, “a plea arresting an action or claim”, etc. The goal is to
identify the correct sense of “bar” in a given sentence.
The problem of disambiguation can be described as AI complete in that some
representation of common sense and real world knowledge is required before it can be
resolved (Lecture). Two steps arise in disambiguating a given word. First, all of the
different senses of the word as well as the words to be considered along with the given
word must be determined, and second, a means must be determined by which to assign
each occurrence of a word to the correct sense. Several major sources of information are
typically used: the word’s context as well as external knowledge sources including
lexicons (Ide and Veronis, 1998, pg. 3).
2. WordNet, a large lexical database of English, developed under George A. Miller, is the
best known of the external knowledge sources. Nouns, verbs, adjectives and adverbs are
grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept;
different senses of a word are therefore in different synsets (WordNet,
http://wordnet.princeton.edu/) The meaning of the synsets is further clarified with short
definitions.
Context based methods (also called data-driven or corpus-based methods) use knowledge
about the previously disambiguated instances of the word within corpora (Ide and
Veronis, 1998, pg. 3). The distinction between lexicon driven, knowledge-based methods
and corpus based methods is often the same as the distinction between supervised and
unsupervised learning (supervised referring to a task in which the sense label of each
training instance is known, unsupervised in which it is not). Unsupervised methods
outline a clustering task, in which the external knowledge source of a dictionary or
lexicon is used to seed the system, which then augments the labeled instances by learning
from unlabeled instances. Supervised learning, on the other hand, can be seen as a
classification task in which a function is deduced based on data points.
Numerous issues arise with regard to word sense disambiguation. WordNet’s numerous
synsets per word bring up one of the most prevalent of these, determining the appropriate
degree of sense granularity for a given task. Several authors (e.g. Slater and Wilks, 1987)
have remarked that the sense divisions one finds in dictionaries are often too fine for the
purposes of NLP work; WordNet’s sense distinctions have been criticized, for example,
for being more fine-grained that what may be needed in most natural language processing
applications (Ide and Veronis, 1998, pg 13). Overly fine sense distinctions create
practical difficulties for automated WSD by requiring making sense choices that are
extremely difficult, even for expert lexicographers.
The problem of data-sparseness becomes severe. Very large amounts of text are needed
for supervised methods to ensure that all of the possible senses of a word are represented.
Producing corpora hand-labeled for senses, however, is an expensive, time-consuming
task, and the results are often less than satisfactory. There is often a fair amount of
disparity among human taggers regarding the finer sense distinctions of a word.
Natural Language Processing tasks in which word sense disambiguation is a relevant
concern include information retrieval, machine translation, and speech processing.
Despite the issues of granularity, evaluating WSD systems outside of these tasks remains
a well-documented problem, arising from not only the substantial differences in test
conditions across studies, but also the difference in test words and variance in the criteria
for evaluating the correctness of a sense assignment.
The SENSEVAL competition arose out of this need for accepted evaluation standards.
SENSEVAL uses in vitro evaluation, which involves comparing a systems output for a
given input using precision and recall (versus in vivo evaluation, in which results are
evaluated in terms of their contribution to the overall performance of a system for a given
application) (Ides and Veronis, 1998, pg. 25). While somewhat artificial, the reasoning
3. behind the Senseval competition, and thus that behind my project, is that close
examination of the problems that arise in word sense disambiguation will best improve
the methods used.
Within the Senseval competition, participants can compete in tasks including translation
as well as language-specific disambiguation. English tasks in Senseval include an
English all-words task and the English lexical sample task, the latter with which my
project is concerned. In the lexical sample task, evaluation is based on how well a system
disambiguates word-class specific (for example, all noun) instances in the test data of a
sampling of words pulled from the WordNet lexicon. Tagging algorithms are expected to
assign probabilities to the possible tags they output.
To date three Senseval competitions have been held; this project uses Senseval-2 data.
The corpus for the Senseval-2 English tasks is comprised of sentences from the British
National Corpus 2, the Penn Treebank, and the web, and is provided in xml format. I
used only this corpus in training my system.
Machine Learning Using Support Vector Machines (SVM)
As stated, in recent years linear regression-based methods have increased in popularity
with regard to supervised learning tasks. Any linear classifier is simply a classifier that
uses a linear function of its inputs to base a classification decision on. In other words,
given that the input feature vector to the classifier is a real vector , then the estimated
output score (or probability) is
where is a real vector of weights and f is a function that converting the dot product of
the two vectors into the desired output (Wikipedia, Linear classifier,
http://en.wikipedia.org/wiki/Linear_classifier).
In general, linear classifiers are fast and work well when the number of dimensions of the
input vector are very large; for example, in document classification, each element in the
input vector is typically the number of counts of a word in a document. They can be
divided into generative models, which model conditional density functions, and
discriminative training, models that attempt to maximize the quality of the output of a
training set. While common generative methods like Bayesian classification handle
missing data well, discriminative training methods, including perceptron and Support
Vector Machines, generally yield a higher accuracy (Wikipedia, Linear Classifier,
http://en.wikipedia.org/wiki/Linear_classifier).
4. For a binary classification problem, f is a simple function mapping of all values above a
certain threshold to one class and all other values to a second class (i.e “yes” and “no”).
One can visualize the operation of a linear classifier as splitting a high-dimensional input
space with a hyperplane: all points on one side of the plane belong to the first class, all
points on the other side belong to the second class.
The SVM is a binary classification learning method that categorizes data by constructing
a hyperplane, using optimization, between training instances mapped in a feature space
(Schölkopf and Smola, 2002). Because Lee, Ng, and Chia built one binary classifier for
each sense class, I opted to do the same. Like the authors, I converted nominal features
with numerous possible values into the corresponding number of binary (0 or 1) features.
In this scheme, if a nominal takes the nth value then the corresponding (nth) feature is 1
and all of the other features are set to 0 (Witten and Frank, 2000).
The software I used is SVMlight, an implementation of SVMs in C. SVMligh,t solves
classification and regression problems and ranking problems, by learning a ranking
function. It handles many thousands of support vectors and several hundred-thousands of
training examples. SVMlight is an implementation of Vapnik's Support Vector Machine
(Vapnik, 1995) and the algorithms used in SVMlight are described in (Joachims, 1999).
The goal is to learn a function from preference examples, so that it orders a new set of
objects as accurately as possible. Such ranking problems naturally occur in applications
like search engines and recommender systems. The code has been used on a large range
of problems, including text classification, image recognition tasks, bioinformatics and
medical applications (SVMlight, http://svmlight.joachims.org/).
Method
Features: POS of Neighboring Words
The decision of which features to use determines the project. Like Lee, Ng, and Chia, the
features I used were the Parts of Speech (POS) of neighboring words. The first step
involved deciding upon how many words ahead of and behind the given word I wanted to
consider as far as POS information. Again, both because Lee, Ng, and Chia used a three-
word window, and because research has shown that a window of more k=3 or 4 is
unnecessary (Yarowsky, 1994a and b), I opted to use a three-word window.
5. As an example, given the training corpus sentence “As the leaves grow, train them
through the bars for a lovely effect,” the input vector, corresponding to
< P-3, P-2, P-1, P0, P1, P2, P3 > is set to < DT, NNS, VB, VB, PRP, IN DT >. I converted all
nominal features with numerous possible values into the corresponding number of binary
features. This results in an input vector resembling
< 01000…, 00001…, 00000…, 00000…, 10000…, 00000…, 00000… >;
wherein each place in the vector corresponds to a 45 digit string of 0’s, with a 1 in the
place corresponding to that particular tag.
Senseval provides the corpus already divided into a training and test set. My first step
involved parsing the XML format of the corpus. For this I used XML::Twig, a non-event
based XML parser that provides an easy-to-access tree interface (XML::Twig,
http://xmltwig.com/xmltwig/). While Twig made the initial parsing task much more
efficient in terms of programming, I was still required to develop a program to insert
spacing where the parser took out certain tags.
The accuracy of the POS tagger used in a word sense disambiguation task is a limiting
factor. My next step being to POS tag the corpus, I opted to use the Brill Tagger, an
error-driven transformation-based tagger that works by first tagging a corpus based on the
broadest of a set of tagging rules, then applying a slightly more rule, repeating this
process until some stopping criterion is reached (Jurafsky and Martin, 2006). I chose the
tagger for its accuracy of 95-97% (Brill Tagger,
http://www.ling.gu.se/~lager/mogul/brill-tagger/index.html).
Next, I needed to substitute certain characters outputted by the Brill tagger, because these
characters were recognized by Perl. I then needed to extract the POS information of the
given words from the output of this program and convert the information to the format
needed by the SVM, namely vectors of zeros and ones. To do this I used a program that
created a table corresponding to the 45 parts of speech, and then read through the parsed,
POS-tagged corpora, keeping track of when it reached a new instance of a word. It then
reads through the context associated with each instance, keeping track of when it gets to
the given word. When the word to be disambiguated is reached, the POS’s of the three
words before and after are converted into a vector of seven 45 digit strings, with each
place in the string corresponding to a POS. For each POS in the vector, a one is inserted
in the place corresponding to that POS, while all the other places remain 0’s. A separate
file is created for each word.
After this, the corpora (now in the form of separate files for each word) needed to be
separated into files corresponding to each separate word sense, so that the SVM could be
run once for each particular sense of a word.
Evaluation
I evaluated my project using the evaluation module built into the SVM software.
Provided that the correct answers are supplied with the test data, the SVM outputs
statistics on the accuracy, precision, and recall of its sense assignments. I evaluated my
6. project on the accuracy of the sense assignments, getting an average of 87%, with a
median of 92% and standard deviation of 15%.
Possible Improvements
There are multiple minor improvements which might considerably influence my results.
Most importantly, to accurately compare this project to that which it was modeled on
(Lee, Ng, and Chia), I would need to use the Senseval-3 data (now in the public domain)
as well as to use the Senseval scoring software.
Running a sentence segmentation program on the corpora before POS tagging it
would’ve allowed me to track where in the sentence the word to be disambiguated
occurred. Currently, my project tracks POS information across sentence boundaries.
Like Lee, Ng, and Chia, I built one binary classifier for each instance (meaning) of a
word. However, I might have run the SVM by instead a step-wise reduction method, in
which a binary classifier is first built for all instances of a word, then as one word at a
time is eliminated by the SVM it is removed from the input data file, and a new classifier
built for the remaining instances. This method would be more computationally efficient,
but whether this would improve the accuracy remains to be seen.
7. Bibliography
Ide, Nancy and Jean Veronis. (1998). “Word sense disambiguation: The state of the art.”
In Computational Linguistics, 24(1).
Joachims, Thorsten. (1999). “Transductive inference for text classification using support
vector machines.” Universitat Dormund, Dortmund, Germany.
Jurafsky, Daniel, and James H. Martin. Speech and Language Processing: An
introduction to natural language processing, computational linguistics, and speech
recognition, 2nd edition. (Online version).
http://www.cs.colorado.edu/~martin/slp.html
Lee, Keok Yoong and Hwee Tou Ng. (2002). “An empirical evaluation of knowledge
sources and learning algorithms for word sense disambiguation. In Proceedings
of the Conference on Empirical Methods in Natural Language Processing
(EMNLP).
Schölkopf, Bernhard and Alex Smola. (2002). Learning with kernels. MIT Press,
Cambridge, MA.
Vapnik, Vladimir N. (1995). The nature of statistical learning theory. Springer-Verlag,
New York.
Witten, Ian H. and Eibe Frank. (2000). Data mining: Practical machine learning tools
and techniques with java implementations. Morgan Kaufman, San Francisco.
Yarowsky, D. (1995). “A comparison of corpus-based techniques for restoring accents
in Spanish and French text.” Proceedings of the 2nd Annual Workshop on Very
Large Text Corpora. Las Curces.
8. Programs
All programs can be found (and are to be run) from /data0/users/rthomaso/tmp/hullman
on tangra. All files created within the referenced programs output to this directory. This
sequence of commands/programs was run first on the training data, then on the test data.
To re-run it, some of the pathnames specified in the programs may need to be changed
back (they are currently set to run on the test data); the actual running of this sequence
gets rather complicated.
Steps:
1. Run “xml_spacer_2.pl”
This program inserts a space in the original xml corpora so that the next program,
which parses it, does not abut two words together without a space between.
2. Run “senseval_parse2.pl”
This program calls up XML::Twig and parses the corpora, outputting a file
“senseval_data_spaced.txt”.
3. Run the following three commands:
cd /data2/tools/RULE_BASED_TAGGER_V1.14/Bin_and_Data
export
PATH=$PATH:/data2/tools/RULE_BASED_TAGGER_V1.14/Bin_and_Data
/data2/tools/RULE_BASED_TAGGER_V1.14/Bin_and_Data/tagger LEXICON /
data0/users/rthomaso/tmp/hullman/senseval_data_spaced.txt BIGRAMS
LEXICALRULEFILE CONTEXTUALRULEFILE > /data0/users/rthomaso/tmp/
hullman/senseval_tagged.txt
These three commands run the Brill tagger on the parsed corpora, outputting the
results to a file called senseval_tagged.txt.
4. Run “POS_substitution.pl”
This program takes the POS-tagged corpora and substitutes strings for
problematic characters outputted by the POS-tagger, including $, (, ), #, ., ;, and
commas. Outputs a file “senseval_tagged_POS_substituted.txt”.
9. 5. Run “tag_parser.pl”.
This program and the next are the bulk of the project. This one reads in the POS-
tagged, character substituted output file from the previous program, tracking when
it gets to a new instance of a word. It then reads in the POS’s, and when it gets to
the given word, creates the vector of the POS of that word as well as the three
words before and after the word. For each POS in that vector, a 45 digit string is
created, with a one inserted in the place in that string corresponding to the part of
speech of the word. A separate file is created for every word, along with
indication of which particular vector corresponds to the POS of the given word
(out of all of the other words in the context). Each of these files ends in
“_SVM_input.txt”.
6. Run “SVM_input.pl”:
This splits the data into separate files corresponding to each instance and sense id
of the word. The files it outputs for each instance/sense id are formatted for input
into SVMlight; each ends in “_SVM_prepared_input.txt”
7. Run SVM from /data0/tools/svmlight using two commands (either
“./svm_learn input_file model_file” or “./svm_classify input_file model_file
output_file”).
The first command runs the learning module on the training data file
(corresponding to one sense of a word) and outputs a model file (parameters)
which the SVM classifying module takes in in order to make a prediction.
The classifying module takes in an input file (corresponding to one sense of a
word in the test data), a model file created by the running of the learning module
on the training data file for that sense, and outputs a file with a prediction (in the
form of a one or negative one, depending on which side of the hyperplane the
instance falls) as well as statistics including the accuracy as well as the precision
and recall of the classification.
*Because the Senseval test data did not supply the correct answers within the xml
corpora, and because this information was needed if I was to evaluate the SVM, I
needed to call up a separate Senseval file of answers for the test data and insert
this information into the test corpora (see “test_sense_id.plx”).