SlideShare a Scribd company logo
1 of 9
Jessica Hullman
Natural Language Processing, Fall 2006
Professor Rich Thomason
December 15, 2006.


Abstract

I modeled my project after the implementation of supervised word sense disambiguation
with Support Vector Machines by Lee, Ng, and Chia. The authors participated in the
word sense disambiguation competition Senseval-3 in the English lexical sample task,
using information of the Part of Speech (POS) of neighboring words, single words in
surrounding contexts, local collocations, and syntactic relations to implement the
machine learning technique of Support Vector Machines (SVM). This paper details the
first section of my project in which I modify the POS portion of their implementation
using the identically formatted Senseval-2 data. I scored my performance on the
accuracy of the sense assignments by the SVM and received a mean average accuracy of
87%, with a standard deviation of 15% and a median of 92%.

This paper has five parts (Introduction, Support Vector Machines, Method, Evaluation
section, and Possible Improvements section). This is followed by a bibliography and a
section called “Programs” which outlines how the experiment proceeded more
specifically. I would like to acknowledge the following individuals who helped me
(particularly in designing a couple of the more complicated programs): Robert Finn,
Joshua Gerrish, and Rich Thomason.


Introduction

Word sense disambiguation, an area of considerable research in Computational
Linguistics, refers to the problem of differentiating the various meanings of a word. A
word is described as polysemous if it has multiple meanings; for example, given a word
“bar”, and a set of word senses such as “a long piece of wood, metal etc. used as a
support”, “a barrier of any kind”, “a plea arresting an action or claim”, etc. The goal is to
identify the correct sense of “bar” in a given sentence.

The problem of disambiguation can be described as AI complete in that some
representation of common sense and real world knowledge is required before it can be
resolved (Lecture). Two steps arise in disambiguating a given word. First, all of the
different senses of the word as well as the words to be considered along with the given
word must be determined, and second, a means must be determined by which to assign
each occurrence of a word to the correct sense. Several major sources of information are
typically used: the word’s context as well as external knowledge sources including
lexicons (Ide and Veronis, 1998, pg. 3).
WordNet, a large lexical database of English, developed under George A. Miller, is the
best known of the external knowledge sources. Nouns, verbs, adjectives and adverbs are
grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept;
different senses of a word are therefore in different synsets (WordNet,
http://wordnet.princeton.edu/) The meaning of the synsets is further clarified with short
definitions.

Context based methods (also called data-driven or corpus-based methods) use knowledge
about the previously disambiguated instances of the word within corpora (Ide and
Veronis, 1998, pg. 3). The distinction between lexicon driven, knowledge-based methods
and corpus based methods is often the same as the distinction between supervised and
unsupervised learning (supervised referring to a task in which the sense label of each
training instance is known, unsupervised in which it is not). Unsupervised methods
outline a clustering task, in which the external knowledge source of a dictionary or
lexicon is used to seed the system, which then augments the labeled instances by learning
from unlabeled instances. Supervised learning, on the other hand, can be seen as a
classification task in which a function is deduced based on data points.

Numerous issues arise with regard to word sense disambiguation. WordNet’s numerous
synsets per word bring up one of the most prevalent of these, determining the appropriate
degree of sense granularity for a given task. Several authors (e.g. Slater and Wilks, 1987)
have remarked that the sense divisions one finds in dictionaries are often too fine for the
purposes of NLP work; WordNet’s sense distinctions have been criticized, for example,
for being more fine-grained that what may be needed in most natural language processing
applications (Ide and Veronis, 1998, pg 13). Overly fine sense distinctions create
practical difficulties for automated WSD by requiring making sense choices that are
extremely difficult, even for expert lexicographers.

The problem of data-sparseness becomes severe. Very large amounts of text are needed
for supervised methods to ensure that all of the possible senses of a word are represented.
Producing corpora hand-labeled for senses, however, is an expensive, time-consuming
task, and the results are often less than satisfactory. There is often a fair amount of
disparity among human taggers regarding the finer sense distinctions of a word.

 Natural Language Processing tasks in which word sense disambiguation is a relevant
concern include information retrieval, machine translation, and speech processing.
Despite the issues of granularity, evaluating WSD systems outside of these tasks remains
a well-documented problem, arising from not only the substantial differences in test
conditions across studies, but also the difference in test words and variance in the criteria
for evaluating the correctness of a sense assignment.

The SENSEVAL competition arose out of this need for accepted evaluation standards.
SENSEVAL uses in vitro evaluation, which involves comparing a systems output for a
given input using precision and recall (versus in vivo evaluation, in which results are
evaluated in terms of their contribution to the overall performance of a system for a given
application) (Ides and Veronis, 1998, pg. 25). While somewhat artificial, the reasoning
behind the Senseval competition, and thus that behind my project, is that close
examination of the problems that arise in word sense disambiguation will best improve
the methods used.

Within the Senseval competition, participants can compete in tasks including translation
as well as language-specific disambiguation. English tasks in Senseval include an
English all-words task and the English lexical sample task, the latter with which my
project is concerned. In the lexical sample task, evaluation is based on how well a system
disambiguates word-class specific (for example, all noun) instances in the test data of a
sampling of words pulled from the WordNet lexicon. Tagging algorithms are expected to
assign probabilities to the possible tags they output.

To date three Senseval competitions have been held; this project uses Senseval-2 data.
The corpus for the Senseval-2 English tasks is comprised of sentences from the British
National Corpus 2, the Penn Treebank, and the web, and is provided in xml format. I
used only this corpus in training my system.


Machine Learning Using Support Vector Machines (SVM)


As stated, in recent years linear regression-based methods have increased in popularity
with regard to supervised learning tasks. Any linear classifier is simply a classifier that
uses a linear function of its inputs to base a classification decision on. In other words,
given that the input feature vector to the classifier is a real vector , then the estimated
output score (or probability) is




where is a real vector of weights and f is a function that converting the dot product of
the two vectors into the desired output (Wikipedia, Linear classifier,
http://en.wikipedia.org/wiki/Linear_classifier).

In general, linear classifiers are fast and work well when the number of dimensions of the
input vector are very large; for example, in document classification, each element in the
input vector is typically the number of counts of a word in a document. They can be
divided into generative models, which model conditional density functions, and
discriminative training, models that attempt to maximize the quality of the output of a
training set. While common generative methods like Bayesian classification handle
missing data well, discriminative training methods, including perceptron and Support
Vector Machines, generally yield a higher accuracy (Wikipedia, Linear Classifier,
http://en.wikipedia.org/wiki/Linear_classifier).
For a binary classification problem, f is a simple function mapping of all values above a
certain threshold to one class and all other values to a second class (i.e “yes” and “no”).
One can visualize the operation of a linear classifier as splitting a high-dimensional input
space with a hyperplane: all points on one side of the plane belong to the first class, all
points on the other side belong to the second class.




The SVM is a binary classification learning method that categorizes data by constructing
a hyperplane, using optimization, between training instances mapped in a feature space
(Schölkopf and Smola, 2002). Because Lee, Ng, and Chia built one binary classifier for
each sense class, I opted to do the same. Like the authors, I converted nominal features
with numerous possible values into the corresponding number of binary (0 or 1) features.
In this scheme, if a nominal takes the nth value then the corresponding (nth) feature is 1
and all of the other features are set to 0 (Witten and Frank, 2000).

The software I used is SVMlight, an implementation of SVMs in C. SVMligh,t solves
classification and regression problems and ranking problems, by learning a ranking
function. It handles many thousands of support vectors and several hundred-thousands of
training examples. SVMlight is an implementation of Vapnik's Support Vector Machine
(Vapnik, 1995) and the algorithms used in SVMlight are described in (Joachims, 1999).

The goal is to learn a function from preference examples, so that it orders a new set of
objects as accurately as possible. Such ranking problems naturally occur in applications
like search engines and recommender systems. The code has been used on a large range
of problems, including text classification, image recognition tasks, bioinformatics and
medical applications (SVMlight, http://svmlight.joachims.org/).


Method
Features: POS of Neighboring Words

The decision of which features to use determines the project. Like Lee, Ng, and Chia, the
features I used were the Parts of Speech (POS) of neighboring words. The first step
involved deciding upon how many words ahead of and behind the given word I wanted to
consider as far as POS information. Again, both because Lee, Ng, and Chia used a three-
word window, and because research has shown that a window of more k=3 or 4 is
unnecessary (Yarowsky, 1994a and b), I opted to use a three-word window.
As an example, given the training corpus sentence “As the leaves grow, train them
through the bars for a lovely effect,” the input vector, corresponding to
< P-3, P-2, P-1, P0, P1, P2, P3 > is set to < DT, NNS, VB, VB, PRP, IN DT >. I converted all
nominal features with numerous possible values into the corresponding number of binary
features. This results in an input vector resembling
< 01000…, 00001…, 00000…, 00000…, 10000…, 00000…, 00000… >;
wherein each place in the vector corresponds to a 45 digit string of 0’s, with a 1 in the
place corresponding to that particular tag.

Senseval provides the corpus already divided into a training and test set. My first step
involved parsing the XML format of the corpus. For this I used XML::Twig, a non-event
based XML parser that provides an easy-to-access tree interface (XML::Twig,
http://xmltwig.com/xmltwig/). While Twig made the initial parsing task much more
efficient in terms of programming, I was still required to develop a program to insert
spacing where the parser took out certain tags.

The accuracy of the POS tagger used in a word sense disambiguation task is a limiting
factor. My next step being to POS tag the corpus, I opted to use the Brill Tagger, an
error-driven transformation-based tagger that works by first tagging a corpus based on the
broadest of a set of tagging rules, then applying a slightly more rule, repeating this
process until some stopping criterion is reached (Jurafsky and Martin, 2006). I chose the
tagger for its accuracy of 95-97% (Brill Tagger,
http://www.ling.gu.se/~lager/mogul/brill-tagger/index.html).

Next, I needed to substitute certain characters outputted by the Brill tagger, because these
characters were recognized by Perl. I then needed to extract the POS information of the
given words from the output of this program and convert the information to the format
needed by the SVM, namely vectors of zeros and ones. To do this I used a program that
created a table corresponding to the 45 parts of speech, and then read through the parsed,
POS-tagged corpora, keeping track of when it reached a new instance of a word. It then
reads through the context associated with each instance, keeping track of when it gets to
the given word. When the word to be disambiguated is reached, the POS’s of the three
words before and after are converted into a vector of seven 45 digit strings, with each
place in the string corresponding to a POS. For each POS in the vector, a one is inserted
in the place corresponding to that POS, while all the other places remain 0’s. A separate
file is created for each word.

After this, the corpora (now in the form of separate files for each word) needed to be
separated into files corresponding to each separate word sense, so that the SVM could be
run once for each particular sense of a word.

Evaluation

I evaluated my project using the evaluation module built into the SVM software.
Provided that the correct answers are supplied with the test data, the SVM outputs
statistics on the accuracy, precision, and recall of its sense assignments. I evaluated my
project on the accuracy of the sense assignments, getting an average of 87%, with a
median of 92% and standard deviation of 15%.

Possible Improvements

There are multiple minor improvements which might considerably influence my results.
Most importantly, to accurately compare this project to that which it was modeled on
(Lee, Ng, and Chia), I would need to use the Senseval-3 data (now in the public domain)
as well as to use the Senseval scoring software.

Running a sentence segmentation program on the corpora before POS tagging it
would’ve allowed me to track where in the sentence the word to be disambiguated
occurred. Currently, my project tracks POS information across sentence boundaries.

Like Lee, Ng, and Chia, I built one binary classifier for each instance (meaning) of a
word. However, I might have run the SVM by instead a step-wise reduction method, in
which a binary classifier is first built for all instances of a word, then as one word at a
time is eliminated by the SVM it is removed from the input data file, and a new classifier
built for the remaining instances. This method would be more computationally efficient,
but whether this would improve the accuracy remains to be seen.
Bibliography



Ide, Nancy and Jean Veronis. (1998). “Word sense disambiguation: The state of the art.”
       In Computational Linguistics, 24(1).

Joachims, Thorsten. (1999). “Transductive inference for text classification using support
       vector machines.” Universitat Dormund, Dortmund, Germany.

Jurafsky, Daniel, and James H. Martin. Speech and Language Processing: An
       introduction to natural language processing, computational linguistics, and speech
       recognition, 2nd edition. (Online version).
       http://www.cs.colorado.edu/~martin/slp.html

Lee, Keok Yoong and Hwee Tou Ng. (2002). “An empirical evaluation of knowledge
       sources and learning algorithms for word sense disambiguation. In Proceedings
       of the Conference on Empirical Methods in Natural Language Processing
       (EMNLP).

Schölkopf, Bernhard and Alex Smola. (2002). Learning with kernels. MIT Press,
      Cambridge, MA.

Vapnik, Vladimir N. (1995). The nature of statistical learning theory. Springer-Verlag,
      New York.

Witten, Ian H. and Eibe Frank. (2000). Data mining: Practical machine learning tools
       and techniques with java implementations. Morgan Kaufman, San Francisco.

Yarowsky, D. (1995). “A comparison of corpus-based techniques for restoring accents
      in Spanish and French text.” Proceedings of the 2nd Annual Workshop on Very
      Large Text Corpora. Las Curces.
Programs


All programs can be found (and are to be run) from /data0/users/rthomaso/tmp/hullman
on tangra. All files created within the referenced programs output to this directory. This
sequence of commands/programs was run first on the training data, then on the test data.
To re-run it, some of the pathnames specified in the programs may need to be changed
back (they are currently set to run on the test data); the actual running of this sequence
gets rather complicated.

Steps:
   1. Run “xml_spacer_2.pl”

       This program inserts a space in the original xml corpora so that the next program,
       which parses it, does not abut two words together without a space between.

   2. Run “senseval_parse2.pl”

       This program calls up XML::Twig and parses the corpora, outputting a file
       “senseval_data_spaced.txt”.

   3. Run the following three commands:

       cd /data2/tools/RULE_BASED_TAGGER_V1.14/Bin_and_Data

       export
       PATH=$PATH:/data2/tools/RULE_BASED_TAGGER_V1.14/Bin_and_Data

       /data2/tools/RULE_BASED_TAGGER_V1.14/Bin_and_Data/tagger LEXICON /
       data0/users/rthomaso/tmp/hullman/senseval_data_spaced.txt BIGRAMS
       LEXICALRULEFILE CONTEXTUALRULEFILE > /data0/users/rthomaso/tmp/
       hullman/senseval_tagged.txt

       These three commands run the Brill tagger on the parsed corpora, outputting the
       results to a file called senseval_tagged.txt.

   4. Run “POS_substitution.pl”

       This program takes the POS-tagged corpora and substitutes strings for
       problematic characters outputted by the POS-tagger, including $, (, ), #, ., ;, and
       commas. Outputs a file “senseval_tagged_POS_substituted.txt”.
5. Run “tag_parser.pl”.

   This program and the next are the bulk of the project. This one reads in the POS-
   tagged, character substituted output file from the previous program, tracking when
   it gets to a new instance of a word. It then reads in the POS’s, and when it gets to
   the given word, creates the vector of the POS of that word as well as the three
   words before and after the word. For each POS in that vector, a 45 digit string is
   created, with a one inserted in the place in that string corresponding to the part of
   speech of the word. A separate file is created for every word, along with
   indication of which particular vector corresponds to the POS of the given word
   (out of all of the other words in the context). Each of these files ends in
   “_SVM_input.txt”.

6. Run “SVM_input.pl”:

   This splits the data into separate files corresponding to each instance and sense id
   of the word. The files it outputs for each instance/sense id are formatted for input
   into SVMlight; each ends in “_SVM_prepared_input.txt”

7. Run SVM from /data0/tools/svmlight using two commands (either
   “./svm_learn input_file model_file” or “./svm_classify input_file model_file
   output_file”).

   The first command runs the learning module on the training data file
   (corresponding to one sense of a word) and outputs a model file (parameters)
   which the SVM classifying module takes in in order to make a prediction.


   The classifying module takes in an input file (corresponding to one sense of a
   word in the test data), a model file created by the running of the learning module
   on the training data file for that sense, and outputs a file with a prediction (in the
   form of a one or negative one, depending on which side of the hyperplane the
   instance falls) as well as statistics including the accuracy as well as the precision
   and recall of the classification.


   *Because the Senseval test data did not supply the correct answers within the xml
   corpora, and because this information was needed if I was to evaluate the SVM, I
   needed to call up a separate Senseval file of answers for the test data and insert
   this information into the test corpora (see “test_sense_id.plx”).

More Related Content

What's hot

A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEkevig
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET Journal
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Dev Sahu
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...ijnlc
 
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET Journal
 
Resolving the semantics of vietnamese questions in v news qaict system
Resolving the semantics of vietnamese questions in v news qaict systemResolving the semantics of vietnamese questions in v news qaict system
Resolving the semantics of vietnamese questions in v news qaict systemijaia
 
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-ServiceMarius Corici
 
OOAD - UML - Sequence and Communication Diagrams - Lab
OOAD - UML - Sequence and Communication Diagrams - LabOOAD - UML - Sequence and Communication Diagrams - Lab
OOAD - UML - Sequence and Communication Diagrams - LabVicter Paul
 
Text Classification, Sentiment Analysis, and Opinion Mining
Text Classification, Sentiment Analysis, and Opinion MiningText Classification, Sentiment Analysis, and Opinion Mining
Text Classification, Sentiment Analysis, and Opinion MiningFabrizio Sebastiani
 
OOAD - UML - Class and Object Diagrams - Lab
OOAD - UML - Class and Object Diagrams - LabOOAD - UML - Class and Object Diagrams - Lab
OOAD - UML - Class and Object Diagrams - LabVicter Paul
 
slides
slidesslides
slidesbutest
 
Identifying the semantic relations on
Identifying the semantic relations onIdentifying the semantic relations on
Identifying the semantic relations onijistjournal
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similaritypathsproject
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Reviewinscit2006
 
Supervised Corpus-based Methods for Word Sense Disambiguation
Supervised Corpus-based Methods for Word Sense DisambiguationSupervised Corpus-based Methods for Word Sense Disambiguation
Supervised Corpus-based Methods for Word Sense Disambiguationbutest
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalSudarsun Santhiappan
 
introduction to machine learning and nlp
introduction to machine learning and nlpintroduction to machine learning and nlp
introduction to machine learning and nlpMahmoud Farag
 
The Fuzzy Logical Databases
The Fuzzy Logical DatabasesThe Fuzzy Logical Databases
The Fuzzy Logical DatabasesAlaaZ
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
 

What's hot (20)

A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
 
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
 
Resolving the semantics of vietnamese questions in v news qaict system
Resolving the semantics of vietnamese questions in v news qaict systemResolving the semantics of vietnamese questions in v news qaict system
Resolving the semantics of vietnamese questions in v news qaict system
 
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
 
OOAD - UML - Sequence and Communication Diagrams - Lab
OOAD - UML - Sequence and Communication Diagrams - LabOOAD - UML - Sequence and Communication Diagrams - Lab
OOAD - UML - Sequence and Communication Diagrams - Lab
 
Text Classification, Sentiment Analysis, and Opinion Mining
Text Classification, Sentiment Analysis, and Opinion MiningText Classification, Sentiment Analysis, and Opinion Mining
Text Classification, Sentiment Analysis, and Opinion Mining
 
OOAD - UML - Class and Object Diagrams - Lab
OOAD - UML - Class and Object Diagrams - LabOOAD - UML - Class and Object Diagrams - Lab
OOAD - UML - Class and Object Diagrams - Lab
 
slides
slidesslides
slides
 
Identifying the semantic relations on
Identifying the semantic relations onIdentifying the semantic relations on
Identifying the semantic relations on
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
 
Plug play language_models
Plug play language_modelsPlug play language_models
Plug play language_models
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Review
 
Supervised Corpus-based Methods for Word Sense Disambiguation
Supervised Corpus-based Methods for Word Sense DisambiguationSupervised Corpus-based Methods for Word Sense Disambiguation
Supervised Corpus-based Methods for Word Sense Disambiguation
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
introduction to machine learning and nlp
introduction to machine learning and nlpintroduction to machine learning and nlp
introduction to machine learning and nlp
 
The Fuzzy Logical Databases
The Fuzzy Logical DatabasesThe Fuzzy Logical Databases
The Fuzzy Logical Databases
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
 

Similar to Jessica Hullman's Natural Language Processing Project Using Support Vector Machines

THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
Analysis of Opinionated Text for Opinion Mining
Analysis of Opinionated Text for Opinion MiningAnalysis of Opinionated Text for Opinion Mining
Analysis of Opinionated Text for Opinion Miningmlaij
 
Multi label classification of
Multi label classification ofMulti label classification of
Multi label classification ofijaia
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUECOMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
 
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...
A N H YBRID  A PPROACH TO  W ORD  S ENSE  D ISAMBIGUATION  W ITH  A ND  W ITH...A N H YBRID  A PPROACH TO  W ORD  S ENSE  D ISAMBIGUATION  W ITH  A ND  W ITH...
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...ijnlc
 
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEDETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEAbdurrahimDerric
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
 
A neural probabilistic language model
A neural probabilistic language modelA neural probabilistic language model
A neural probabilistic language modelc sharada
 
Effect of word embedding vector dimensionality on sentiment analysis through ...
Effect of word embedding vector dimensionality on sentiment analysis through ...Effect of word embedding vector dimensionality on sentiment analysis through ...
Effect of word embedding vector dimensionality on sentiment analysis through ...IAESIJAI
 
Tracing Requirements as a Problem of Machine Learning
Tracing Requirements as a Problem of Machine Learning Tracing Requirements as a Problem of Machine Learning
Tracing Requirements as a Problem of Machine Learning ijseajournal
 
A hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzerA hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzerIAESIJAI
 
Continuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdfContinuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdfdevangmittal4
 
Automatic classification of bengali sentences based on sense definitions pres...
Automatic classification of bengali sentences based on sense definitions pres...Automatic classification of bengali sentences based on sense definitions pres...
Automatic classification of bengali sentences based on sense definitions pres...ijctcm
 
Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learningcsandit
 
A supervised word sense disambiguation method using ontology and context know...
A supervised word sense disambiguation method using ontology and context know...A supervised word sense disambiguation method using ontology and context know...
A supervised word sense disambiguation method using ontology and context know...Alexander Decker
 
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONAN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONijnlc
 

Similar to Jessica Hullman's Natural Language Processing Project Using Support Vector Machines (20)

THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
228-SE3001_2
228-SE3001_2228-SE3001_2
228-SE3001_2
 
Analysis of Opinionated Text for Opinion Mining
Analysis of Opinionated Text for Opinion MiningAnalysis of Opinionated Text for Opinion Mining
Analysis of Opinionated Text for Opinion Mining
 
Multi label classification of
Multi label classification ofMulti label classification of
Multi label classification of
 
G04124041046
G04124041046G04124041046
G04124041046
 
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUECOMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
 
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...
A N H YBRID  A PPROACH TO  W ORD  S ENSE  D ISAMBIGUATION  W ITH  A ND  W ITH...A N H YBRID  A PPROACH TO  W ORD  S ENSE  D ISAMBIGUATION  W ITH  A ND  W ITH...
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...
 
Supervised Approach to Extract Sentiments from Unstructured Text
Supervised Approach to Extract Sentiments from Unstructured TextSupervised Approach to Extract Sentiments from Unstructured Text
Supervised Approach to Extract Sentiments from Unstructured Text
 
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEDETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
 
A neural probabilistic language model
A neural probabilistic language modelA neural probabilistic language model
A neural probabilistic language model
 
Effect of word embedding vector dimensionality on sentiment analysis through ...
Effect of word embedding vector dimensionality on sentiment analysis through ...Effect of word embedding vector dimensionality on sentiment analysis through ...
Effect of word embedding vector dimensionality on sentiment analysis through ...
 
Tracing Requirements as a Problem of Machine Learning
Tracing Requirements as a Problem of Machine Learning Tracing Requirements as a Problem of Machine Learning
Tracing Requirements as a Problem of Machine Learning
 
A hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzerA hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzer
 
Continuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdfContinuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdf
 
Automatic classification of bengali sentences based on sense definitions pres...
Automatic classification of bengali sentences based on sense definitions pres...Automatic classification of bengali sentences based on sense definitions pres...
Automatic classification of bengali sentences based on sense definitions pres...
 
Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learning
 
A supervised word sense disambiguation method using ontology and context know...
A supervised word sense disambiguation method using ontology and context know...A supervised word sense disambiguation method using ontology and context know...
A supervised word sense disambiguation method using ontology and context know...
 
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONAN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Jessica Hullman's Natural Language Processing Project Using Support Vector Machines

  • 1. Jessica Hullman Natural Language Processing, Fall 2006 Professor Rich Thomason December 15, 2006. Abstract I modeled my project after the implementation of supervised word sense disambiguation with Support Vector Machines by Lee, Ng, and Chia. The authors participated in the word sense disambiguation competition Senseval-3 in the English lexical sample task, using information of the Part of Speech (POS) of neighboring words, single words in surrounding contexts, local collocations, and syntactic relations to implement the machine learning technique of Support Vector Machines (SVM). This paper details the first section of my project in which I modify the POS portion of their implementation using the identically formatted Senseval-2 data. I scored my performance on the accuracy of the sense assignments by the SVM and received a mean average accuracy of 87%, with a standard deviation of 15% and a median of 92%. This paper has five parts (Introduction, Support Vector Machines, Method, Evaluation section, and Possible Improvements section). This is followed by a bibliography and a section called “Programs” which outlines how the experiment proceeded more specifically. I would like to acknowledge the following individuals who helped me (particularly in designing a couple of the more complicated programs): Robert Finn, Joshua Gerrish, and Rich Thomason. Introduction Word sense disambiguation, an area of considerable research in Computational Linguistics, refers to the problem of differentiating the various meanings of a word. A word is described as polysemous if it has multiple meanings; for example, given a word “bar”, and a set of word senses such as “a long piece of wood, metal etc. used as a support”, “a barrier of any kind”, “a plea arresting an action or claim”, etc. The goal is to identify the correct sense of “bar” in a given sentence. The problem of disambiguation can be described as AI complete in that some representation of common sense and real world knowledge is required before it can be resolved (Lecture). Two steps arise in disambiguating a given word. First, all of the different senses of the word as well as the words to be considered along with the given word must be determined, and second, a means must be determined by which to assign each occurrence of a word to the correct sense. Several major sources of information are typically used: the word’s context as well as external knowledge sources including lexicons (Ide and Veronis, 1998, pg. 3).
  • 2. WordNet, a large lexical database of English, developed under George A. Miller, is the best known of the external knowledge sources. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept; different senses of a word are therefore in different synsets (WordNet, http://wordnet.princeton.edu/) The meaning of the synsets is further clarified with short definitions. Context based methods (also called data-driven or corpus-based methods) use knowledge about the previously disambiguated instances of the word within corpora (Ide and Veronis, 1998, pg. 3). The distinction between lexicon driven, knowledge-based methods and corpus based methods is often the same as the distinction between supervised and unsupervised learning (supervised referring to a task in which the sense label of each training instance is known, unsupervised in which it is not). Unsupervised methods outline a clustering task, in which the external knowledge source of a dictionary or lexicon is used to seed the system, which then augments the labeled instances by learning from unlabeled instances. Supervised learning, on the other hand, can be seen as a classification task in which a function is deduced based on data points. Numerous issues arise with regard to word sense disambiguation. WordNet’s numerous synsets per word bring up one of the most prevalent of these, determining the appropriate degree of sense granularity for a given task. Several authors (e.g. Slater and Wilks, 1987) have remarked that the sense divisions one finds in dictionaries are often too fine for the purposes of NLP work; WordNet’s sense distinctions have been criticized, for example, for being more fine-grained that what may be needed in most natural language processing applications (Ide and Veronis, 1998, pg 13). Overly fine sense distinctions create practical difficulties for automated WSD by requiring making sense choices that are extremely difficult, even for expert lexicographers. The problem of data-sparseness becomes severe. Very large amounts of text are needed for supervised methods to ensure that all of the possible senses of a word are represented. Producing corpora hand-labeled for senses, however, is an expensive, time-consuming task, and the results are often less than satisfactory. There is often a fair amount of disparity among human taggers regarding the finer sense distinctions of a word. Natural Language Processing tasks in which word sense disambiguation is a relevant concern include information retrieval, machine translation, and speech processing. Despite the issues of granularity, evaluating WSD systems outside of these tasks remains a well-documented problem, arising from not only the substantial differences in test conditions across studies, but also the difference in test words and variance in the criteria for evaluating the correctness of a sense assignment. The SENSEVAL competition arose out of this need for accepted evaluation standards. SENSEVAL uses in vitro evaluation, which involves comparing a systems output for a given input using precision and recall (versus in vivo evaluation, in which results are evaluated in terms of their contribution to the overall performance of a system for a given application) (Ides and Veronis, 1998, pg. 25). While somewhat artificial, the reasoning
  • 3. behind the Senseval competition, and thus that behind my project, is that close examination of the problems that arise in word sense disambiguation will best improve the methods used. Within the Senseval competition, participants can compete in tasks including translation as well as language-specific disambiguation. English tasks in Senseval include an English all-words task and the English lexical sample task, the latter with which my project is concerned. In the lexical sample task, evaluation is based on how well a system disambiguates word-class specific (for example, all noun) instances in the test data of a sampling of words pulled from the WordNet lexicon. Tagging algorithms are expected to assign probabilities to the possible tags they output. To date three Senseval competitions have been held; this project uses Senseval-2 data. The corpus for the Senseval-2 English tasks is comprised of sentences from the British National Corpus 2, the Penn Treebank, and the web, and is provided in xml format. I used only this corpus in training my system. Machine Learning Using Support Vector Machines (SVM) As stated, in recent years linear regression-based methods have increased in popularity with regard to supervised learning tasks. Any linear classifier is simply a classifier that uses a linear function of its inputs to base a classification decision on. In other words, given that the input feature vector to the classifier is a real vector , then the estimated output score (or probability) is where is a real vector of weights and f is a function that converting the dot product of the two vectors into the desired output (Wikipedia, Linear classifier, http://en.wikipedia.org/wiki/Linear_classifier). In general, linear classifiers are fast and work well when the number of dimensions of the input vector are very large; for example, in document classification, each element in the input vector is typically the number of counts of a word in a document. They can be divided into generative models, which model conditional density functions, and discriminative training, models that attempt to maximize the quality of the output of a training set. While common generative methods like Bayesian classification handle missing data well, discriminative training methods, including perceptron and Support Vector Machines, generally yield a higher accuracy (Wikipedia, Linear Classifier, http://en.wikipedia.org/wiki/Linear_classifier).
  • 4. For a binary classification problem, f is a simple function mapping of all values above a certain threshold to one class and all other values to a second class (i.e “yes” and “no”). One can visualize the operation of a linear classifier as splitting a high-dimensional input space with a hyperplane: all points on one side of the plane belong to the first class, all points on the other side belong to the second class. The SVM is a binary classification learning method that categorizes data by constructing a hyperplane, using optimization, between training instances mapped in a feature space (Schölkopf and Smola, 2002). Because Lee, Ng, and Chia built one binary classifier for each sense class, I opted to do the same. Like the authors, I converted nominal features with numerous possible values into the corresponding number of binary (0 or 1) features. In this scheme, if a nominal takes the nth value then the corresponding (nth) feature is 1 and all of the other features are set to 0 (Witten and Frank, 2000). The software I used is SVMlight, an implementation of SVMs in C. SVMligh,t solves classification and regression problems and ranking problems, by learning a ranking function. It handles many thousands of support vectors and several hundred-thousands of training examples. SVMlight is an implementation of Vapnik's Support Vector Machine (Vapnik, 1995) and the algorithms used in SVMlight are described in (Joachims, 1999). The goal is to learn a function from preference examples, so that it orders a new set of objects as accurately as possible. Such ranking problems naturally occur in applications like search engines and recommender systems. The code has been used on a large range of problems, including text classification, image recognition tasks, bioinformatics and medical applications (SVMlight, http://svmlight.joachims.org/). Method Features: POS of Neighboring Words The decision of which features to use determines the project. Like Lee, Ng, and Chia, the features I used were the Parts of Speech (POS) of neighboring words. The first step involved deciding upon how many words ahead of and behind the given word I wanted to consider as far as POS information. Again, both because Lee, Ng, and Chia used a three- word window, and because research has shown that a window of more k=3 or 4 is unnecessary (Yarowsky, 1994a and b), I opted to use a three-word window.
  • 5. As an example, given the training corpus sentence “As the leaves grow, train them through the bars for a lovely effect,” the input vector, corresponding to < P-3, P-2, P-1, P0, P1, P2, P3 > is set to < DT, NNS, VB, VB, PRP, IN DT >. I converted all nominal features with numerous possible values into the corresponding number of binary features. This results in an input vector resembling < 01000…, 00001…, 00000…, 00000…, 10000…, 00000…, 00000… >; wherein each place in the vector corresponds to a 45 digit string of 0’s, with a 1 in the place corresponding to that particular tag. Senseval provides the corpus already divided into a training and test set. My first step involved parsing the XML format of the corpus. For this I used XML::Twig, a non-event based XML parser that provides an easy-to-access tree interface (XML::Twig, http://xmltwig.com/xmltwig/). While Twig made the initial parsing task much more efficient in terms of programming, I was still required to develop a program to insert spacing where the parser took out certain tags. The accuracy of the POS tagger used in a word sense disambiguation task is a limiting factor. My next step being to POS tag the corpus, I opted to use the Brill Tagger, an error-driven transformation-based tagger that works by first tagging a corpus based on the broadest of a set of tagging rules, then applying a slightly more rule, repeating this process until some stopping criterion is reached (Jurafsky and Martin, 2006). I chose the tagger for its accuracy of 95-97% (Brill Tagger, http://www.ling.gu.se/~lager/mogul/brill-tagger/index.html). Next, I needed to substitute certain characters outputted by the Brill tagger, because these characters were recognized by Perl. I then needed to extract the POS information of the given words from the output of this program and convert the information to the format needed by the SVM, namely vectors of zeros and ones. To do this I used a program that created a table corresponding to the 45 parts of speech, and then read through the parsed, POS-tagged corpora, keeping track of when it reached a new instance of a word. It then reads through the context associated with each instance, keeping track of when it gets to the given word. When the word to be disambiguated is reached, the POS’s of the three words before and after are converted into a vector of seven 45 digit strings, with each place in the string corresponding to a POS. For each POS in the vector, a one is inserted in the place corresponding to that POS, while all the other places remain 0’s. A separate file is created for each word. After this, the corpora (now in the form of separate files for each word) needed to be separated into files corresponding to each separate word sense, so that the SVM could be run once for each particular sense of a word. Evaluation I evaluated my project using the evaluation module built into the SVM software. Provided that the correct answers are supplied with the test data, the SVM outputs statistics on the accuracy, precision, and recall of its sense assignments. I evaluated my
  • 6. project on the accuracy of the sense assignments, getting an average of 87%, with a median of 92% and standard deviation of 15%. Possible Improvements There are multiple minor improvements which might considerably influence my results. Most importantly, to accurately compare this project to that which it was modeled on (Lee, Ng, and Chia), I would need to use the Senseval-3 data (now in the public domain) as well as to use the Senseval scoring software. Running a sentence segmentation program on the corpora before POS tagging it would’ve allowed me to track where in the sentence the word to be disambiguated occurred. Currently, my project tracks POS information across sentence boundaries. Like Lee, Ng, and Chia, I built one binary classifier for each instance (meaning) of a word. However, I might have run the SVM by instead a step-wise reduction method, in which a binary classifier is first built for all instances of a word, then as one word at a time is eliminated by the SVM it is removed from the input data file, and a new classifier built for the remaining instances. This method would be more computationally efficient, but whether this would improve the accuracy remains to be seen.
  • 7. Bibliography Ide, Nancy and Jean Veronis. (1998). “Word sense disambiguation: The state of the art.” In Computational Linguistics, 24(1). Joachims, Thorsten. (1999). “Transductive inference for text classification using support vector machines.” Universitat Dormund, Dortmund, Germany. Jurafsky, Daniel, and James H. Martin. Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition, 2nd edition. (Online version). http://www.cs.colorado.edu/~martin/slp.html Lee, Keok Yoong and Hwee Tou Ng. (2002). “An empirical evaluation of knowledge sources and learning algorithms for word sense disambiguation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Schölkopf, Bernhard and Alex Smola. (2002). Learning with kernels. MIT Press, Cambridge, MA. Vapnik, Vladimir N. (1995). The nature of statistical learning theory. Springer-Verlag, New York. Witten, Ian H. and Eibe Frank. (2000). Data mining: Practical machine learning tools and techniques with java implementations. Morgan Kaufman, San Francisco. Yarowsky, D. (1995). “A comparison of corpus-based techniques for restoring accents in Spanish and French text.” Proceedings of the 2nd Annual Workshop on Very Large Text Corpora. Las Curces.
  • 8. Programs All programs can be found (and are to be run) from /data0/users/rthomaso/tmp/hullman on tangra. All files created within the referenced programs output to this directory. This sequence of commands/programs was run first on the training data, then on the test data. To re-run it, some of the pathnames specified in the programs may need to be changed back (they are currently set to run on the test data); the actual running of this sequence gets rather complicated. Steps: 1. Run “xml_spacer_2.pl” This program inserts a space in the original xml corpora so that the next program, which parses it, does not abut two words together without a space between. 2. Run “senseval_parse2.pl” This program calls up XML::Twig and parses the corpora, outputting a file “senseval_data_spaced.txt”. 3. Run the following three commands: cd /data2/tools/RULE_BASED_TAGGER_V1.14/Bin_and_Data export PATH=$PATH:/data2/tools/RULE_BASED_TAGGER_V1.14/Bin_and_Data /data2/tools/RULE_BASED_TAGGER_V1.14/Bin_and_Data/tagger LEXICON / data0/users/rthomaso/tmp/hullman/senseval_data_spaced.txt BIGRAMS LEXICALRULEFILE CONTEXTUALRULEFILE > /data0/users/rthomaso/tmp/ hullman/senseval_tagged.txt These three commands run the Brill tagger on the parsed corpora, outputting the results to a file called senseval_tagged.txt. 4. Run “POS_substitution.pl” This program takes the POS-tagged corpora and substitutes strings for problematic characters outputted by the POS-tagger, including $, (, ), #, ., ;, and commas. Outputs a file “senseval_tagged_POS_substituted.txt”.
  • 9. 5. Run “tag_parser.pl”. This program and the next are the bulk of the project. This one reads in the POS- tagged, character substituted output file from the previous program, tracking when it gets to a new instance of a word. It then reads in the POS’s, and when it gets to the given word, creates the vector of the POS of that word as well as the three words before and after the word. For each POS in that vector, a 45 digit string is created, with a one inserted in the place in that string corresponding to the part of speech of the word. A separate file is created for every word, along with indication of which particular vector corresponds to the POS of the given word (out of all of the other words in the context). Each of these files ends in “_SVM_input.txt”. 6. Run “SVM_input.pl”: This splits the data into separate files corresponding to each instance and sense id of the word. The files it outputs for each instance/sense id are formatted for input into SVMlight; each ends in “_SVM_prepared_input.txt” 7. Run SVM from /data0/tools/svmlight using two commands (either “./svm_learn input_file model_file” or “./svm_classify input_file model_file output_file”). The first command runs the learning module on the training data file (corresponding to one sense of a word) and outputs a model file (parameters) which the SVM classifying module takes in in order to make a prediction. The classifying module takes in an input file (corresponding to one sense of a word in the test data), a model file created by the running of the learning module on the training data file for that sense, and outputs a file with a prediction (in the form of a one or negative one, depending on which side of the hyperplane the instance falls) as well as statistics including the accuracy as well as the precision and recall of the classification. *Because the Senseval test data did not supply the correct answers within the xml corpora, and because this information was needed if I was to evaluate the SVM, I needed to call up a separate Senseval file of answers for the test data and insert this information into the test corpora (see “test_sense_id.plx”).